US Federal Highway Administration collected the following data in order to understand the effect of state gasoline tax on fuel consumption.
They collected information on the following quantities:
Lets start by downloading the data
filepath <- "https://www.maths.nottingham.ac.uk/personal/pmzrdw/FuelData.txt"
# Download the data from the internet
download.file(filepath, destfile = "FuelData.txt", method = "curl")
FuelData <- read.table(file='FuelData.txt', header=TRUE,sep="&")
FuelData[1:10,]
## State TAX DLIC INC ROAD FUEL
## 1 Alabama 18.0 1031.38 23.471 94.440 690.26
## 2 Alaska 8.0 1031.64 30.064 13.628 514.27
## 3 Arizona 18.0 908.59 25.578 55.245 621.47
## 4 Arkansas 21.7 946.57 22.257 98.132 655.29
## 5 California 18.0 844.70 32.275 168.771 573.91
## 6 Colorado 22.0 989.60 32.949 85.854 616.61
## 7 Connecticut 25.0 999.59 40.640 20.910 549.99
## 8 Delaware 23.0 924.34 31.255 5.814 626.02
## 9 Dist of Col 20.0 700.19 37.383 1.534 317.49
## 10 Florida 13.6 1000.12 28.145 117.299 586.34
str(FuelData) # look at the data structure
## 'data.frame': 51 obs. of 6 variables:
## $ State: Factor w/ 51 levels "Alabama ",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ TAX : num 18 8 18 21.7 18 22 25 23 20 13.6 ...
## $ DLIC : num 1031 1032 909 947 845 ...
## $ INC : num 23.5 30.1 25.6 22.3 32.3 ...
## $ ROAD : num 94.4 13.6 55.2 98.1 168.8 ...
## $ FUEL : num 690 514 621 655 574 ...
Lets first visualise the data with a scatter-plot matrix
plot(FuelData[,-1]) ## -1 so as not to plot the state name
or we can use the following, which draws a smooth on the data suggesting the general trend (useful when \(n\) is large).
require(car)
scatterplotMatrix(FuelData[,-1]) ## gives slightly more info
The plots give the impression that FUEL decreases on average with TAX, but it is hard to say anything for certain as there is a lot of variation. The impression is that FUEL is at best weakly related to the other variables.
However, the scatter-plot matrix just shows marginal relationships between pairs of variables (i.e. FUEL vs TAX ignores the information in DLIC, ROAD and INC). It doesn’t help us to understand how fuel is related to all four predictors simultaneously.
Consider the multiple linear regression model
\[FUEL_i=\beta_0 + \beta_1 TAX_i+\beta_2 DLIC_i+\beta_3 INC_i+ \beta_4 ROAD_i+\epsilon_i\]
fit <- lm(FUEL ~ TAX + DLIC + INC + ROAD, data = FuelData )
coef(fit)
## (Intercept) TAX DLIC INC ROAD
## 383.9544095 -4.1145064 0.5352993 -7.1373711 0.4015963
The coefficients tell us how important each variable is for predicting the fuel consumption, but its always nice to visualise things when possible. Visualising a model with 4 inputs isn’t easy, but the R package visreg helps.
library(visreg) # You will have to install the package first time you use it
# install.packages('visreg')
par(mfrow=c(2,2))
visreg(fit)
The key question we want to answer:
Test \(H_{0}:\beta_{1}=0;\) vs \(H_{1}: \beta _{1}\neq 0;\)
In this case we can use a t-test or an F-test as there is just a single constraint, and \(F_{1, n-p} = (t_{n-p})^2\). The test statistic is: \[ T=\frac{\widehat{\beta}_1}{\textrm{std.error}( \widehat{\beta}_{1}) }\sim t_{n-p}\quad\text{under }H_{0}. \] i.e. we should reject \(H_{0}\) at the \(100 \alpha\)% level if
\[ \left| T_{obs}\right| \geqslant t_{46}\left( 1-\alpha /2\right). \]
R automatically carries out this test when you run the summary command.
summary(fit)
##
## Call:
## lm(formula = FUEL ~ TAX + DLIC + INC + ROAD, data = FuelData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -179.900 -35.945 5.394 36.820 180.187
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 383.9544 165.7540 2.316 0.025044 *
## TAX -4.1145 2.1074 -1.952 0.056988 .
## DLIC 0.5353 0.1373 3.898 0.000313 ***
## INC -7.1374 2.2054 -3.236 0.002247 **
## ROAD 0.4016 0.1873 2.144 0.037315 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 67.17 on 46 degrees of freedom
## Multiple R-squared: 0.4755, Adjusted R-squared: 0.4299
## F-statistic: 10.43 on 4 and 46 DF, p-value: 4.271e-06
We can simply read off the t-statistic and the corresponding p-value. R also provides a visual indication of the significance, with ‘.’ in this case, showing that the p-value is between 0.05 and 0.1, i.e., not enough evidence to reject \(H_0\) at the 5% level.
This table suggests that TAX is the only variable that does not contain much information about the response once the other variables have been considered.
It is important to note that the \(t\)-tests on each \(\beta_{j}\) for including that parameter are not independent. For example if two of the input variables were not significant then leaving just one of them out of the model may cause the other one to become significant.
Conversely, if only TAX and the intercept are included in the model, then TAX might well be significant.
This contains all of the information in a concise format. You should make sure you understand what every number in this output means, how to interpret it, and how to calculate it. If you wanted to do the analysis using the F-test, you could type
fit2 <- lm(FUEL ~ DLIC+INC+ROAD, data=FuelData)
anova(fit, fit2)
We want to test whether the full model is a significant improvement over the null model, i.e., test \[H_{0}:\beta _{1}=\beta_2=\beta_3=\beta_4=0;\,\beta _{0}\;\,% \text{arbitrary}\] vs \[H_{1}:\beta _{0},\, \beta_1,\,\beta_{2},\,\beta _{3},\,\beta _{4}\;\,\text{arbitrary}\]
fit0<-lm(FUEL~1, data=FuelData)
anova(fit0, fit) ## compares the full model with the null model
## Analysis of Variance Table
##
## Model 1: FUEL ~ 1
## Model 2: FUEL ~ TAX + DLIC + INC + ROAD
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 50 395700
## 2 46 207533 4 188167 10.427 4.271e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can read from the table that \(F=10.43\) and \(p<0.001\).
This implies that the full model is a significant improvement on the null model, i.e. that at least one of the input variables is informative about the response variable.
It does not imply that all of the input variables are informative though.
Since \(F>F_{4,46}(0.99)=3.76\) we have strong evidence that at least some of the explanatory variables are useful for prediction.
require(car)
## Loading required package: car
influenceIndexPlot(fit, id.n=2, labels=FuelData$State)
influencePlot(fit, id.n=2, labels=FuelData$State)
## StudRes Hat CookD
## Alaska -3.3982194 0.23636154 0.76257332
## Dist of Col -1.6753422 0.29343215 0.47361957
## Texas 0.1954799 0.38048945 0.06923925
## Wyoming 3.0497586 0.08664797 0.38664771
These suggest that Alaska is by far the most influential point as it is a large outlier and has reasonably high leverage. It could be argued that it is an unusual state and should be left out of the analysis.
FuelData2 <- FuelData[-2,] # remove Alaska - alternatively use select command in dply package
fit3 <- lm(FUEL ~ TAX + DLIC + INC + ROAD, data = FuelData2)
compareCoefs(fit, fit3)
##
## Call:
## 1: lm(formula = FUEL ~ TAX + DLIC + INC + ROAD, data = FuelData)
## 2: lm(formula = FUEL ~ TAX + DLIC + INC + ROAD, data = FuelData2)
## Est. 1 SE 1 Est. 2 SE 2
## (Intercept) 383.954 165.754 354.986 149.741
## TAX -4.115 2.107 -6.826 2.061
## DLIC 0.535 0.137 0.626 0.127
## INC -7.137 2.205 -6.657 1.994
## ROAD 0.402 0.187 0.311 0.171
summary(fit3)
##
## Call:
## lm(formula = FUEL ~ TAX + DLIC + INC + ROAD, data = FuelData2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -152.115 -29.596 4.543 31.075 148.863
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 354.9862 149.7406 2.371 0.02210 *
## TAX -6.8258 2.0614 -3.311 0.00184 **
## DLIC 0.6256 0.1267 4.939 1.13e-05 ***
## INC -6.6575 1.9941 -3.339 0.00170 **
## ROAD 0.3113 0.1710 1.821 0.07527 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60.58 on 45 degrees of freedom
## Multiple R-squared: 0.5718, Adjusted R-squared: 0.5338
## F-statistic: 15.03 on 4 and 45 DF, p-value: 7.127e-08
This has made us much more certain that TAX has an impact on FUEL usage. We can check if removing any of the other points with large Cook’s distance has much effect, but only perhaps Dist. of Col. and Hawaii can be justified on the grounds of being unusual in some way.
FuelData3 <- FuelData[-c(2,9,12),]
fit4 <- lm(FUEL ~ TAX + DLIC + INC + ROAD, data = FuelData3)
compareCoefs(fit, fit3, fit4, se = FALSE)
##
## Call:
## 1: lm(formula = FUEL ~ TAX + DLIC + INC + ROAD, data = FuelData)
## 2: lm(formula = FUEL ~ TAX + DLIC + INC + ROAD, data = FuelData2)
## 3: lm(formula = FUEL ~ TAX + DLIC + INC + ROAD, data = FuelData3)
## Est. 1 Est. 2 Est. 3
## (Intercept) 383.9544 354.9862 604.7502
## TAX -4.1145 -6.8258 -8.3581
## DLIC 0.5353 0.6256 0.4021
## INC -7.1374 -6.6575 -6.2424
## ROAD 0.4016 0.3113 0.0387
summary(fit4)
##
## Call:
## lm(formula = FUEL ~ TAX + DLIC + INC + ROAD, data = FuelData3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -135.706 -29.059 0.283 26.639 133.670
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 604.75021 144.65077 4.181 0.00014 ***
## TAX -8.35815 1.83258 -4.561 4.2e-05 ***
## DLIC 0.40207 0.12623 3.185 0.00269 **
## INC -6.24245 1.76758 -3.532 0.00100 ***
## ROAD 0.03872 0.16348 0.237 0.81388
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 52.63 on 43 degrees of freedom
## Multiple R-squared: 0.5358, Adjusted R-squared: 0.4926
## F-statistic: 12.41 on 4 and 43 DF, p-value: 8.545e-07
Again, removing these two additional states has strengthed the evidence against \(H_0\). Finally, to be confident in our conclusion, we should check that there are no obvious violations of the modelling assumptions.