Linear Regression
0.1 Packages
library(data.table)
library(dplyr)
library(ggplot2)
1 Faithful
So we’ll start by looking at the Old Faithful dataset, faithful. The two variables we have are waiting, which is the amount of time between eruptions, compared to eruptions, which is the duration in minutes of each eruption.
head(faithful)
## eruptions waiting
## 1 3.600 79
## 2 1.800 54
## 3 3.333 74
## 4 2.283 62
## 5 4.533 85
## 6 2.883 55
By plotting the data we can see that there does appear to be a linear relationship between the two variables.
p <- ggplot(faithful, aes(waiting, eruptions))
p + geom_point()
1.1 Find the correlation coefficient
After plotting the data we need to find r and to do that we use this formula for Pearson’s Correlation:
\[r = \frac{n(\sum xy) - (\sum x)(\sum y)}{\sqrt{[n(\sum x^2) - (\sum x)^2][n(\sum y^2) - (\sum y)^2]}}\]
Where r can be between -1 and 1.
A score close to 1 or -1 means there is a strong correlation between the two variables, an r close to 0 means there is little or no correlation.
In this scenario we’ll have x represent waiting time between eruptions and y represent the duration of the eruption in minutes.
correlation <- data.table(waiting_x = faithful$waiting, eruptions_y = faithful$eruptions)
This set of operations will find the values for \(xy\), \(x^2\), and \(y^2\).
correlation[,xy := correlation[,waiting_x] * correlation[,eruptions_y]] #Find x * y
correlation[,x_squared := correlation[,waiting_x]^2] #Find x ^ 2
correlation[,y_squared := correlation[,eruptions_y]^2] #Find y ^ 2
correlation[1:5]
## waiting_x eruptions_y xy x_squared y_squared
## 1: 79 3.600 284.400 6241 12.960000
## 2: 54 1.800 97.200 2916 3.240000
## 3: 74 3.333 246.642 5476 11.108889
## 4: 62 2.283 141.546 3844 5.212089
## 5: 85 4.533 385.305 7225 20.548089
And this set of operations will create a table that gives us the sums of all of those values, \(x, y, xy, x^2, \text{ and } y^2\).
colsToSum <- names(correlation)
correlationSums <- correlation[, lapply( .SD,sum), .SDcols = colsToSum]
correlationSums
## waiting_x eruptions_y xy x_squared y_squared
## 1: 19284 948.677 71046.4 1417266 3661.819
After that we just punch it into the formula:
\[r = \frac{272(71046.4) - (19284)(948.677)}{\sqrt{[272(1417266) - (19284)^2][272(3661.819) - (948.677)^2]}} \approx .901\]
2 Using cor() to find the correlation coefficient.
So obviously this is a lot easier to just punch the numbers into R.
cor(faithful$waiting, faithful$eruptions)
## [1] 0.9008112
3 Significance of Correlation Coefficient: test value and p-value
To see if this result has any significance we use the formula for the t test of the correlation coefficient.
3.1 First we state our hypothesis:
\(H_0: \rho = 0\)
\(H_1: \rho \neq 0\)
3.2 Then we find the critical values:
To find the critical value from the student’s t-test we use the qt() function like this:
qt(probability, degrees_of_freedom*, lower.tail = FALSE)
- degrees of freedom = n-2
qt(.025, 270, lower.tail = FALSE)
## [1] 1.968789
So for the result to be significant we’re looking for a score with an absolute value that is greater than 1.969.
3.3 Students Test Value
To find the test value for this data set we use this formula:
\[t = r \sqrt{\frac{n-2}{1-r^2}}\] So for our faithful data:
\[t = 0.901 \sqrt{\frac{272-2}{1-0.901^2}} \approx 34.089\]
This is quite a bit greater than 1.969 suggesting that we should not accept the null hypothesis \(H_0 : \rho = 0\) and that there is a significant relationship between wait time and the duration of each eruption.
3.4 P Value
To find the p value we use pt()
pt(quantile, degrees_of_freedom, lower.tail=FALSE)
In this case the p-value is incredibly small.
pt(34.089, 270, lower.tail = FALSE)
## [1] 4.066021e-100
4 Use cor.test()
Using cor.test() we can find all of that data very quickly.
with(faithful, cor.test(waiting, eruptions, method = "pearson"))
##
## Pearson's product-moment correlation
##
## data: waiting and eruptions
## t = 34.089, df = 270, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8756964 0.9210652
## sample estimates:
## cor
## 0.9008112
5 Using lm()
We can use lm()
to find the formula \(y' = a + bx\).
lm(eruptions ~ waiting, data = faithful)
##
## Call:
## lm(formula = eruptions ~ waiting, data = faithful)
##
## Coefficients:
## (Intercept) waiting
## -1.87402 0.07563
This tells us that the y-intercept is -1.87402 and the value for the slope b
is 0.07563.
We can get even more data from lm()
by adding the summary()
function.
Faithful.lm <- lm(eruptions ~ waiting, data = faithful)
FaithfulFit <- summary(Faithful.lm)
FaithfulFit
##
## Call:
## lm(formula = eruptions ~ waiting, data = faithful)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.29917 -0.37689 0.03508 0.34909 1.19329
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.874016 0.160143 -11.70 <2e-16 ***
## waiting 0.075628 0.002219 34.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4965 on 270 degrees of freedom
## Multiple R-squared: 0.8115, Adjusted R-squared: 0.8108
## F-statistic: 1162 on 1 and 270 DF, p-value: < 2.2e-16
This data is stored as a list and we can access this data by calling the names for each item.
names(FaithfulFit)
## [1] "call" "terms" "residuals" "coefficients"
## [5] "aliased" "sigma" "df" "r.squared"
## [9] "adj.r.squared" "fstatistic" "cov.unscaled"
To get the coefficients we can use either of these methods:
FaithfulFit[["coefficients"]]
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.87401599 0.160143302 -11.70212 7.359171e-26
## waiting 0.07562795 0.002218541 34.08904 8.129959e-100
coefficients(FaithfulFit)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.87401599 0.160143302 -11.70212 7.359171e-26
## waiting 0.07562795 0.002218541 34.08904 8.129959e-100
And to get a and b values for the formula \(y' = a + bx\) we can do either of these functions:
FaithfulFit[["coefficients"]][,"Estimate"]
## (Intercept) waiting
## -1.87401599 0.07562795
coefficients(Faithful.lm)
## (Intercept) waiting
## -1.87401599 0.07562795
So let’s compare that to the plot with a regression line below.
p <- ggplot(faithful, aes(waiting, eruptions))
p + geom_point() + geom_smooth(method = "lm")
Using data from the lm()
function we can put this into slope intercept form and by adding a variable, waitTime
we can find roughly the duration of the eruption based on the number of minutes spent waiting.
coeffs = coefficients(Faithful.lm); coeffs
## (Intercept) waiting
## -1.87401599 0.07562795
waitTime <- 75
eruptionLength <- coeffs[1] + coeffs[2]*waitTime
eruptionLength
## (Intercept)
## 3.79808
So with a 75 minute wait time, then Old Faithful should erupt for around 3.8 minutes.
6 Variation
6.1 Coefficient of Determination
So our formula for the regression line for the faithful
data set looks like this:
\[y' = -1.87 + 0.076x\] y’ is the estimate for y for each value of x in the dataset.
So we can add the y’ value for each value x.
correlation[, y_prime := -1.87402 + 0.07563 * waiting_x]
6.2 Total Variation
The total variation is equal to:
\[\text{total variation} = (y-\bar{y})^2\]
correlation[, total_var := (eruptions_y - mean(eruptions_y))^2]
6.3 Explained Variation
Explained variation is equal to:
\[\text{explained variation} = (y' - \bar{y})^2\]
correlation[, explained_var := (y_prime - mean(eruptions_y))^2]
6.4 Unexplained Variation
The unexplained variation is equal to:
\[\text{unexplained variation} = (y - y')^2\]
correlation[, unexplained_var := (eruptions_y - y_prime)^2]
correlation[1:5, .(waiting_x, eruptions_y, y_prime, total_var, explained_var, unexplained_var) ]
## waiting_x eruptions_y y_prime total_var explained_var unexplained_var
## 1: 79 3.600 4.10075 0.01259264 0.37572843 0.2507505625
## 2: 54 1.800 2.21000 2.84861175 1.63272962 0.1681000000
## 3: 74 3.333 3.72260 0.02395780 0.05513898 0.1517881600
## 4: 62 2.283 2.81504 1.45150229 0.45258326 0.2830665616
## 5: 85 4.533 4.55453 1.09247839 1.13794897 0.0004635409
Faithful_total_var = sum(correlation[,total_var])
Faithful_explained_var = sum(correlation[,explained_var])
Faithful_unexplained_var = sum(correlation[,unexplained_var])
Faithful_total_var
## [1] 353.0394
Faithful_explained_var
## [1] 286.4932
Faithful_unexplained_var
## [1] 66.56178
What is worth noting is that the total variation is equal to the sum of the unexplained variation plus the unexplained variation:
\[\text{total variation = unexplained variation + explained variation}\]
Faithful_total_var
## [1] 353.0394
Faithful_explained_var + Faithful_unexplained_var
## [1] 353.0549
6.5 Coefficient of Determination
In the case of our faithful dataset, the y value eruptions
, or the duration of eruptions in minutes, ranges between 1.6 and 5.1 minutes. We want to know what percentage of the variation in x, the waiting time, can be used to explain this variation in y, the duration of eruptions.
The Coefficient of Determination tells us the variation of the dependent variable that is directly related to the variation of the dependent variable. As \(r^2\) approaches 1, the variation of the dependent variable is more directly related to the variation of the independent variable. This indicates that there is a stronger correlation between the two.
This is the formula for the coefficient of determination:
\[r^2 = \frac{\text{explained variation}}{\text{total variation}}\]
Of course the coefficient of determination is easier found by taking the correlation coefficient and squaring it.
Remember that r for this dataset is equal to 0.9008112 and squaring it gives us 0.8114608.
Faithful_explained_var/Faithful_total_var
## [1] 0.8115048
This means that 81.2% of the variation in the dependent variable eruptions_y
can be explained by the variations in the independent variable waiting_x
.
6.6 The Coefficient of Nondetermination:
\[1.00 - r^2\]
The rest of the variation is unexplained and is called the coefficient of nondetermination. So around 18.8% of the variation in the duration of eruptions cannot be explained by the independent variable waiting.
7 Standard Error of the Estimate
The Standard Error of the Estimate is the standard deviation of the observed y values about the predicted y’ values. So this is the distance or error between the actual oberved values for y and the estimated values (y’) of y.
This is the formula:
\[s_{est} = \sqrt{\frac{\sum(y-y')^2}{n-2}}\]
\[s_{est} = \sqrt{\frac{66.56178}{272-2}} = 0.4965129516\]
Another formula for the standard error of the estimate:
\[s_{est} = \sqrt{\frac{\sum{y^2} - a \sum{y}-b \sum{xy}}{n-2}}\]
To get the standard error of the estimate we use the summary function on the results from lm()
FaithfulFit
##
## Call:
## lm(formula = eruptions ~ waiting, data = faithful)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.29917 -0.37689 0.03508 0.34909 1.19329
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.874016 0.160143 -11.70 <2e-16 ***
## waiting 0.075628 0.002219 34.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4965 on 270 degrees of freedom
## Multiple R-squared: 0.8115, Adjusted R-squared: 0.8108
## F-statistic: 1162 on 1 and 270 DF, p-value: < 2.2e-16
FaithfulFit$sigma
## [1] 0.4965129
8 Prediction Interval
We can use the standard error of the estimate to find a prediction interval around the y’ variable. This prediction interval will give us the range of values that an observed y value will fall within 95% of the time, or whatever percentage interval is used.
The y’ value that we found with the equation \(y' = -1.87 + 0.076x\) is a point estimate, we don’t know how accurate that estimate is, but we can create an interval around this estimate.
Formula for the Prediction Error around the y’ estimate
\[y' - t_{\alpha/2}s_{est} \sqrt{1+\frac{1}{n} + \frac{n(x-\bar{X})^2}{n\sum x^2-(\sum x)^2}} < y < y' + t_{\alpha/2}s_{est} \sqrt{1+\frac{1}{n} + \frac{n(x-\bar{X})^2}{n\sum x^2-(\sum x)^2}}\] * with degrees of freedeom = n-2
8.1 Using predict()
Using predict()
we can find the prediction interval for a particular value of x.
So this gives us the interval within which 95% of the values of y will when faithful erupts at 80 minutes.
prediction = data.frame(waiting = 80)
predict(Faithful.lm, prediction, interval="predict", level = .95)
## fit lwr upr
## 1 4.17622 3.196089 5.156351
Another prediction interval but for 99% confidence.
predict(Faithful.lm, prediction, interval="predict", level = .99)
## fit lwr upr
## 1 4.17622 2.884757 5.467683
8.2 Plotting Prediction Values
If we wanted to add these prediction values to the linear regression plot of the faithful
data we can do this:
prediction = data.frame(waiting = c(seq(43,96, by = 1)))
prediction <- cbind(prediction, predict(Faithful.lm, prediction, interval="predict", level = .95))
head(prediction)
## waiting fit lwr upr
## 1 43 1.377986 0.3911100 2.364862
## 2 44 1.453614 0.4672677 2.439960
## 3 45 1.529242 0.5434063 2.515077
## 4 46 1.604870 0.6195259 2.590213
## 5 47 1.680498 0.6956263 2.665369
## 6 48 1.756126 0.7717076 2.740543
p <- ggplot(faithful, aes(waiting, eruptions))
p + geom_point() +
geom_smooth(method="lm") +
geom_path(data = prediction, aes(waiting,lwr), color = "navy") +
geom_path(data = prediction, aes(waiting,upr), color = "navy")
9 Confidence Interval
The confidence interval is the range of values within which the linear regression line is most likely to fall. It shows the amount of uncertainty about the regression line.
9.1 confint()
Using confint()
we can get the \(a + bx\) formulas for both the upper and lower confidence intervals.
confint(Faithful.lm, level = .95)
## 2.5 % 97.5 %
## (Intercept) -2.18930436 -1.55872761
## waiting 0.07126011 0.07999579
9.2 predict() for Confidence Intervals
To get the confidence intervals just change interval="confidence"
.
confidence = data.frame(waiting = c(seq(43,96, by = 1)))
confidence <- cbind(confidence, predict(Faithful.lm, confidence, interval="confidence", level = .95))
head(confidence)
## waiting fit lwr upr
## 1 43 1.377986 1.242485 1.513487
## 2 44 1.453614 1.322027 1.585201
## 3 45 1.529242 1.401539 1.656944
## 4 46 1.604870 1.481019 1.728720
## 5 47 1.680498 1.560464 1.800531
## 6 48 1.756126 1.639870 1.872381
p <- ggplot(faithful, aes(waiting, eruptions))
p + geom_point() +
geom_smooth(method="lm") +
geom_path(data = prediction, aes(waiting,lwr), color = "navy") +
geom_path(data = prediction, aes(waiting,upr), color = "navy") +
geom_path(data = confidence, aes(waiting,lwr), color = "red") +
geom_path(data = confidence, aes(waiting,upr), color = "red")
10 References
R-Tutor: Estimated Simple Regression Equation
Cross Validated Stack Exchange: Interpretation of R’s lm() output