Linear Regression

0.1 Packages

library(data.table)
library(dplyr)
library(ggplot2)


1 Faithful


So we’ll start by looking at the Old Faithful dataset, faithful. The two variables we have are waiting, which is the amount of time between eruptions, compared to eruptions, which is the duration in minutes of each eruption.

head(faithful)
##   eruptions waiting
## 1     3.600      79
## 2     1.800      54
## 3     3.333      74
## 4     2.283      62
## 5     4.533      85
## 6     2.883      55

By plotting the data we can see that there does appear to be a linear relationship between the two variables.

p <- ggplot(faithful, aes(waiting, eruptions))
p + geom_point()


1.1 Find the correlation coefficient

After plotting the data we need to find r and to do that we use this formula for Pearson’s Correlation:

\[r = \frac{n(\sum xy) - (\sum x)(\sum y)}{\sqrt{[n(\sum x^2) - (\sum x)^2][n(\sum y^2) - (\sum y)^2]}}\]

Where r can be between -1 and 1.

A score close to 1 or -1 means there is a strong correlation between the two variables, an r close to 0 means there is little or no correlation.

In this scenario we’ll have x represent waiting time between eruptions and y represent the duration of the eruption in minutes.

correlation <- data.table(waiting_x = faithful$waiting, eruptions_y = faithful$eruptions)


This set of operations will find the values for \(xy\), \(x^2\), and \(y^2\).

correlation[,xy := correlation[,waiting_x] * correlation[,eruptions_y]]    #Find x * y
correlation[,x_squared := correlation[,waiting_x]^2]                       #Find x ^ 2
correlation[,y_squared := correlation[,eruptions_y]^2]                     #Find y ^ 2
correlation[1:5]
##    waiting_x eruptions_y      xy x_squared y_squared
## 1:        79       3.600 284.400      6241 12.960000
## 2:        54       1.800  97.200      2916  3.240000
## 3:        74       3.333 246.642      5476 11.108889
## 4:        62       2.283 141.546      3844  5.212089
## 5:        85       4.533 385.305      7225 20.548089


And this set of operations will create a table that gives us the sums of all of those values, \(x, y, xy, x^2, \text{ and } y^2\).

colsToSum <- names(correlation)
correlationSums <- correlation[, lapply( .SD,sum), .SDcols = colsToSum]
correlationSums
##    waiting_x eruptions_y      xy x_squared y_squared
## 1:     19284     948.677 71046.4   1417266  3661.819


After that we just punch it into the formula:

\[r = \frac{272(71046.4) - (19284)(948.677)}{\sqrt{[272(1417266) - (19284)^2][272(3661.819) - (948.677)^2]}} \approx .901\]


2 Using cor() to find the correlation coefficient.

So obviously this is a lot easier to just punch the numbers into R.


cor(faithful$waiting, faithful$eruptions)
## [1] 0.9008112


3 Significance of Correlation Coefficient: test value and p-value

To see if this result has any significance we use the formula for the t test of the correlation coefficient.

3.1 First we state our hypothesis:


\(H_0: \rho = 0\)

\(H_1: \rho \neq 0\)


3.2 Then we find the critical values:


To find the critical value from the student’s t-test we use the qt() function like this:

qt(probability, degrees_of_freedom*, lower.tail = FALSE)

  • degrees of freedom = n-2
qt(.025, 270, lower.tail = FALSE)
## [1] 1.968789

So for the result to be significant we’re looking for a score with an absolute value that is greater than 1.969.


3.3 Students Test Value

To find the test value for this data set we use this formula:

\[t = r \sqrt{\frac{n-2}{1-r^2}}\] So for our faithful data:

\[t = 0.901 \sqrt{\frac{272-2}{1-0.901^2}} \approx 34.089\]


This is quite a bit greater than 1.969 suggesting that we should not accept the null hypothesis \(H_0 : \rho = 0\) and that there is a significant relationship between wait time and the duration of each eruption.


3.4 P Value

To find the p value we use pt()

pt(quantile, degrees_of_freedom, lower.tail=FALSE)

In this case the p-value is incredibly small.

pt(34.089, 270, lower.tail = FALSE)
## [1] 4.066021e-100

4 Use cor.test()


Using cor.test() we can find all of that data very quickly.

with(faithful, cor.test(waiting, eruptions, method = "pearson"))
## 
##  Pearson's product-moment correlation
## 
## data:  waiting and eruptions
## t = 34.089, df = 270, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8756964 0.9210652
## sample estimates:
##       cor 
## 0.9008112


5 Using lm()


We can use lm() to find the formula \(y' = a + bx\).

lm(eruptions ~ waiting, data = faithful)
## 
## Call:
## lm(formula = eruptions ~ waiting, data = faithful)
## 
## Coefficients:
## (Intercept)      waiting  
##    -1.87402      0.07563

This tells us that the y-intercept is -1.87402 and the value for the slope b is 0.07563.

We can get even more data from lm() by adding the summary() function.

Faithful.lm <- lm(eruptions ~ waiting, data = faithful)
FaithfulFit <- summary(Faithful.lm)
FaithfulFit
## 
## Call:
## lm(formula = eruptions ~ waiting, data = faithful)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.29917 -0.37689  0.03508  0.34909  1.19329 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.874016   0.160143  -11.70   <2e-16 ***
## waiting      0.075628   0.002219   34.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4965 on 270 degrees of freedom
## Multiple R-squared:  0.8115, Adjusted R-squared:  0.8108 
## F-statistic:  1162 on 1 and 270 DF,  p-value: < 2.2e-16


This data is stored as a list and we can access this data by calling the names for each item.

names(FaithfulFit)
##  [1] "call"          "terms"         "residuals"     "coefficients" 
##  [5] "aliased"       "sigma"         "df"            "r.squared"    
##  [9] "adj.r.squared" "fstatistic"    "cov.unscaled"


To get the coefficients we can use either of these methods:

FaithfulFit[["coefficients"]]
##                Estimate  Std. Error   t value      Pr(>|t|)
## (Intercept) -1.87401599 0.160143302 -11.70212  7.359171e-26
## waiting      0.07562795 0.002218541  34.08904 8.129959e-100
coefficients(FaithfulFit)
##                Estimate  Std. Error   t value      Pr(>|t|)
## (Intercept) -1.87401599 0.160143302 -11.70212  7.359171e-26
## waiting      0.07562795 0.002218541  34.08904 8.129959e-100


And to get a and b values for the formula \(y' = a + bx\) we can do either of these functions:


FaithfulFit[["coefficients"]][,"Estimate"]
## (Intercept)     waiting 
## -1.87401599  0.07562795
coefficients(Faithful.lm)
## (Intercept)     waiting 
## -1.87401599  0.07562795


So let’s compare that to the plot with a regression line below.

p <- ggplot(faithful, aes(waiting, eruptions))
p + geom_point() + geom_smooth(method = "lm")


Using data from the lm() function we can put this into slope intercept form and by adding a variable, waitTime we can find roughly the duration of the eruption based on the number of minutes spent waiting.


coeffs = coefficients(Faithful.lm); coeffs 
## (Intercept)     waiting 
## -1.87401599  0.07562795

waitTime <- 75
eruptionLength <- coeffs[1] + coeffs[2]*waitTime
eruptionLength            
## (Intercept) 
##     3.79808


So with a 75 minute wait time, then Old Faithful should erupt for around 3.8 minutes.


6 Variation

6.1 Coefficient of Determination

So our formula for the regression line for the faithful data set looks like this:

\[y' = -1.87 + 0.076x\] y’ is the estimate for y for each value of x in the dataset.

So we can add the y’ value for each value x.

correlation[, y_prime := -1.87402 + 0.07563 * waiting_x]


6.2 Total Variation

The total variation is equal to:

\[\text{total variation} = (y-\bar{y})^2\]

correlation[, total_var := (eruptions_y - mean(eruptions_y))^2]


6.3 Explained Variation

Explained variation is equal to:

\[\text{explained variation} = (y' - \bar{y})^2\]

correlation[, explained_var := (y_prime - mean(eruptions_y))^2]


6.4 Unexplained Variation

The unexplained variation is equal to:

\[\text{unexplained variation} = (y - y')^2\]

correlation[, unexplained_var := (eruptions_y - y_prime)^2]
correlation[1:5, .(waiting_x, eruptions_y, y_prime, total_var, explained_var, unexplained_var) ]
##    waiting_x eruptions_y y_prime  total_var explained_var unexplained_var
## 1:        79       3.600 4.10075 0.01259264    0.37572843    0.2507505625
## 2:        54       1.800 2.21000 2.84861175    1.63272962    0.1681000000
## 3:        74       3.333 3.72260 0.02395780    0.05513898    0.1517881600
## 4:        62       2.283 2.81504 1.45150229    0.45258326    0.2830665616
## 5:        85       4.533 4.55453 1.09247839    1.13794897    0.0004635409


Faithful_total_var = sum(correlation[,total_var])
Faithful_explained_var = sum(correlation[,explained_var])
Faithful_unexplained_var = sum(correlation[,unexplained_var])
Faithful_total_var
## [1] 353.0394
Faithful_explained_var
## [1] 286.4932
Faithful_unexplained_var
## [1] 66.56178


What is worth noting is that the total variation is equal to the sum of the unexplained variation plus the unexplained variation:

\[\text{total variation = unexplained variation + explained variation}\]

Faithful_total_var
## [1] 353.0394
Faithful_explained_var + Faithful_unexplained_var
## [1] 353.0549


6.5 Coefficient of Determination


In the case of our faithful dataset, the y value eruptions, or the duration of eruptions in minutes, ranges between 1.6 and 5.1 minutes. We want to know what percentage of the variation in x, the waiting time, can be used to explain this variation in y, the duration of eruptions.


The Coefficient of Determination tells us the variation of the dependent variable that is directly related to the variation of the dependent variable. As \(r^2\) approaches 1, the variation of the dependent variable is more directly related to the variation of the independent variable. This indicates that there is a stronger correlation between the two.

This is the formula for the coefficient of determination:

\[r^2 = \frac{\text{explained variation}}{\text{total variation}}\]


Of course the coefficient of determination is easier found by taking the correlation coefficient and squaring it.

Remember that r for this dataset is equal to 0.9008112 and squaring it gives us 0.8114608.


Faithful_explained_var/Faithful_total_var
## [1] 0.8115048


This means that 81.2% of the variation in the dependent variable eruptions_y can be explained by the variations in the independent variable waiting_x.


6.6 The Coefficient of Nondetermination:

\[1.00 - r^2\]


The rest of the variation is unexplained and is called the coefficient of nondetermination. So around 18.8% of the variation in the duration of eruptions cannot be explained by the independent variable waiting.


7 Standard Error of the Estimate

The Standard Error of the Estimate is the standard deviation of the observed y values about the predicted y’ values. So this is the distance or error between the actual oberved values for y and the estimated values (y’) of y.

This is the formula:

\[s_{est} = \sqrt{\frac{\sum(y-y')^2}{n-2}}\]


\[s_{est} = \sqrt{\frac{66.56178}{272-2}} = 0.4965129516\]


Another formula for the standard error of the estimate:

\[s_{est} = \sqrt{\frac{\sum{y^2} - a \sum{y}-b \sum{xy}}{n-2}}\]

To get the standard error of the estimate we use the summary function on the results from lm()

FaithfulFit
## 
## Call:
## lm(formula = eruptions ~ waiting, data = faithful)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.29917 -0.37689  0.03508  0.34909  1.19329 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.874016   0.160143  -11.70   <2e-16 ***
## waiting      0.075628   0.002219   34.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4965 on 270 degrees of freedom
## Multiple R-squared:  0.8115, Adjusted R-squared:  0.8108 
## F-statistic:  1162 on 1 and 270 DF,  p-value: < 2.2e-16
FaithfulFit$sigma
## [1] 0.4965129


8 Prediction Interval

We can use the standard error of the estimate to find a prediction interval around the y’ variable. This prediction interval will give us the range of values that an observed y value will fall within 95% of the time, or whatever percentage interval is used.

The y’ value that we found with the equation \(y' = -1.87 + 0.076x\) is a point estimate, we don’t know how accurate that estimate is, but we can create an interval around this estimate.

Formula for the Prediction Error around the y’ estimate

\[y' - t_{\alpha/2}s_{est} \sqrt{1+\frac{1}{n} + \frac{n(x-\bar{X})^2}{n\sum x^2-(\sum x)^2}} < y < y' + t_{\alpha/2}s_{est} \sqrt{1+\frac{1}{n} + \frac{n(x-\bar{X})^2}{n\sum x^2-(\sum x)^2}}\] * with degrees of freedeom = n-2


8.1 Using predict()


Using predict() we can find the prediction interval for a particular value of x.


So this gives us the interval within which 95% of the values of y will when faithful erupts at 80 minutes.

prediction = data.frame(waiting = 80)
predict(Faithful.lm, prediction, interval="predict", level = .95) 
##       fit      lwr      upr
## 1 4.17622 3.196089 5.156351


Another prediction interval but for 99% confidence.

predict(Faithful.lm, prediction, interval="predict", level = .99) 
##       fit      lwr      upr
## 1 4.17622 2.884757 5.467683


8.2 Plotting Prediction Values


If we wanted to add these prediction values to the linear regression plot of the faithful data we can do this:

prediction = data.frame(waiting = c(seq(43,96, by = 1)))
prediction <- cbind(prediction, predict(Faithful.lm, prediction, interval="predict", level = .95))
head(prediction)
##   waiting      fit       lwr      upr
## 1      43 1.377986 0.3911100 2.364862
## 2      44 1.453614 0.4672677 2.439960
## 3      45 1.529242 0.5434063 2.515077
## 4      46 1.604870 0.6195259 2.590213
## 5      47 1.680498 0.6956263 2.665369
## 6      48 1.756126 0.7717076 2.740543
p <- ggplot(faithful, aes(waiting, eruptions))
p + geom_point() + 
    geom_smooth(method="lm") + 
    geom_path(data = prediction, aes(waiting,lwr), color = "navy") + 
    geom_path(data = prediction, aes(waiting,upr), color = "navy")


9 Confidence Interval


The confidence interval is the range of values within which the linear regression line is most likely to fall. It shows the amount of uncertainty about the regression line.


9.1 confint()


Using confint() we can get the \(a + bx\) formulas for both the upper and lower confidence intervals.

confint(Faithful.lm, level = .95) 
##                   2.5 %      97.5 %
## (Intercept) -2.18930436 -1.55872761
## waiting      0.07126011  0.07999579


9.2 predict() for Confidence Intervals


To get the confidence intervals just change interval="confidence".

confidence = data.frame(waiting = c(seq(43,96, by = 1)))
confidence <- cbind(confidence, predict(Faithful.lm, confidence, interval="confidence", level = .95))
head(confidence)
##   waiting      fit      lwr      upr
## 1      43 1.377986 1.242485 1.513487
## 2      44 1.453614 1.322027 1.585201
## 3      45 1.529242 1.401539 1.656944
## 4      46 1.604870 1.481019 1.728720
## 5      47 1.680498 1.560464 1.800531
## 6      48 1.756126 1.639870 1.872381
p <- ggplot(faithful, aes(waiting, eruptions))
p + geom_point() + 
    geom_smooth(method="lm") + 
    geom_path(data = prediction, aes(waiting,lwr), color = "navy") + 
    geom_path(data = prediction, aes(waiting,upr), color = "navy") + 
    geom_path(data = confidence, aes(waiting,lwr), color = "red") + 
    geom_path(data = confidence, aes(waiting,upr), color = "red")