Confidence Intervals Part 2: Using T-Distributions When Sigma Is Unknown

0.1 Libraries

library(data.table)
library(ggplot2)
library(tigerstats)

This is nearly identical to finding a confidence interval when sigma is known except that we can’t use the population sigma because we don’t know the population sigma.

Instead of the population sigma we use sample sigma and instead of using a normal distribution we use a t distribution. The t distribution is nearly identical to the normal distribution.


1 T-Distribution

With a t distribution, also called a student t distribution, we’re calculating probabilities based on an unknown population standard deviation. Because it’s unknown, it’s also less certain.

2 Degrees of Freedom

Simply, degrees of freedom are n-1 or the sample size minus 1. The formula for degrees of freedom is different for other distributions but for this one it’s n-1.

  • Degrees of freedom are the number of values that are free to vary once a sample statistic has been calculated.

So what the heck does that mean?

So if you have a vector with a mean of 30 and 10 numbers, the first 9 numbers can be anything, but the 10th number must bring the sum of the vector to 300 or \(10 \cdot 30\)

Example

Value1 Value2 Value3 Value4 Value5 Value6 Value7 Value8 Value9 Value10 Mean
Vector 1 1 .5 8 83 16 45 90 35 8 30
Vector 2 2 91 45 3 86 41 32 21 102 50
  • Vector 1

So with Vector 1, the values from 1 through 9 can be anything, but once we have those values in place then the 10th value must equal a specific number to give us a mean of 30. In this example the 10th value must equal 13.5

  • Vector 2

Likewise with Vector 2, the values from 1 through 9 can also be anything, but once we have these value in place the 10th value in this example must be 77.

So in both of these examples there are 9 degrees of freedom.

Also:

  • As the sample size n increases, the degrees of freedom also increase and the distribution approaches the normal distribution.

3 Plotting T and Normal Distributions


This will create a data.table of that contains a normal distribution and four t distributions with 2, 5, 20 and 50 degrees of freedom.

Distributions <- rbindlist(l=list(
    data.table(Distribution = "TDist DF = 2", quantile = seq(-10,10, by =.1) ,Density=dt(seq(-10,10, by = .1), df=2)),
    data.table(Distribution = "TDist DF = 5", quantile = seq(-10,10, by =.1) ,Density=dt(seq(-10,10, by = .1), df=5)),
    data.table(Distribution = "TDist DF = 20", quantile = seq(-10,10, by =.1) ,Density=dt(seq(-10,10, by = .1), df=20)), 
    data.table(Distribution = "TDist DF = 50", quantile = seq(-10,10, by =.1) ,Density=dt(seq(-10,10, by = .1), df=50)),
    data.table(Distribution = "Normal", x = seq(-10,10, by =.1),Density=dnorm(seq(-10,10, by = .1)))
    ))


Plotting the distributions together will show that the t distribution approaches the normal distribution as the degrees of freedom increase.

ggplot(Distributions, aes(quantile, Density, colour = Distribution)) +
  geom_line()


And we can compare the different distributions at certain quantiles to see how they differ.

Distributions[quantile==-3]
##     Distribution quantile     Density
## 1:  TDist DF = 2       -3 0.027410122
## 2:  TDist DF = 5       -3 0.017292579
## 3: TDist DF = 20       -3 0.007963787
## 4: TDist DF = 50       -3 0.005831061
## 5:        Normal       -3 0.004431848
Distributions[quantile==-2]
##     Distribution quantile    Density
## 1:  TDist DF = 2       -2 0.06804138
## 2:  TDist DF = 5       -2 0.06509031
## 3: TDist DF = 20       -2 0.05808722
## 4: TDist DF = 50       -2 0.05577415
## 5:        Normal       -2 0.05399097
Distributions[quantile==-1]
##     Distribution quantile   Density
## 1:  TDist DF = 2       -1 0.1924501
## 2:  TDist DF = 5       -1 0.2196798
## 3: TDist DF = 20       -1 0.2360456
## 4: TDist DF = 50       -1 0.2395711
## 5:        Normal       -1 0.2419707
Distributions[quantile==0]
##     Distribution quantile   Density
## 1:  TDist DF = 2        0 0.3535534
## 2:  TDist DF = 5        0 0.3796067
## 3: TDist DF = 20        0 0.3939886
## 4: TDist DF = 50        0 0.3969527
## 5:        Normal        0 0.3989423

4 The Formula

The formula is nearly identical to the formula for a confidence interval using a normal distribution.

\[\overline{X} \pm t_{\alpha / 2} \left( \frac{s}{\sqrt{n}} \right)\]

OR

\[\overline{X} - t_{\alpha / 2} \left( \frac{s}{\sqrt{n}} \right) < \mu < \overline{X} + t_{\alpha / 2} \left( \frac{s}{\sqrt{n}} \right)\]

Where:

  • \(\overline{X}\) is the sample mean
  • \(\alpha\) is the confidence level and represents the total of the areas in both tails of the normal distribution.
  • \(s\) is the sample standard deviation
  • \(n\) is the number of measures in the sample

Also

  • the degrees of freedom are n-1

  • \(t_{\alpha / 2} \left( \frac{s}{\sqrt{n}} \right)\) is called the maximum error of the estimate. It is represent by the variable E.

5 Example NYC Tree Data

5.1 Import Data

So again I’ll use data from the 2015 tree census from NY City’s Open Data website. Link to tree data.

First we load it into a data.table.

TreeCensus<- fread("data/2015_Street_Tree_Census_-_Tree_Data.csv")
## 
Read 40.6% of 468341 rows
Read 468341 rows and 41 (of 41) columns from 0.130 GB file in 00:00:03


So we’ll pretend this isn’t census data and then select 20 American Linden’s and draw some conclusions from the sample.

AmericanLindenSample<- TreeCensus[spc_common == "American Linden", sample(tree_dbh, 20)]

So, to start we must find \(t_{\alpha / 2}\)

If we want a 95% confidence interval from this sample we must go through our usual gymnastics of converting a middle 95% to 95% plus all the values below the interval. But anyway a 95% confidence interval goes from 2.5% to 97.5%

qt(.975, df=19)
## [1] 2.093024

5.2 90% Confidence Interval

So this is the 90% confidence interval for a t distributed sample

MaximumErrorOfEstimate <- 
    qt(.95, df=19) * (sd(AmericanLindenSample)/sqrt(length(AmericanLindenSample)))

MaximumErrorOfEstimate
## [1] 2.670745
c(mean(AmericanLindenSample) - MaximumErrorOfEstimate,
mean(AmericanLindenSample) + MaximumErrorOfEstimate)
## [1]  7.479255 12.820745

5.3 95% Confidence Interval

MaximumErrorOfEstimate <- 
    qt(.975, df=19) * (sd(AmericanLindenSample)/sqrt(length(AmericanLindenSample)))

MaximumErrorOfEstimate
## [1] 3.232796
c(mean(AmericanLindenSample) - MaximumErrorOfEstimate,
mean(AmericanLindenSample) + MaximumErrorOfEstimate)
## [1]  6.917204 13.382796

5.4 99% Confidence Interval

MaximumErrorOfEstimate <- 
    qt(.995, df=19) * (sd(AmericanLindenSample)/sqrt(length(AmericanLindenSample)))

MaximumErrorOfEstimate
## [1] 4.418878
c(mean(AmericanLindenSample) - MaximumErrorOfEstimate,
mean(AmericanLindenSample) + MaximumErrorOfEstimate)
## [1]  5.731122 14.568878

And then see the actual population mean.

TreeCensus[spc_common == "American Linden", mean(tree_dbh)]
## [1] 8.545474

6 T Distribution Functions: dt, pt, qt, rt.

  • dt(x, df, ncp, log = FALSE)
  • pt(q, df, ncp, lower.tail = TRUE, log.p = FALSE)
  • qt(p, df, ncp, lower.tail = TRUE, log.p = FALSE)
  • rt(n, df, ncp)

6.1 dt

Give it a quantile it gives densities.

dt(c(-3,-3,-1,0,1,2,3),df=5)
## [1] 0.01729258 0.01729258 0.21967980 0.37960669 0.21967980 0.06509031
## [7] 0.01729258

6.2 qt

Give it a cumulative percent of values and it gives you the corresponding quantile.

qt(.5, df=5)
## [1] 0
qt(.75, df=5)
## [1] 0.7266868
qt(.25, df=5, lower.tail = FALSE)
## [1] 0.7266868

6.3 pt

Give it a quantile and it gives you the percent of t distributed quantiles above or below that quantile.

pt(0, df=5)
## [1] 0.5
pt(0.7266868, df=5)
## [1] 0.75

6.4 rt

rt will generate a vector of t distributed quantiles.

rt(15,df=5)
##  [1] -0.1852286  0.4342204  0.2491056 -0.8443610  0.5887939 -0.3740037
##  [7] -0.4757799 -0.5448651  1.6344138  0.4776623 -0.2585830 -1.1545117
## [13]  0.5134461 -2.3323556  2.2158237

densityplot(rt(1599,df=5))