Confidence Intervals When Sigma Is Known.

0.1 Libraries

library(data.table)
library(tigerstats)

1 Finding Confidence Intervals Using NY City Tree Data

Import Data

So this is data from a 2015 tree census from NY City’s Open Data website. Link to tree data.

First we load it into a data.table.

TreeCensus<- fread("data/2015_Street_Tree_Census_-_Tree_Data.csv")
## 
Read 47.0% of 468341 rows
Read 468341 rows and 41 (of 41) columns from 0.130 GB file in 00:00:03
dim(TreeCensus)
## [1] 468341     41

So this data set is 41 columns by 468,341 observations.

And for this example we want to find the mean diameter at breast height (DBH) of the Pin Oak (Quercus palustris). We could simply tell R to tell us this information:

TreeCensus[spc_common == "Pin Oak", mean(tree_dbh)]
## [1] 15.59482
TreeCensus[spc_latin == "Quercus palustris", mean(tree_dbh)]
## [1] 15.59482


It’s a complete census so that’s pretty easy, but if we didn’t have that census data we could go out and sample a bunch of Pin Oaks and then draw some conclusions from that sample. The problem is that if we sampled 50 Pin Oaks we couldn’t be 100% certain that the mean of that sample represents the mean of the population and it likely wouldn’t be the same because of the problem of sampling error. So, what we’re looking for is a range of possible values and a degree of confidence that the actual population mean is between those values.

Standard confidence levels are 90%, 95% and 99%. For this example we’ll do 95%

First let’s sample 50 Pin Oaks from the census:

PinOakSample<- TreeCensus[spc_latin == "Quercus palustris", sample(tree_dbh, 50)]
PinOakSample
##  [1]  2 20  2 13  5 11 12 26  3 28 19 13 11 18 21  5 17  4 28  9  2 35 28
## [24] 15 10 26  6 22 11  9 24 17  5 31  7 27  5 14 20 30 21  5  6  2 17 34
## [47]  8  7  5  8


So we can get the mean of our sample.

mean(PinOakSample)
## [1] 14.48


This is called a point estimate.


Point Estimate

  • A point estimate of a population parameter is a single value used to estimate the population parameter. For example, the sample mean x is a point estimate of the population mean μ.

Now this sample will have a sampling error so we won’t get the exact mean of the population of all trees that we would find in the data from the tree census. But it gets us fairly close.

The question is, how close does it get us? There is no way of knowing for sure how close this estimate is to the population mean without the census data. But we can use an interval estimate to find a range of values that the population mean falls in.

Interval Estimate

  • An interval estimate is defined by two numbers, between which a population parameter is said to lie. For example, a < μ < b is an interval estimate for the population mean μ. It indicates that the population mean is greater than a but less than b.

The population mean must be within a range of normally distributed values where the standard deviation is the standard deviation of the popuation and the mean is the mean of the sample. Our confidence in this range is the confidence level.

Confidence Level

  • In survey sampling, different samples can be randomly selected from the same population; and each sample can often produce a different confidence interval. Some confidence intervals include the true population parameter; others do not.

  • A confidence level refers to the percentage of all possible samples that can be expected to include the true population parameter. For example, suppose all possible samples were selected from the same population, and a confidence interval were computed for each sample. A 95% confidence level implies that 95% of the confidence intervals would include the true population parameter.

Confidence levels are usually 90%, 95% or 99%.

So if we want to be 95% confident that the population mean is within a range then it must be somewhere within a normally distributed range that has the mean of the sample and standard deviation of the population.

qnormGC(c(.95), region = "between", mean=mean(PinOakSample),sd=(TreeCensus[spc_common == "Pin Oak", sd(tree_dbh)]/sqrt(length(PinOakSample))), graph = TRUE)

## [1] 11.76816 17.19184

The confidence interval is the range of values around the mean where, in this example, 95% of the mean values would lie.

1.1 The Formula

The formula for finding this range:

\[\overline{X} \pm z_{\alpha / 2} \left( \frac{\sigma}{\sqrt{n}} \right)\]

OR

\[\overline{X} - z_{\alpha / 2} \left( \frac{\sigma}{\sqrt{n}} \right) < \mu < \overline{X} + z_{\alpha / 2} \left( \frac{\sigma}{\sqrt{n}} \right)\]

Where:

  • \(\overline{X}\) is the sample mean
  • \(\alpha\) is the confidence level and represents the total of the areas in both tails of the normal distribution.
  • \(\sigma\) is the population standard deviation
  • \(n\) is the number of measures in the sample

Also

  • \(z_{\alpha / 2} \left( \frac{\sigma}{\sqrt{n}} \right)\) is called the maximum error of the estimate. It is represent by the variable E.

The maximum error of the estimate is one-half the width of the confidence interval.

1.1.1 Finding Z Value

So first thing we do is take alpha and find the z score. Because we’re finding the middle values of a normal distribution on both sides of the mean we have to divide alpha in two. Depending on which z table we use there is a different method for finding the value of z. But both methods give the same answer.

The Z-Score tells us the number of standard deviations from the mean on which a particular value falls. In this case we’re find the number of standard deviations from the mean that contains 95% percent of our distribution.

1.1.1.1 Method #1

Cumulative Standard Normal Table

The Cumulateive Standard Normal Table gives cumulative percentages from 0% to 100% and from \(-\infty \text{ to} +\infty\) with 50% equal to the mean or 0 in a standard normal distribution.

Now since we are looking for a range of values in the middle of a distribution, or the middle 95%, and the cumulative standard normal table gives us values starting from \(- \infty\) we have to convert the 95% to the equivalent value in a cumulative standard normal table. In other words we want the middle 95% plus everything that falls below the distribution all the way to %-%.

\[\alpha = 1-.95 = .05\]

\[\alpha / 2 = .05 / 2 = .025\]

\[1-.025 = .975\]

Then using the Cumulative Standard Normal Table we find that:

\[z_{\alpha/2} = 1.96\]

OR
alpha <- 1-.95
z <- 1-(alpha/2)
z
## [1] 0.975
alpha
## [1] 0.05
alpha/2
## [1] 0.025

qnorm(z, lower.tail=TRUE)
## [1] 1.959964
qnormGC(z, region="below", graph = TRUE)

## [1] 1.959964


Of course this also means that we could find the top end of the range as well, or the everything from \(\infty\) to the top of the 95% range and get the same answer.

qnorm(alpha/2, lower.tail = FALSE)
## [1] 1.959964
qnormGC(alpha/2, region="above", graph = TRUE)

## [1] 1.959964


1.1.1.2 Method #2

The Cumulative From the Mean Standard Normal Table gives cumulative values from the mean to a percentage. So we can do this…

\[\alpha/2 = .475\]

And from the Cumulative From the Mean Standard Normal Table we know that a probability of 0.475 gives a z value of 1.96.

Using the Cumulative From the Mean Standard Normal Table we find that 0.475 gives a z value of 1.96. This method seems to get to the heart of \(z_{\alpha / 2}\) as alpha should be the total percentage of values grouped around the mean that one wants to estimate, and we just want half of that percentage so we can get the range above and below the mean.

There isn’t really a method for giving R a decimal percent of 0.475 and getting 1.96 in return.


1.1.1.3 Method #3 Just Use R

How you do it in R:

qnorm(c(.025,.975))
## [1] -1.959964  1.959964


Three Methods

Note that these all give the same answer.

qnormGC(.95, region="between", graph = TRUE)

## [1] -1.959964  1.959964

qnormGC(.975, region="below", graph = TRUE)

## [1] 1.959964

qnormGC(.025, region = "above", graph=TRUE)

## [1] 1.959964


1.1.2 Finding The Maximum Error Of The Estimate

So now we’re calculating the formula \(z_{\alpha / 2} \left( \frac{\sigma}{\sqrt{n}} \right)\)

We’ve got a value for \(z_{\alpha / 2}\) which is 1.96, now we just need to plug in \(\sigma\) and n.

z_value <- qnorm(z, lower.tail = TRUE)
z_value
## [1] 1.959964
MaximumErrorOfEstimate <- z_value*(TreeCensus[spc_common == "Pin Oak", sd(tree_dbh)]/sqrt(length(PinOakSample)))

MaximumErrorOfEstimate
## [1] 2.711837

Then we just have to find our range above and below the mean, which is:

mean(PinOakSample) - MaximumErrorOfEstimate
## [1] 11.76816
mean(PinOakSample) + MaximumErrorOfEstimate
## [1] 17.19184


So the actual mean DBH of the census is approximately 15.6 which falls within the confidence interval.


2 Just Do The Operation in R


The easier way to do this is to just load all this information in qnorm including both the upper and lower bounds of the distribution.

qnorm(c(alpha/2,(1-alpha/2)), mean=mean(PinOakSample), sd=(TreeCensus[spc_common == "Pin Oak", sd(tree_dbh)]/sqrt(length(PinOakSample))), lower.tail=TRUE)
## [1] 11.76816 17.19184

qnormGC(.95, mean=mean(PinOakSample), sd=(TreeCensus[spc_common == "Pin Oak", sd(tree_dbh)]/sqrt(length(PinOakSample))), region = "between", graph = TRUE)

## [1] 11.76816 17.19184

3 Range of Different Confidence Intervals

Now different levels of certain will change the size of the range. If you want more certainty then the range will be larger, if less certainty is ok then the range will be smaller. Here are the 90% and 99% confidence intervals. Note that the 99% confidence interval has a wider range.

qnormGC(c(.90), region = "between", mean=mean(PinOakSample),sd=(TreeCensus[spc_common == "Pin Oak", sd(tree_dbh)]/sqrt(length(PinOakSample))), graph = TRUE)

## [1] 12.20415 16.75585
qnormGC(c(.99), region = "between", mean=mean(PinOakSample),sd=(TreeCensus[spc_common == "Pin Oak", sd(tree_dbh)]/sqrt(length(PinOakSample))), graph = TRUE)

## [1] 10.91604 18.04396

4 Other Stuff

As a fun exercise (fun?) we can create a bunch of sample means and show that a certain number of them are actually greater than the 95% confidence interval.

This first bit of code creates a list of 1000 sample means.

PinOakSampleMeans <- vector('numeric')
for( i in 1:1000){
    PinOakSampleMeans<- append(PinOakSampleMeans,mean(TreeCensus[spc_latin == "Quercus palustris", sample(tree_dbh, 50)]))
}

So we’re finding the percentage of times where the population mean falls within the 95% confidence interval of a sample mean.

In other words, if this function indicates that the population mean is within a range of values 95% of the time, then we should find that these sample means fall within the 95% confidence interval of the population mean 95% of the time.

#95% confidence interval for population mean

qnorm(c(.025,.975), mean=TreeCensus[spc_common=="Pin Oak", mean(tree_dbh)], sd = TreeCensus[spc_common == "Pin Oak", sd(tree_dbh)]/sqrt(length(PinOakSample)))
## [1] 12.88298 18.30666


#The percentage of values outside of the confidence interval.

length(PinOakSampleMeans[PinOakSampleMeans > (TreeCensus[spc_common=="Pin Oak", mean(tree_dbh)] + MaximumErrorOfEstimate) | PinOakSampleMeans < (TreeCensus[spc_common=="Pin Oak", mean(tree_dbh)] - MaximumErrorOfEstimate)]) / length(PinOakSampleMeans)
## [1] 0.053
hist(PinOakSampleMeans)

5 Sample Size

This brings us to another issue. What is the minimum sample size if we want an accurate sample mean. The correct sample size depends on three things:

  • The maximum error estimate (E)
  • The population standard deviation \(\sigma\)
  • the degree of confidence (90%, 95%, or 99%)

\[E=z_{\alpha / 2} \left( \frac{\sigma}{\sqrt{n}} \right)\]

and solving for n gives us:

\[n = \left(\frac{z_{\alpha / 2} \cdot \sigma}{E} \right)^{2}\]

So in our example of our Pin Oaks with a 95% confidence level the math would look like this:

\[n = \left(\frac{1.959964 \cdot 9.783641}{2.711837} \right)^{2}\] \[n = 50\]

If there’s a fraction, round up to the next highest whole number.

Here are the minimum sample sizes for the three different confidence levels, 90%, 95% and 99%:

z_values <- c(qnorm((1-((1-.90)/2)), lower.tail = TRUE), qnorm((1-((1-.95)/2)), lower.tail = TRUE), qnorm((1-((1-.99)/2)), lower.tail = TRUE))

((z_values * TreeCensus[spc_common == "Pin Oak", sd(tree_dbh)])/MaximumErrorOfEstimate)^2
## [1] 35.21505 50.00000 86.35907

So, as the level of certainty increases, the number of samples must also increase.