Confidence Intervals Part 3: Confidence Intervals and Sample Size for Proportions

0.1 Libraries

library(data.table)
library(ggplot2)

1 Confidence Interval From a Proportion

Oh jeez, another confidence interval.

1.1 Formula

The formula for maximum error of the estimate is: \[E=z_{\alpha / 2} \sqrt{\frac{\hat{p}\hat{q}}{n}}\]

The formula for the confidence interval for a proportion is: \[\hat{p} - z_{\alpha / 2} \sqrt{\frac{\hat{p}\hat{q}}{n}} < p < \hat{p} + z_{\alpha / 2} \sqrt{\frac{\hat{p}\hat{q}}{n}}\]

Where:

  • n is the sample size
  • \(\hat{p}\) is the population proportion
  • \(\hat{q} = 1-\hat{p}\)
  • z is

Also:

  • when \(n\hat{p}\) and \(n\hat{q}\) are each greater than 5.
  • round off to 3 decimal places.

2 Example Tree Data again.

So we’re going to make a 95% confidence interval from a sampled proportion of this data set.

This is data from a 2015 tree census from NY City’s Open Data website. Link to tree data.


First we load it into a data.table.

TreeCensus<- fread("data/2015_Street_Tree_Census_-_Tree_Data.csv")
## 
Read 87.5% of 468341 rows
Read 468341 rows and 41 (of 41) columns from 0.130 GB file in 00:00:03


Now we’ll take 500 samples from the data, tally each species and load the resulting count of species into a data.table.

TreeSample <- TreeCensus[,sample(spc_common, 500)]
TreeSample <- table(TreeSample)
TreeSample <- data.table(TreeSample)
head(TreeSample)
##           TreeSample  N
## 1:                   16
## 2:      American Elm  8
## 3: American Hornbeam  1
## 4:   American Linden  4
## 5:        Amur Maple  2
## 6:      Bald Cypress  2


Now we can find the proportion of a species, in this case the London Planetree.

pHat <- TreeSample[TreeSample == "London Planetree",N]/sum(TreeSample$N)
qHat <- 1-pHat
pHat
## [1] 0.098
qHat
## [1] 0.902


Now using the formula we can find the 95% confidence interval for the population proportion.

\[\hat{p} - Z_{\alpha / 2} \sqrt{\frac{\hat{p}\hat{q}}{n}} < p < \hat{p} + Z_{\alpha / 2} \sqrt{\frac{\hat{p}\hat{q}}{n}}\]

c(
    pHat - qnorm(.975)*sqrt((pHat*qHat)/500),
    pHat + qnorm(.975)*sqrt((pHat*qHat)/500)
)
## [1] 0.0719397 0.1240603


We can verify that this proportion falls around the actual population proportion

TreeCount <- TreeCensus[,.N, by = spc_common]
TreeCount[spc_common=="London Planetree", N]/nrow(TreeCensus)
## [1] 0.1107953

2.1 Two More Methods

From R-Bloggers

simpasym <- function(n, p, z=1.96, cc=TRUE){
  out <- list()
  if(cc){
    out$lb <- p - z*sqrt((p*(1-p))/n) - 0.5/n
    out$ub <- p + z*sqrt((p*(1-p))/n) + 0.5/n
  } else {
    out$lb <- p - z*sqrt((p*(1-p))/n)
    out$ub <- p + z*sqrt((p*(1-p))/n)
  }
  out
}

simpasym(n=500, p=pHat, z=qnorm(.975), cc=FALSE)
## $lb
## [1] 0.0719397
## 
## $ub
## [1] 0.1240603

and

scoreint <- function(n, p, z=1.96, cc=TRUE){
  out <- list()
  q <- 1-p
  zsq <- z^2
  denom <- (2*(n+zsq))
  if(cc){ 
    numl <- (2*n*p)+zsq-1-(z*sqrt(zsq-2-(1/n)+4*p*((n*q)+1)))
    numu <- (2*n*p)+zsq+1+(z*sqrt(zsq+2-(1/n)+4*p*((n*q)-1)))
    out$lb <- numl/denom
    out$ub <- numu/denom
    if(p==1) out$ub <- 1
    if(p==0) out$lb <- 0
  } else {
    out$lb <- ((2*n*p)+zsq-(z*sqrt(zsq+(4*n*p*q))))/denom
    out$ub <- ((2*n*p)+zsq+(z*sqrt(zsq+(4*n*p*q))))/denom
  }
  out
}

scoreint(n=500, p=pHat, z=qnorm(.975), cc=FALSE)
## $lb
## [1] 0.07492392
## 
## $ub
## [1] 0.1272061

3 Sample Size For Proportions

So we just have to take this formula and solve for n.

\[E=z_{\alpha / 2} \sqrt{\frac{\hat{p}\hat{q}}{n}}\]

Which looks like this:

\[n=\hat{p}\hat{q}\left(\frac{z_{\alpha/2}}{E}\right)^2\]

What Value of \(\hat{p}\) should be used?

For a new study we won’t have \(\hat{p}\) but a previous study on the same subject may give \(\hat{p}\) and that can be used to find n. However if there isn’t a previous study to borrow a value of \(\hat{p}\) from, then just use 0.5, this will give a sufficiently large sample size and will yield an accurate prediction.