Confidence Intervals Part 3: Confidence Intervals and Sample Size for Proportions
0.1 Libraries
library(data.table)
library(ggplot2)1 Confidence Interval From a Proportion
Oh jeez, another confidence interval.
1.1 Formula
The formula for maximum error of the estimate is: \[E=z_{\alpha / 2} \sqrt{\frac{\hat{p}\hat{q}}{n}}\]
The formula for the confidence interval for a proportion is: \[\hat{p} - z_{\alpha / 2} \sqrt{\frac{\hat{p}\hat{q}}{n}} < p < \hat{p} + z_{\alpha / 2} \sqrt{\frac{\hat{p}\hat{q}}{n}}\]
Where:
- n is the sample size
- \(\hat{p}\) is the population proportion
- \(\hat{q} = 1-\hat{p}\)
- z is
Also:
- when \(n\hat{p}\) and \(n\hat{q}\) are each greater than 5.
- round off to 3 decimal places.
2 Example Tree Data again.
So we’re going to make a 95% confidence interval from a sampled proportion of this data set.
This is data from a 2015 tree census from NY City’s Open Data website. Link to tree data.
First we load it into a data.table.
TreeCensus<- fread("data/2015_Street_Tree_Census_-_Tree_Data.csv")## 
Read 87.5% of 468341 rows
Read 468341 rows and 41 (of 41) columns from 0.130 GB file in 00:00:03Now we’ll take 500 samples from the data, tally each species and load the resulting count of species into a data.table.
TreeSample <- TreeCensus[,sample(spc_common, 500)]
TreeSample <- table(TreeSample)
TreeSample <- data.table(TreeSample)
head(TreeSample)##           TreeSample  N
## 1:                   16
## 2:      American Elm  8
## 3: American Hornbeam  1
## 4:   American Linden  4
## 5:        Amur Maple  2
## 6:      Bald Cypress  2Now we can find the proportion of a species, in this case the London Planetree.
pHat <- TreeSample[TreeSample == "London Planetree",N]/sum(TreeSample$N)
qHat <- 1-pHat
pHat## [1] 0.098qHat## [1] 0.902Now using the formula we can find the 95% confidence interval for the population proportion.
\[\hat{p} - Z_{\alpha / 2} \sqrt{\frac{\hat{p}\hat{q}}{n}} < p < \hat{p} + Z_{\alpha / 2} \sqrt{\frac{\hat{p}\hat{q}}{n}}\]
c(
    pHat - qnorm(.975)*sqrt((pHat*qHat)/500),
    pHat + qnorm(.975)*sqrt((pHat*qHat)/500)
)## [1] 0.0719397 0.1240603We can verify that this proportion falls around the actual population proportion
TreeCount <- TreeCensus[,.N, by = spc_common]
TreeCount[spc_common=="London Planetree", N]/nrow(TreeCensus)## [1] 0.11079532.1 Two More Methods
From R-Bloggers
simpasym <- function(n, p, z=1.96, cc=TRUE){
  out <- list()
  if(cc){
    out$lb <- p - z*sqrt((p*(1-p))/n) - 0.5/n
    out$ub <- p + z*sqrt((p*(1-p))/n) + 0.5/n
  } else {
    out$lb <- p - z*sqrt((p*(1-p))/n)
    out$ub <- p + z*sqrt((p*(1-p))/n)
  }
  out
}
simpasym(n=500, p=pHat, z=qnorm(.975), cc=FALSE)## $lb
## [1] 0.0719397
## 
## $ub
## [1] 0.1240603and
scoreint <- function(n, p, z=1.96, cc=TRUE){
  out <- list()
  q <- 1-p
  zsq <- z^2
  denom <- (2*(n+zsq))
  if(cc){ 
    numl <- (2*n*p)+zsq-1-(z*sqrt(zsq-2-(1/n)+4*p*((n*q)+1)))
    numu <- (2*n*p)+zsq+1+(z*sqrt(zsq+2-(1/n)+4*p*((n*q)-1)))
    out$lb <- numl/denom
    out$ub <- numu/denom
    if(p==1) out$ub <- 1
    if(p==0) out$lb <- 0
  } else {
    out$lb <- ((2*n*p)+zsq-(z*sqrt(zsq+(4*n*p*q))))/denom
    out$ub <- ((2*n*p)+zsq+(z*sqrt(zsq+(4*n*p*q))))/denom
  }
  out
}
scoreint(n=500, p=pHat, z=qnorm(.975), cc=FALSE)## $lb
## [1] 0.07492392
## 
## $ub
## [1] 0.12720613 Sample Size For Proportions
So we just have to take this formula and solve for n.
\[E=z_{\alpha / 2} \sqrt{\frac{\hat{p}\hat{q}}{n}}\]
Which looks like this:
\[n=\hat{p}\hat{q}\left(\frac{z_{\alpha/2}}{E}\right)^2\]
What Value of \(\hat{p}\) should be used?
For a new study we won’t have \(\hat{p}\) but a previous study on the same subject may give \(\hat{p}\) and that can be used to find n. However if there isn’t a previous study to borrow a value of \(\hat{p}\) from, then just use 0.5, this will give a sufficiently large sample size and will yield an accurate prediction.