Confidence Intervals Part 3: Confidence Intervals and Sample Size for Proportions
0.1 Libraries
library(data.table)
library(ggplot2)
1 Confidence Interval From a Proportion
Oh jeez, another confidence interval.
1.1 Formula
The formula for maximum error of the estimate is: \[E=z_{\alpha / 2} \sqrt{\frac{\hat{p}\hat{q}}{n}}\]
The formula for the confidence interval for a proportion is: \[\hat{p} - z_{\alpha / 2} \sqrt{\frac{\hat{p}\hat{q}}{n}} < p < \hat{p} + z_{\alpha / 2} \sqrt{\frac{\hat{p}\hat{q}}{n}}\]
Where:
- n is the sample size
- \(\hat{p}\) is the population proportion
- \(\hat{q} = 1-\hat{p}\)
- z is
Also:
- when \(n\hat{p}\) and \(n\hat{q}\) are each greater than 5.
- round off to 3 decimal places.
2 Example Tree Data again.
So we’re going to make a 95% confidence interval from a sampled proportion of this data set.
This is data from a 2015 tree census from NY City’s Open Data website. Link to tree data.
First we load it into a data.table.
TreeCensus<- fread("data/2015_Street_Tree_Census_-_Tree_Data.csv")
##
Read 87.5% of 468341 rows
Read 468341 rows and 41 (of 41) columns from 0.130 GB file in 00:00:03
Now we’ll take 500 samples from the data, tally each species and load the resulting count of species into a data.table.
TreeSample <- TreeCensus[,sample(spc_common, 500)]
TreeSample <- table(TreeSample)
TreeSample <- data.table(TreeSample)
head(TreeSample)
## TreeSample N
## 1: 16
## 2: American Elm 8
## 3: American Hornbeam 1
## 4: American Linden 4
## 5: Amur Maple 2
## 6: Bald Cypress 2
Now we can find the proportion of a species, in this case the London Planetree.
pHat <- TreeSample[TreeSample == "London Planetree",N]/sum(TreeSample$N)
qHat <- 1-pHat
pHat
## [1] 0.098
qHat
## [1] 0.902
Now using the formula we can find the 95% confidence interval for the population proportion.
\[\hat{p} - Z_{\alpha / 2} \sqrt{\frac{\hat{p}\hat{q}}{n}} < p < \hat{p} + Z_{\alpha / 2} \sqrt{\frac{\hat{p}\hat{q}}{n}}\]
c(
pHat - qnorm(.975)*sqrt((pHat*qHat)/500),
pHat + qnorm(.975)*sqrt((pHat*qHat)/500)
)
## [1] 0.0719397 0.1240603
We can verify that this proportion falls around the actual population proportion
TreeCount <- TreeCensus[,.N, by = spc_common]
TreeCount[spc_common=="London Planetree", N]/nrow(TreeCensus)
## [1] 0.1107953
2.1 Two More Methods
From R-Bloggers
simpasym <- function(n, p, z=1.96, cc=TRUE){
out <- list()
if(cc){
out$lb <- p - z*sqrt((p*(1-p))/n) - 0.5/n
out$ub <- p + z*sqrt((p*(1-p))/n) + 0.5/n
} else {
out$lb <- p - z*sqrt((p*(1-p))/n)
out$ub <- p + z*sqrt((p*(1-p))/n)
}
out
}
simpasym(n=500, p=pHat, z=qnorm(.975), cc=FALSE)
## $lb
## [1] 0.0719397
##
## $ub
## [1] 0.1240603
and
scoreint <- function(n, p, z=1.96, cc=TRUE){
out <- list()
q <- 1-p
zsq <- z^2
denom <- (2*(n+zsq))
if(cc){
numl <- (2*n*p)+zsq-1-(z*sqrt(zsq-2-(1/n)+4*p*((n*q)+1)))
numu <- (2*n*p)+zsq+1+(z*sqrt(zsq+2-(1/n)+4*p*((n*q)-1)))
out$lb <- numl/denom
out$ub <- numu/denom
if(p==1) out$ub <- 1
if(p==0) out$lb <- 0
} else {
out$lb <- ((2*n*p)+zsq-(z*sqrt(zsq+(4*n*p*q))))/denom
out$ub <- ((2*n*p)+zsq+(z*sqrt(zsq+(4*n*p*q))))/denom
}
out
}
scoreint(n=500, p=pHat, z=qnorm(.975), cc=FALSE)
## $lb
## [1] 0.07492392
##
## $ub
## [1] 0.1272061
3 Sample Size For Proportions
So we just have to take this formula and solve for n.
\[E=z_{\alpha / 2} \sqrt{\frac{\hat{p}\hat{q}}{n}}\]
Which looks like this:
\[n=\hat{p}\hat{q}\left(\frac{z_{\alpha/2}}{E}\right)^2\]
What Value of \(\hat{p}\) should be used?
For a new study we won’t have \(\hat{p}\) but a previous study on the same subject may give \(\hat{p}\) and that can be used to find n. However if there isn’t a previous study to borrow a value of \(\hat{p}\) from, then just use 0.5, this will give a sufficiently large sample size and will yield an accurate prediction.