Normal Approximation to the Binomial Distribution
0.1 Libraries
library(tigerstats)
1 Why Use a Normal Approximation of a Binomial Distribution
The simple reason is that the formula for a binomial distribution gets a little unwieldy when the value of n goes over 100.
For example, if you wanted to find the probability of 15 heads in 100 coin flips the math would look like this:
\[P(\text{15 heads in 100 flips}) = \frac{100!}{(100-15)!\cdot15!}\cdot .5^{15} \cdot .5^{100-15}\]
Calculating 100! overwhelms most calculators although R can do this operation using dbinom:
dbinom(15,100,.5)
## [1] 1.998488e-13
1.1 Conditions
There are several conditions that must be met when using a normal distribution to calculate a binomial distribution
- There must be a fixed number of trials
- The outcome of each trial must be independent
- Each experiment can have only two outcomes
- The the probability of success for each trial must be the same.
- This doesn’t work if the probability of each trial is close to 0 or 1 or if the number of trials is small.
- \(n \cdot p\) and \(n \cdot q\) should both be greater than 5.
2 Continuity Correction Factor
Calculating a binomial distribution using a normal distribution means we are using a continuous distribution to calculate a discrete distribution. A continuity correction factor must be used to account for this difference. This just means that we are using a range of values the calculate the probability of an event and not just one value. See the table to figure out which correction factor to use.
Discrete
Discrete variables are values that can be counted.
Continuous
Continuous variables can assume an infinite number of values between any two values. They are often obtained by measuring and often include fractions or decimals.
Continuity Correction Factor Table
Binomial | Normal |
---|---|
If P(X = n) | use P(n – 0.5 < X < n + 0.5) |
If P(X > n) | use P(X > n + 0.5) |
If P(X ≤ n) | use P(X < n + 0.5) |
If P(X < n) | P(X < n – 0.5) |
If P(X ≥ n) | use P(X > n – 0.5) |
So for example if you want to calculate P(55) you’d have to figure out both P(54.5) and P(55.5)
3 The Different Formulas
So the binomial distribution formula looks like this: \[P(\text{X successes in n trials}) = \frac{n!}{(n-X)!\cdot X!}\cdot p^{X} \cdot q^{n-X}\]
Where:
- n is the number of trials
- X is the number of successes
- p is the probability of success
- q is the probability of failure or 1-p
The formula for a normal distribution looks like this:
\[z = \frac{X-\mu}{\sigma}\]
Where:
- X is the number of successes
- \(\mu\) is the mean of the distribution
- \(\sigma\) is the standard dev of the distribution
- z is the z-score for X
To find \(\mu\) from the binomial distribution:
\[\mu = n \cdot p\]
To find \(\sigma\) from the binomial distribution:
\[\sigma = \sqrt{n \cdot p \cdot q}\]
4 Example
So as an example consider flipping a coin 100 times and trying to find the probability of getting 47 heads.
4.1 The Math
Binomial Distribution
\[P(\text{47 heads in 100 flips}) = \frac{100!}{(100-47)!\cdot47!}\cdot .5^{47} \cdot .5^{100-47}\]
Normal Distribution
First we have to verify that the probability is large enough and that the number of trials is large enough. Both np and nq must be greater than 5.
\[np = 50\] \[nq = 50\]
Then:
\[\mu = n \cdot p = 50\]
\[\sigma = \sqrt{n \cdot p \cdot q} = 5\]
And since we’re using a normal appoximation of a binomial distribution we have to calculate from 46.5 to 47.5
\[z_1 = \frac{46.5-50}{5} = -0.7\] \[z_2 = \frac{47.5-50}{5} = -0.5\]
And from a z-score table we know that:
\(z_1 = -.7\) has a probability of .2420
\(z_2 = -.5\) has a probability of .3085
Subtracting the two gives us a probability of 0.0665 or 6.65%
4.2 Using R
dbinom(x = 47, size = 100, prob = .5)
## [1] 0.0665905
dnorm(x = 47, mean = 50, sd = 5)
## [1] 0.06664492
pnorm(q = 47.5, mean = 50, sd = 5) - pnorm(q = 46.5, mean = 50, sd = 5)
## [1] 0.06657389
pnormGC(c(47.5,46.5), "between", mean = 50, sd = 5, graph = TRUE)
## [1] 0.06657389
5 When a Normal Distribution Won’t Work
So a normal distribution won’t work when the probability p is close to 0 or 1 or when the number of trials n is small.
Notice that these plots don’t quite line up. In these plots the red lines are the normal approximations and the bars are the binomial distributions. In the left plot the probability is close to 1, in the right plot \(n \cdot p\) is less than 5.
They don’t quite match the probabilities of the same binomial distribution.
probs2 = dbinom(1:100, size=100, prob=.95)
x<-probs2
barplot(probs2, names.arg=c(1:100), space=0, xlim=c(85,100), ylim=c(0,0.2))
lines((c(1:100)-.5), dnorm(c(1:100), 95, sqrt(4.75)), type="h", lwd=2, col="red")
curve(dnorm(c(x+.5), mean=95, sd=sqrt(4.75)), from=85, to=101, xlim = c(85:100), add=T, col="blue")
probs2 = dbinom(1:10, size=10, prob=.3)
x <- probs2
barplot(probs2, names.arg=c(1:10), space=0, xlim=c(0,10), ylim=c(0,0.3))
lines((c(1:10)-.5), dnorm(c(1:10), 3, sqrt(2.1)), type="h", lwd=2, col="red")
curve(dnorm(c(x+.5), mean=3, sd=sqrt(2.1)), from=0, to=10, xlim = c(0:10), add=T, col="blue")
But if we adjust them things will improve. The left plot has a probability closer to 0.5 and in the right plot \(n \cdot p\) is equal to 5.
probs2 = dbinom(1:100, size=100, prob=.65)
x <- probs2
barplot(probs2, names.arg=c(1:100), space=0, xlim=c(47,83), ylim=c(0,0.09))
lines((c(1:100)-.5), dnorm(c(1:100), 65, sqrt(22.75)), type="h", lwd=2, col="red")
curve(dnorm(c(x+.5), mean=65, sd=sqrt(22.75)), from=47, to=83, xlim = c(47:83), add=T, col="blue")
probs2 = dbinom(1:10, size=10, prob=.5)
x <- probs2
barplot(probs2, names.arg=c(1:10), space=0, xlim=c(0,10), ylim=c(0,0.3))
lines((c(1:10)-.5), dnorm(c(1:10), 5, sqrt(2.5)), type="h", lwd=2, col="red")
curve(dnorm(c(x+.5), mean=5, sd=sqrt(2.5)), from=0, to=10, xlim = c(0:10), add=T, col="blue")