Normal Distribution
0.0.1 Libraries
library(data.table)
library(tigerstats)
library(mosaic)
1 Normal Distribution
To start with a normal distribution is:
- a continuous symmetric, bell-shaped distribution of a variable
It has the following properties:
- A normal distribution is bell-shaped
- The mean, median, and mode are equal and are located at the center of the distribution
- A normal distribution curve is unimodal.
- The curve is symmetric about the mean, which is equivalent to saying that its shape is the same on both sides of a vertical line passing through the center.
- The curve is continuous; that is there are no gaps or holes; For each value of x, there is a corresponding value of y.
- Theoretically the curve never touches the x-axis but it gets increasingly closer.
- The total area under a normal distribution curve is equal to 1.00 or 100%. This fact may seem unusual, since the curve never touches the x-axis, but once can prove it mathematically by using claculus.
- The area under the part of a normal curve that lies within 1 standard deviation of the mean is approximately 0.68 or 68%; within two standard deviations, about 0.95 or 95%; and within 3 standard deviations, about 0.997 or 99.7%.
1.1 Normal Distribution Formula
The formula for the normal distribution looks like this:
\[y=\frac{e^{-(X-\mu)^{2}/(2\sigma^{2})}}{\sigma\sqrt{2\pi}}\]
Where:
- \(e \approx\) 2.718
- \(\pi \approx\) 3.14
- \(\mu =\) population mean
- \(\sigma =\) the population standard deviation
2 Standard Normal Distribution
The Standrad Normal Distribution is a distribution with a mean if 0 and standard deviation of 1.
To create a standard normal distribution we’ll make a data.table “standardNormal” that has 20,000 normally distributed numbers with a mean of 0 and a standard deviation of 1.
standardNormal <- data.table(data=rnorm(20000, 0, 1))
We can print some summary statistics
print(standardNormal)
## data
## 1: 0.04099566
## 2: -0.47458254
## 3: 0.92621105
## 4: -0.71600523
## 5: 0.68300686
## ---
## 19996: -0.72051275
## 19997: -0.27691505
## 19998: 1.08561794
## 19999: 0.03206857
## 20000: 0.46978767
summary(standardNormal$data)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.555000 -0.672600 -0.004642 0.000524 0.667000 3.839000
sd(standardNormal$data)
## [1] 0.9915133
And plot the data with a vertical line showing the mean.
ggplot(standardNormal, aes(data)) +
geom_density() + geom_vline(xintercept = c(mean(standardNormal$data)))
3 Variation in Normal Distributions
While normal distributions all look bell-shaped they may vary in their mean and standard deviation.
The left distribution in this plot has a mean of -4 and a standard deviation of 1, and the right distribution has a mean of 15 and a standard deviation of 3. They are both normal distributions.
The notation for these two distributions is written like this:
\[N(\mu=-4, \sigma=1)\] \[N(\mu=15, \sigma=3)\]
4 Z-Scores : An Example
In this example we have two datasets: LakeHuron and Nile. LakeHuron gives the depth of Lake Huron measured in feet over many years. Nile gives the annual flow measurement of the Nile at Ashwan over many years. Both datasets are roughly normal distributions. This is some basic data about both datasets.
Dataset | StDev | Mean |
---|---|---|
Nile | 169.2275 | 919.35 |
LakeHuron | 1.318299 | 579.0041 |
So assuming these datasets are normal and that in one year the Nile has a flow of 1200 and on one day Lake Huron has a depth of 580, which of these figures is relatively higher.
To find the answer we find the z-score for both figures.\[z=\frac{value-mean}{standard deviation}\] or \[z=\frac{X-\mu}{\sigma}\]
So
\[z_{Nile}=\frac{1200-919.35}{169.2275}=1.658\]
and
\[z_{LakeHuron}=\frac{580-579.0041}{1.319299}=0.755\]
So in this case the z-score for the Nile river is larger and therefore higher relative to its dataset.
If we wanted to do this in R:
z_huron <- (580-mean(LakeHuron))/sd(LakeHuron)
z_huron
## [1] 0.7554574
z_nile <- (1200-mean(Nile))/sd(Nile)
z_nile
## [1] 1.658418
4.1 Z Score Table
Using a z-score table we can find the percentage of values that fall between the mean of the dataset and the water level value.
Nile River
So the fraction of values that fall between the mean and 1200 on the Nile River is 0.45352. This means that 95.352% of the time the river levels will be below 1200.
Lake Huron
The fraction of values that fall between the mean and 580 on Lake Huron is 0.27637, so 77.637% of the time the level of Lake Huron will be below 580 feet.
4.2 pnorm
The nice part about R is that we can bypass using a lookup table and let R do that step with pnorm.
pnorm(z_nile)
## [1] 0.9513835
pnorm(z_huron)
## [1] 0.7750127
The default values for pnorm are for a standard normal distribution, which are also the values that a z score lookup table is based on. But, because this is R we just give it the mean and standard deviation without having to calculate z scores first.
pnorm(1200, mean = mean(Nile), sd = sd(Nile))
## [1] 0.9513835
pnorm(580, mean = mean(LakeHuron), sd = sd(LakeHuron))
## [1] 0.7750127
4.3 Zscore Function
We can use the zscore function in the mosaic library to get the z-score for every value in a vector. For example:
huron <- data.table(level=LakeHuron, zscore=zscore(LakeHuron))
huron[1:20,]
## level zscore
## 1: 580.38 1.0437077
## 2: 581.86 2.1663670
## 3: 580.97 1.4912543
## 4: 580.80 1.3623002
## 5: 579.79 0.5961612
## 6: 580.39 1.0512933
## 7: 580.42 1.0740499
## 8: 580.82 1.3774713
## 9: 581.40 1.8174323
## 10: 581.32 1.7567481
## 11: 581.44 1.8477745
## 12: 581.68 2.0298273
## 13: 581.17 1.6429650
## 14: 580.53 1.1574908
## 15: 580.01 0.7630429
## 16: 579.91 0.6871876
## 17: 579.14 0.1031014
## 18: 579.16 0.1182724
## 19: 579.55 0.4141083
## 20: 579.67 0.5051347
5 Finding Area Under Normal Distribution Curve
The area of a section of a normal distribution tells you what percentage of values fall within that range. For example if 80% of the volume of a normal distribution falls between two values on the x-axis, then 80% of the values in that dataset will fall between those two values.
5.1 Example Using PnormGC Function
Using the pnormGC function, aka the Graphical Calculator for Normal Curve Probabilities, we can find the percentage of values that fall below 580 feet in Lake Huron assuming the lake levels have a normal distribution.
pnormGC(580, region="below", mean=mean(huron$level), sd=sd(huron$level), graph=TRUE)
## [1] 0.7750127
So according to this function 77.5% of the values fall are in the shaded area and fall below 580. We can verify this by doing a little math.
sum(huron$level < 580)/length(huron$level)
## [1] 0.7959184
So according to the actual data the percentage of values below 580 is actually closer to 80%, but it’s not a perfectly normal distribution.
We can also do the same math using the Nile River data and our value of 1200.
pnormGC(1200, region="below", mean(Nile), sd(Nile), graph=TRUE)
## [1] 0.9513835
So we can see that 95% of the values in the Nile River dataset fall below 1200.
5.2 Example Using Pnorm Function
The pnorm function does the same math as pnormGC but without the graph.
pnorm(580, mean(huron$level), sd(huron$level))
## [1] 0.7750127
5.3 Using pnorm to find percent greater than a value.
So what if we wanted to find the percentage of values that are above 580 feet in the Lake Huron dataset.
pnorm(580, mean(huron$level), sd(huron$level), lower.tail = FALSE)
## [1] 0.2249873
So approximately 22.5% of values are higher thant 580 feet and we can show this with pnormGC
pnormGC(580, region="above", mean=mean(huron$level), sd=sd(huron$level), graph=TRUE)
## [1] 0.2249873
We can also find the percentage of values between two points on the x-axis. These graphs use the standardNormal dataset and find the following ranges:
- between 0 and 1
- greater than -3
pnormGC(c(0,1),region="between", mean=0, sd=1, graph=TRUE)
pnormGC(-3,region="above", mean=0, sd=1, graph=TRUE)
## [1] 0.3413447
## [1] 0.9986501
5.4 Standard Deviation and Area of Normal Distribution
- The area within one standard deviation of the mean of a normal distribution is \(\approx\) 68%.
- Within two standard deviations it’s \(\approx\) 95%
- Within three standard deviations it’s \(\approx\) 99.7%
pnormGC(c(-1,1),region="between", mean=0, sd=1, graph=TRUE)
pnormGC(c(-2,2),region="between", mean=0, sd=1, graph=TRUE)
pnormGC(c(-3,3),region="between", mean=0, sd=1, graph=TRUE)
## [1] 0.6826895
## [1] 0.9544997
## [1] 0.9973002
6 Normal Probability Plot
Using a Normal Probability Plot or a Quantile Quantile Plot we can find out how closely a dataset approaches a normal distribution using the qqnorm and qqline functions. The closer the data matches the line the more normal the data is. Looking at the data we can see how close these datasets approach a bell curve.
qplot(as.numeric(Nile), geom="density", main ="Nile River Annual River Flow \nDensity Plot")
qqnorm(Nile, main = "Nile River Annual River Flow \nNormal Probability Plot")
qqline(Nile)
qplot(as.numeric(LakeHuron), geom="density", main ="Lake Huron Water Level \n Density Plot")
qqnorm(LakeHuron, main = "Lake Huron Water Level \n Normal Probability Plot")
qqline(LakeHuron)
7 How Normal Is A Dataset
One way to determine how normal a dataset is to plot it.
But we can also:
- Find it’s mean and standard deviation
- Find how many values fall within 1, 2 and 3 standard deviations of the mean.
- Then compare those values to 68%, 95% and 99.7% respectively.
That should give a rough idea of how normal the data is.
So with the airquality data set we have an air temperature variable:
head(airquality$Temp)
## [1] 67 72 74 62 56 66
Find the mean and standard deviation:
mean(airquality$Temp)
## [1] 77.88235
sd(airquality$Temp)
## [1] 9.46527
The percentage of values within 1 standard devation are:
length(airquality$Temp[airquality$Temp > mean(airquality$Temp) - sd(airquality$Temp) & airquality$Temp < mean(airquality$Temp) + sd(airquality$Temp)]) / length(airquality$Temp)
## [1] 0.6666667
The percentage of values within 2 standard deviations are:
length(airquality$Temp[airquality$Temp > mean(airquality$Temp) - 2*sd(airquality$Temp) & airquality$Temp < mean(airquality$Temp) + 2*sd(airquality$Temp)]) / length(airquality$Temp)
## [1] 0.9542484
The percentage of values within 3 standard deviations are:
length(airquality$Temp[airquality$Temp > mean(airquality$Temp) - 3*sd(airquality$Temp) & airquality$Temp < mean(airquality$Temp) + 3*sd(airquality$Temp)]) / length(airquality$Temp)
## [1] 1
- these operations can be done much more easily with pnorm or pnormGC.
qqnorm(airquality$Temp)
qqline(airquality$Temp)
ggplot(airquality, aes(Temp)) + geom_density()
8 dnorm, pnorm, qnorm, rnorm
The four normal functions in R are:
- dnorm(x, mean = 77, sd = 9.5, log = FALSE)
- pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
- qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
- rnorm(n, mean = 0, sd = 1)
8.1 dnorm
dnorm tells you the density of a point in a normal distribution. So for example, if we plot the airquality temparature data and compare the plot at 70 degrees to dnorm at 70 degrees with the same mean and standard deviation, we find that the two values are approximately the same.
ggplot(airquality, aes(Temp)) + geom_density()
dnorm(70, mean=77.88, sd=9.47, log=FALSE)
## [1] 0.0297995
8.2 pnorm
pnorm tells you the percentage of values in a normal distribution above or below a specific value.
So this will indicate the percentage of values greater than 70 degrees in a normal distrbution. Changing lower.tail=FALSE
to TRUE
will give the percentage less than 70 degrees.
pnorm(70, mean=77.88, sd=9.47, lower.tail = FALSE)
## [1] 0.7973241
pnorm(70, mean=77.88, sd=9.47, lower.tail = TRUE)
## [1] 0.2026759
8.3 qnorm
qnorm is the opposite of pnorm. pnorm gives you the percentage of values greater than or less than a value on the x-axis, qnorm gives the value on the x-axis based from a percentage of values.
qnorm(0.7973241, mean=77.88, sd=9.47, lower.tail = FALSE)
## [1] 70
8.4 rnorm
rnorm generates random numbers which have a normal distribution.
x <- rnorm(1000, 0, 1)
qplot(x,geom="density")