Using Dplyr
dplyr package is about data frames. Each observation is one row and each column is one variable or measure
dplyr Properties
The first argument is a data frame.
The subsequent arguments describe what to do with it, and you can refer to columns in the data frame directly without using the $ operator (just use the names).
The result is a new data frame
Data frames must be properly formatted and annotated for this to all be useful
Using The Functions
First download data from: https://raw.github.com/DataScienceSpecialization/courses/master/03_GettingData/dplyr/chicago.rds
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
options(width=105)
chicago <- readRDS("./data/chicago.rds")
str(chicago)
## 'data.frame': 6940 obs. of 8 variables:
## $ city : chr "chic" "chic" "chic" "chic" ...
## $ tmpd : num 31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...
## $ dptp : num 31.5 29.9 27.4 28.6 28.9 ...
## $ date : Date, format: "1987-01-01" "1987-01-02" "1987-01-03" ...
## $ pm25tmean2: num NA NA NA NA NA NA NA NA NA NA ...
## $ pm10tmean2: num 34 NA 34.2 47 NA ...
## $ o3tmean2 : num 4.25 3.3 3.33 4.38 4.75 ...
## $ no2tmean2 : num 20 23.2 23.8 30.4 30.3 ...
Use the names function to get the names of the columns in the data.frame
names(chicago)
## [1] "city" "tmpd" "dptp" "date" "pm25tmean2" "pm10tmean2" "o3tmean2"
## [8] "no2tmean2"
The functions are:
select: return a subset of the columns of a data frame
So here I can return the head of the data.frame using select and showing the head of the columns from city to dptp.
head(select(chicago, city:dptp))
## city tmpd dptp
## 1 chic 31.5 31.500
## 2 chic 33.0 29.875
## 3 chic 33.0 27.375
## 4 chic 29.0 28.625
## 5 chic 32.0 28.875
## 6 chic 40.0 35.125
You can also use the minus sign to show all columns except those in the range specified with select
head(select(chicago,- (city:dptp)))
## date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
## 1 1987-01-01 NA 34.00000 4.250000 19.98810
## 2 1987-01-02 NA NA 3.304348 23.19099
## 3 1987-01-03 NA 34.16667 3.333333 23.81548
## 4 1987-01-04 NA 47.00000 4.375000 30.43452
## 5 1987-01-05 NA NA 4.750000 30.33333
## 6 1987-01-06 NA 48.00000 5.833333 25.77233
To do this without dplyr takes a few extra steps:
i <- match("city", names(chicago))
j <- match("dptp", names(chicago))
head(chicago[, -(i:j)])
## date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
## 1 1987-01-01 NA 34.00000 4.250000 19.98810
## 2 1987-01-02 NA NA 3.304348 23.19099
## 3 1987-01-03 NA 34.16667 3.333333 23.81548
## 4 1987-01-04 NA 47.00000 4.375000 30.43452
## 5 1987-01-05 NA NA 4.750000 30.33333
## 6 1987-01-06 NA 48.00000 5.833333 25.77233
filter: extract a subset of rows from a data frame based on logical conditions
For example you might want to take all rows in the chicago dataset where PM2.5 is > 30
chic.f <- filter(chicago, pm25tmean2 > 30)
head(chic.f, 10)
## city tmpd dptp date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
## 1 chic 23 21.9 1998-01-17 38.10 32.46154 3.180556 25.30000
## 2 chic 28 25.8 1998-01-23 33.95 38.69231 1.750000 29.37630
## 3 chic 55 51.3 1998-04-30 39.40 34.00000 10.786232 25.31310
## 4 chic 59 53.7 1998-05-01 35.40 28.50000 14.295125 31.42905
## 5 chic 57 52.0 1998-05-02 33.30 35.00000 20.662879 26.79861
## 6 chic 57 56.0 1998-05-07 32.10 34.50000 24.270422 33.99167
## 7 chic 75 65.8 1998-05-15 56.50 91.00000 38.573007 29.03261
## 8 chic 61 59.0 1998-06-09 33.80 26.00000 17.890810 25.49668
## 9 chic 73 60.3 1998-07-13 30.30 64.50000 37.018865 37.93056
## 10 chic 78 67.1 1998-07-14 41.40 75.00000 40.080902 32.59054
You can add additional conditions:
chic.f <- filter(chicago, pm25tmean2 > 30 & tmpd > 80)
head(chic.f, 10)
## city tmpd dptp date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
## 1 chic 81 71.2 1998-08-23 39.6000 59.0 45.86364 14.32639
## 2 chic 81 70.4 1998-09-06 31.5000 50.5 50.66250 20.31250
## 3 chic 82 72.2 2001-07-20 32.3000 58.5 33.00380 33.67500
## 4 chic 84 72.9 2001-08-01 43.7000 81.5 45.17736 27.44239
## 5 chic 85 72.6 2001-08-08 38.8375 70.0 37.98047 27.62743
## 6 chic 84 72.6 2001-08-09 38.2000 66.0 36.73245 26.46742
## 7 chic 82 67.4 2002-06-20 33.0000 80.5 47.42673 30.76703
## 8 chic 82 63.5 2002-06-23 42.5000 65.0 54.88043 30.03913
## 9 chic 81 70.4 2002-07-08 33.1000 64.0 45.34969 27.67857
## 10 chic 82 66.2 2002-07-18 38.8500 72.5 44.98045 26.06905
arrange: reorder rows of a data frame
This will arrange the variables according to the date variable:
chicago <- arrange(chicago, date)
head(chicago)
## city tmpd dptp date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
## 1 chic 31.5 31.500 1987-01-01 NA 34.00000 4.250000 19.98810
## 2 chic 33.0 29.875 1987-01-02 NA NA 3.304348 23.19099
## 3 chic 33.0 27.375 1987-01-03 NA 34.16667 3.333333 23.81548
## 4 chic 29.0 28.625 1987-01-04 NA 47.00000 4.375000 30.43452
## 5 chic 32.0 28.875 1987-01-05 NA NA 4.750000 30.33333
## 6 chic 40.0 35.125 1987-01-06 NA 48.00000 5.833333 25.77233
tail(chicago)
## city tmpd dptp date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
## 6935 chic 35 29.6 2005-12-26 8.40000 8.5 14.041667 16.81944
## 6936 chic 40 33.6 2005-12-27 23.56000 27.0 4.468750 23.50000
## 6937 chic 37 34.5 2005-12-28 17.75000 27.5 3.260417 19.28563
## 6938 chic 35 29.4 2005-12-29 7.45000 23.5 6.794837 19.97222
## 6939 chic 36 31.0 2005-12-30 15.05714 19.2 3.034420 22.80556
## 6940 chic 35 30.1 2005-12-31 15.00000 23.5 2.531250 13.25000
This will arrange the variables in descending order:
chicago <- arrange(chicago, desc(date))
head(chicago)
## city tmpd dptp date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
## 1 chic 35 30.1 2005-12-31 15.00000 23.5 2.531250 13.25000
## 2 chic 36 31.0 2005-12-30 15.05714 19.2 3.034420 22.80556
## 3 chic 35 29.4 2005-12-29 7.45000 23.5 6.794837 19.97222
## 4 chic 37 34.5 2005-12-28 17.75000 27.5 3.260417 19.28563
## 5 chic 40 33.6 2005-12-27 23.56000 27.0 4.468750 23.50000
## 6 chic 35 29.6 2005-12-26 8.40000 8.5 14.041667 16.81944
rename: rename variables in a data frame
If you want to rename the pm25tmean2 you can rename that variable:
chicago <- rename(chicago, pm25 = pm25tmean2, dewpoint = dptp)
names(chicago)
## [1] "city" "tmpd" "dewpoint" "date" "pm25" "pm10tmean2" "o3tmean2"
## [8] "no2tmean2"
mutate: add new variables / columns or transform existing variables
create or transform existing variables. So this will create a column showing how much pm25 emissions deviate from the mean.
chicago <- mutate(chicago, pm25detrend = pm25-mean(pm25, na.rm=TRUE))
head(select(chicago, pm25, pm25detrend))
## pm25 pm25detrend
## 1 15.00000 -1.230958
## 2 15.05714 -1.173815
## 3 7.45000 -8.780958
## 4 17.75000 1.519042
## 5 23.56000 7.329042
## 6 8.40000 -7.830958
This creates a factor variable indicating whether the temperatures are hot or cold.
chicago <- mutate(chicago, tempcat = factor(1 * (tmpd > 80), labels = c("cold", "hot")))
hotcold <- group_by(chicago, tempcat)
hotcold
## Source: local data frame [6,940 x 10]
## Groups: tempcat
##
## city tmpd dewpoint date pm25 pm10tmean2 o3tmean2 no2tmean2 pm25detrend tempcat
## 1 chic 35 30.1 2005-12-31 15.00000 23.5 2.531250 13.25000 -1.230958 cold
## 2 chic 36 31.0 2005-12-30 15.05714 19.2 3.034420 22.80556 -1.173815 cold
## 3 chic 35 29.4 2005-12-29 7.45000 23.5 6.794837 19.97222 -8.780958 cold
## 4 chic 37 34.5 2005-12-28 17.75000 27.5 3.260417 19.28563 1.519042 cold
## 5 chic 40 33.6 2005-12-27 23.56000 27.0 4.468750 23.50000 7.329042 cold
## 6 chic 35 29.6 2005-12-26 8.40000 8.5 14.041667 16.81944 -7.830958 cold
## 7 chic 35 32.1 2005-12-25 6.70000 8.0 14.354167 13.79167 -9.530958 cold
## 8 chic 37 35.2 2005-12-24 30.77143 25.2 1.770833 31.98611 14.540471 cold
## 9 chic 41 32.6 2005-12-23 32.90000 34.5 6.906250 29.08333 16.669042 cold
## 10 chic 22 23.3 2005-12-22 36.65000 42.5 5.385417 33.73026 20.419042 cold
## .. ... ... ... ... ... ... ... ... ... ...
summarise / summarize: generate summary statistics of different variables in the data frame, possibly within strata
This gives mean pm25, maxo 3tmean2, and median no2tmean2 for both hot and cold days. There are NA values for pm25 so na.rm has to be added to pm25
summarize(hotcold, pm25 = mean(pm25, na.rm=TRUE), o3 = max(o3tmean2), no2 = median(no2tmean2))
## Source: local data frame [3 x 4]
##
## tempcat pm25 o3 no2
## 1 cold 15.97807 66.587500 24.54924
## 2 hot 26.48118 62.969656 24.93870
## 3 NA 47.73750 9.416667 37.44444
Might want a summary for every year. Using mutate we can add a year variable. This will show us mean pm25, max o3, and median no2 for each year in the dataset.
chicago <- mutate(chicago, year = as.POSIXlt(date)$year + 1900)
years <- group_by(chicago, year)
summarize(years, pm25=mean(pm25, na.rm = TRUE), o3 = max(o3tmean2), no2 = median(no2tmean2))
## Source: local data frame [19 x 4]
##
## year pm25 o3 no2
## 1 1987 NaN 62.96966 23.49369
## 2 1988 NaN 61.67708 24.52296
## 3 1989 NaN 59.72727 26.14062
## 4 1990 NaN 52.22917 22.59583
## 5 1991 NaN 63.10417 21.38194
## 6 1992 NaN 50.82870 24.78921
## 7 1993 NaN 44.30093 25.76993
## 8 1994 NaN 52.17844 28.47500
## 9 1995 NaN 66.58750 27.26042
## 10 1996 NaN 58.39583 26.38715
## 11 1997 NaN 56.54167 25.48143
## 12 1998 18.26467 50.66250 24.58649
## 13 1999 18.49646 57.48864 24.66667
## 14 2000 16.93806 55.76103 23.46082
## 15 2001 16.92632 51.81984 25.06522
## 16 2002 15.27335 54.88043 22.73750
## 17 2003 15.23183 56.16608 24.62500
## 18 2004 14.62864 44.48240 23.39130
## 19 2005 16.18556 58.84126 22.62387
Pipeline Operator %>%
dplyr has a function that allows you to chain multiple functions together and it allows you see what is happening in a readable way. It uses these symbols: %>%
This will mutate it by month, then take that output and group_by it with the month variable then take the output of that and summarize. This shows a summary of three pollutants grouped by the month of the year.
chicago %>% mutate(month = as.POSIXlt(date)$mon + 1) %>% group_by(month) %>% summarize(pm25 = mean(pm25, na.rm=TRUE), o3 = max(o3tmean2), no2 = median(no2tmean2))
## Source: local data frame [12 x 4]
##
## month pm25 o3 no2
## 1 1 17.76996 28.22222 25.35417
## 2 2 20.37513 37.37500 26.78034
## 3 3 17.40818 39.05000 26.76984
## 4 4 13.85879 47.94907 25.03125
## 5 5 14.07420 52.75000 24.22222
## 6 6 15.86461 66.58750 25.01140
## 7 7 16.57087 59.54167 22.38442
## 8 8 16.93380 53.96701 22.98333
## 9 9 15.91279 57.48864 24.47917
## 10 10 14.23557 47.09275 24.15217
## 11 11 15.15794 29.45833 23.56537
## 12 12 17.52221 27.70833 24.45773