Merging Data
If you load two datasets into R and want to merge them together. What you’ll want to match those datasets together using an ID. This is an example from JTLeek
if(!file.exists("./data")){dir.create("./data")}
fileUrl1 = "https://dl.dropboxusercontent.com/u/7710864/data/reviews-apr29.csv"
fileUrl2 = "https://dl.dropboxusercontent.com/u/7710864/data/solutions-apr29.csv"
download.file(fileUrl1, destfile="./data/reviews.csv", method="curl")
download.file(fileUrl2, destfile="./data/solutions.csv", method="curl")
reviews <- read.csv("./data/reviews.csv"); solutions <- read.csv("./data/solutions.csv")
head(reviews,2)
## id solution_id reviewer_id start stop time_left accept
## 1 1 3 27 1304095698 1304095758 1754 1
## 2 2 4 22 1304095188 1304095206 2306 1
head(solutions, 2)
## id problem_id subject_id start stop time_left answer
## 1 1 156 29 1304095119 1304095169 2343 B
## 2 2 269 25 1304095119 1304095183 2329 C
In the reviews dataset there is a solution_id variable which corresponds to the id variable in the solutions dataset.
Merging data - merge()
-Merges data frames - Important parameters: x,y,by,by.x,by.y,all
names(reviews)
## [1] "id" "solution_id" "reviewer_id" "start" "stop"
## [6] "time_left" "accept"
names(solutions)
## [1] "id" "problem_id" "subject_id" "start" "stop"
## [6] "time_left" "answer"
We have to tell merge which data it has to merge. This tells merge solution_id and id are the same all = TRUE will add variables that show up in one dataset but not in the other.
mergedData = merge(reviews, solutions, by.x="solution_id", by.y="id", all=TRUE)
head(mergedData)
## solution_id id reviewer_id start.x stop.x time_left.x accept
## 1 1 4 26 1304095267 1304095423 2089 1
## 2 2 6 29 1304095471 1304095513 1999 1
## 3 3 1 27 1304095698 1304095758 1754 1
## 4 4 2 22 1304095188 1304095206 2306 1
## 5 5 3 28 1304095276 1304095320 2192 1
## 6 6 16 22 1304095303 1304095471 2041 1
## problem_id subject_id start.y stop.y time_left.y answer
## 1 156 29 1304095119 1304095169 2343 B
## 2 269 25 1304095119 1304095183 2329 C
## 3 34 22 1304095127 1304095146 2366 C
## 4 19 23 1304095127 1304095150 2362 D
## 5 605 26 1304095127 1304095167 2345 A
## 6 384 27 1304095131 1304095270 2242 C
Default - merge all common column names
The default is to merge on all common column names. Intersect shows us that the two data sets have these four variables in common: “id” “start” “stop” and “time_left”
intersect(names(solutions), names(reviews))
## [1] "id" "start" "stop" "time_left"
So sometimes the data variables will match up and other times it won’t. So merge will create multiple rows for each variable making a much larger data set than is necessary
mergedData2 = merge(reviews,solutions,all=TRUE)
head(mergedData2)
## id start stop time_left solution_id reviewer_id accept
## 1 1 1304095119 1304095169 2343 NA NA NA
## 2 1 1304095698 1304095758 1754 3 27 1
## 3 2 1304095119 1304095183 2329 NA NA NA
## 4 2 1304095188 1304095206 2306 4 22 1
## 5 3 1304095127 1304095146 2366 NA NA NA
## 6 3 1304095276 1304095320 2192 5 28 1
## problem_id subject_id answer
## 1 156 29 B
## 2 NA NA <NA>
## 3 269 25 C
## 4 NA NA <NA>
## 5 34 22 C
## 6 NA NA <NA>
Using join in the plyr package
join is a bit faster but less full featured - default to left join, see help file for more info.
library(plyr)
df1 = data.frame(id=sample(1:10), x=rnorm(10))
df2 = data.frame(id=sample(1:10), y=rnorm(10))
arrange(join(df1,df2),id)
## Joining by: id
## id x y
## 1 1 -0.8705105 -0.6440821
## 2 2 -1.0105164 -0.9986890
## 3 3 1.3798872 0.1731997
## 4 4 0.3262530 -0.4339347
## 5 5 0.4906615 0.5439697
## 6 6 -0.9870139 -0.8324297
## 7 7 0.5041528 -0.7852223
## 8 8 1.1495053 -1.8563536
## 9 9 -1.1025739 -0.1505799
## 10 10 0.6095613 0.1015107
It can only merge based on common names between two datasets. It cannot work with two different names like solution_id and id.
If you have multiple data frames
It’s harder to do this with merge but it works fairly well with join_all from plyr. Create a list and then join by pointing to the list.
df1 = data.frame(id=sample(1:10), x=rnorm(10))
df2 = data.frame(id=sample(1:10), y=rnorm(10))
df3 = data.frame(id=sample(1:10), z=rnorm(10))
dfList=list(df1,df2,df3)
join_all(dfList)
## Joining by: id
## Joining by: id
## id x y z
## 1 7 -1.6366161 -0.3919997 1.535524347
## 2 5 0.9312909 -0.2207783 1.180974984
## 3 3 -2.2642911 -0.1860606 0.748630263
## 4 6 0.7361912 2.7198882 0.152189871
## 5 9 1.3186331 0.7091286 -2.179985754
## 6 1 0.9866399 1.3522396 -0.592890273
## 7 2 0.4195678 0.4007915 0.305412005
## 8 8 -0.4737391 1.0069134 -0.326728918
## 9 4 0.5716195 0.3834385 0.004047811
## 10 10 -0.7917472 0.3496248 -0.620695865
More on merging data
- The quick R data merging page http://www.statmethods.net/management/merging.html
- plyr information http://plyr.had.co.nz
- Types of joins http://en.wikipedia.org/wiki/Join_(SQL)