Merging Data

If you load two datasets into R and want to merge them together. What you’ll want to match those datasets together using an ID. This is an example from JTLeek

if(!file.exists("./data")){dir.create("./data")}
fileUrl1 = "https://dl.dropboxusercontent.com/u/7710864/data/reviews-apr29.csv"
fileUrl2 = "https://dl.dropboxusercontent.com/u/7710864/data/solutions-apr29.csv"
download.file(fileUrl1, destfile="./data/reviews.csv", method="curl")
download.file(fileUrl2, destfile="./data/solutions.csv", method="curl")
reviews <- read.csv("./data/reviews.csv"); solutions <- read.csv("./data/solutions.csv")
head(reviews,2)
##   id solution_id reviewer_id      start       stop time_left accept
## 1  1           3          27 1304095698 1304095758      1754      1
## 2  2           4          22 1304095188 1304095206      2306      1
head(solutions, 2)
##   id problem_id subject_id      start       stop time_left answer
## 1  1        156         29 1304095119 1304095169      2343      B
## 2  2        269         25 1304095119 1304095183      2329      C

In the reviews dataset there is a solution_id variable which corresponds to the id variable in the solutions dataset.

Merging data - merge()

-Merges data frames - Important parameters: x,y,by,by.x,by.y,all

names(reviews)
## [1] "id"          "solution_id" "reviewer_id" "start"       "stop"       
## [6] "time_left"   "accept"
names(solutions)
## [1] "id"         "problem_id" "subject_id" "start"      "stop"      
## [6] "time_left"  "answer"

We have to tell merge which data it has to merge. This tells merge solution_id and id are the same all = TRUE will add variables that show up in one dataset but not in the other.

mergedData = merge(reviews, solutions, by.x="solution_id", by.y="id", all=TRUE)
head(mergedData)
##   solution_id id reviewer_id    start.x     stop.x time_left.x accept
## 1           1  4          26 1304095267 1304095423        2089      1
## 2           2  6          29 1304095471 1304095513        1999      1
## 3           3  1          27 1304095698 1304095758        1754      1
## 4           4  2          22 1304095188 1304095206        2306      1
## 5           5  3          28 1304095276 1304095320        2192      1
## 6           6 16          22 1304095303 1304095471        2041      1
##   problem_id subject_id    start.y     stop.y time_left.y answer
## 1        156         29 1304095119 1304095169        2343      B
## 2        269         25 1304095119 1304095183        2329      C
## 3         34         22 1304095127 1304095146        2366      C
## 4         19         23 1304095127 1304095150        2362      D
## 5        605         26 1304095127 1304095167        2345      A
## 6        384         27 1304095131 1304095270        2242      C


Default - merge all common column names

The default is to merge on all common column names. Intersect shows us that the two data sets have these four variables in common: “id” “start” “stop” and “time_left”

intersect(names(solutions), names(reviews))
## [1] "id"        "start"     "stop"      "time_left"

So sometimes the data variables will match up and other times it won’t. So merge will create multiple rows for each variable making a much larger data set than is necessary

mergedData2 = merge(reviews,solutions,all=TRUE)
head(mergedData2)
##   id      start       stop time_left solution_id reviewer_id accept
## 1  1 1304095119 1304095169      2343          NA          NA     NA
## 2  1 1304095698 1304095758      1754           3          27      1
## 3  2 1304095119 1304095183      2329          NA          NA     NA
## 4  2 1304095188 1304095206      2306           4          22      1
## 5  3 1304095127 1304095146      2366          NA          NA     NA
## 6  3 1304095276 1304095320      2192           5          28      1
##   problem_id subject_id answer
## 1        156         29      B
## 2         NA         NA   <NA>
## 3        269         25      C
## 4         NA         NA   <NA>
## 5         34         22      C
## 6         NA         NA   <NA>


Using join in the plyr package

join is a bit faster but less full featured - default to left join, see help file for more info.

library(plyr)
df1 = data.frame(id=sample(1:10), x=rnorm(10))
df2 = data.frame(id=sample(1:10), y=rnorm(10))
arrange(join(df1,df2),id)
## Joining by: id
##    id          x          y
## 1   1 -0.8705105 -0.6440821
## 2   2 -1.0105164 -0.9986890
## 3   3  1.3798872  0.1731997
## 4   4  0.3262530 -0.4339347
## 5   5  0.4906615  0.5439697
## 6   6 -0.9870139 -0.8324297
## 7   7  0.5041528 -0.7852223
## 8   8  1.1495053 -1.8563536
## 9   9 -1.1025739 -0.1505799
## 10 10  0.6095613  0.1015107

It can only merge based on common names between two datasets. It cannot work with two different names like solution_id and id.


If you have multiple data frames

It’s harder to do this with merge but it works fairly well with join_all from plyr. Create a list and then join by pointing to the list.

df1 = data.frame(id=sample(1:10), x=rnorm(10))
df2 = data.frame(id=sample(1:10), y=rnorm(10))
df3 = data.frame(id=sample(1:10), z=rnorm(10))
dfList=list(df1,df2,df3)
join_all(dfList)
## Joining by: id
## Joining by: id
##    id          x          y            z
## 1   7 -1.6366161 -0.3919997  1.535524347
## 2   5  0.9312909 -0.2207783  1.180974984
## 3   3 -2.2642911 -0.1860606  0.748630263
## 4   6  0.7361912  2.7198882  0.152189871
## 5   9  1.3186331  0.7091286 -2.179985754
## 6   1  0.9866399  1.3522396 -0.592890273
## 7   2  0.4195678  0.4007915  0.305412005
## 8   8 -0.4737391  1.0069134 -0.326728918
## 9   4  0.5716195  0.3834385  0.004047811
## 10 10 -0.7917472  0.3496248 -0.620695865


More on merging data