Reshape
Melt
Melt allows you to convert multiple variable columns into two columns, one column with the variable name and the second with the value of that variable. THis is an example with mtcars data.
library(reshape2)
mtcars$cars<- rownames(mtcars)
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
## cars
## Mazda RX4 Mazda RX4
## Mazda RX4 Wag Mazda RX4 Wag
## Datsun 710 Datsun 710
## Hornet 4 Drive Hornet 4 Drive
## Hornet Sportabout Hornet Sportabout
## Valiant Valiant
Notice the variables “wt”, “gear”, and “carb”.
carsData <- melt(mtcars, id.vars = c("cars", "mpg", "cyl", "disp", "hp"), measure.vars = c("wt","gear", "carb"))
After the data has been melted, notice the variable column and also that each car now has three entries, one for “wt”, one for “gear”, and one for “carb”.
head(carsData)
## cars mpg cyl disp hp variable value
## 1 Mazda RX4 21.0 6 160 110 wt 2.620
## 2 Mazda RX4 Wag 21.0 6 160 110 wt 2.875
## 3 Datsun 710 22.8 4 108 93 wt 2.320
## 4 Hornet 4 Drive 21.4 6 258 110 wt 3.215
## 5 Hornet Sportabout 18.7 8 360 175 wt 3.440
## 6 Valiant 18.1 6 225 105 wt 3.460
carsData[carsData$cars == "Mazda RX4",]
## cars mpg cyl disp hp variable value
## 1 Mazda RX4 21 6 160 110 wt 2.62
## 33 Mazda RX4 21 6 160 110 gear 4.00
## 65 Mazda RX4 21 6 160 110 carb 4.00
Again the same thing with airquality data.
names(airquality) <- tolower(names(airquality))
head(airquality)
## ozone solar.r wind temp month day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
aqm <- melt(airquality, id=c("month", "day"), na.rm=TRUE)
head(aqm)
## month day variable value
## 1 5 1 ozone 41
## 2 5 2 ozone 36
## 3 5 3 ozone 12
## 4 5 4 ozone 18
## 6 5 6 ozone 28
## 7 5 7 ozone 23
acast & dcast
acast and dcast are basically the same. The difference is that acast returns a vector/matrix/array and dcast returns a data.frame.
The function will perform a function on a melted dataset and return data set.
levels(aqm$variable)
## [1] "ozone" "solar.r" "wind" "temp"
airqualityDcast <- dcast(aqm, formula = month ~ variable, mean, margins = c("month", "variable"))
class(airqualityDcast) ; airqualityDcast
## [1] "data.frame"
## month ozone solar.r wind temp (all)
## 1 5 23.61538 181.2963 11.622581 65.54839 68.70696
## 2 6 29.44444 190.1667 10.266667 79.10000 87.38384
## 3 7 59.11538 216.4839 8.941935 83.90323 93.49748
## 4 8 59.96154 171.8571 8.793548 83.96774 79.71207
## 5 9 31.44828 167.4333 10.180000 76.90000 71.82689
## 6 (all) 42.12931 185.9315 9.957516 77.88235 80.05722
The formula argument in dcast/acast, e.g. “month ~ variable”, determines what will be the variables that mean is applied to and how the data is grouped. In the example above (month~variable) returns the mean of the ozone, solar.r, wind, and temp grouped by each month.
This does the same thing as above but returns the mean of these values by the day of the month.
aqmDayDcast <- dcast(aqm, day ~ variable, mean, margins = c("month", "variable"))
head(aqmDayDcast)
## day ozone solar.r wind temp (all)
## 1 1 77.75000 199.0000 6.78 80.2 91.62632
## 2 2 43.00000 174.8000 9.16 80.8 78.72632
## 3 3 33.25000 177.4000 9.62 79.4 77.11053
## 4 4 62.33333 197.2500 8.62 81.8 84.00588
## 5 5 48.66667 163.3333 8.46 79.2 67.14375
## 6 6 41.50000 223.3333 12.04 79.8 76.18824
The margins argument applies the mean (or whatever function) to all values in the rows and/or all values in the columns. For example:
dcast(aqm, month ~ variable, mean, margins = c("variable"))
## month ozone solar.r wind temp (all)
## 1 5 23.61538 181.2963 11.622581 65.54839 68.70696
## 2 6 29.44444 190.1667 10.266667 79.10000 87.38384
## 3 7 59.11538 216.4839 8.941935 83.90323 93.49748
## 4 8 59.96154 171.8571 8.793548 83.96774 79.71207
## 5 9 31.44828 167.4333 10.180000 76.90000 71.82689
dcast(aqm, month ~ variable, mean, margins = c("month"))
## month ozone solar.r wind temp
## 1 5 23.61538 181.2963 11.622581 65.54839
## 2 6 29.44444 190.1667 10.266667 79.10000
## 3 7 59.11538 216.4839 8.941935 83.90323
## 4 8 59.96154 171.8571 8.793548 83.96774
## 5 9 31.44828 167.4333 10.180000 76.90000
## 6 (all) 42.12931 185.9315 9.957516 77.88235
dcast(aqm, month ~ variable, mean)
## month ozone solar.r wind temp
## 1 5 23.61538 181.2963 11.622581 65.54839
## 2 6 29.44444 190.1667 10.266667 79.10000
## 3 7 59.11538 216.4839 8.941935 83.90323
## 4 8 59.96154 171.8571 8.793548 83.96774
## 5 9 31.44828 167.4333 10.180000 76.90000
Notice that the mean of all ozone values from the airquality dataset is the same as the ozone mean from the dcast(aqm) data with the “margins = c(”month“)” argument.
mean(airquality$ozone, na.rm=TRUE); mean(airquality$solar.r, na.rm=TRUE); mean(airquality$wind, na.rm=TRUE); mean(airquality$temp, na.rm=TRUE)
## [1] 42.12931
## [1] 185.9315
## [1] 9.957516
## [1] 77.88235
Plotting reshaped data
Melted data is a lot easier with lattice and ggplot.
library(ggplot2)
ggplot(aqm, aes(x=month, y=value, fill=variable)) + geom_bar(stat = "identity")
Cast data isn’t as effective for showing multiple variables on one plot.
ggplot(airqualityDcast, aes(x=month, y=ozone)) + geom_bar(stat = "identity")
head(aqm)
## month day variable value
## 1 5 1 ozone 41
## 2 5 2 ozone 36
## 3 5 3 ozone 12
## 4 5 4 ozone 18
## 6 5 6 ozone 28
## 7 5 7 ozone 23
head(airqualityDcast)
## month ozone solar.r wind temp (all)
## 1 5 23.61538 181.2963 11.622581 65.54839 68.70696
## 2 6 29.44444 190.1667 10.266667 79.10000 87.38384
## 3 7 59.11538 216.4839 8.941935 83.90323 93.49748
## 4 8 59.96154 171.8571 8.793548 83.96774 79.71207
## 5 9 31.44828 167.4333 10.180000 76.90000 71.82689
## 6 (all) 42.12931 185.9315 9.957516 77.88235 80.05722
Error : Mapping a variable to y and also using stat=“bin”
If the geom_bar function is empty or if geom_bar contains the argument stat=“bin” it will throw an error, stat needs to be identity.
For more information, click here, here, and here.