How To Use Data.Tables
- 1 Create a data.table
- 2 Listing data.tables
- 3 [i, j, by]
- 4 Select rows:
DT[i]
- 5 Select Columns:
DT[,j]
- 6 Selecting Rows and Columns
DT[i,j]
- 7 Perform a function on a column by the value of another column
DT[,j,by]
- 8 Perform A Function On A Column By Values In Another Column From A Subset of Rows
DT[i,j,by]
- 9 Modifying Data.Tables
DT[i,j := ]
- 10 Complete Cases - removing NA rows
- 11 Using
setkey()
to sort by a keyed column - 12 N
- 13 List
- 14 SD - ie Subset of x’s Data
- 15 Chaining multiple operations together
- 16 Set and looping in a data.table
- 17 Change column names using setnames()
- 18 Changing column order - setcolorder()
- 19 Unique
- 20 Reference Websites
There are significant speed advantages to using data.tables over data.frames. However the syntax is a bit confusing. This is an attempt to demistify data.tables.
library(data.table)
1 Create a data.table
Start by creating a few data tables.
set.seed(45L)
DT <- data.table(
V1=c(1L,2L),
V2=LETTERS[1:3],
V3=round(rnorm(4),4),
V4=1:12,
V5=21:32,
V6= LETTERS[1:12],
V7=101:112,
V8=121:132,
V9=rep(c(NA, 1,2,3), 3)
)
MTCarsDT <- data.table(mtcars)
DT
## V1 V2 V3 V4 V5 V6 V7 V8 V9
## 1: 1 A 0.3408 1 21 A 101 121 NA
## 2: 2 B -0.7033 2 22 B 102 122 1
## 3: 1 C -0.3795 3 23 C 103 123 2
## 4: 2 A -0.7460 4 24 D 104 124 3
## 5: 1 B 0.3408 5 25 E 105 125 NA
## 6: 2 C -0.7033 6 26 F 106 126 1
## 7: 1 A -0.3795 7 27 G 107 127 2
## 8: 2 B -0.7460 8 28 H 108 128 3
## 9: 1 C 0.3408 9 29 I 109 129 NA
## 10: 2 A -0.7033 10 30 J 110 130 1
## 11: 1 B -0.3795 11 31 K 111 131 2
## 12: 2 C -0.7460 12 32 L 112 132 3
head(MTCarsDT)
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## 2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## 3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## 4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## 5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## 6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
2 Listing data.tables
Data.table has a nice function that will list all tables in the global environment as well as their columns, and some extra information.
tables()
## NAME NROW NCOL MB COLS
## [1,] DT 12 9 1 V1,V2,V3,V4,V5,V6,V7,V8,V9
## [2,] MTCarsDT 32 11 1 mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
## KEY
## [1,]
## [2,]
## Total: 2MB
3 [i, j, by]
Data.tables allow you to find data in them and perform operations using the following syntax: DT[i, j, by]
, which means, “Take DT, subset rows using i
, then calculate j
grouped by by
.”
4 Select rows: DT[i]
How to select rows from a data.table
MTCarsDT[3]
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1: 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
MTCarsDT[3:5]
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## 2: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## 3: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
MTCarsDT[c(3,5:8)]
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1: 22.8 4 108.0 93 3.85 2.32 18.61 1 1 4 1
## 2: 18.7 8 360.0 175 3.15 3.44 17.02 0 0 3 2
## 3: 18.1 6 225.0 105 2.76 3.46 20.22 1 0 3 1
## 4: 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4
## 5: 24.4 4 146.7 62 3.69 3.19 20.00 1 0 4 2
Notice that this operation prints out the 3rd row for columns 5 through 8:
MTCarsDT[3,5:8]
## drat wt qsec vs
## 1: 3.85 2.32 18.61 1
4.1 Print rows by searching for a value in a column
Print rows with 6 cylinder cars
MTCarsDT[cyl == 6]
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1: 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## 2: 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## 3: 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## 4: 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## 5: 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## 6: 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## 7: 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Print rows with 4 and 6 cylinder cars.
MTCarsDT[cyl %in% c(4,6)]
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1: 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## 2: 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## 3: 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## 4: 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## 5: 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## 6: 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## 7: 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## 8: 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## 9: 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## 10: 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## 11: 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## 12: 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## 13: 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## 14: 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## 15: 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## 16: 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## 17: 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## 18: 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Print rows with cars that have 110 or 123 hp.
MTCarsDT[hp %in% c(110,123)]
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1: 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## 2: 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## 3: 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## 4: 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## 5: 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
5 Select Columns: DT[,j]
Print hp column as a vector
MTCarsDT[,hp]
## [1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 205 215 230
## [18] 66 52 65 97 150 150 245 175 66 91 113 264 175 335 109
Print just the hp column as a data.table
head(MTCarsDT[,.(hp)])
## hp
## 1: 110
## 2: 110
## 3: 93
## 4: 110
## 5: 175
## 6: 105
Note that the expression .(hp)
is identical to list(hp)
. The period is equivalent to the list function.
head(MTCarsDT[,list(hp)])
## hp
## 1: 110
## 2: 110
## 3: 93
## 4: 110
## 5: 175
## 6: 105
Print multiple columns as a data.table
MTCarsDT[,.(hp, cyl)]
## hp cyl
## 1: 110 6
## 2: 110 6
## 3: 93 4
## 4: 110 6
## 5: 175 8
## 6: 105 6
## 7: 245 8
## 8: 62 4
## 9: 95 4
## 10: 123 6
## 11: 123 6
## 12: 180 8
## 13: 180 8
## 14: 180 8
## 15: 205 8
## 16: 215 8
## 17: 230 8
## 18: 66 4
## 19: 52 4
## 20: 65 4
## 21: 97 4
## 22: 150 8
## 23: 150 8
## 24: 245 8
## 25: 175 8
## 26: 66 4
## 27: 91 4
## 28: 113 4
## 29: 264 8
## 30: 175 6
## 31: 335 8
## 32: 109 4
## hp cyl
Note that we can also print columns by index using the index number and with=FALSE
head(MTCarsDT[, 1:3, with=FALSE])
## mpg cyl disp
## 1: 21.0 6 160
## 2: 21.0 6 160
## 3: 22.8 4 108
## 4: 21.4 6 258
## 5: 18.7 8 360
## 6: 18.1 6 225
head(MTCarsDT[, c(1:3,4,6), with=FALSE])
## mpg cyl disp hp wt
## 1: 21.0 6 160 110 2.620
## 2: 21.0 6 160 110 2.875
## 3: 22.8 4 108 93 2.320
## 4: 21.4 6 258 110 3.215
## 5: 18.7 8 360 175 3.440
## 6: 18.1 6 225 105 3.460
However this could create problems if used in code. This is from the setkey documentation:
It isn’t good programming practice, in general, to use column numbers rather than names. This is why setkey and setkeyv only accept column names. If you use column numbers then bugs (possibly silent) can more easily creep into your code as time progresses if changes are made elsewhere in your code; e.g., if you add, remove or reorder columns in a few months time, a setkey by column number will then refer to a different column, possibly returning incorrect results with no warning.
Return mean of the hp column
MTCarsDT[,mean(hp)]
## [1] 146.6875
Return mean hp and sd of mpg.
MTCarsDT[,.(mean(hp), sd(mpg))]
## V1 V2
## 1: 146.6875 6.026948
The same as above but the results have different names.
MTCarsDT[,.(horsepower=mean(hp), milespergallon=sd(mpg))]
## horsepower milespergallon
## 1: 146.6875 6.026948
This prints the horsepower for each row and repeats the values of the sd and mean of the hp column.
MTCarsDT[,.(hp, SD_HP=sd(hp), Mean_HP=mean(hp))]
## hp SD_HP Mean_HP
## 1: 110 68.56287 146.6875
## 2: 110 68.56287 146.6875
## 3: 93 68.56287 146.6875
## 4: 110 68.56287 146.6875
## 5: 175 68.56287 146.6875
## 6: 105 68.56287 146.6875
## 7: 245 68.56287 146.6875
## 8: 62 68.56287 146.6875
## 9: 95 68.56287 146.6875
## 10: 123 68.56287 146.6875
## 11: 123 68.56287 146.6875
## 12: 180 68.56287 146.6875
## 13: 180 68.56287 146.6875
## 14: 180 68.56287 146.6875
## 15: 205 68.56287 146.6875
## 16: 215 68.56287 146.6875
## 17: 230 68.56287 146.6875
## 18: 66 68.56287 146.6875
## 19: 52 68.56287 146.6875
## 20: 65 68.56287 146.6875
## 21: 97 68.56287 146.6875
## 22: 150 68.56287 146.6875
## 23: 150 68.56287 146.6875
## 24: 245 68.56287 146.6875
## 25: 175 68.56287 146.6875
## 26: 66 68.56287 146.6875
## 27: 91 68.56287 146.6875
## 28: 113 68.56287 146.6875
## 29: 264 68.56287 146.6875
## 30: 175 68.56287 146.6875
## 31: 335 68.56287 146.6875
## 32: 109 68.56287 146.6875
## hp SD_HP Mean_HP
We can use apply on a data.table to apply a function to each column
sapply(MTCarsDT, class)
## mpg cyl disp hp drat wt qsec
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## vs am gear carb
## "numeric" "numeric" "numeric" "numeric"
sapply(DT, mean) #returns errors if column is not numeric
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## V1 V2 V3 V4 V5 V6 V7 V8 V9
## 1.500 NA -0.372 6.500 26.500 NA 106.500 126.500 NA
The curly braces allow multiple functions. Notice that the separate functions must be placed on separate lines or separated by semicolons.
MTCarsDT[, {print(disp); plot(mpg, wt); sapply(MTCarsDT, mean)}]
## [1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6
## [12] 275.8 275.8 275.8 472.0 460.0 440.0 78.7 75.7 71.1 120.1 318.0
## [23] 304.0 350.0 400.0 79.0 120.3 95.1 351.0 145.0 301.0 121.0
## mpg cyl disp hp drat wt
## 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250
## qsec vs am gear carb
## 17.848750 0.437500 0.406250 3.687500 2.812500
6 Selecting Rows and Columns DT[i,j]
Print rows 1:5 but only the hp, weight, and mpg columns.
MTCarsDT[1:5, .(hp,wt,mpg)]
## hp wt mpg
## 1: 110 2.620 21.0
## 2: 110 2.875 21.0
## 3: 93 2.320 22.8
## 4: 110 3.215 21.4
## 5: 175 3.440 18.7
Return cylinders, hp, mean hp and mpg for 6 cylinder cars.
MTCarsDT[cyl == 6,.(cyl,hp,MeanHorsepower = mean(hp),mpg)]
## cyl hp MeanHorsepower mpg
## 1: 6 110 122.2857 21.0
## 2: 6 110 122.2857 21.0
## 3: 6 110 122.2857 21.4
## 4: 6 105 122.2857 18.1
## 5: 6 123 122.2857 19.2
## 6: 6 123 122.2857 17.8
## 7: 6 175 122.2857 19.7
Note that the mean horsepower is only for 6 cylinder cars and doesn’t include 4 and 8 cylinder cars.
MTCarsDT[,mean(hp)]
## [1] 146.6875
7 Perform a function on a column by the value of another column DT[,j,by]
Gives the mean hp by the number of cylinders
MTCarsDT[,.(MeanHP = mean(hp)), by=cyl]
## cyl MeanHP
## 1: 6 122.28571
## 2: 4 82.63636
## 3: 8 209.21429
This will give mean time in seconds for 1/4 mile race based on both gears and cylinders.
MTCarsDT[,.(MeanQuarterMile = mean(qsec)), by=.(cyl,gear)]
## cyl gear MeanQuarterMile
## 1: 6 4 17.6700
## 2: 4 4 19.6125
## 3: 6 3 19.8300
## 4: 8 3 17.1425
## 5: 4 3 20.0100
## 6: 4 5 16.8000
## 7: 8 5 14.5500
## 8: 6 5 15.5000
7.1 Return The Number of Objects .N
Returns the number of cars based on the numbers of gears.
MTCarsDT[,.N, by=gear]
## gear N
## 1: 4 12
## 2: 3 15
## 3: 5 5
This also works.
MTCarsDT[,table(gear)]
## gear
## 3 4 5
## 15 12 5
8 Perform A Function On A Column By Values In Another Column From A Subset of Rows DT[i,j,by]
Gives the mean horsepower of the first ten cars by their number of cylinders.
MTCarsDT[1:10,.(mean_hp = mean(hp)), by=cyl]
## cyl mean_hp
## 1: 6 111.60000
## 2: 4 83.33333
## 3: 8 210.00000
9 Modifying Data.Tables DT[i,j := ]
The :=
opearator updates columns and does so invisibly. The assignment operator (DT <- DT[.....]
) is unnecessary. For more on :=.
DT[,V8]
## [1] 121 122 123 124 125 126 127 128 129 130 131 132
DT[, V8 := round(exp(V3),2)]
DT[,V8]
## [1] 1.41 0.49 0.68 0.47 1.41 0.49 0.68 0.47 1.41 0.49 0.68 0.47
Replace NA’s in V9 column with 0
DT[,V9]
## [1] NA 1 2 3 NA 1 2 3 NA 1 2 3
DT[is.na(V9), V9 := 0]
DT[1:5]
## V1 V2 V3 V4 V5 V6 V7 V8 V9
## 1: 1 A 0.3408 1 21 A 101 1.41 0
## 2: 2 B -0.7033 2 22 B 102 0.49 1
## 3: 1 C -0.3795 3 23 C 103 0.68 2
## 4: 2 A -0.7460 4 24 D 104 0.47 3
## 5: 1 B 0.3408 5 25 E 105 1.41 0
Columns V7 and V8 are updated with the results of the functions after the := operator.
DT[,.(V7,V6)]
## V7 V6
## 1: 101 A
## 2: 102 B
## 3: 103 C
## 4: 104 D
## 5: 105 E
## 6: 106 F
## 7: 107 G
## 8: 108 H
## 9: 109 I
## 10: 110 J
## 11: 111 K
## 12: 112 L
DT[, c("V7","V6") := .(round(exp(V1),2), LETTERS [4:6])]
DT[,.(V7,V6)]
## V7 V6
## 1: 2.72 D
## 2: 7.39 E
## 3: 2.72 F
## 4: 7.39 D
## 5: 2.72 E
## 6: 7.39 F
## 7: 2.72 D
## 8: 7.39 E
## 9: 2.72 F
## 10: 7.39 D
## 11: 2.72 E
## 12: 7.39 F
Adding brackets ([]
) to the end of the operation will print the result automatically.
DT[, c("V7","V6") := .(round(exp(V1),2), LETTERS [4:6])][]
## V1 V2 V3 V4 V5 V6 V7 V8 V9
## 1: 1 A 0.3408 1 21 D 2.72 1.41 0
## 2: 2 B -0.7033 2 22 E 7.39 0.49 1
## 3: 1 C -0.3795 3 23 F 2.72 0.68 2
## 4: 2 A -0.7460 4 24 D 7.39 0.47 3
## 5: 1 B 0.3408 5 25 E 2.72 1.41 0
## 6: 2 C -0.7033 6 26 F 7.39 0.49 1
## 7: 1 A -0.3795 7 27 D 2.72 0.68 2
## 8: 2 B -0.7460 8 28 E 7.39 0.47 3
## 9: 1 C 0.3408 9 29 F 2.72 1.41 0
## 10: 2 A -0.7033 10 30 D 7.39 0.49 1
## 11: 1 B -0.3795 11 31 E 2.72 0.68 2
## 12: 2 C -0.7460 12 32 F 7.39 0.47 3
This deletes the V1 column
DT[, V1 := NULL][]
## V2 V3 V4 V5 V6 V7 V8 V9
## 1: A 0.3408 1 21 D 2.72 1.41 0
## 2: B -0.7033 2 22 E 7.39 0.49 1
## 3: C -0.3795 3 23 F 2.72 0.68 2
## 4: A -0.7460 4 24 D 7.39 0.47 3
## 5: B 0.3408 5 25 E 2.72 1.41 0
## 6: C -0.7033 6 26 F 7.39 0.49 1
## 7: A -0.3795 7 27 D 2.72 0.68 2
## 8: B -0.7460 8 28 E 7.39 0.47 3
## 9: C 0.3408 9 29 F 2.72 1.41 0
## 10: A -0.7033 10 30 D 7.39 0.49 1
## 11: B -0.3795 11 31 E 2.72 0.68 2
## 12: C -0.7460 12 32 F 7.39 0.47 3
And this deletes V7 and V8
DT[, c("V7","V8") := NULL][]
## V2 V3 V4 V5 V6 V9
## 1: A 0.3408 1 21 D 0
## 2: B -0.7033 2 22 E 1
## 3: C -0.3795 3 23 F 2
## 4: A -0.7460 4 24 D 3
## 5: B 0.3408 5 25 E 0
## 6: C -0.7033 6 26 F 1
## 7: A -0.3795 7 27 D 2
## 8: B -0.7460 8 28 E 3
## 9: C 0.3408 9 29 F 0
## 10: A -0.7033 10 30 D 1
## 11: B -0.3795 11 31 E 2
## 12: C -0.7460 12 32 F 3
Find the values in column V2 = A and assign X to them.
DT[V2 == "A", V2 := "X"][]
## V2 V3 V4 V5 V6 V9
## 1: X 0.3408 1 21 D 0
## 2: B -0.7033 2 22 E 1
## 3: C -0.3795 3 23 F 2
## 4: X -0.7460 4 24 D 3
## 5: B 0.3408 5 25 E 0
## 6: C -0.7033 6 26 F 1
## 7: X -0.3795 7 27 D 2
## 8: B -0.7460 8 28 E 3
## 9: C 0.3408 9 29 F 0
## 10: X -0.7033 10 30 D 1
## 11: B -0.3795 11 31 E 2
## 12: C -0.7460 12 32 F 3
Add columns V7 and V8 back.
DT[,c("V7","V8") := .(101:112,121:132)][]
## V2 V3 V4 V5 V6 V9 V7 V8
## 1: X 0.3408 1 21 D 0 101 121
## 2: B -0.7033 2 22 E 1 102 122
## 3: C -0.3795 3 23 F 2 103 123
## 4: X -0.7460 4 24 D 3 104 124
## 5: B 0.3408 5 25 E 0 105 125
## 6: C -0.7033 6 26 F 1 106 126
## 7: X -0.3795 7 27 D 2 107 127
## 8: B -0.7460 8 28 E 3 108 128
## 9: C 0.3408 9 29 F 0 109 129
## 10: X -0.7033 10 30 D 1 110 130
## 11: B -0.3795 11 31 E 2 111 131
## 12: C -0.7460 12 32 F 3 112 132
10 Complete Cases - removing NA rows
We can use complete cases
to delete all rows with NA values from any column. It’s not part of the data.table package, but it is handy.
First I’ll change the V8 column to include some NA’s, then delete all those rows with NA’s.
DT[,"V8" := rep(c(NA, 1,2,3), 3)]
DT <- DT[complete.cases(DT)]
head(DT)
## V2 V3 V4 V5 V6 V9 V7 V8
## 1: B -0.7033 2 22 E 1 102 1
## 2: C -0.3795 3 23 F 2 103 2
## 3: X -0.7460 4 24 D 3 104 3
## 4: C -0.7033 6 26 F 1 106 1
## 5: X -0.3795 7 27 D 2 107 2
## 6: B -0.7460 8 28 E 3 108 3
11 Using setkey()
to sort by a keyed column
Setkey keys a column to sort by.
This operation sets the key to column V2 in the Data Table. Then it returns every row where V2 has the value of “B”.
setkey(DT,V2)
DT["B"]
## V2 V3 V4 V5 V6 V9 V7 V8
## 1: B -0.7033 2 22 E 1 102 1
## 2: B -0.7460 8 28 E 3 108 3
## 3: B -0.3795 11 31 E 2 111 2
OR X and B
DT[c("B", "X")]
## V2 V3 V4 V5 V6 V9 V7 V8
## 1: B -0.7033 2 22 E 1 102 1
## 2: B -0.7460 8 28 E 3 108 3
## 3: B -0.3795 11 31 E 2 111 2
## 4: X -0.7460 4 24 D 3 104 3
## 5: X -0.3795 7 27 D 2 107 2
## 6: X -0.7033 10 30 D 1 110 1
11.1 key()
key()
will tell us which column, if any, are set as the keyed column.
key(DT)
## [1] "V2"
haskey()
will return TRUE or FALSE if a data.table has a key assigned.
haskey(MTCarsDT)
## [1] FALSE
And our old friend tables()
will also tell us any keyed columns in any data.table in the global environment.
tables()
## NAME NROW NCOL MB COLS
## [1,] DT 9 8 1 V2,V3,V4,V5,V6,V9,V7,V8
## [2,] MTCarsDT 32 11 1 mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
## KEY
## [1,] V2
## [2,]
## Total: 2MB
11.2 Returning specified columns - mult()
The mult
command returns the row specified. The options are first, last and all. All is the default.
DT["X", mult="first"]
## V2 V3 V4 V5 V6 V9 V7 V8
## 1: X -0.746 4 24 D 3 104 3
DT["X", mult="last"]
## V2 V3 V4 V5 V6 V9 V7 V8
## 1: X -0.7033 10 30 D 1 110 1
DT["X", mult="all"]
## V2 V3 V4 V5 V6 V9 V7 V8
## 1: X -0.7460 4 24 D 3 104 3
## 2: X -0.3795 7 27 D 2 107 2
## 3: X -0.7033 10 30 D 1 110 1
11.3 Nomatch
If you search for a value that doesn’t exist the data table will return with an NA row.
DT[c("X", "D")]
## V2 V3 V4 V5 V6 V9 V7 V8
## 1: X -0.7460 4 24 D 3 104 3
## 2: X -0.3795 7 27 D 2 107 2
## 3: X -0.7033 10 30 D 1 110 1
## 4: D NA NA NA NA NA NA NA
However you can add the nomatch command and it won’t include the NA row.
DT[c("X", "D"), nomatch=0]
## V2 V3 V4 V5 V6 V9 V7 V8
## 1: X -0.7460 4 24 D 3 104 3
## 2: X -0.3795 7 27 D 2 107 2
## 3: X -0.7033 10 30 D 1 110 1
11.4 Math Operations using setkey
Now we can perform specific operations on these rows that have been keyed. Like getting the mean hp for 6 cylinder cars.
setkey(MTCarsDT, cyl)
MTCarsDT[.(6), mean(hp)]
## [1] 122.2857
And we can return mean hp for both 4 and 6 cylinder cars…
MTCarsDT[.(c(4, 6)), mean(hp)]
## [1] 98.05556
11.5 by=.EACHI
But what if we want the mean hp for 6 cylinder cars and the mean hp for 4 cylinder cars returned separately. Use the by.=EACHI
command which will perform same operation separately for each entered value in the keyed column. For keyed columns that are characater data we can just enter a command like this example from DT:
DT[c("X", "B"), mean(V5), by=.EACHI]
## V2 V1
## 1: X 27
## 2: B 27
This however returns an error:
MTCarsDT[c(4, 6), mean(hp), by=.EACHI]
The problem is that when subsetting a keyed column that has integer data, the data.table syntax assumes that MTCarsDT[c(4, 6), mean(hp), by=.EACHI]
this is referring to the 4th and 6th rows, not to all 4 cylinder and 6 cylinder cars.
For example this operation gives the mean of rows 4 and 6.
MTCarsDT[c(4, 6), mean(hp)]
## [1] 65.5
The solution is to put the c(4,6)
into a list (see List below) which works like this:
MTCarsDT[.(c(4, 6)), mean(hp), by=.EACHI]
## cyl V1
## 1: 4 82.63636
## 2: 6 122.28571
So while this works with character data:
DT[c("X", "B"), mean(V5), by=.EACHI]
A list must be used for integer data:
MTCarsDT[.(c(4, 6)), mean(hp), by=.EACHI]
Notice also that I used this same syntax above in Math Operations using setkey
11.6 Using setkey() on Multiple Columns
We can key multiple columns, for example the cylinder and gear columns.
setkey(MTCarsDT, cyl, gear)
And then print the rows with 4 cylinders and 4 gears
MTCarsDT[.(4,4)]
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1: 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## 2: 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## 3: 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## 4: 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## 5: 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## 6: 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## 7: 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## 8: 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Or 4 cylinders and 4 or 5 gears
MTCarsDT[.(4,c(4,5))]
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1: 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## 2: 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## 3: 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## 4: 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## 5: 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## 6: 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## 7: 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## 8: 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
## 9: 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## 10: 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
And then return the mean hp for 4 cylinder 4 gear cars and for 4 cylinder 5 gear cars.
MTCarsDT[.(4, c(4,5)), mean(hp), by=.EACHI]
## cyl gear V1
## 1: 4 4 76
## 2: 4 5 102
12 N
.N
displays the last row.
MTCarsDT[.N]
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1: 15 8 301 335 3.54 3.57 14.6 0 1 5 8
.N-1
displays the penultimate row and so on.
MTCarsDT[.N-1]
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1: 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
MTCarsDT[.N-4]
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1: 15.2 8 304 150 3.15 3.435 17.3 0 0 3 2
but ,.N
displays the number of rows
MTCarsDT[,.N]
## [1] 32
13 List
The .()
command is the same as list()
.
head(MTCarsDT[,.(cyl, disp)])
## cyl disp
## 1: 4 120.1
## 2: 4 108.0
## 3: 4 146.7
## 4: 4 140.8
## 5: 4 78.7
## 6: 4 75.7
head(MTCarsDT[,list(cyl, disp)])
## cyl disp
## 1: 4 120.1
## 2: 4 108.0
## 3: 4 146.7
## 4: 4 140.8
## 5: 4 78.7
## 6: 4 75.7
We can use this list function to get the mean hp for each combination of gears and cylinders
MTCarsDT[,mean(hp),.(cyl,gear)]
## cyl gear V1
## 1: 4 3 97.0000
## 2: 4 4 76.0000
## 3: 4 5 102.0000
## 4: 6 3 107.5000
## 5: 6 4 116.5000
## 6: 6 5 175.0000
## 7: 8 3 194.1667
## 8: 8 5 299.5000
14 SD - ie Subset of x’s Data
This will print every column except the cylinder column, but it will sort the data grouped by the number of cylinders.
MTCarsDT[,print(.SD), by=cyl]
## mpg disp hp drat wt qsec vs am gear carb
## 1: 21.5 120.1 97 3.70 2.465 20.01 1 0 3 1
## 2: 22.8 108.0 93 3.85 2.320 18.61 1 1 4 1
## 3: 24.4 146.7 62 3.69 3.190 20.00 1 0 4 2
## 4: 22.8 140.8 95 3.92 3.150 22.90 1 0 4 2
## 5: 32.4 78.7 66 4.08 2.200 19.47 1 1 4 1
## 6: 30.4 75.7 52 4.93 1.615 18.52 1 1 4 2
## 7: 33.9 71.1 65 4.22 1.835 19.90 1 1 4 1
## 8: 27.3 79.0 66 4.08 1.935 18.90 1 1 4 1
## 9: 21.4 121.0 109 4.11 2.780 18.60 1 1 4 2
## 10: 26.0 120.3 91 4.43 2.140 16.70 0 1 5 2
## 11: 30.4 95.1 113 3.77 1.513 16.90 1 1 5 2
## mpg disp hp drat wt qsec vs am gear carb
## 1: 21.4 258.0 110 3.08 3.215 19.44 1 0 3 1
## 2: 18.1 225.0 105 2.76 3.460 20.22 1 0 3 1
## 3: 21.0 160.0 110 3.90 2.620 16.46 0 1 4 4
## 4: 21.0 160.0 110 3.90 2.875 17.02 0 1 4 4
## 5: 19.2 167.6 123 3.92 3.440 18.30 1 0 4 4
## 6: 17.8 167.6 123 3.92 3.440 18.90 1 0 4 4
## 7: 19.7 145.0 175 3.62 2.770 15.50 0 1 5 6
## mpg disp hp drat wt qsec vs am gear carb
## 1: 18.7 360.0 175 3.15 3.440 17.02 0 0 3 2
## 2: 14.3 360.0 245 3.21 3.570 15.84 0 0 3 4
## 3: 16.4 275.8 180 3.07 4.070 17.40 0 0 3 3
## 4: 17.3 275.8 180 3.07 3.730 17.60 0 0 3 3
## 5: 15.2 275.8 180 3.07 3.780 18.00 0 0 3 3
## 6: 10.4 472.0 205 2.93 5.250 17.98 0 0 3 4
## 7: 10.4 460.0 215 3.00 5.424 17.82 0 0 3 4
## 8: 14.7 440.0 230 3.23 5.345 17.42 0 0 3 4
## 9: 15.5 318.0 150 2.76 3.520 16.87 0 0 3 2
## 10: 15.2 304.0 150 3.15 3.435 17.30 0 0 3 2
## 11: 13.3 350.0 245 3.73 3.840 15.41 0 0 3 4
## 12: 19.2 400.0 175 3.08 3.845 17.05 0 0 3 2
## 13: 15.8 351.0 264 4.22 3.170 14.50 0 1 5 4
## 14: 15.0 301.0 335 3.54 3.570 14.60 0 1 5 8
## Empty data.table (0 rows) of 1 col: cyl
And you can select all the data by cylinder.
MTCarsDT[.(6),print(.SD), by=cyl]
## mpg disp hp drat wt qsec vs am gear carb
## 1: 21.4 258.0 110 3.08 3.215 19.44 1 0 3 1
## 2: 18.1 225.0 105 2.76 3.460 20.22 1 0 3 1
## 3: 21.0 160.0 110 3.90 2.620 16.46 0 1 4 4
## 4: 21.0 160.0 110 3.90 2.875 17.02 0 1 4 4
## 5: 19.2 167.6 123 3.92 3.440 18.30 1 0 4 4
## 6: 17.8 167.6 123 3.92 3.440 18.90 1 0 4 4
## 7: 19.7 145.0 175 3.62 2.770 15.50 0 1 5 6
## Empty data.table (0 rows) of 1 col: cyl
This prints the first and last row grouped by the number of cylinders in the data set.
MTCarsDT[,.SD[c(1,.N)], by=cyl]
## cyl mpg disp hp drat wt qsec vs am gear carb
## 1: 4 21.5 120.1 97 3.70 2.465 20.01 1 0 3 1
## 2: 4 30.4 95.1 113 3.77 1.513 16.90 1 1 5 2
## 3: 6 21.4 258.0 110 3.08 3.215 19.44 1 0 3 1
## 4: 6 19.7 145.0 175 3.62 2.770 15.50 0 1 5 6
## 5: 8 18.7 360.0 175 3.15 3.440 17.02 0 0 3 2
## 6: 8 15.0 301.0 335 3.54 3.570 14.60 0 1 5 8
14.1 lapply
Using lapply() we can get the mean of every column grouped by the number of cylinders.
MTCarsDT[, lapply(.SD, mean), by=cyl]
## cyl mpg disp hp drat wt qsec vs
## 1: 4 26.66364 105.1364 82.63636 4.070909 2.285727 19.13727 0.9090909
## 2: 6 19.74286 183.3143 122.28571 3.585714 3.117143 17.97714 0.5714286
## 3: 8 15.10000 353.1000 209.21429 3.229286 3.999214 16.77214 0.0000000
## am gear carb
## 1: 0.7272727 4.090909 1.545455
## 2: 0.4285714 3.857143 3.428571
## 3: 0.1428571 3.285714 3.500000
14.2 SDcols
We can select specifically the columns we want to calculate and display using .SDcols
.
MTCarsDT[, lapply(.SD,mean), by=cyl, .SDcols = c("wt","hp", "disp")]
## cyl wt hp disp
## 1: 4 2.285727 82.63636 105.1364
## 2: 6 3.117143 122.28571 183.3143
## 3: 8 3.999214 209.21429 353.1000
Or we can use .SDcols
to specify the column numbers that we want to display.
MTCarsDT[, lapply(.SD,mean), by=cyl, .SDcols = 3:7]
## cyl disp hp drat wt qsec
## 1: 4 105.1364 82.63636 4.070909 2.285727 19.13727
## 2: 6 183.3143 122.28571 3.585714 3.117143 17.97714
## 3: 8 353.1000 209.21429 3.229286 3.999214 16.77214
And we can specify a several columns in a sequence by using pasteO
as long as the columns are numbered sequentially.
DT[,lapply(.SD,sum), by=V2, .SDcols = paste0("V",3:5)]
## V2 V3 V4 V5
## 1: B -1.8288 21 81
## 2: C -1.8288 21 81
## 3: X -1.8288 21 81
15 Chaining multiple operations together
Chaining allows you to perform multiple functions in one statement.
This set of operations will print the mean horsepower for cars, grouped by cylinder, that have more than 100 hp. But this creates a new data.table and is unnecessary.
MTCarsDT2 <- MTCarsDT[, .(mean.hp = mean(hp)), by=cyl]
MTCarsDT2
## cyl mean.hp
## 1: 4 82.63636
## 2: 6 122.28571
## 3: 8 209.21429
MTCarsDT2[mean.hp > 100]
## cyl mean.hp
## 1: 6 122.2857
## 2: 8 209.2143
This is the easier way.
MTCarsDT[, .(mean.hp = mean(hp)), by=cyl][mean.hp > 100]
## cyl mean.hp
## 1: 6 122.2857
## 2: 8 209.2143
16 Set and looping in a data.table
Set can be used to assign values in a data.table. Normally the := operation is better but set yields faster results in a for loop than any other function, so if you must loop, use set().
Since [.data.table incurs overhead to check the existence and type of arguments (for example), set() provides direct (but less flexible) assignment by reference with low overhead, appropriate for use inside a for loop. See examples. := is more flexible than set() because := is intended to be combined with i and by in single queries on large datasets.
Remember our friend DT?
DT
## V2 V3 V4 V5 V6 V9 V7 V8
## 1: B -0.7033 2 22 E 1 102 1
## 2: B -0.7460 8 28 E 3 108 3
## 3: B -0.3795 11 31 E 2 111 2
## 4: C -0.3795 3 23 F 2 103 2
## 5: C -0.7033 6 26 F 1 106 1
## 6: C -0.7460 12 32 F 3 112 3
## 7: X -0.7460 4 24 D 3 104 3
## 8: X -0.3795 7 27 D 2 107 2
## 9: X -0.7033 10 30 D 1 110 1
This is the syntax for set : for (i in from:to) set(DT, row, column, new value)
And this will renumber the V9 column:
for (i in 1:9) set(DT,i,"V9",i)
head(DT)
## V2 V3 V4 V5 V6 V9 V7 V8
## 1: B -0.7033 2 22 E 1 102 1
## 2: B -0.7460 8 28 E 2 108 3
## 3: B -0.3795 11 31 E 3 111 2
## 4: C -0.3795 3 23 F 4 103 2
## 5: C -0.7033 6 26 F 5 106 1
## 6: C -0.7460 12 32 F 6 112 3
An example of the speed of using set.
m = matrix(1,nrow=100000,ncol=100)
DF = as.data.frame(m)
DT = as.data.table(m)
dim(DT)
## [1] 100000 100
Now we can run a speed test on the different methods of assigning the value of i to the i’th row of the first column.
system.time(for (i in 1:100000) DF[i,1] <- i)
## user system elapsed
## 40.674 50.405 93.783
system.time(for (i in 1:100000) DT[i,V1:=i])
## user system elapsed
## 31.554 0.328 33.284
system.time(for (i in 1:100000) set(DT,i,1L,i))
## user system elapsed
## 0.241 0.002 0.243
So you can see there are big speed advantages to using set() over the assignment operator <- or the assignment by reference operator := .
Also notice the the use of 1L
to select the first column in the set command of the final speed test. See here and here for a discussion of the use of 1 vs 1L for integers in R.
17 Change column names using setnames()
The syntax is setnames(DT, “oldname”, “newname”)
colnames(MTCarsDT)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
setnames(MTCarsDT, "hp", "horsepower")
colnames(MTCarsDT)
## [1] "mpg" "cyl" "disp" "horsepower" "drat"
## [6] "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
You can also change multiple columnames at the same time.
setnames(MTCarsDT, c("cyl", "disp"), c("cylinders", "displacement"))
colnames(MTCarsDT)
## [1] "mpg" "cylinders" "displacement" "horsepower"
## [5] "drat" "wt" "qsec" "vs"
## [9] "am" "gear" "carb"
18 Changing column order - setcolorder()
We can create a new column order by
MTCarsColumns <- c("drat","wt","qsec","vs","am","gear","carb","mpg","cylinders","displacement","horsepower")
setcolorder(MTCarsDT, MTCarsColumns)
MTCarsDT[1]
## drat wt qsec vs am gear carb mpg cylinders displacement horsepower
## 1: 3.7 2.465 20.01 1 0 3 1 21.5 4 120.1 97
or by giving it a string of names
setcolorder(MTCarsDT, c("mpg","cylinders","displacement","horsepower","drat","wt","qsec","vs","am","gear","carb"))
MTCarsDT[1]
## mpg cylinders displacement horsepower drat wt qsec vs am gear carb
## 1: 21.5 4 120.1 97 3.7 2.465 20.01 1 0 3 1
19 Unique
Unique returns a data.table where duplicate data, by keyed row, are removed. So here’s a new data.table, notice that rows 3 and 12 are identical to rows that come before them.
DT <- data.table(A = rep(1:3, each=4), B = rep(1:4, each=3),
C = rep(1:2, 6), key = "B")
DT
## A B C
## 1: 1 1 1
## 2: 1 1 2
## 3: 1 1 1
## 4: 1 2 2
## 5: 2 2 1
## 6: 2 2 2
## 7: 2 3 1
## 8: 2 3 2
## 9: 3 3 1
## 10: 3 4 2
## 11: 3 4 1
## 12: 3 4 2
Using duplicated
we can see that there are two rows TRUE
that are identical to other rows that have already been displayed.
duplicated(DT)
## [1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] TRUE
Unique
will return a DT without the duplicates.
unique(DT)
## A B C
## 1: 1 1 1
## 2: 1 1 2
## 3: 1 2 2
## 4: 2 2 1
## 5: 2 2 2
## 6: 2 3 1
## 7: 2 3 2
## 8: 3 3 1
## 9: 3 4 2
## 10: 3 4 1
uniqueN returns the number of unique rows.
uniqueN(DT)
## [1] 10
20 Reference Websites
More info and references: