How To Use Data.Tables


There are significant speed advantages to using data.tables over data.frames. However the syntax is a bit confusing. This is an attempt to demistify data.tables.

library(data.table)


1 Create a data.table


Start by creating a few data tables.


set.seed(45L)

DT <- data.table(
    V1=c(1L,2L),
    V2=LETTERS[1:3], 
    V3=round(rnorm(4),4), 
    V4=1:12, 
    V5=21:32, 
    V6= LETTERS[1:12], 
    V7=101:112, 
    V8=121:132, 
    V9=rep(c(NA, 1,2,3), 3)
    )
MTCarsDT <- data.table(mtcars)
DT
##     V1 V2      V3 V4 V5 V6  V7  V8 V9
##  1:  1  A  0.3408  1 21  A 101 121 NA
##  2:  2  B -0.7033  2 22  B 102 122  1
##  3:  1  C -0.3795  3 23  C 103 123  2
##  4:  2  A -0.7460  4 24  D 104 124  3
##  5:  1  B  0.3408  5 25  E 105 125 NA
##  6:  2  C -0.7033  6 26  F 106 126  1
##  7:  1  A -0.3795  7 27  G 107 127  2
##  8:  2  B -0.7460  8 28  H 108 128  3
##  9:  1  C  0.3408  9 29  I 109 129 NA
## 10:  2  A -0.7033 10 30  J 110 130  1
## 11:  1  B -0.3795 11 31  K 111 131  2
## 12:  2  C -0.7460 12 32  L 112 132  3
head(MTCarsDT)
##     mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1: 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## 2: 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## 3: 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## 4: 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## 5: 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## 6: 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1


2 Listing data.tables

Data.table has a nice function that will list all tables in the global environment as well as their columns, and some extra information.

tables()
##      NAME     NROW NCOL MB COLS                                        
## [1,] DT         12    9  1 V1,V2,V3,V4,V5,V6,V7,V8,V9                  
## [2,] MTCarsDT   32   11  1 mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
##      KEY
## [1,]    
## [2,]    
## Total: 2MB

3 [i, j, by]


Data.tables allow you to find data in them and perform operations using the following syntax: DT[i, j, by], which means, “Take DT, subset rows using i, then calculate j grouped by by.”

4 Select rows: DT[i]


How to select rows from a data.table

MTCarsDT[3]
##     mpg cyl disp hp drat   wt  qsec vs am gear carb
## 1: 22.8   4  108 93 3.85 2.32 18.61  1  1    4    1
MTCarsDT[3:5]
##     mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1: 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## 2: 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## 3: 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
MTCarsDT[c(3,5:8)]
##     mpg cyl  disp  hp drat   wt  qsec vs am gear carb
## 1: 22.8   4 108.0  93 3.85 2.32 18.61  1  1    4    1
## 2: 18.7   8 360.0 175 3.15 3.44 17.02  0  0    3    2
## 3: 18.1   6 225.0 105 2.76 3.46 20.22  1  0    3    1
## 4: 14.3   8 360.0 245 3.21 3.57 15.84  0  0    3    4
## 5: 24.4   4 146.7  62 3.69 3.19 20.00  1  0    4    2


Notice that this operation prints out the 3rd row for columns 5 through 8:

MTCarsDT[3,5:8] 
##    drat   wt  qsec vs
## 1: 3.85 2.32 18.61  1


5 Select Columns: DT[,j]



Print hp column as a vector

MTCarsDT[,hp]
##  [1] 110 110  93 110 175 105 245  62  95 123 123 180 180 180 205 215 230
## [18]  66  52  65  97 150 150 245 175  66  91 113 264 175 335 109

Print just the hp column as a data.table

head(MTCarsDT[,.(hp)])
##     hp
## 1: 110
## 2: 110
## 3:  93
## 4: 110
## 5: 175
## 6: 105

Note that the expression .(hp) is identical to list(hp). The period is equivalent to the list function.

head(MTCarsDT[,list(hp)])
##     hp
## 1: 110
## 2: 110
## 3:  93
## 4: 110
## 5: 175
## 6: 105

Print multiple columns as a data.table

MTCarsDT[,.(hp, cyl)]
##      hp cyl
##  1: 110   6
##  2: 110   6
##  3:  93   4
##  4: 110   6
##  5: 175   8
##  6: 105   6
##  7: 245   8
##  8:  62   4
##  9:  95   4
## 10: 123   6
## 11: 123   6
## 12: 180   8
## 13: 180   8
## 14: 180   8
## 15: 205   8
## 16: 215   8
## 17: 230   8
## 18:  66   4
## 19:  52   4
## 20:  65   4
## 21:  97   4
## 22: 150   8
## 23: 150   8
## 24: 245   8
## 25: 175   8
## 26:  66   4
## 27:  91   4
## 28: 113   4
## 29: 264   8
## 30: 175   6
## 31: 335   8
## 32: 109   4
##      hp cyl

Note that we can also print columns by index using the index number and with=FALSE

head(MTCarsDT[, 1:3, with=FALSE])
##     mpg cyl disp
## 1: 21.0   6  160
## 2: 21.0   6  160
## 3: 22.8   4  108
## 4: 21.4   6  258
## 5: 18.7   8  360
## 6: 18.1   6  225
head(MTCarsDT[, c(1:3,4,6), with=FALSE])
##     mpg cyl disp  hp    wt
## 1: 21.0   6  160 110 2.620
## 2: 21.0   6  160 110 2.875
## 3: 22.8   4  108  93 2.320
## 4: 21.4   6  258 110 3.215
## 5: 18.7   8  360 175 3.440
## 6: 18.1   6  225 105 3.460


However this could create problems if used in code. This is from the setkey documentation:

It isn’t good programming practice, in general, to use column numbers rather than names. This is why setkey and setkeyv only accept column names. If you use column numbers then bugs (possibly silent) can more easily creep into your code as time progresses if changes are made elsewhere in your code; e.g., if you add, remove or reorder columns in a few months time, a setkey by column number will then refer to a different column, possibly returning incorrect results with no warning.


Return mean of the hp column

MTCarsDT[,mean(hp)]
## [1] 146.6875

Return mean hp and sd of mpg.

MTCarsDT[,.(mean(hp), sd(mpg))]
##          V1       V2
## 1: 146.6875 6.026948

The same as above but the results have different names.

MTCarsDT[,.(horsepower=mean(hp), milespergallon=sd(mpg))]
##    horsepower milespergallon
## 1:   146.6875       6.026948

This prints the horsepower for each row and repeats the values of the sd and mean of the hp column.

MTCarsDT[,.(hp, SD_HP=sd(hp), Mean_HP=mean(hp))]
##      hp    SD_HP  Mean_HP
##  1: 110 68.56287 146.6875
##  2: 110 68.56287 146.6875
##  3:  93 68.56287 146.6875
##  4: 110 68.56287 146.6875
##  5: 175 68.56287 146.6875
##  6: 105 68.56287 146.6875
##  7: 245 68.56287 146.6875
##  8:  62 68.56287 146.6875
##  9:  95 68.56287 146.6875
## 10: 123 68.56287 146.6875
## 11: 123 68.56287 146.6875
## 12: 180 68.56287 146.6875
## 13: 180 68.56287 146.6875
## 14: 180 68.56287 146.6875
## 15: 205 68.56287 146.6875
## 16: 215 68.56287 146.6875
## 17: 230 68.56287 146.6875
## 18:  66 68.56287 146.6875
## 19:  52 68.56287 146.6875
## 20:  65 68.56287 146.6875
## 21:  97 68.56287 146.6875
## 22: 150 68.56287 146.6875
## 23: 150 68.56287 146.6875
## 24: 245 68.56287 146.6875
## 25: 175 68.56287 146.6875
## 26:  66 68.56287 146.6875
## 27:  91 68.56287 146.6875
## 28: 113 68.56287 146.6875
## 29: 264 68.56287 146.6875
## 30: 175 68.56287 146.6875
## 31: 335 68.56287 146.6875
## 32: 109 68.56287 146.6875
##      hp    SD_HP  Mean_HP


We can use apply on a data.table to apply a function to each column

sapply(MTCarsDT, class)
##       mpg       cyl      disp        hp      drat        wt      qsec 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##        vs        am      gear      carb 
## "numeric" "numeric" "numeric" "numeric"
sapply(DT, mean) #returns errors if column is not numeric
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
##      V1      V2      V3      V4      V5      V6      V7      V8      V9 
##   1.500      NA  -0.372   6.500  26.500      NA 106.500 126.500      NA


The curly braces allow multiple functions. Notice that the separate functions must be placed on separate lines or separated by semicolons.

MTCarsDT[, {print(disp); plot(mpg, wt); sapply(MTCarsDT, mean)}]
##  [1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6
## [12] 275.8 275.8 275.8 472.0 460.0 440.0  78.7  75.7  71.1 120.1 318.0
## [23] 304.0 350.0 400.0  79.0 120.3  95.1 351.0 145.0 301.0 121.0

##        mpg        cyl       disp         hp       drat         wt 
##  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250 
##       qsec         vs         am       gear       carb 
##  17.848750   0.437500   0.406250   3.687500   2.812500

6 Selecting Rows and Columns DT[i,j]


Print rows 1:5 but only the hp, weight, and mpg columns.

MTCarsDT[1:5, .(hp,wt,mpg)]
##     hp    wt  mpg
## 1: 110 2.620 21.0
## 2: 110 2.875 21.0
## 3:  93 2.320 22.8
## 4: 110 3.215 21.4
## 5: 175 3.440 18.7

Return cylinders, hp, mean hp and mpg for 6 cylinder cars.

MTCarsDT[cyl == 6,.(cyl,hp,MeanHorsepower = mean(hp),mpg)]
##    cyl  hp MeanHorsepower  mpg
## 1:   6 110       122.2857 21.0
## 2:   6 110       122.2857 21.0
## 3:   6 110       122.2857 21.4
## 4:   6 105       122.2857 18.1
## 5:   6 123       122.2857 19.2
## 6:   6 123       122.2857 17.8
## 7:   6 175       122.2857 19.7

Note that the mean horsepower is only for 6 cylinder cars and doesn’t include 4 and 8 cylinder cars.

MTCarsDT[,mean(hp)]
## [1] 146.6875


7 Perform a function on a column by the value of another column DT[,j,by]


Gives the mean hp by the number of cylinders

MTCarsDT[,.(MeanHP = mean(hp)), by=cyl]
##    cyl    MeanHP
## 1:   6 122.28571
## 2:   4  82.63636
## 3:   8 209.21429

This will give mean time in seconds for 1/4 mile race based on both gears and cylinders.

MTCarsDT[,.(MeanQuarterMile = mean(qsec)), by=.(cyl,gear)]
##    cyl gear MeanQuarterMile
## 1:   6    4         17.6700
## 2:   4    4         19.6125
## 3:   6    3         19.8300
## 4:   8    3         17.1425
## 5:   4    3         20.0100
## 6:   4    5         16.8000
## 7:   8    5         14.5500
## 8:   6    5         15.5000


7.1 Return The Number of Objects .N

Returns the number of cars based on the numbers of gears.

MTCarsDT[,.N, by=gear]
##    gear  N
## 1:    4 12
## 2:    3 15
## 3:    5  5

This also works.

MTCarsDT[,table(gear)]
## gear
##  3  4  5 
## 15 12  5


8 Perform A Function On A Column By Values In Another Column From A Subset of Rows DT[i,j,by]

Gives the mean horsepower of the first ten cars by their number of cylinders.

MTCarsDT[1:10,.(mean_hp = mean(hp)), by=cyl]
##    cyl   mean_hp
## 1:   6 111.60000
## 2:   4  83.33333
## 3:   8 210.00000


9 Modifying Data.Tables DT[i,j := ]


The := opearator updates columns and does so invisibly. The assignment operator (DT <- DT[.....]) is unnecessary. For more on :=.

DT[,V8]
##  [1] 121 122 123 124 125 126 127 128 129 130 131 132
DT[, V8 := round(exp(V3),2)]

DT[,V8]
##  [1] 1.41 0.49 0.68 0.47 1.41 0.49 0.68 0.47 1.41 0.49 0.68 0.47

Replace NA’s in V9 column with 0

DT[,V9]
##  [1] NA  1  2  3 NA  1  2  3 NA  1  2  3
DT[is.na(V9), V9 := 0]
DT[1:5]
##    V1 V2      V3 V4 V5 V6  V7   V8 V9
## 1:  1  A  0.3408  1 21  A 101 1.41  0
## 2:  2  B -0.7033  2 22  B 102 0.49  1
## 3:  1  C -0.3795  3 23  C 103 0.68  2
## 4:  2  A -0.7460  4 24  D 104 0.47  3
## 5:  1  B  0.3408  5 25  E 105 1.41  0

Columns V7 and V8 are updated with the results of the functions after the := operator.

DT[,.(V7,V6)]
##      V7 V6
##  1: 101  A
##  2: 102  B
##  3: 103  C
##  4: 104  D
##  5: 105  E
##  6: 106  F
##  7: 107  G
##  8: 108  H
##  9: 109  I
## 10: 110  J
## 11: 111  K
## 12: 112  L
DT[, c("V7","V6") := .(round(exp(V1),2), LETTERS [4:6])]
DT[,.(V7,V6)]
##       V7 V6
##  1: 2.72  D
##  2: 7.39  E
##  3: 2.72  F
##  4: 7.39  D
##  5: 2.72  E
##  6: 7.39  F
##  7: 2.72  D
##  8: 7.39  E
##  9: 2.72  F
## 10: 7.39  D
## 11: 2.72  E
## 12: 7.39  F

Adding brackets ([]) to the end of the operation will print the result automatically.

DT[, c("V7","V6") := .(round(exp(V1),2), LETTERS [4:6])][]
##     V1 V2      V3 V4 V5 V6   V7   V8 V9
##  1:  1  A  0.3408  1 21  D 2.72 1.41  0
##  2:  2  B -0.7033  2 22  E 7.39 0.49  1
##  3:  1  C -0.3795  3 23  F 2.72 0.68  2
##  4:  2  A -0.7460  4 24  D 7.39 0.47  3
##  5:  1  B  0.3408  5 25  E 2.72 1.41  0
##  6:  2  C -0.7033  6 26  F 7.39 0.49  1
##  7:  1  A -0.3795  7 27  D 2.72 0.68  2
##  8:  2  B -0.7460  8 28  E 7.39 0.47  3
##  9:  1  C  0.3408  9 29  F 2.72 1.41  0
## 10:  2  A -0.7033 10 30  D 7.39 0.49  1
## 11:  1  B -0.3795 11 31  E 2.72 0.68  2
## 12:  2  C -0.7460 12 32  F 7.39 0.47  3

This deletes the V1 column

DT[, V1 := NULL][]
##     V2      V3 V4 V5 V6   V7   V8 V9
##  1:  A  0.3408  1 21  D 2.72 1.41  0
##  2:  B -0.7033  2 22  E 7.39 0.49  1
##  3:  C -0.3795  3 23  F 2.72 0.68  2
##  4:  A -0.7460  4 24  D 7.39 0.47  3
##  5:  B  0.3408  5 25  E 2.72 1.41  0
##  6:  C -0.7033  6 26  F 7.39 0.49  1
##  7:  A -0.3795  7 27  D 2.72 0.68  2
##  8:  B -0.7460  8 28  E 7.39 0.47  3
##  9:  C  0.3408  9 29  F 2.72 1.41  0
## 10:  A -0.7033 10 30  D 7.39 0.49  1
## 11:  B -0.3795 11 31  E 2.72 0.68  2
## 12:  C -0.7460 12 32  F 7.39 0.47  3

And this deletes V7 and V8

DT[, c("V7","V8") := NULL][]
##     V2      V3 V4 V5 V6 V9
##  1:  A  0.3408  1 21  D  0
##  2:  B -0.7033  2 22  E  1
##  3:  C -0.3795  3 23  F  2
##  4:  A -0.7460  4 24  D  3
##  5:  B  0.3408  5 25  E  0
##  6:  C -0.7033  6 26  F  1
##  7:  A -0.3795  7 27  D  2
##  8:  B -0.7460  8 28  E  3
##  9:  C  0.3408  9 29  F  0
## 10:  A -0.7033 10 30  D  1
## 11:  B -0.3795 11 31  E  2
## 12:  C -0.7460 12 32  F  3

Find the values in column V2 = A and assign X to them.

DT[V2 == "A", V2 := "X"][]
##     V2      V3 V4 V5 V6 V9
##  1:  X  0.3408  1 21  D  0
##  2:  B -0.7033  2 22  E  1
##  3:  C -0.3795  3 23  F  2
##  4:  X -0.7460  4 24  D  3
##  5:  B  0.3408  5 25  E  0
##  6:  C -0.7033  6 26  F  1
##  7:  X -0.3795  7 27  D  2
##  8:  B -0.7460  8 28  E  3
##  9:  C  0.3408  9 29  F  0
## 10:  X -0.7033 10 30  D  1
## 11:  B -0.3795 11 31  E  2
## 12:  C -0.7460 12 32  F  3

Add columns V7 and V8 back.

DT[,c("V7","V8") := .(101:112,121:132)][]
##     V2      V3 V4 V5 V6 V9  V7  V8
##  1:  X  0.3408  1 21  D  0 101 121
##  2:  B -0.7033  2 22  E  1 102 122
##  3:  C -0.3795  3 23  F  2 103 123
##  4:  X -0.7460  4 24  D  3 104 124
##  5:  B  0.3408  5 25  E  0 105 125
##  6:  C -0.7033  6 26  F  1 106 126
##  7:  X -0.3795  7 27  D  2 107 127
##  8:  B -0.7460  8 28  E  3 108 128
##  9:  C  0.3408  9 29  F  0 109 129
## 10:  X -0.7033 10 30  D  1 110 130
## 11:  B -0.3795 11 31  E  2 111 131
## 12:  C -0.7460 12 32  F  3 112 132

10 Complete Cases - removing NA rows

We can use complete cases to delete all rows with NA values from any column. It’s not part of the data.table package, but it is handy.


First I’ll change the V8 column to include some NA’s, then delete all those rows with NA’s.

DT[,"V8" := rep(c(NA, 1,2,3), 3)] 
DT <- DT[complete.cases(DT)]
head(DT)
##    V2      V3 V4 V5 V6 V9  V7 V8
## 1:  B -0.7033  2 22  E  1 102  1
## 2:  C -0.3795  3 23  F  2 103  2
## 3:  X -0.7460  4 24  D  3 104  3
## 4:  C -0.7033  6 26  F  1 106  1
## 5:  X -0.3795  7 27  D  2 107  2
## 6:  B -0.7460  8 28  E  3 108  3

11 Using setkey() to sort by a keyed column

Setkey keys a column to sort by.
This operation sets the key to column V2 in the Data Table. Then it returns every row where V2 has the value of “B”.

setkey(DT,V2)
DT["B"]
##    V2      V3 V4 V5 V6 V9  V7 V8
## 1:  B -0.7033  2 22  E  1 102  1
## 2:  B -0.7460  8 28  E  3 108  3
## 3:  B -0.3795 11 31  E  2 111  2


OR X and B


DT[c("B", "X")]
##    V2      V3 V4 V5 V6 V9  V7 V8
## 1:  B -0.7033  2 22  E  1 102  1
## 2:  B -0.7460  8 28  E  3 108  3
## 3:  B -0.3795 11 31  E  2 111  2
## 4:  X -0.7460  4 24  D  3 104  3
## 5:  X -0.3795  7 27  D  2 107  2
## 6:  X -0.7033 10 30  D  1 110  1


11.1 key()

key() will tell us which column, if any, are set as the keyed column.

key(DT)
## [1] "V2"

haskey() will return TRUE or FALSE if a data.table has a key assigned.

haskey(MTCarsDT)
## [1] FALSE

And our old friend tables() will also tell us any keyed columns in any data.table in the global environment.

tables()
##      NAME     NROW NCOL MB COLS                                        
## [1,] DT          9    8  1 V2,V3,V4,V5,V6,V9,V7,V8                     
## [2,] MTCarsDT   32   11  1 mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
##      KEY
## [1,] V2 
## [2,]    
## Total: 2MB

11.2 Returning specified columns - mult()

The mult command returns the row specified. The options are first, last and all. All is the default.

DT["X", mult="first"]
##    V2     V3 V4 V5 V6 V9  V7 V8
## 1:  X -0.746  4 24  D  3 104  3
DT["X", mult="last"]
##    V2      V3 V4 V5 V6 V9  V7 V8
## 1:  X -0.7033 10 30  D  1 110  1
DT["X", mult="all"]
##    V2      V3 V4 V5 V6 V9  V7 V8
## 1:  X -0.7460  4 24  D  3 104  3
## 2:  X -0.3795  7 27  D  2 107  2
## 3:  X -0.7033 10 30  D  1 110  1


11.3 Nomatch


If you search for a value that doesn’t exist the data table will return with an NA row.


DT[c("X", "D")]
##    V2      V3 V4 V5 V6 V9  V7 V8
## 1:  X -0.7460  4 24  D  3 104  3
## 2:  X -0.3795  7 27  D  2 107  2
## 3:  X -0.7033 10 30  D  1 110  1
## 4:  D      NA NA NA NA NA  NA NA


However you can add the nomatch command and it won’t include the NA row.


DT[c("X", "D"), nomatch=0]
##    V2      V3 V4 V5 V6 V9  V7 V8
## 1:  X -0.7460  4 24  D  3 104  3
## 2:  X -0.3795  7 27  D  2 107  2
## 3:  X -0.7033 10 30  D  1 110  1


11.4 Math Operations using setkey


Now we can perform specific operations on these rows that have been keyed. Like getting the mean hp for 6 cylinder cars.

setkey(MTCarsDT, cyl)
MTCarsDT[.(6), mean(hp)]
## [1] 122.2857


And we can return mean hp for both 4 and 6 cylinder cars…

MTCarsDT[.(c(4, 6)), mean(hp)]
## [1] 98.05556


11.5 by=.EACHI


But what if we want the mean hp for 6 cylinder cars and the mean hp for 4 cylinder cars returned separately. Use the by.=EACHI command which will perform same operation separately for each entered value in the keyed column. For keyed columns that are characater data we can just enter a command like this example from DT:

DT[c("X", "B"), mean(V5), by=.EACHI]
##    V2 V1
## 1:  X 27
## 2:  B 27

This however returns an error:


MTCarsDT[c(4, 6), mean(hp), by=.EACHI]


The problem is that when subsetting a keyed column that has integer data, the data.table syntax assumes that MTCarsDT[c(4, 6), mean(hp), by=.EACHI] this is referring to the 4th and 6th rows, not to all 4 cylinder and 6 cylinder cars.

For example this operation gives the mean of rows 4 and 6.

MTCarsDT[c(4, 6), mean(hp)]
## [1] 65.5


The solution is to put the c(4,6) into a list (see List below) which works like this:

MTCarsDT[.(c(4, 6)), mean(hp), by=.EACHI]
##    cyl        V1
## 1:   4  82.63636
## 2:   6 122.28571


So while this works with character data:

  • DT[c("X", "B"), mean(V5), by=.EACHI]

A list must be used for integer data:

  • MTCarsDT[.(c(4, 6)), mean(hp), by=.EACHI]


Notice also that I used this same syntax above in Math Operations using setkey



11.6 Using setkey() on Multiple Columns


We can key multiple columns, for example the cylinder and gear columns.

setkey(MTCarsDT, cyl, gear)


And then print the rows with 4 cylinders and 4 gears

MTCarsDT[.(4,4)] 
##     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1: 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## 2: 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## 3: 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## 4: 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## 5: 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## 6: 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## 7: 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## 8: 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2


Or 4 cylinders and 4 or 5 gears

MTCarsDT[.(4,c(4,5))]
##      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
##  1: 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
##  2: 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
##  3: 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
##  4: 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
##  5: 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
##  6: 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
##  7: 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
##  8: 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
##  9: 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## 10: 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2


And then return the mean hp for 4 cylinder 4 gear cars and for 4 cylinder 5 gear cars.

MTCarsDT[.(4, c(4,5)), mean(hp), by=.EACHI]
##    cyl gear  V1
## 1:   4    4  76
## 2:   4    5 102


12 N


.N displays the last row.

MTCarsDT[.N]
##    mpg cyl disp  hp drat   wt qsec vs am gear carb
## 1:  15   8  301 335 3.54 3.57 14.6  0  1    5    8


.N-1 displays the penultimate row and so on.

MTCarsDT[.N-1]
##     mpg cyl disp  hp drat   wt qsec vs am gear carb
## 1: 15.8   8  351 264 4.22 3.17 14.5  0  1    5    4
MTCarsDT[.N-4]
##     mpg cyl disp  hp drat    wt qsec vs am gear carb
## 1: 15.2   8  304 150 3.15 3.435 17.3  0  0    3    2


but ,.N displays the number of rows

MTCarsDT[,.N]
## [1] 32

13 List

The .() command is the same as list().

head(MTCarsDT[,.(cyl, disp)])
##    cyl  disp
## 1:   4 120.1
## 2:   4 108.0
## 3:   4 146.7
## 4:   4 140.8
## 5:   4  78.7
## 6:   4  75.7
head(MTCarsDT[,list(cyl, disp)])
##    cyl  disp
## 1:   4 120.1
## 2:   4 108.0
## 3:   4 146.7
## 4:   4 140.8
## 5:   4  78.7
## 6:   4  75.7


We can use this list function to get the mean hp for each combination of gears and cylinders

MTCarsDT[,mean(hp),.(cyl,gear)]
##    cyl gear       V1
## 1:   4    3  97.0000
## 2:   4    4  76.0000
## 3:   4    5 102.0000
## 4:   6    3 107.5000
## 5:   6    4 116.5000
## 6:   6    5 175.0000
## 7:   8    3 194.1667
## 8:   8    5 299.5000


14 SD - ie Subset of x’s Data


This will print every column except the cylinder column, but it will sort the data grouped by the number of cylinders.

MTCarsDT[,print(.SD), by=cyl]
##      mpg  disp  hp drat    wt  qsec vs am gear carb
##  1: 21.5 120.1  97 3.70 2.465 20.01  1  0    3    1
##  2: 22.8 108.0  93 3.85 2.320 18.61  1  1    4    1
##  3: 24.4 146.7  62 3.69 3.190 20.00  1  0    4    2
##  4: 22.8 140.8  95 3.92 3.150 22.90  1  0    4    2
##  5: 32.4  78.7  66 4.08 2.200 19.47  1  1    4    1
##  6: 30.4  75.7  52 4.93 1.615 18.52  1  1    4    2
##  7: 33.9  71.1  65 4.22 1.835 19.90  1  1    4    1
##  8: 27.3  79.0  66 4.08 1.935 18.90  1  1    4    1
##  9: 21.4 121.0 109 4.11 2.780 18.60  1  1    4    2
## 10: 26.0 120.3  91 4.43 2.140 16.70  0  1    5    2
## 11: 30.4  95.1 113 3.77 1.513 16.90  1  1    5    2
##     mpg  disp  hp drat    wt  qsec vs am gear carb
## 1: 21.4 258.0 110 3.08 3.215 19.44  1  0    3    1
## 2: 18.1 225.0 105 2.76 3.460 20.22  1  0    3    1
## 3: 21.0 160.0 110 3.90 2.620 16.46  0  1    4    4
## 4: 21.0 160.0 110 3.90 2.875 17.02  0  1    4    4
## 5: 19.2 167.6 123 3.92 3.440 18.30  1  0    4    4
## 6: 17.8 167.6 123 3.92 3.440 18.90  1  0    4    4
## 7: 19.7 145.0 175 3.62 2.770 15.50  0  1    5    6
##      mpg  disp  hp drat    wt  qsec vs am gear carb
##  1: 18.7 360.0 175 3.15 3.440 17.02  0  0    3    2
##  2: 14.3 360.0 245 3.21 3.570 15.84  0  0    3    4
##  3: 16.4 275.8 180 3.07 4.070 17.40  0  0    3    3
##  4: 17.3 275.8 180 3.07 3.730 17.60  0  0    3    3
##  5: 15.2 275.8 180 3.07 3.780 18.00  0  0    3    3
##  6: 10.4 472.0 205 2.93 5.250 17.98  0  0    3    4
##  7: 10.4 460.0 215 3.00 5.424 17.82  0  0    3    4
##  8: 14.7 440.0 230 3.23 5.345 17.42  0  0    3    4
##  9: 15.5 318.0 150 2.76 3.520 16.87  0  0    3    2
## 10: 15.2 304.0 150 3.15 3.435 17.30  0  0    3    2
## 11: 13.3 350.0 245 3.73 3.840 15.41  0  0    3    4
## 12: 19.2 400.0 175 3.08 3.845 17.05  0  0    3    2
## 13: 15.8 351.0 264 4.22 3.170 14.50  0  1    5    4
## 14: 15.0 301.0 335 3.54 3.570 14.60  0  1    5    8
## Empty data.table (0 rows) of 1 col: cyl


And you can select all the data by cylinder.

MTCarsDT[.(6),print(.SD), by=cyl]
##     mpg  disp  hp drat    wt  qsec vs am gear carb
## 1: 21.4 258.0 110 3.08 3.215 19.44  1  0    3    1
## 2: 18.1 225.0 105 2.76 3.460 20.22  1  0    3    1
## 3: 21.0 160.0 110 3.90 2.620 16.46  0  1    4    4
## 4: 21.0 160.0 110 3.90 2.875 17.02  0  1    4    4
## 5: 19.2 167.6 123 3.92 3.440 18.30  1  0    4    4
## 6: 17.8 167.6 123 3.92 3.440 18.90  1  0    4    4
## 7: 19.7 145.0 175 3.62 2.770 15.50  0  1    5    6
## Empty data.table (0 rows) of 1 col: cyl


This prints the first and last row grouped by the number of cylinders in the data set.

MTCarsDT[,.SD[c(1,.N)], by=cyl]
##    cyl  mpg  disp  hp drat    wt  qsec vs am gear carb
## 1:   4 21.5 120.1  97 3.70 2.465 20.01  1  0    3    1
## 2:   4 30.4  95.1 113 3.77 1.513 16.90  1  1    5    2
## 3:   6 21.4 258.0 110 3.08 3.215 19.44  1  0    3    1
## 4:   6 19.7 145.0 175 3.62 2.770 15.50  0  1    5    6
## 5:   8 18.7 360.0 175 3.15 3.440 17.02  0  0    3    2
## 6:   8 15.0 301.0 335 3.54 3.570 14.60  0  1    5    8


14.1 lapply

Using lapply() we can get the mean of every column grouped by the number of cylinders.

MTCarsDT[, lapply(.SD, mean), by=cyl]
##    cyl      mpg     disp        hp     drat       wt     qsec        vs
## 1:   4 26.66364 105.1364  82.63636 4.070909 2.285727 19.13727 0.9090909
## 2:   6 19.74286 183.3143 122.28571 3.585714 3.117143 17.97714 0.5714286
## 3:   8 15.10000 353.1000 209.21429 3.229286 3.999214 16.77214 0.0000000
##           am     gear     carb
## 1: 0.7272727 4.090909 1.545455
## 2: 0.4285714 3.857143 3.428571
## 3: 0.1428571 3.285714 3.500000


14.2 SDcols


We can select specifically the columns we want to calculate and display using .SDcols.

MTCarsDT[, lapply(.SD,mean), by=cyl, .SDcols = c("wt","hp", "disp")]
##    cyl       wt        hp     disp
## 1:   4 2.285727  82.63636 105.1364
## 2:   6 3.117143 122.28571 183.3143
## 3:   8 3.999214 209.21429 353.1000


Or we can use .SDcols to specify the column numbers that we want to display.

MTCarsDT[, lapply(.SD,mean), by=cyl, .SDcols = 3:7]
##    cyl     disp        hp     drat       wt     qsec
## 1:   4 105.1364  82.63636 4.070909 2.285727 19.13727
## 2:   6 183.3143 122.28571 3.585714 3.117143 17.97714
## 3:   8 353.1000 209.21429 3.229286 3.999214 16.77214


And we can specify a several columns in a sequence by using pasteO as long as the columns are numbered sequentially.

DT[,lapply(.SD,sum), by=V2, .SDcols = paste0("V",3:5)]
##    V2      V3 V4 V5
## 1:  B -1.8288 21 81
## 2:  C -1.8288 21 81
## 3:  X -1.8288 21 81


15 Chaining multiple operations together

Chaining allows you to perform multiple functions in one statement.


This set of operations will print the mean horsepower for cars, grouped by cylinder, that have more than 100 hp. But this creates a new data.table and is unnecessary.

MTCarsDT2 <- MTCarsDT[, .(mean.hp = mean(hp)), by=cyl]
MTCarsDT2
##    cyl   mean.hp
## 1:   4  82.63636
## 2:   6 122.28571
## 3:   8 209.21429
MTCarsDT2[mean.hp > 100]
##    cyl  mean.hp
## 1:   6 122.2857
## 2:   8 209.2143


This is the easier way.

MTCarsDT[, .(mean.hp = mean(hp)), by=cyl][mean.hp > 100]
##    cyl  mean.hp
## 1:   6 122.2857
## 2:   8 209.2143

16 Set and looping in a data.table

Set can be used to assign values in a data.table. Normally the := operation is better but set yields faster results in a for loop than any other function, so if you must loop, use set().

From inside-R:

Since [.data.table incurs overhead to check the existence and type of arguments (for example), set() provides direct (but less flexible) assignment by reference with low overhead, appropriate for use inside a for loop. See examples. := is more flexible than set() because := is intended to be combined with i and by in single queries on large datasets.

Remember our friend DT?

DT
##    V2      V3 V4 V5 V6 V9  V7 V8
## 1:  B -0.7033  2 22  E  1 102  1
## 2:  B -0.7460  8 28  E  3 108  3
## 3:  B -0.3795 11 31  E  2 111  2
## 4:  C -0.3795  3 23  F  2 103  2
## 5:  C -0.7033  6 26  F  1 106  1
## 6:  C -0.7460 12 32  F  3 112  3
## 7:  X -0.7460  4 24  D  3 104  3
## 8:  X -0.3795  7 27  D  2 107  2
## 9:  X -0.7033 10 30  D  1 110  1


This is the syntax for set : for (i in from:to) set(DT, row, column, new value)

And this will renumber the V9 column:

for (i in 1:9) set(DT,i,"V9",i)
head(DT)
##    V2      V3 V4 V5 V6 V9  V7 V8
## 1:  B -0.7033  2 22  E  1 102  1
## 2:  B -0.7460  8 28  E  2 108  3
## 3:  B -0.3795 11 31  E  3 111  2
## 4:  C -0.3795  3 23  F  4 103  2
## 5:  C -0.7033  6 26  F  5 106  1
## 6:  C -0.7460 12 32  F  6 112  3


An example of the speed of using set.

m = matrix(1,nrow=100000,ncol=100)
DF = as.data.frame(m)
DT = as.data.table(m)  
dim(DT)
## [1] 100000    100

Now we can run a speed test on the different methods of assigning the value of i to the i’th row of the first column.

system.time(for (i in 1:100000) DF[i,1] <- i)
##    user  system elapsed 
##  40.674  50.405  93.783
     
system.time(for (i in 1:100000) DT[i,V1:=i])
##    user  system elapsed 
##  31.554   0.328  33.284

system.time(for (i in 1:100000) set(DT,i,1L,i))
##    user  system elapsed 
##   0.241   0.002   0.243

So you can see there are big speed advantages to using set() over the assignment operator <- or the assignment by reference operator := .

Also notice the the use of 1L to select the first column in the set command of the final speed test. See here and here for a discussion of the use of 1 vs 1L for integers in R.

17 Change column names using setnames()


The syntax is setnames(DT, “oldname”, “newname”)

colnames(MTCarsDT)
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
setnames(MTCarsDT, "hp", "horsepower")
colnames(MTCarsDT)
##  [1] "mpg"        "cyl"        "disp"       "horsepower" "drat"      
##  [6] "wt"         "qsec"       "vs"         "am"         "gear"      
## [11] "carb"

You can also change multiple columnames at the same time.

setnames(MTCarsDT, c("cyl", "disp"), c("cylinders", "displacement"))
colnames(MTCarsDT)
##  [1] "mpg"          "cylinders"    "displacement" "horsepower"  
##  [5] "drat"         "wt"           "qsec"         "vs"          
##  [9] "am"           "gear"         "carb"


18 Changing column order - setcolorder()

We can create a new column order by

MTCarsColumns <- c("drat","wt","qsec","vs","am","gear","carb","mpg","cylinders","displacement","horsepower") 

setcolorder(MTCarsDT, MTCarsColumns)
MTCarsDT[1]
##    drat    wt  qsec vs am gear carb  mpg cylinders displacement horsepower
## 1:  3.7 2.465 20.01  1  0    3    1 21.5         4        120.1         97


or by giving it a string of names

setcolorder(MTCarsDT, c("mpg","cylinders","displacement","horsepower","drat","wt","qsec","vs","am","gear","carb"))
MTCarsDT[1]
##     mpg cylinders displacement horsepower drat    wt  qsec vs am gear carb
## 1: 21.5         4        120.1         97  3.7 2.465 20.01  1  0    3    1


19 Unique

Unique returns a data.table where duplicate data, by keyed row, are removed. So here’s a new data.table, notice that rows 3 and 12 are identical to rows that come before them.

DT <- data.table(A = rep(1:3, each=4), B = rep(1:4, each=3), 
                 C = rep(1:2, 6), key = "B")
DT
##     A B C
##  1: 1 1 1
##  2: 1 1 2
##  3: 1 1 1
##  4: 1 2 2
##  5: 2 2 1
##  6: 2 2 2
##  7: 2 3 1
##  8: 2 3 2
##  9: 3 3 1
## 10: 3 4 2
## 11: 3 4 1
## 12: 3 4 2

Using duplicated we can see that there are two rows TRUE that are identical to other rows that have already been displayed.

duplicated(DT)
##  [1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12]  TRUE


Unique will return a DT without the duplicates.

unique(DT)
##     A B C
##  1: 1 1 1
##  2: 1 1 2
##  3: 1 2 2
##  4: 2 2 1
##  5: 2 2 2
##  6: 2 3 1
##  7: 2 3 2
##  8: 3 3 1
##  9: 3 4 2
## 10: 3 4 1

uniqueN returns the number of unique rows.

uniqueN(DT)
## [1] 10