Reading

Please read and work through the examples in the following data.table vignettes on CRAN:

Additional (optional) resources are available at the end of this document.

About data.table

The data.table package in R provides an extension of thedata.frame class that aims to be both more computationally and memory efficient. It is particularly well suited for large in-memory data sets and utilizes indexed keys to allow quick search, subset, and aggregate by group operations.

Creating data.table objects

The data.table package provides a function fread for reading delimited files like read.table and readr::read_delim, but returns a data.table object instead. As with the tibble class, data.table inherits from data.frame. Here is an example using the 2014 New York City Flights data.

library(data.table)
nyc14 = fread('https://github.com/arunsrinivasan/flights/wiki/NYCflights14/flights14.csv')
class(nyc14)
## [1] "data.table" "data.frame"

You can also create a data.table like a data.frame using:

n = 1e3
data.table(a = 1:n, b=rnorm(n), c=sample(letters, n, replace=TRUE))
##          a           b c
##    1:    1  1.33346691 g
##    2:    2 -1.29141893 k
##    3:    3  0.12641955 n
##    4:    4 -0.07923505 a
##    5:    5  0.79729718 w
##   ---                   
##  996:  996 -0.91272866 i
##  997:  997  0.67084124 z
##  998:  998  0.12980401 s
##  999:  999 -2.31556698 h
## 1000: 1000 -0.38065645 p

Question: How can we view the print method for class data.table?

“Indexing” with the bracket function

The syntax for the data.tables package is inspired by the bracket ([]) notation for indexing matrices and data frames. At the same time, it aims to allow many common data operations (i.e. dplyr verbs) to be expressed within these brackets. These goals are expressed at the start of the “Introduction to data.table” vignettte.

The basic idea is DT[i, j, by] where we:

Additional operations can also be expressed within the brackets. Remember that, even for a data.frame or matrix the left bracket [ is actually a function:

`[`
## .Primitive("[")

Below are some examples on the nyc14 data.table created using fread() above:

nyc14[1:2,]
##    year month day dep_time dep_delay arr_time arr_delay cancelled carrier
## 1: 2014     1   1      914        14     1238        13         0      AA
## 2: 2014     1   1     1157        -3     1523        13         0      AA
##    tailnum flight origin dest air_time distance hour min
## 1:  N338AA      1    JFK  LAX      359     2475    9  14
## 2:  N335AA      3    JFK  LAX      363     2475   11  57

Subsetting

Find all flights from LGA to DTW:

lga_dtw = nyc14[origin == 'LGA' & dest == 'DTW',]

Get the first and last rows of lga_dtw:

lga_dtw[c(1, .N)]
##    year month day dep_time dep_delay arr_time arr_delay cancelled carrier
## 1: 2014     1   1      901        -4     1102       -11         0      DL
## 2: 2014    10  31     1106        -4     1325        15         0      MQ
##    tailnum flight origin dest air_time distance hour min
## 1:  N917DL    181    LGA  DTW       99      502    9   1
## 2:  N511MQ   3592    LGA  DTW       75      502   11   6

In the above, we used .N to index the last row. This is a special symbol defined by data.table to hold the number of rows or observations in the “current” group. “Current” here refers to the scope in which it is used.

Also, notice the difference from standard data.frame or matrix sub-setting:

# Indexing a matrix
ld_mat = as.matrix(lga_dtw)
ld_mat[c(1,nrow(ld_mat))]
## [1] "2014" "2014"
# Indexing a data.fame
ld_df = as.data.frame(lga_dtw)
dim(ld_df[1:2])  # What would happen if we called: ld_df[c(1,nrow(ld_df))] ?
## [1] 3663    2

You can also use the i clause to order a data.table:

lga_dtw[order(-month, -day, dep_time)]
##       year month day dep_time dep_delay arr_time arr_delay cancelled
##    1: 2014    10  31      718        -2      904        -9         0
##    2: 2014    10  31      900        -5     1051       -22         0
##    3: 2014    10  31      939        -6     1138        -2         0
##    4: 2014    10  31     1106        -4     1325        15         0
##    5: 2014    10  31     1113         8     1317         8         0
##   ---                                                               
## 3659: 2014     1   1     1302        17     1503        21         0
## 3660: 2014     1   1     1350        -5     1600         5         0
## 3661: 2014     1   1     1628        -1     1829        -8         0
## 3662: 2014     1   1     1920       130     2137       147         0
## 3663: 2014     1   1     2037        52     2242        52         0
##       carrier tailnum flight origin dest air_time distance hour min
##    1:      DL  N320US    831    LGA  DTW       74      502    7  18
##    2:      DL  N818DA    189    LGA  DTW       79      502    9   0
##    3:      MQ  N528MQ   3478    LGA  DTW       80      502    9  39
##    4:      MQ  N511MQ   3592    LGA  DTW       75      502   11   6
##    5:      DL  N3758Y   2449    LGA  DTW       74      502   11  13
##   ---                                                              
## 3659:      DL  N331NW   1131    LGA  DTW       98      502   13   2
## 3660:      MQ  N839MQ   3340    LGA  DTW      101      502   13  50
## 3661:      DL  N310NW   2231    LGA  DTW       98      502   16  28
## 3662:      MQ  N833MQ   3530    LGA  DTW      103      502   19  20
## 3663:      MQ  N856MQ   3603    LGA  DTW      105      502   20  37

Column Selection

Get the departure and arrival times, flight number, and carrier for all flights from LGA to DTW:

nyc14[origin == 'LGA' & dest == 'DTW', list(dep_time, arr_time, carrier, flight)]
##       dep_time arr_time carrier flight
##    1:      901     1102      DL    181
##    2:      555      745      DL    731
##    3:     1302     1503      DL   1131
##    4:     1628     1829      DL   2231
##    5:      849     1058      MQ   3478
##   ---                                 
## 3659:     1613     1757      DL   2231
## 3660:      939     1138      MQ   3478
## 3661:     1912     2104      MQ   3603
## 3662:     1346     1535      MQ   3631
## 3663:     1106     1325      MQ   3592

Notice the use of list() to select columns. A synonym for list() within data.table is .() to save typing and enhance readability:

nyc14[origin == 'LGA' & dest == 'DTW', .(dep_time, arr_time, carrier, flight)]
##       dep_time arr_time carrier flight
##    1:      901     1102      DL    181
##    2:      555      745      DL    731
##    3:     1302     1503      DL   1131
##    4:     1628     1829      DL   2231
##    5:      849     1058      MQ   3478
##   ---                                 
## 3659:     1613     1757      DL   2231
## 3660:      939     1138      MQ   3478
## 3661:     1912     2104      MQ   3603
## 3662:     1346     1535      MQ   3631
## 3663:     1106     1325      MQ   3592

Columns can also be selected using a character vector of column names.

nyc14[origin == 'LGA' & dest == 'DTW', c("dep_time", "arr_time", "carrier", "flight")]
##       dep_time arr_time carrier flight
##    1:      901     1102      DL    181
##    2:      555      745      DL    731
##    3:     1302     1503      DL   1131
##    4:     1628     1829      DL   2231
##    5:      849     1058      MQ   3478
##   ---                                 
## 3659:     1613     1757      DL   2231
## 3660:      939     1138      MQ   3478
## 3661:     1912     2104      MQ   3603
## 3662:     1346     1535      MQ   3631
## 3663:     1106     1325      MQ   3592

We can deselect columns using negation (- or !) if we also set with to false:

nyc14 = nyc14[, -c("tailnum"), with=FALSE]

nyc14 = nyc14[, !c("cancelled", "year", "day", "hour", "min"), with=FALSE]
nyc14
##         month dep_time dep_delay arr_time arr_delay carrier flight origin
##      1:     1      914        14     1238        13      AA      1    JFK
##      2:     1     1157        -3     1523        13      AA      3    JFK
##      3:     1     1902         2     2224         9      AA     21    JFK
##      4:     1      722        -8     1014       -26      AA     29    LGA
##      5:     1     1347         2     1706         1      AA    117    JFK
##     ---                                                                  
## 253312:    10     1459         1     1747       -30      UA   1744    LGA
## 253313:    10      854        -5     1147       -14      UA   1758    EWR
## 253314:    10     1102        -8     1311        16      MQ   3591    LGA
## 253315:    10     1106        -4     1325        15      MQ   3592    LGA
## 253316:    10      824        -5     1045         1      MQ   3599    LGA
##         dest air_time distance
##      1:  LAX      359     2475
##      2:  LAX      363     2475
##      3:  LAX      351     2475
##      4:  PBI      157     1035
##      5:  LAX      350     2475
##     ---                       
## 253312:  IAH      201     1416
## 253313:  IAH      189     1400
## 253314:  RDU       83      431
## 253315:  DTW       75      502
## 253316:  SDF      110      659

Computing with Columns

The j-clause can be used to compute with column variables like “dplyr::summarize”. Below, we find the upper and lower quintiles of departure delays for flights between LGA and DTW during this period:

lga_dtw[ , .(p20=quantile(dep_delay, .2), p80=quantile(dep_delay, .8))]
##    p20 p80
## 1:  -6  10

The j-clause can also be used to compute with column variables much like “dplyr::transmute”. Here, we create new columns indicating whether the arrival or departure delays were greater than 15 minutes:

nyc14[, .(delay15 = 1*{dep_delay > 15 | arr_delay > 15})]
##         delay15
##      1:       0
##      2:       0
##      3:       0
##      4:       0
##      5:       0
##     ---        
## 253312:       0
## 253313:       0
## 253314:       1
## 253315:       0
## 253316:       0

To get behavior like “dplyr::mutate” we need reference semantics.

First, using the special function “:=()” we can add the delay15 above.

nyc14[,`:=`(delay15 = 1L*{dep_delay > 15 | arr_delay > 15})]
nyc14
##         month dep_time dep_delay arr_time arr_delay carrier flight origin
##      1:     1      914        14     1238        13      AA      1    JFK
##      2:     1     1157        -3     1523        13      AA      3    JFK
##      3:     1     1902         2     2224         9      AA     21    JFK
##      4:     1      722        -8     1014       -26      AA     29    LGA
##      5:     1     1347         2     1706         1      AA    117    JFK
##     ---                                                                  
## 253312:    10     1459         1     1747       -30      UA   1744    LGA
## 253313:    10      854        -5     1147       -14      UA   1758    EWR
## 253314:    10     1102        -8     1311        16      MQ   3591    LGA
## 253315:    10     1106        -4     1325        15      MQ   3592    LGA
## 253316:    10      824        -5     1045         1      MQ   3599    LGA
##         dest air_time distance delay15
##      1:  LAX      359     2475       0
##      2:  LAX      363     2475       0
##      3:  LAX      351     2475       0
##      4:  PBI      157     1035       0
##      5:  LAX      350     2475       0
##     ---                               
## 253312:  IAH      201     1416       0
## 253313:  IAH      189     1400       0
## 253314:  RDU       83      431       1
## 253315:  DTW       75      502       0
## 253316:  SDF      110      659       0

Pay close attention to the first line of the code above. Why didn’t it print and why didn’t we need an assignment? Also, what is the role of that “L”?

Another way to accomplish this is shown below. Which do you find more intuitive?

nyc14[, delay30 := 1L*{dep_delay > 30 | arr_delay > 30}]
nyc14
##         month dep_time dep_delay arr_time arr_delay carrier flight origin
##      1:     1      914        14     1238        13      AA      1    JFK
##      2:     1     1157        -3     1523        13      AA      3    JFK
##      3:     1     1902         2     2224         9      AA     21    JFK
##      4:     1      722        -8     1014       -26      AA     29    LGA
##      5:     1     1347         2     1706         1      AA    117    JFK
##     ---                                                                  
## 253312:    10     1459         1     1747       -30      UA   1744    LGA
## 253313:    10      854        -5     1147       -14      UA   1758    EWR
## 253314:    10     1102        -8     1311        16      MQ   3591    LGA
## 253315:    10     1106        -4     1325        15      MQ   3592    LGA
## 253316:    10      824        -5     1045         1      MQ   3599    LGA
##         dest air_time distance delay15 delay30
##      1:  LAX      359     2475       0       0
##      2:  LAX      363     2475       0       0
##      3:  LAX      351     2475       0       0
##      4:  PBI      157     1035       0       0
##      5:  LAX      350     2475       0       0
##     ---                                       
## 253312:  IAH      201     1416       0       0
## 253313:  IAH      189     1400       0       0
## 253314:  RDU       83      431       1       0
## 253315:  DTW       75      502       0       0
## 253316:  SDF      110      659       0       0

by

To perform operations group-wise use a by argument after the j statement.

nyc14[ , "delay30" := 1L*{dep_delay > 30 | arr_delay > 30}]

# Find the percent of flights with delays of 30 minutes or more by carrier
nyc14[, .(del30_pct = 100*mean(delay30)), by=carrier]
##     carrier del30_pct
##  1:      AA  15.21177
##  2:      AS  13.93728
##  3:      B6  18.43117
##  4:      DL  15.69945
##  5:      EV  23.55157
##  6:      F9  34.88372
##  7:      FL  23.66107
##  8:      HA  15.00000
##  9:      MQ  18.32534
## 10:      VX  13.00813
## 11:      WN  21.30734
## 12:      UA  18.89684
## 13:      US  10.85373
## 14:      OO  18.00000

We can use a list to specify multiple grouping variables.

# Find percent of flights with delays of 30 minutes or more by carrier and origin.
nyc14[,.(del30_pct = 100*mean(delay30)), by=.(carrier, origin)]
##     carrier origin del30_pct
##  1:      AA    JFK  15.71752
##  2:      AA    LGA  14.39898
##  3:      AA    EWR  16.53454
##  4:      AS    EWR  13.93728
##  5:      B6    JFK  18.35769
##  6:      B6    EWR  17.59547
##  7:      B6    LGA  19.91224
##  8:      DL    LGA  15.09909
##  9:      DL    EWR  17.19239
## 10:      DL    JFK  15.96501
## 11:      EV    EWR  23.57032
## 12:      EV    LGA  23.69027
## 13:      F9    LGA  34.88372
## 14:      FL    LGA  23.66107
## 15:      HA    JFK  15.00000
## 16:      MQ    LGA  18.37349
## 17:      MQ    JFK  17.94636
## 18:      VX    JFK  12.93818
## 19:      VX    EWR  13.26034
## 20:      WN    EWR  21.05263
## 21:      WN    LGA  21.47992
## 22:      MQ    EWR  26.94611
## 23:      EV    JFK  21.70253
## 24:      UA    EWR  19.45798
## 25:      UA    LGA  18.08973
## 26:      UA    JFK  15.01019
## 27:      US    EWR  11.03448
## 28:      US    JFK  13.95085
## 29:      US    LGA  10.02353
## 30:      OO    LGA  18.09045
## 31:      OO    EWR   0.00000
## 32:      VX    LGA   0.00000
##     carrier origin del30_pct

Pay attention to how the result is ordered.

How does this compare to the dplyr group_by() %>% summarize() operation?

Using keyby

To order according to the values in the by argument, use keyby which sets a key with the data.table ordered by this key. More on keys can be found below.

delay_pct1 = nyc14[,.(del30_pct = 100*mean(delay30)), by=carrier]
key(delay_pct1)
## NULL
delay_pct2 = nyc14[,.(del30_pct = 100*mean(delay30)), keyby=carrier]
key(delay_pct2)
## [1] "carrier"
cbind(delay_pct1, delay_pct2)
##     carrier del30_pct carrier del30_pct
##  1:      AA  15.21177      AA  15.21177
##  2:      AS  13.93728      AS  13.93728
##  3:      B6  18.43117      B6  18.43117
##  4:      DL  15.69945      DL  15.69945
##  5:      EV  23.55157      EV  23.55157
##  6:      F9  34.88372      F9  34.88372
##  7:      FL  23.66107      FL  23.66107
##  8:      HA  15.00000      HA  15.00000
##  9:      MQ  18.32534      MQ  18.32534
## 10:      VX  13.00813      OO  18.00000
## 11:      WN  21.30734      UA  18.89684
## 12:      UA  18.89684      US  10.85373
## 13:      US  10.85373      VX  13.00813
## 14:      OO  18.00000      WN  21.30734

Chaining

As with standard data.frame indexing, we can compose data.table bracketed expressions using chaining.

## Find max departure delay by flight among all flights from LGA to DTW
## Then, select flights within the shortest 10% of max_delay
nyc14[origin=='LGA' & dest=='DTW', .(max_delay = max(dep_delay)), 
      by = .(carrier, flight)
     ][, .(carrier, flight, max_delay, max_delay_q10 = quantile(max_delay, .1)) 
     ][max_delay < max_delay_q10, -"max_delay_q10", with=FALSE
     ]
##    carrier flight max_delay
## 1:      DL   1107        -2
## 2:      EV   5405        -5
## 3:      DL    796        -8
## 4:      EV   5596       -10

If you prefer pipes %>% for clarity, you can use them by appending a . before the opening bracket:

# Same as above, but using magrittr::%>%
  nyc14[origin=='LGA' & dest=='DTW', .(max_delay = max(dep_delay)), by = .(carrier, flight)] %>%
    .[, .(carrier, flight, max_delay, max_delay_q10 = quantile(max_delay, .1))] %>%
    .[max_delay < max_delay_q10, -"max_delay_q10", with=FALSE]
##    carrier flight max_delay
## 1:      DL   1107        -2
## 2:      EV   5405        -5
## 3:      DL    796        -8
## 4:      EV   5596       -10

.SD

Recall that the special symbol .N contains the number of rows in each subset defined using by or keyby.

# How many total flights by each carrier to DTW?
nyc14[dest=='DTW', .N, by=carrier]
##    carrier    N
## 1:      DL 3095
## 2:      EV 1584
## 3:      MQ 1331

There is another special symbol .SD which references the entire subset of data for each group. It is itself a data.table as we can see using the code below.

nyc14[dest=='DTW', .(rows=nrow(.SD), cols=ncol(.SD), n=.N, class = class(.SD)[1]), by=carrier]
##    carrier rows cols    n      class
## 1:      DL 3095   12 3095 data.table
## 2:      EV 1584   12 1584 data.table
## 3:      MQ 1331   12 1331 data.table

As a reminder, any valid R expression can be placed in j.

nyc14[dest=='DTW', print(.SD[1:2]), by=carrier]
##    month dep_time dep_delay arr_time arr_delay flight origin dest air_time
## 1:     1      901        -4     1102       -11    181    LGA  DTW       99
## 2:     1      555        -5      745        -7    731    LGA  DTW       93
##    distance delay30
## 1:      502       0
## 2:      502       0
##    month dep_time dep_delay arr_time arr_delay flight origin dest air_time
## 1:     1     1225        10     1428         9   4118    EWR  DTW      103
## 2:     1     2055        -4     2305        11   4247    EWR  DTW      102
##    distance delay30
## 1:      488       0
## 2:      488       0
##    month dep_time dep_delay arr_time arr_delay flight origin dest air_time
## 1:     1      849        -6     1058        -2   3478    LGA  DTW      103
## 2:     1     1920       130     2137       147   3530    LGA  DTW      103
##    distance delay30
## 1:      502       0
## 2:      502       1
## Empty data.table (0 rows) of 1 col: carrier
nyc14[dest=='DTW', cat(nrow(.SD), '\n'), by=carrier]
## 3095 
## 1584 
## 1331
## Empty data.table (0 rows) of 1 col: carrier

Notice that the grouping variale carrier is not a column in .SD.

We can pass an additional argument .SDcols to the bracketing function to limit the columns in .SD.

nyc14[dest=='DTW', .(rows=nrow(.SD), cols=ncol(.SD), n=.N), by=carrier,
      .SDcols = c("origin", "dest", "flight", "dep_time")]
##    carrier rows cols    n
## 1:      DL 3095    4 3095
## 2:      EV 1584    4 1584
## 3:      MQ 1331    4 1331
nyc14[dest=='DTW', print(.SD), by=carrier, .SDcols = c("origin", "dest", "flight", "dep_time")]
##       origin dest flight dep_time
##    1:    LGA  DTW    181      901
##    2:    LGA  DTW    731      555
##    3:    LGA  DTW   1131     1302
##    4:    JFK  DTW   2184     1601
##    5:    LGA  DTW   2231     1628
##   ---                            
## 3091:    EWR  DTW    825     1628
## 3092:    LGA  DTW    831      718
## 3093:    LGA  DTW   1131     1235
## 3094:    LGA  DTW   2131     1813
## 3095:    LGA  DTW   2231     1613
##       origin dest flight dep_time
##    1:    EWR  DTW   4118     1225
##    2:    EWR  DTW   4247     2055
##    3:    EWR  DTW   4381     1639
##    4:    EWR  DTW   5078     1250
##    5:    EWR  DTW   4246     1117
##   ---                            
## 1580:    EWR  DTW   4297      842
## 1581:    EWR  DTW   4911      910
## 1582:    EWR  DTW   4246      731
## 1583:    EWR  DTW   4247     2133
## 1584:    EWR  DTW   4911      913
##       origin dest flight dep_time
##    1:    LGA  DTW   3478      849
##    2:    LGA  DTW   3530     1920
##    3:    LGA  DTW   3603     2037
##    4:    LGA  DTW   3689     1128
##    5:    LGA  DTW   3340     1350
##   ---                            
## 1327:    LGA  DTW   3592     1058
## 1328:    LGA  DTW   3478      939
## 1329:    LGA  DTW   3603     1912
## 1330:    LGA  DTW   3631     1346
## 1331:    LGA  DTW   3592     1106
## Empty data.table (0 rows) of 1 col: carrier

This can be useful in the j statement because it allows us to use lapply or any other function that returns a list to compute on multiple columns.

# What is the mean departure & arrival delay for each flight to DTW?
nyc14[dest=='DTW', lapply(.SD, mean), by=.(origin, dest, carrier, flight),
      .SDcols = c("arr_delay", "dep_delay")]
##      origin dest carrier flight  arr_delay   dep_delay
##   1:    LGA  DTW      DL    181   5.694118  11.8176471
##   2:    LGA  DTW      DL    731  -2.637795  -0.6653543
##   3:    LGA  DTW      DL   1131   5.774648   9.3450704
##   4:    JFK  DTW      DL   2184  -8.000000  -4.0000000
##   5:    LGA  DTW      DL   2231  10.861486  17.9054054
##  ---                                                  
## 168:    EWR  DTW      DL   2509 -12.037037  -2.5555556
## 169:    EWR  DTW      EV   5283 -14.947368  -1.9473684
## 170:    EWR  DTW      EV   4886 -18.750000  -6.0000000
## 171:    LGA  DTW      EV   5596 -12.000000 -10.0000000
## 172:    EWR  DTW      EV   4911 -10.000000  -3.5000000

Columns can also be specified as ranges in .SDcols.

nyc14[dest=='DTW', lapply(.SD, mean), 
      by=.(origin, dest, carrier, flight), 
      .SDcols = arr_delay:dep_delay]
##      origin dest carrier flight  arr_delay  arr_time   dep_delay
##   1:    LGA  DTW      DL    181   5.694118 1120.9706  11.8176471
##   2:    LGA  DTW      DL    731  -2.637795  754.5906  -0.6653543
##   3:    LGA  DTW      DL   1131   5.774648 1450.4366   9.3450704
##   4:    JFK  DTW      DL   2184  -8.000000 1815.0000  -4.0000000
##   5:    LGA  DTW      DL   2231  10.861486 1872.9527  17.9054054
##  ---                                                            
## 168:    EWR  DTW      DL   2509 -12.037037 1412.0000  -2.5555556
## 169:    EWR  DTW      EV   5283 -14.947368 1239.1579  -1.9473684
## 170:    EWR  DTW      EV   4886 -18.750000 1398.2500  -6.0000000
## 171:    LGA  DTW      EV   5596 -12.000000 1018.0000 -10.0000000
## 172:    EWR  DTW      EV   4911 -10.000000 1087.0000  -3.5000000

Because .SDcols takes a character vector it is often useful to construct it programmatically from the names() of the data.table object.

delay_cols = names(nyc14)[ grep('delay', names(nyc14)) ]
delay_stats = 
  nyc14[dest=='DTW', c( lapply(.SD, mean),
                        lapply(.SD, sd) 
                      ), keyby = .(carrier), .SDcols = delay_cols]

delay_stats
##    carrier dep_delay arr_delay   delay30 dep_delay arr_delay   delay30
## 1:      DL  9.406785  3.074960 0.1350565  37.47738  40.83548 0.3418392
## 2:      EV 17.391414 13.052399 0.2468434  43.13941  46.21437 0.4313110
## 3:      MQ  4.903080  4.510143 0.1367393  26.47499  30.55465 0.3437011
new_names = 
 c( key(delay_stats),
    paste(delay_cols, 'mean', sep='_'),
    paste(delay_cols, 'sd', sep='_') 
 )
setnames(delay_stats,  new_names)

In this example, note that setnames() like all set* functions in data.table updates in place by reference.

Copies

One of the goals of the data.table package is to use memory efficiently. This is achieved in part by preferring “shallow” copies by reference over “deep copies” by value when appropriate. When an object is copied by reference it shares physical memory address with the object it is copied from. This is more efficient, but may lead to confusion as changing the value in memory also changes what is pointed to by both objects.

In the example below, we create a data.table DT1 and then assign it to DT2. Typical R objects would be copied by value using “copy on modify” semantics, but DT2 is copied by reference. We can ask for a copy by value explicitly using copy().

DT1 = data.table(A=5:1, B=letters[5:1])
DT2 = DT1         # Copy by reference
DT3 = copy(DT1)   # Copy by value
DT4 = DT1[,.SD]   # Value or reference? See below.
  
DT1[,C := 2*A]    # Create a new column 
DT1
##    A B  C
## 1: 5 e 10
## 2: 4 d  8
## 3: 3 c  6
## 4: 2 b  4
## 5: 1 a  2
DT2 
##    A B  C
## 1: 5 e 10
## 2: 4 d  8
## 3: 3 c  6
## 4: 2 b  4
## 5: 1 a  2
DT3
##    A B
## 1: 5 e
## 2: 4 d
## 3: 3 c
## 4: 2 b
## 5: 1 a
DT4                # What do we learn about the DT1[,.SD] syntax? 
##    A B
## 1: 5 e
## 2: 4 d
## 3: 3 c
## 4: 2 b
## 5: 1 a

After updating DT1 to include a new column, C, the column appears in DT2 as well because DT1 and DT2 refer to the same object.

How can we undestand the DT4 example? Does it make sense that it works as it does?

Reference Semantics

In the last example above we used reference semantics to create a new column in DT1 without copying the other other columns and reassigning to a new DT1 object.

A similar notation using := as a function is below.

DT1[ , `:=`( D = C + 1 ) ]

One way in which this is useful is to modify subsets of a data.table without re-allocating the entire thing.

# Truncate all arrival delays at 0
range(nyc14$arr_delay)
## [1] -112 1494
# Address for nyc14 location in memory
tracemem(nyc14$arr_delay)
## [1] "<0x10be25000>"
nyc14[arr_delay < 0, arr_delay := 0]
range(nyc14$arr_delay)
## [1]    0 1494
# Show memory location
tracemem(nyc14$arr_delay)
## [1] "<0x10be25000>"

If we replace nyc14$arr_delay with nyc14[, 'arr_delay'] or nyc14[,.(arr_delay)] will we see the same memory location from tracemem()? If not, why not? What does this tell us about these syntaxes?

nyc_df = as.data.frame(nyc14)
tracemem(nyc_df$arr_delay)
## [1] "<0x10dcf2000>"
nyc_df$arr_delay[which(nyc_df$arr_delay<0)] = 0
## tracemem[0x10dcf2000 -> 0x10e987000]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> 
## tracemem[0x10e987000 -> 0x10ec6f000]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous>
tracemem(nyc_df$arr_delay)
## [1] "<0x10ec6f000>"

We can also delete columns by reference using NULL:

nyc14[,c("month") := NULL]
# i.e. nyc14$month = NULL

We can use this with by to accomplish tasks such as adding a column showing the maximum departure delay by flight.

nyc14[,max_dep_delay := max(dep_delay), by=.(carrier, flight)][]
##         dep_time dep_delay arr_time arr_delay carrier flight origin dest
##      1:      914        14     1238        13      AA      1    JFK  LAX
##      2:     1157        -3     1523        13      AA      3    JFK  LAX
##      3:     1902         2     2224         9      AA     21    JFK  LAX
##      4:      722        -8     1014         0      AA     29    LGA  PBI
##      5:     1347         2     1706         1      AA    117    JFK  LAX
##     ---                                                                 
## 253312:     1459         1     1747         0      UA   1744    LGA  IAH
## 253313:      854        -5     1147         0      UA   1758    EWR  IAH
## 253314:     1102        -8     1311        16      MQ   3591    LGA  RDU
## 253315:     1106        -4     1325        15      MQ   3592    LGA  DTW
## 253316:      824        -5     1045         1      MQ   3599    LGA  SDF
##         air_time distance delay30 max_dep_delay
##      1:      359     2475       0           156
##      2:      363     2475       0           284
##      3:      351     2475       0           848
##      4:      157     1035       0            89
##      5:      350     2475       0           248
##     ---                                        
## 253312:      201     1416       0           385
## 253313:      189     1400       0            42
## 253314:       83      431       0           240
## 253315:       75      502       0            68
## 253316:      110      659       0           121

With the functional form you can create multiple columns by reference.

nyc14[, `:=`(max_dep_delay = max(dep_delay),
             min_dep_delay = min(dep_delay)        
            ),
      by=.(carrier, flight)
     ][]
##         dep_time dep_delay arr_time arr_delay carrier flight origin dest
##      1:      914        14     1238        13      AA      1    JFK  LAX
##      2:     1157        -3     1523        13      AA      3    JFK  LAX
##      3:     1902         2     2224         9      AA     21    JFK  LAX
##      4:      722        -8     1014         0      AA     29    LGA  PBI
##      5:     1347         2     1706         1      AA    117    JFK  LAX
##     ---                                                                 
## 253312:     1459         1     1747         0      UA   1744    LGA  IAH
## 253313:      854        -5     1147         0      UA   1758    EWR  IAH
## 253314:     1102        -8     1311        16      MQ   3591    LGA  RDU
## 253315:     1106        -4     1325        15      MQ   3592    LGA  DTW
## 253316:      824        -5     1045         1      MQ   3599    LGA  SDF
##         air_time distance delay30 max_dep_delay min_dep_delay
##      1:      359     2475       0           156           -12
##      2:      363     2475       0           284           -16
##      3:      351     2475       0           848           -11
##      4:      157     1035       0            89           -15
##      5:      350     2475       0           248           -10
##     ---                                                      
## 253312:      201     1416       0           385            -7
## 253313:      189     1400       0            42            -5
## 253314:       83      431       0           240           -16
## 253315:       75      502       0            68           -15
## 253316:      110      659       0           121           -20

The last set of empty brackets above is a short-hand for a subsequent call to print(nyc14).

Keys and Indexing

The material on in this section is based on the vignette here.

Above we used “indexing” in a generic sense to mean “subsetting”. What we mean by “indexing” here is more specific and technical: we create an indexed data table by designating specific columns as keys and sorting the table by these keys to create more efficient lookups and aggregations.

Keys

We can designate one or more columns as a key in data.table using setkey().

setkey(nyc14, origin) #also, setkeyv(nyc14, "origin") if character is preferred.
key(nyc14)
## [1] "origin"

After a key has been set, we can subset in the i-statement using lists:

nyc14[.("LGA")]
##        dep_time dep_delay arr_time arr_delay carrier flight origin dest
##     1:      722        -8     1014         0      AA     29    LGA  PBI
##     2:      553        -7      739         0      AA    301    LGA  ORD
##     3:      623        -7      815         0      AA    303    LGA  ORD
##     4:      652        -8      833         0      AA    305    LGA  ORD
##     5:      738        -2      940        15      AA    307    LGA  ORD
##    ---                                                                 
## 84429:      609        24      843         0      UA   1714    LGA  IAH
## 84430:     1459         1     1747         0      UA   1744    LGA  IAH
## 84431:     1102        -8     1311        16      MQ   3591    LGA  RDU
## 84432:     1106        -4     1325        15      MQ   3592    LGA  DTW
## 84433:      824        -5     1045         1      MQ   3599    LGA  SDF
##        air_time distance delay30 max_dep_delay min_dep_delay
##     1:      157     1035       0            89           -15
##     2:      142      733       0            65           -13
##     3:      143      733       0           126           -12
##     4:      139      733       0           263           -13
##     5:      145      733       0           140           -12
##    ---                                                      
## 84429:      198     1416       0            53            -8
## 84430:      201     1416       0           385            -7
## 84431:       83      431       0           240           -16
## 84432:       75      502       0            68           -15
## 84433:      110      659       0           121           -20

We can have more than one column contribute to the order used to form the key.

# key by origin and destination
setkey(nyc14, origin, dest)
key(nyc14)
## [1] "origin" "dest"
# Get flights of any origin to DTW
microbenchmark(
 nyc14[.(c("JFK", "LGA", "EWR"), 'DTW')],
 nyc14[.(dest="DTW")],
 nyc14[dest=='DTW']
)
## Unit: microseconds
##                                     expr     min        lq      mean
##  nyc14[.(c("JFK", "LGA", "EWR"), "DTW")] 831.689  917.4880 1229.2002
##                   nyc14[.(dest = "DTW")] 609.523  676.7805  900.4494
##                     nyc14[dest == "DTW"] 926.314 1052.4400 1476.0400
##     median       uq      max neval cld
##  1250.6140 1454.796 2112.365   100  b 
##   783.5985  943.367 7475.287   100 a  
##  1343.5760 1608.982 7913.950   100   c

We can of course combine this with j and by statements.

# Find the median departure delay for all flights to DTW
nyc14[.(unique(origin),'DTW'), 
      .(med_dep_delay = as.numeric(median(dep_delay)), n=.N),
      by=.(origin, dest, flight)
     ] %>%
  .[order(origin, med_dep_delay, -n)] %>% # Order
  .[n>42]  # Limit to at least ~1 flight per week
##     origin dest flight med_dep_delay   n
##  1:    EWR  DTW   5203          -3.5  70
##  2:    EWR  DTW   4297          -3.0 188
##  3:    EWR  DTW   5067          -3.0  67
##  4:    EWR  DTW   1080          -3.0  63
##  5:    EWR  DTW   1129          -3.0  47
##  6:    EWR  DTW   4246          -2.0  78
##  7:    EWR  DTW   1214          -2.0  45
##  8:    EWR  DTW   4118           0.0 196
##  9:    EWR  DTW   3837           0.0  55
## 10:    EWR  DTW   4382           1.0 213
## 11:    EWR  DTW   4898           5.0  63
## 12:    EWR  DTW   4247           6.0  88
## 13:    EWR  DTW   4381          12.0 146
## 14:    LGA  DTW   3478          -6.0 282
## 15:    LGA  DTW   2810          -6.0 184
## 16:    LGA  DTW   3340          -5.0  77
## 17:    LGA  DTW    731          -4.0 254
## 18:    LGA  DTW   3530          -4.0 148
## 19:    LGA  DTW   3631          -4.0 132
## 20:    LGA  DTW   3689          -4.0 131
## 21:    LGA  DTW   3603          -3.0 257
## 22:    LGA  DTW    831          -3.0 240
## 23:    LGA  DTW    181          -3.0 170
## 24:    LGA  DTW   2449          -3.0 120
## 25:    LGA  DTW   1131          -2.0 284
## 26:    LGA  DTW   2131          -2.0 247
## 27:    LGA  DTW   1231          -2.0 231
## 28:    LGA  DTW    189          -2.0  93
## 29:    LGA  DTW   2231          -1.0 296
## 30:    LGA  DTW   2331          -1.0 241
## 31:    LGA  DTW   3592           0.0 119
##     origin dest flight med_dep_delay   n

In data.table when we designate columns as keys, the rows are re-ordered by reference in increasing order. This physically reorders the rows but uses the same locations in memory.

Indices

In contrast to keys, we can use secondary indices to store the row order without physically reordering the entire data.table. With an index, the original columns are left as is in memory but an attribute index is created to store the order of the data.table.

This works something like this:

# Example of a secondary index on a data-frame
df = data.frame(ID = sample(LETTERS, 100, replace=TRUE), value=rpois(100, 10))
df$index = order(df$ID)
head(df)
##   ID value index
## 1  C    12    63
## 2  Z    11    91
## 3  R    12    46
## 4  E    12    66
## 5  Q     7    83
## 6  Q    10     1
head(df[df$index, ])
##    ID value index
## 63  A     9    85
## 91  A     9    74
## 46  B     7    51
## 66  B     8     6
## 83  B    11    88
## 1   C    12    63

Here is an example using the nyc14 data to count the flights to DTW using an index to subset.

setindex(nyc14, NULL)
setindex(nyc14, dest)
indices(nyc14)
## [1] "dest"
names( attributes(nyc14) )
## [1] "names"             "class"             "row.names"        
## [4] ".internal.selfref" "sorted"            "index"
nyc14["DTW",  .N, on = 'dest']
## [1] 6010

Comparisons

What is the advantage of indexing by setting a key?

# Generate two columns of data with 10 million rows
N = 1e8
DT = data.table(group = sample(1:26, N, replace=TRUE))
DT = DT[ ,.(count=rpois(.N,group)), keyby=group]

# See size of this data
print(object.size(DT), units='MB')
## 762.9 Mb
## Unkeyd approach
tm0 = system.time({
  ans0 <- DT[group==1 | group==26, .(lambda_hat = mean(count)), by=group]
})

## Alternate unkeyed approach
tm1 = system.time({
  ans1 <- DT[, .(lambda_hat = mean(count)), by=group][group==1 | group==26]
})

# Set index
tm_index = system.time({
 setindex(DT, group) 
})

# Indexed approach
tm2 = system.time({
   ans2 <- DT[ .( c(1,26) ), .(lambda_hat = mean(count)), by=group, on = 'group']
})

# Set key
tm_key = system.time({
 setkey(DT, group)  
})

# keyed approach
tm3 = system.time({
  ans3 <- DT[ .( c(1,26) ), .(lambda_hat = mean(count)), by=group]
})

# Compare timings (seconds)
rbind(subset_first = tm0, mean_first = tm1, index = tm2, key = tm3)[,'elapsed']
## subset_first   mean_first        index          key 
##        1.727        1.040        0.158        0.166
rbind(tm_key, tm_index)[,'elapsed']
##   tm_key tm_index 
##    0.418    0.416
rm(DT,tm0,tm1,tm2)

Joining data.table’s

There is often a need to join information stored across two or more data frames. In R we have previously used dplyr::left_join() or similar *_join() functions for this. In base R two data.frames can be joined using the S3 generic merge() which calls the merge.data.frame() method.

merge()

The data.table package defines a merge.data.table() method. In addition to the tables to join, there are two key parameters: by and all.

We use by to specify which columns to join on.

nyc14[ origin == 'JFK', .N, .(carrier)] %>% 
 merge( . , nycflights13::airlines, by = 'carrier', all = TRUE) %>%
  .[order(-N)]
##     carrier     N                        name
##  1:      B6 34220             JetBlue Airways
##  2:      DL 18860        Delta Air Lines Inc.
##  3:      AA 11923      American Airlines Inc.
##  4:      MQ  5444                   Envoy Air
##  5:      UA  3924       United Air Lines Inc.
##  6:      VX  3138              Virgin America
##  7:      US  2645             US Airways Inc.
##  8:      EV  1069    ExpressJet Airlines Inc.
##  9:      HA   260      Hawaiian Airlines Inc.
## 10:      9E    NA           Endeavor Air Inc.
## 11:      AS    NA        Alaska Airlines Inc.
## 12:      F9    NA      Frontier Airlines Inc.
## 13:      FL    NA AirTran Airways Corporation
## 14:      OO    NA       SkyWest Airlines Inc.
## 15:      WN    NA      Southwest Airlines Co.
## 16:      YV    NA          Mesa Airlines Inc.

The by variables must exist in both tables. If not, use by.x and by.y instead.

We an specify inner (all=FALSE), left (all.x=TRUE), right (all.y=TRUE), or full (all=TRUE) joins using the all* parameters.

nyc14[ origin == 'JFK', .N, .(carrier)] %>% 
 merge( . , nycflights13::airlines, by = 'carrier', all = FALSE) %>%
  .[order(-N)]
##    carrier     N                     name
## 1:      B6 34220          JetBlue Airways
## 2:      DL 18860     Delta Air Lines Inc.
## 3:      AA 11923   American Airlines Inc.
## 4:      MQ  5444                Envoy Air
## 5:      UA  3924    United Air Lines Inc.
## 6:      VX  3138           Virgin America
## 7:      US  2645          US Airways Inc.
## 8:      EV  1069 ExpressJet Airlines Inc.
## 9:      HA   260   Hawaiian Airlines Inc.

The resulting merge contains all columns from both tables with duplicate names not used in by renamed using a sufix, i.e. col.x or col.y.

x = data.table( id = 0:4, letter = letters[26 - 0:4])
y = data.table( id = 1:5, letter = LETTERS[26 - 1:5])
merge(x, y, by = 'id', all = TRUE)
##    id letter.x letter.y
## 1:  0        z       NA
## 2:  1        y        Y
## 3:  2        x        X
## 4:  3        w        W
## 5:  4        v        V
## 6:  5       NA        U

Joining with []

There may be times where you wish to perform some computation using columns from two tables without the need for an explicit merge first.

In these cases you can use the DT1[DT2,] syntax for joins.

x[y, , on = 'id']
##    id letter i.letter
## 1:  1      y        Y
## 2:  2      x        X
## 3:  3      w        W
## 4:  4      v        V
## 5:  5     NA        U
y[x, , on = 'id']
##    id letter i.letter
## 1:  0     NA        z
## 2:  1      Y        y
## 3:  2      X        x
## 4:  3      W        w
## 5:  4      V        v
x[y, .(id, letter), on = 'id']
##    id letter
## 1:  1      y
## 2:  2      x
## 3:  3      w
## 4:  4      v
## 5:  5     NA
x[y, .(id, letter), on = 'id', nomatch = 0L]
##    id letter
## 1:  1      y
## 2:  2      x
## 3:  3      w
## 4:  4      v

If we are mathching on set keys, we do not need to provide these as on.

setkey(x, id)
x[y]
##    id letter i.letter
## 1:  1      y        Y
## 2:  2      x        X
## 3:  3      w        W
## 4:  4      v        V
## 5:  5     NA        U
# y[x] ## Fails because we have no key set for y
setkey(y, id)
y[x]
##    id letter i.letter
## 1:  0     NA        z
## 2:  1      Y        y
## 3:  2      X        x
## 4:  3      W        w
## 5:  4      V        v

Here is the previous flights example using this syntax.

airlines = data.table(nycflights13::airlines)
nyc14[ origin == 'JFK', .N, keyby=carrier][airlines, nomatch=0L][order(-N)]
##    carrier     N                     name
## 1:      B6 34220          JetBlue Airways
## 2:      DL 18860     Delta Air Lines Inc.
## 3:      AA 11923   American Airlines Inc.
## 4:      MQ  5444                Envoy Air
## 5:      UA  3924    United Air Lines Inc.
## 6:      VX  3138           Virgin America
## 7:      US  2645          US Airways Inc.
## 8:      EV  1069 ExpressJet Airlines Inc.
## 9:      HA   260   Hawaiian Airlines Inc.

And here is some additional formatting …

nyc14[ origin == 'JFK', .N, keyby=carrier][airlines, nomatch=0L][order(-N)] %>%
  .[,.(Carrier = name, N = format(N, big.mark = ','))] %>%
  knitr::kable( align = 'r')
Carrier N
JetBlue Airways 34,220
Delta Air Lines Inc. 18,860
American Airlines Inc. 11,923
Envoy Air 5,444
United Air Lines Inc. 3,924
Virgin America 3,138
US Airways Inc. 2,645
ExpressJet Airlines Inc. 1,069
Hawaiian Airlines Inc. 260

Reshaping data tables

We may not cover this in class, but you should read this vignette.

The important points to take away from the vignette are that:

  1. melt() reshapes wide data into long format, and
  2. dcast() reshapes long data into wide format.

Resources