R’s data.table package

About `data.table`

The data.table package in R provides an extension of thedata.frame class that aims to be both more computationally and memory efficient. It is particularly well suited for large in-memory data sets and utilizes indexed keys to allow quick search, subset, and aggregate by group operations.

Creating data.table objects

The data.table package provides a function fread for reading delimited files like read.table and readr::read_delim, but returns a data.table object instead. As with the tibble class, data.table inherits from data.frame. Here is an example using the 2014 New York City Flights data.

library(data.table)
nyc14 = fread('https://github.com/arunsrinivasan/flights/wiki/NYCflights14/flights14.csv')
class(nyc14)

## [1] "data.table" "data.frame"

You can also create a data.table like a data.frame using:

n = 1e3
data.table(a = 1:n, b=rnorm(n), c=sample(letters, n, replace=TRUE))

##          a          b c
##    1:    1  1.2027769 t
##    2:    2 -1.8309957 n
##    3:    3 -0.7729531 h
##    4:    4 -1.1258606 o
##    5:    5  0.7653633 v
##   ---                  
##  996:  996 -0.8487881 x
##  997:  997 -1.8994197 t
##  998:  998 -1.6173459 n
##  999:  999  1.8103370 q
## 1000: 1000 -0.3518696 d

Question: How can we view the print method for class data.table?

“Indexing”

The syntax for the data.tables package is inspired by the bracket ([]) notation for indexing matrices and data frames. At the same time, it aims to allow many common data operations (i.e. dplyr verbs) to be expressed within these brackets. These goals are expressed at the start of the “Introduction to data.table” vignettte.

The basic idea is DT[i, j, by] where we: + subset or filter rows in the i statement, + select, transform, or create columns (variables) in the j statement + and group with the by statement.

Additional operations can also be expressed within the brackets. Remember that, even for a data.frame or matrix the left bracket [ is actually a function:

`[`

## .Primitive("[")

Below are some examples on the nyc14 data.table created using fread() above:

nyc14

##         year month day dep_time dep_delay arr_time arr_delay cancelled
##      1: 2014     1   1      914        14     1238        13         0
##      2: 2014     1   1     1157        -3     1523        13         0
##      3: 2014     1   1     1902         2     2224         9         0
##      4: 2014     1   1      722        -8     1014       -26         0
##      5: 2014     1   1     1347         2     1706         1         0
##     ---                                                               
## 253312: 2014    10  31     1459         1     1747       -30         0
## 253313: 2014    10  31      854        -5     1147       -14         0
## 253314: 2014    10  31     1102        -8     1311        16         0
## 253315: 2014    10  31     1106        -4     1325        15         0
## 253316: 2014    10  31      824        -5     1045         1         0
##         carrier tailnum flight origin dest air_time distance hour min
##      1:      AA  N338AA      1    JFK  LAX      359     2475    9  14
##      2:      AA  N335AA      3    JFK  LAX      363     2475   11  57
##      3:      AA  N327AA     21    JFK  LAX      351     2475   19   2
##      4:      AA  N3EHAA     29    LGA  PBI      157     1035    7  22
##      5:      AA  N319AA    117    JFK  LAX      350     2475   13  47
##     ---                                                              
## 253312:      UA  N23708   1744    LGA  IAH      201     1416   14  59
## 253313:      UA  N33132   1758    EWR  IAH      189     1400    8  54
## 253314:      MQ  N827MQ   3591    LGA  RDU       83      431   11   2
## 253315:      MQ  N511MQ   3592    LGA  DTW       75      502   11   6
## 253316:      MQ  N813MQ   3599    LGA  SDF      110      659    8  24

Subsetting

Find all flights from LGA to DTW:

lga_dtw = nyc14[origin == 'LGA' & dest == 'DTW',]

Get the first and last rows of lga_dtw:

lga_dtw[c(1,.N)]

##    year month day dep_time dep_delay arr_time arr_delay cancelled carrier
## 1: 2014     1   1      901        -4     1102       -11         0      DL
## 2: 2014    10  31     1106        -4     1325        15         0      MQ
##    tailnum flight origin dest air_time distance hour min
## 1:  N917DL    181    LGA  DTW       99      502    9   1
## 2:  N511MQ   3592    LGA  DTW       75      502   11   6

In the above, we used .N to index the last row. This is a special symbol defined by data.table to hold the number of rows or observations in the “current” group. “Current” here refers to the scope in which it is used.

Also, notice the difference from standard data.frame or matrix sub-setting:

# Indexing a matrix
ld_mat = as.matrix(lga_dtw)
ld_mat[c(1,nrow(ld_mat))]

## [1] "2014" "2014"

# Indexing a data.fame
ld_df = as.data.frame(lga_dtw)
dim(ld_df[1:2])  # What would happen if we called: ld_df[c(1,nrow(ld_df))] ?

## [1] 3663    2

You can also use the “i” clause to order a data.table:

lga_dtw[order(-month,-day,dep_time)]

##       year month day dep_time dep_delay arr_time arr_delay cancelled
##    1: 2014    10  31      718        -2      904        -9         0
##    2: 2014    10  31      900        -5     1051       -22         0
##    3: 2014    10  31      939        -6     1138        -2         0
##    4: 2014    10  31     1106        -4     1325        15         0
##    5: 2014    10  31     1113         8     1317         8         0
##   ---                                                               
## 3659: 2014     1   1     1302        17     1503        21         0
## 3660: 2014     1   1     1350        -5     1600         5         0
## 3661: 2014     1   1     1628        -1     1829        -8         0
## 3662: 2014     1   1     1920       130     2137       147         0
## 3663: 2014     1   1     2037        52     2242        52         0
##       carrier tailnum flight origin dest air_time distance hour min
##    1:      DL  N320US    831    LGA  DTW       74      502    7  18
##    2:      DL  N818DA    189    LGA  DTW       79      502    9   0
##    3:      MQ  N528MQ   3478    LGA  DTW       80      502    9  39
##    4:      MQ  N511MQ   3592    LGA  DTW       75      502   11   6
##    5:      DL  N3758Y   2449    LGA  DTW       74      502   11  13
##   ---                                                              
## 3659:      DL  N331NW   1131    LGA  DTW       98      502   13   2
## 3660:      MQ  N839MQ   3340    LGA  DTW      101      502   13  50
## 3661:      DL  N310NW   2231    LGA  DTW       98      502   16  28
## 3662:      MQ  N833MQ   3530    LGA  DTW      103      502   19  20
## 3663:      MQ  N856MQ   3603    LGA  DTW      105      502   20  37

Column Selection

Get the departure and arrival times, flight number, and carrier for all flights from LGA to DTW:

nyc14[origin == 'LGA' & dest == 'DTW', list(dep_time, arr_time, carrier, flight)]

##       dep_time arr_time carrier flight
##    1:      901     1102      DL    181
##    2:      555      745      DL    731
##    3:     1302     1503      DL   1131
##    4:     1628     1829      DL   2231
##    5:      849     1058      MQ   3478
##   ---                                 
## 3659:     1613     1757      DL   2231
## 3660:      939     1138      MQ   3478
## 3661:     1912     2104      MQ   3603
## 3662:     1346     1535      MQ   3631
## 3663:     1106     1325      MQ   3592

Notice the use of list() to select columns. A synonym for list() within data.table is .() to save typing and enhance readability:

nyc14[origin == 'LGA' & dest == 'DTW', .(dep_time, arr_time, carrier, flight)]

##       dep_time arr_time carrier flight
##    1:      901     1102      DL    181
##    2:      555      745      DL    731
##    3:     1302     1503      DL   1131
##    4:     1628     1829      DL   2231
##    5:      849     1058      MQ   3478
##   ---                                 
## 3659:     1613     1757      DL   2231
## 3660:      939     1138      MQ   3478
## 3661:     1912     2104      MQ   3603
## 3662:     1346     1535      MQ   3631
## 3663:     1106     1325      MQ   3592

Columns can also be selected using a character vector of column names.

nyc14[origin == 'LGA' & dest == 'DTW', c("dep_time", "arr_time", "carrier", "flight")]

##       dep_time arr_time carrier flight
##    1:      901     1102      DL    181
##    2:      555      745      DL    731
##    3:     1302     1503      DL   1131
##    4:     1628     1829      DL   2231
##    5:      849     1058      MQ   3478
##   ---                                 
## 3659:     1613     1757      DL   2231
## 3660:      939     1138      MQ   3478
## 3661:     1912     2104      MQ   3603
## 3662:     1346     1535      MQ   3631
## 3663:     1106     1325      MQ   3592

We can deselect columns using negation (- or !) if we also set “with” to false:

nyc14 = nyc14[, -c("tailnum"), with=FALSE]

nyc14 = nyc14[, !c("cancelled", "year", "day", "hour", "min"), with=FALSE]
nyc14

##         month dep_time dep_delay arr_time arr_delay carrier flight origin
##      1:     1      914        14     1238        13      AA      1    JFK
##      2:     1     1157        -3     1523        13      AA      3    JFK
##      3:     1     1902         2     2224         9      AA     21    JFK
##      4:     1      722        -8     1014       -26      AA     29    LGA
##      5:     1     1347         2     1706         1      AA    117    JFK
##     ---                                                                  
## 253312:    10     1459         1     1747       -30      UA   1744    LGA
## 253313:    10      854        -5     1147       -14      UA   1758    EWR
## 253314:    10     1102        -8     1311        16      MQ   3591    LGA
## 253315:    10     1106        -4     1325        15      MQ   3592    LGA
## 253316:    10      824        -5     1045         1      MQ   3599    LGA
##         dest air_time distance
##      1:  LAX      359     2475
##      2:  LAX      363     2475
##      3:  LAX      351     2475
##      4:  PBI      157     1035
##      5:  LAX      350     2475
##     ---                       
## 253312:  IAH      201     1416
## 253313:  IAH      189     1400
## 253314:  RDU       83      431
## 253315:  DTW       75      502
## 253316:  SDF      110      659

Computing with Columns

The “j”-clause can be used to compute with column variables like “dplyr::summarize”. Below, we find the upper and lower quintiles of departure delays for flights between LGA and DTW during this period:

lga_dtw[,.(p20=quantile(dep_delay,.2), p80=quantile(dep_delay,.8))]

##    p20 p80
## 1:  -6  10

The “j”-clause can also be used to compute with column variables much like “dplyr::transmute”. Here, we create new columns indicating whether the arrival or departure delays were greater than 15 minutes:

nyc14[,.(delay15 = 1*{dep_delay > 15 | arr_delay > 15})]

##         delay15
##      1:       0
##      2:       0
##      3:       0
##      4:       0
##      5:       0
##     ---        
## 253312:       0
## 253313:       0
## 253314:       1
## 253315:       0
## 253316:       0

To get behavior like “dplyr::mutate” we need reference semantics.

First, using the special function “:=()” we can add the delay15 above.

nyc14[,`:=`(delay15 = 1L*{dep_delay > 15 | arr_delay > 15})]
nyc14

##         month dep_time dep_delay arr_time arr_delay carrier flight origin
##      1:     1      914        14     1238        13      AA      1    JFK
##      2:     1     1157        -3     1523        13      AA      3    JFK
##      3:     1     1902         2     2224         9      AA     21    JFK
##      4:     1      722        -8     1014       -26      AA     29    LGA
##      5:     1     1347         2     1706         1      AA    117    JFK
##     ---                                                                  
## 253312:    10     1459         1     1747       -30      UA   1744    LGA
## 253313:    10      854        -5     1147       -14      UA   1758    EWR
## 253314:    10     1102        -8     1311        16      MQ   3591    LGA
## 253315:    10     1106        -4     1325        15      MQ   3592    LGA
## 253316:    10      824        -5     1045         1      MQ   3599    LGA
##         dest air_time distance delay15
##      1:  LAX      359     2475       0
##      2:  LAX      363     2475       0
##      3:  LAX      351     2475       0
##      4:  PBI      157     1035       0
##      5:  LAX      350     2475       0
##     ---                               
## 253312:  IAH      201     1416       0
## 253313:  IAH      189     1400       0
## 253314:  RDU       83      431       1
## 253315:  DTW       75      502       0
## 253316:  SDF      110      659       0

Pay close attention to the first line of the code above. Why didn’t it print and why didn’t we need an assignment? Also, what is the role of that “L”?

Another way to accomplish this is as below. Which do you find more intuitive?

nyc14[,"delay30" := 1L*{dep_delay > 30 | arr_delay > 30}]
nyc14

##         month dep_time dep_delay arr_time arr_delay carrier flight origin
##      1:     1      914        14     1238        13      AA      1    JFK
##      2:     1     1157        -3     1523        13      AA      3    JFK
##      3:     1     1902         2     2224         9      AA     21    JFK
##      4:     1      722        -8     1014       -26      AA     29    LGA
##      5:     1     1347         2     1706         1      AA    117    JFK
##     ---                                                                  
## 253312:    10     1459         1     1747       -30      UA   1744    LGA
## 253313:    10      854        -5     1147       -14      UA   1758    EWR
## 253314:    10     1102        -8     1311        16      MQ   3591    LGA
## 253315:    10     1106        -4     1325        15      MQ   3592    LGA
## 253316:    10      824        -5     1045         1      MQ   3599    LGA
##         dest air_time distance delay15 delay30
##      1:  LAX      359     2475       0       0
##      2:  LAX      363     2475       0       0
##      3:  LAX      351     2475       0       0
##      4:  PBI      157     1035       0       0
##      5:  LAX      350     2475       0       0
##     ---                                       
## 253312:  IAH      201     1416       0       0
## 253313:  IAH      189     1400       0       0
## 253314:  RDU       83      431       1       0
## 253315:  DTW       75      502       0       0
## 253316:  SDF      110      659       0       0

by

To perform operations group-wise use a by argument after the “j” statement.

nyc14[,"delay30" := 1L*{dep_delay > 30 | arr_delay > 30}]
# Find percent of flights with delays of 15 minutes or more by carrier
nyc14[,.(del30_pct = 100*mean(delay30)), by=carrier]

##     carrier del30_pct
##  1:      AA  15.21177
##  2:      AS  13.93728
##  3:      B6  18.43117
##  4:      DL  15.69945
##  5:      EV  23.55157
##  6:      F9  34.88372
##  7:      FL  23.66107
##  8:      HA  15.00000
##  9:      MQ  18.32534
## 10:      VX  13.00813
## 11:      WN  21.30734
## 12:      UA  18.89684
## 13:      US  10.85373
## 14:      OO  18.00000

We can use a list to specify multiple grouping variables.

# Find percent of flights with delays of 15 minutes or more by carrier and origin.
nyc14[,.(del30_pct = 100*mean(delay30)), by=.(carrier, origin)]

##     carrier origin del30_pct
##  1:      AA    JFK  15.71752
##  2:      AA    LGA  14.39898
##  3:      AA    EWR  16.53454
##  4:      AS    EWR  13.93728
##  5:      B6    JFK  18.35769
##  6:      B6    EWR  17.59547
##  7:      B6    LGA  19.91224
##  8:      DL    LGA  15.09909
##  9:      DL    EWR  17.19239
## 10:      DL    JFK  15.96501
## 11:      EV    EWR  23.57032
## 12:      EV    LGA  23.69027
## 13:      F9    LGA  34.88372
## 14:      FL    LGA  23.66107
## 15:      HA    JFK  15.00000
## 16:      MQ    LGA  18.37349
## 17:      MQ    JFK  17.94636
## 18:      VX    JFK  12.93818
## 19:      VX    EWR  13.26034
## 20:      WN    EWR  21.05263
## 21:      WN    LGA  21.47992
## 22:      MQ    EWR  26.94611
## 23:      EV    JFK  21.70253
## 24:      UA    EWR  19.45798
## 25:      UA    LGA  18.08973
## 26:      UA    JFK  15.01019
## 27:      US    EWR  11.03448
## 28:      US    JFK  13.95085
## 29:      US    LGA  10.02353
## 30:      OO    LGA  18.09045
## 31:      OO    EWR   0.00000
## 32:      VX    LGA   0.00000
##     carrier origin del30_pct

Chaining

As with standard data.frame indexing, we can compose data.table bracketed expressions using chaining.

## Find max departure delay by flight among all flights from LGA to DTW
## Then, select flights within the shortest 10% of max_delay
nyc14[origin=='LGA' & dest=='DTW', .(max_delay = max(dep_delay)), 
      by = .(carrier, flight)
     ][, .(carrier, flight, max_delay, max_delay_q10 = quantile(max_delay, .1)) 
     ][max_delay < max_delay_q10, -"max_delay_q10", with=FALSE
     ]

##    carrier flight max_delay
## 1:      DL   1107        -2
## 2:      EV   5405        -5
## 3:      DL    796        -8
## 4:      EV   5596       -10

If you prefer pipes %>% for clarity, you can use them by appending a . before the opening bracket:

# Same as above, but using magrittr::%>%
  nyc14[origin=='LGA' & dest=='DTW', .(max_delay = max(dep_delay)), by = .(carrier, flight)] %>%
    .[, .(carrier, flight, max_delay, max_delay_q10 = quantile(max_delay, .1))] %>%
    .[max_delay < max_delay_q10, -"max_delay_q10", with=FALSE]

##    carrier flight max_delay
## 1:      DL   1107        -2
## 2:      EV   5405        -5
## 3:      DL    796        -8
## 4:      EV   5596       -10

Here is a quick timing comparison between chaining, piping, and an equivalent dplyr expression. The differences are negligible for this data size.

# data.table chaining
f_chain <- function(){
  nyc14[origin=='LGA' & dest=='DTW', .(max_delay = max(dep_delay)), 
      by = .(carrier, flight)
     ][, .(carrier, flight, max_delay, max_delay_q10 = quantile(max_delay, .1)) 
     ][max_delay < max_delay_q10, -"max_delay_q10", with=FALSE
     ]
}

# data.table with pipe
f_pipe <- function(){
  nyc14[origin=='LGA' & dest=='DTW', .(max_delay = max(dep_delay)), by = .(carrier, flight)] %>%
    .[, .(carrier, flight, max_delay, max_delay_q10 = quantile(max_delay, .1))] %>%
    .[max_delay < max_delay_q10, -"max_delay_q10", with=FALSE]
}

# dplyr pipe
f_dplyr <- function(){
   nyc14 %>% 
    filter(origin=='LGA' & dest=='DTW') %>%
    group_by(carrier, flight) %>%
    summarize(max_delay = max(dep_delay)) %>%
    ungroup() %>%
    mutate(max_delay_q10 = quantile(max_delay, .1)) %>%
    filter(max_delay < max_delay_q10) %>%
    select(-max_delay_q10)
}

# compare timing
microbenchmark(f_chain, f_pipe, f_dplyr, times=1e3L)

## Unit: nanoseconds
##     expr min lq   mean median uq   max neval
##  f_chain  26 33 75.513     35 39 11949  1000
##   f_pipe  23 31 45.389     33 36   451  1000
##  f_dplyr  25 31 48.424     33 36  7176  1000

`.SD`

Recall that the special symbol .N contains the number of rows in each subset defined using by.

# How many total flights by each carrier to DTW?
nyc14[dest=='DTW', .N, by=carrier]

##    carrier    N
## 1:      DL 3095
## 2:      EV 1584
## 3:      MQ 1331

There is another special symbol .SD which references the entire subset of data for each group. It is itself a data.table as we can see using the code below.

nyc14[dest=='DTW', .(rows=nrow(.SD), cols=ncol(.SD), n=.N), by=carrier]

##    carrier rows cols    n
## 1:      DL 3095   10 3095
## 2:      EV 1584   10 1584
## 3:      MQ 1331   10 1331

nyc14[dest=='DTW', print(.SD), by=carrier]

##       month dep_time dep_delay arr_time arr_delay flight origin dest
##    1:     1      901        -4     1102       -11    181    LGA  DTW
##    2:     1      555        -5      745        -7    731    LGA  DTW
##    3:     1     1302        17     1503        21   1131    LGA  DTW
##    4:     1     1601        -4     1815        -8   2184    JFK  DTW
##    5:     1     1628        -1     1829        -8   2231    LGA  DTW
##   ---                                                               
## 3091:    10     1628        -2     1817        -5    825    EWR  DTW
## 3092:    10      718        -2      904        -9    831    LGA  DTW
## 3093:    10     1235         0     1451        10   1131    LGA  DTW
## 3094:    10     1813        -2     2010       -11   2131    LGA  DTW
## 3095:    10     1613        -2     1757       -23   2231    LGA  DTW
##       air_time distance
##    1:       99      502
##    2:       93      502
##    3:       98      502
##    4:       97      509
##    5:       98      502
##   ---                  
## 3091:       89      488
## 3092:       74      502
## 3093:       94      502
## 3094:       83      502
## 3095:       82      502
##       month dep_time dep_delay arr_time arr_delay flight origin dest
##    1:     1     1225        10     1428         9   4118    EWR  DTW
##    2:     1     2055        -4     2305        11   4247    EWR  DTW
##    3:     1     1639        39     1853        51   4381    EWR  DTW
##    4:     1     1250        10     1452        12   5078    EWR  DTW
##    5:     1     1117       170     1323       167   4246    EWR  DTW
##   ---                                                               
## 1580:    10      842        -8     1033       -17   4297    EWR  DTW
## 1581:    10      910        -5     1059       -18   4911    EWR  DTW
## 1582:    10      731         2      922        -7   4246    EWR  DTW
## 1583:    10     2133       169     2322       154   4247    EWR  DTW
## 1584:    10      913        -2     1115        -2   4911    EWR  DTW
##       air_time distance
##    1:      103      488
##    2:      102      488
##    3:      116      488
##    4:       97      488
##    5:      102      488
##   ---                  
## 1580:       79      488
## 1581:       78      488
## 1582:       80      488
## 1583:       83      488
## 1584:       86      488
##       month dep_time dep_delay arr_time arr_delay flight origin dest
##    1:     1      849        -6     1058        -2   3478    LGA  DTW
##    2:     1     1920       130     2137       147   3530    LGA  DTW
##    3:     1     2037        52     2242        52   3603    LGA  DTW
##    4:     1     1128        33     1330        40   3689    LGA  DTW
##    5:     1     1350        -5     1600         5   3340    LGA  DTW
##   ---                                                               
## 1327:    10     1058       -12     1252       -18   3592    LGA  DTW
## 1328:    10      939        -6     1138        -2   3478    LGA  DTW
## 1329:    10     1912        -3     2104        -6   3603    LGA  DTW
## 1330:    10     1346        -4     1535       -10   3631    LGA  DTW
## 1331:    10     1106        -4     1325        15   3592    LGA  DTW
##       air_time distance
##    1:      103      502
##    2:      103      502
##    3:      105      502
##    4:       98      502
##    5:      101      502
##   ---                  
## 1327:       76      502
## 1328:       80      502
## 1329:       82      502
## 1330:       82      502
## 1331:       75      502

## Empty data.table (0 rows) of 1 col: carrier

We can pass an additional argument .SDcols to the bracketing function to limit the columns in .SD. Note that .SD does not contain the grouping columns.

nyc14[dest=='DTW', .(rows=nrow(.SD), cols=ncol(.SD), n=.N), by=carrier,
      .SDcols = c("origin", "dest", "flight", "dep_time")]

##    carrier rows cols    n
## 1:      DL 3095    4 3095
## 2:      EV 1584    4 1584
## 3:      MQ 1331    4 1331

nyc14[dest=='DTW', print(.SD), by=carrier, .SDcols = c("origin", "dest", "flight", "dep_time")]

##       origin dest flight dep_time
##    1:    LGA  DTW    181      901
##    2:    LGA  DTW    731      555
##    3:    LGA  DTW   1131     1302
##    4:    JFK  DTW   2184     1601
##    5:    LGA  DTW   2231     1628
##   ---                            
## 3091:    EWR  DTW    825     1628
## 3092:    LGA  DTW    831      718
## 3093:    LGA  DTW   1131     1235
## 3094:    LGA  DTW   2131     1813
## 3095:    LGA  DTW   2231     1613
##       origin dest flight dep_time
##    1:    EWR  DTW   4118     1225
##    2:    EWR  DTW   4247     2055
##    3:    EWR  DTW   4381     1639
##    4:    EWR  DTW   5078     1250
##    5:    EWR  DTW   4246     1117
##   ---                            
## 1580:    EWR  DTW   4297      842
## 1581:    EWR  DTW   4911      910
## 1582:    EWR  DTW   4246      731
## 1583:    EWR  DTW   4247     2133
## 1584:    EWR  DTW   4911      913
##       origin dest flight dep_time
##    1:    LGA  DTW   3478      849
##    2:    LGA  DTW   3530     1920
##    3:    LGA  DTW   3603     2037
##    4:    LGA  DTW   3689     1128
##    5:    LGA  DTW   3340     1350
##   ---                            
## 1327:    LGA  DTW   3592     1058
## 1328:    LGA  DTW   3478      939
## 1329:    LGA  DTW   3603     1912
## 1330:    LGA  DTW   3631     1346
## 1331:    LGA  DTW   3592     1106

## Empty data.table (0 rows) of 1 col: carrier

This can be useful in the “j” statement because it allows use lapply or other functions returning a list to.

# What is the mean departure & arrival delay for each flight to DTW?
nyc14[dest=='DTW', lapply(.SD, mean), by=.(origin, dest, carrier, flight),
      .SDcols = c("arr_delay", "dep_delay")]

##      origin dest carrier flight  arr_delay   dep_delay
##   1:    LGA  DTW      DL    181   5.694118  11.8176471
##   2:    LGA  DTW      DL    731  -2.637795  -0.6653543
##   3:    LGA  DTW      DL   1131   5.774648   9.3450704
##   4:    JFK  DTW      DL   2184  -8.000000  -4.0000000
##   5:    LGA  DTW      DL   2231  10.861486  17.9054054
##  ---                                                  
## 168:    EWR  DTW      DL   2509 -12.037037  -2.5555556
## 169:    EWR  DTW      EV   5283 -14.947368  -1.9473684
## 170:    EWR  DTW      EV   4886 -18.750000  -6.0000000
## 171:    LGA  DTW      EV   5596 -12.000000 -10.0000000
## 172:    EWR  DTW      EV   4911 -10.000000  -3.5000000

Columns can also be specified as ranges in .SDcols.

nyc14[dest=='DTW', lapply(.SD, mean), 
      by=.(origin, dest, carrier, flight), 
      .SDcols = arr_delay:dep_delay]

##      origin dest carrier flight  arr_delay  arr_time   dep_delay
##   1:    LGA  DTW      DL    181   5.694118 1120.9706  11.8176471
##   2:    LGA  DTW      DL    731  -2.637795  754.5906  -0.6653543
##   3:    LGA  DTW      DL   1131   5.774648 1450.4366   9.3450704
##   4:    JFK  DTW      DL   2184  -8.000000 1815.0000  -4.0000000
##   5:    LGA  DTW      DL   2231  10.861486 1872.9527  17.9054054
##  ---                                                            
## 168:    EWR  DTW      DL   2509 -12.037037 1412.0000  -2.5555556
## 169:    EWR  DTW      EV   5283 -14.947368 1239.1579  -1.9473684
## 170:    EWR  DTW      EV   4886 -18.750000 1398.2500  -6.0000000
## 171:    LGA  DTW      EV   5596 -12.000000 1018.0000 -10.0000000
## 172:    EWR  DTW      EV   4911 -10.000000 1087.0000  -3.5000000

Copies

One of the goals of the data.table package is to use less memory. This is achieved in part by preferring “shallow” copies by reference over “deep copies” by value when appropriate. When an object is copied by reference it shares physical memory address with the object it is copied from. This is more efficient, but may lead to confusion as changing the value in memory also changes what is pointed to by both objects.

In the example below, we create a data.table DT1 and then create a copy DT2. Typical R objects would be copied by value, but DT2 is copied by reference. We can ask for a copy by value explicitly using copy().

DT1 = data.table(A=5:1, B=letters[5:1])
DT2 = DT1         # Copy by reference
DT3 = copy(DT1)   # Copy by value
DT4 = DT1[,.SD]
  
DT1[,C := 2*A]    # Create a new column 
DT1

##    A B  C
## 1: 5 e 10
## 2: 4 d  8
## 3: 3 c  6
## 4: 2 b  4
## 5: 1 a  2

DT2

##    A B  C
## 1: 5 e 10
## 2: 4 d  8
## 3: 3 c  6
## 4: 2 b  4
## 5: 1 a  2

DT3

##    A B
## 1: 5 e
## 2: 4 d
## 3: 3 c
## 4: 2 b
## 5: 1 a

After updating DT1 to include a new column, C, the column appears in DT2 as well because DT1 and DT2 refer to the same object.

Reference Semantics

In the last example above we used reference semantics to create a new column in DT1 without copying the other other columns and reassigning to a new DT1 object.

A similar notation using := as a function is below.

DT1[,`:=`(D=C+1)]

One way in which this is useful is to modify subsets of a data.table without re-allocating the entire thing.

# Truncated all arrival delays at 0
nyc14[arr_delay<0, arr_delay := 0][]

##         month dep_time dep_delay arr_time arr_delay carrier flight origin
##      1:     1      914        14     1238        13      AA      1    JFK
##      2:     1     1157        -3     1523        13      AA      3    JFK
##      3:     1     1902         2     2224         9      AA     21    JFK
##      4:     1      722        -8     1014         0      AA     29    LGA
##      5:     1     1347         2     1706         1      AA    117    JFK
##     ---                                                                  
## 253312:    10     1459         1     1747         0      UA   1744    LGA
## 253313:    10      854        -5     1147         0      UA   1758    EWR
## 253314:    10     1102        -8     1311        16      MQ   3591    LGA
## 253315:    10     1106        -4     1325        15      MQ   3592    LGA
## 253316:    10      824        -5     1045         1      MQ   3599    LGA
##         dest air_time distance
##      1:  LAX      359     2475
##      2:  LAX      363     2475
##      3:  LAX      351     2475
##      4:  PBI      157     1035
##      5:  LAX      350     2475
##     ---                       
## 253312:  IAH      201     1416
## 253313:  IAH      189     1400
## 253314:  RDU       83      431
## 253315:  DTW       75      502
## 253316:  SDF      110      659

range(nyc14$arr_delay)

## [1]    0 1494

We can also delete columns by reference using NULL:

nyc14[,c("month") := NULL]
# i.e. nyc14$month = NULL

We can use this with by to accomplish tasks such as adding a column showing the maximum departure delay by flight.

nyc14[,max_dep_delay := max(dep_delay), by=.(carrier, flight)][]

##         dep_time dep_delay arr_time arr_delay carrier flight origin dest
##      1:      914        14     1238        13      AA      1    JFK  LAX
##      2:     1157        -3     1523        13      AA      3    JFK  LAX
##      3:     1902         2     2224         9      AA     21    JFK  LAX
##      4:      722        -8     1014         0      AA     29    LGA  PBI
##      5:     1347         2     1706         1      AA    117    JFK  LAX
##     ---                                                                 
## 253312:     1459         1     1747         0      UA   1744    LGA  IAH
## 253313:      854        -5     1147         0      UA   1758    EWR  IAH
## 253314:     1102        -8     1311        16      MQ   3591    LGA  RDU
## 253315:     1106        -4     1325        15      MQ   3592    LGA  DTW
## 253316:      824        -5     1045         1      MQ   3599    LGA  SDF
##         air_time distance max_dep_delay
##      1:      359     2475           156
##      2:      363     2475           284
##      3:      351     2475           848
##      4:      157     1035            89
##      5:      350     2475           248
##     ---                                
## 253312:      201     1416           385
## 253313:      189     1400            42
## 253314:       83      431           240
## 253315:       75      502            68
## 253316:      110      659           121

With the functional form you can create multiple columns by reference.

nyc14[, `:=`(max_dep_delay = max(dep_delay),
             min_dep_delay = min(dep_delay)        
            ),
      by=.(carrier, flight)
     ][]

##         dep_time dep_delay arr_time arr_delay carrier flight origin dest
##      1:      914        14     1238        13      AA      1    JFK  LAX
##      2:     1157        -3     1523        13      AA      3    JFK  LAX
##      3:     1902         2     2224         9      AA     21    JFK  LAX
##      4:      722        -8     1014         0      AA     29    LGA  PBI
##      5:     1347         2     1706         1      AA    117    JFK  LAX
##     ---                                                                 
## 253312:     1459         1     1747         0      UA   1744    LGA  IAH
## 253313:      854        -5     1147         0      UA   1758    EWR  IAH
## 253314:     1102        -8     1311        16      MQ   3591    LGA  RDU
## 253315:     1106        -4     1325        15      MQ   3592    LGA  DTW
## 253316:      824        -5     1045         1      MQ   3599    LGA  SDF
##         air_time distance max_dep_delay min_dep_delay
##      1:      359     2475           156           -12
##      2:      363     2475           284           -16
##      3:      351     2475           848           -11
##      4:      157     1035            89           -15
##      5:      350     2475           248           -10
##     ---                                              
## 253312:      201     1416           385            -7
## 253313:      189     1400            42            -5
## 253314:       83      431           240           -16
## 253315:       75      502            68           -15
## 253316:      110      659           121           -20

The last set of empty brackets above is a short-hand for a subsequent call to print(nyc14).

Keys and Indexing

The material on in this section is based on the vignette here.

Above we used “indexing” in a generic sense to mean “subsetting”. What we mean by “indexing” here is more specific and technical: we create an indexed data table by designating specific columns as keys and sorting the table by these keys to create more efficient lookups and aggregations.

We can designate one or more columns as a key in data.table using setkey().

setkey(nyc14, origin) #also, setkeyv(nyc14, "origin") if character is preferred.
key(nyc14)

## [1] "origin"

In data.table when we designate columns as keys, the rows are re-ordered by reference in increasing order. In other words, the original columns are left as is in memory but a new column key is created to store the order of the data.table. This works something like this:

# Example of keys on a data-frame
df = data.frame(ID = sample(LETTERS,100,replace=TRUE), value=rpois(100, 10))
df$key = order(df$ID)
head(df)

##   ID value key
## 1  P     8   4
## 2  D    13  50
## 3  B    11  61
## 4  A     8  80
## 5  O     9  88
## 6  Y     5   3

head(df[df$key, ])

##    ID value key
## 4   A     8  80
## 50  A    12  73
## 61  A     8   1
## 80  A    17  48
## 88  A     8  17
## 3   B    11  61

After a key has been set, we can compute in the “i”-statement using lists:

nyc14[.("LGA")]

##        month dep_time dep_delay arr_time arr_delay carrier flight origin
##     1:     1      722        -8     1014       -26      AA     29    LGA
##     2:     1      553        -7      739        -6      AA    301    LGA
##     3:     1      623        -7      815         0      AA    303    LGA
##     4:     1      652        -8      833       -17      AA    305    LGA
##     5:     1      738        -2      940        15      AA    307    LGA
##    ---                                                                  
## 84429:    10      609        24      843        -5      UA   1714    LGA
## 84430:    10     1459         1     1747       -30      UA   1744    LGA
## 84431:    10     1102        -8     1311        16      MQ   3591    LGA
## 84432:    10     1106        -4     1325        15      MQ   3592    LGA
## 84433:    10      824        -5     1045         1      MQ   3599    LGA
##        dest air_time distance
##     1:  PBI      157     1035
##     2:  ORD      142      733
##     3:  ORD      143      733
##     4:  ORD      139      733
##     5:  ORD      145      733
##    ---                       
## 84429:  IAH      198     1416
## 84430:  IAH      201     1416
## 84431:  RDU       83      431
## 84432:  DTW       75      502
## 84433:  SDF      110      659

We can have more than one column contribute to the order used to form the key.

# key by origin and destination
setkey(nyc14,origin,dest)
key(nyc14)

## [1] "origin" "dest"

# Get flighs of any origin to DTW
nyc14[.(unique(origin),'DTW')]

##       month dep_time dep_delay arr_time arr_delay carrier flight origin
##    1:     1     1054        -6     1300         4      DL   2508    EWR
##    2:     1      905        -7     1121         3      DL   1807    EWR
##    3:     1     1225        10     1428         9      EV   4118    EWR
##    4:     1     2055        -4     2305        11      EV   4247    EWR
##    5:     1     1639        39     1853        51      EV   4381    EWR
##   ---                                                                  
## 6006:    10     1613        -2     1757       -23      DL   2231    LGA
## 6007:    10      939        -6     1138        -2      MQ   3478    LGA
## 6008:    10     1912        -3     2104        -6      MQ   3603    LGA
## 6009:    10     1346        -4     1535       -10      MQ   3631    LGA
## 6010:    10     1106        -4     1325        15      MQ   3592    LGA
##       dest air_time distance
##    1:  DTW      101      488
##    2:  DTW      102      488
##    3:  DTW      103      488
##    4:  DTW      102      488
##    5:  DTW      116      488
##   ---                       
## 6006:  DTW       82      502
## 6007:  DTW       80      502
## 6008:  DTW       82      502
## 6009:  DTW       82      502
## 6010:  DTW       75      502

We can of course combine this with ‘j’ and ‘by’ statements.

# Find the median departure delay for all flights to DTW
nyc14[.(unique(origin),'DTW'), 
      .(med_dep_delay = as.numeric(median(dep_delay)), n=.N),
      by=.(origin, dest, flight)
     ] %>%
  .[order(origin, med_dep_delay, -n)] %>% # Order
  .[n>42]  # Limit to at least ~1 flight per week

##     origin dest flight med_dep_delay   n
##  1:    EWR  DTW   5203          -3.5  70
##  2:    EWR  DTW   4297          -3.0 188
##  3:    EWR  DTW   5067          -3.0  67
##  4:    EWR  DTW   1080          -3.0  63
##  5:    EWR  DTW   1129          -3.0  47
##  6:    EWR  DTW   4246          -2.0  78
##  7:    EWR  DTW   1214          -2.0  45
##  8:    EWR  DTW   4118           0.0 196
##  9:    EWR  DTW   3837           0.0  55
## 10:    EWR  DTW   4382           1.0 213
## 11:    EWR  DTW   4898           5.0  63
## 12:    EWR  DTW   4247           6.0  88
## 13:    EWR  DTW   4381          12.0 146
## 14:    LGA  DTW   3478          -6.0 282
## 15:    LGA  DTW   2810          -6.0 184
## 16:    LGA  DTW   3340          -5.0  77
## 17:    LGA  DTW    731          -4.0 254
## 18:    LGA  DTW   3530          -4.0 148
## 19:    LGA  DTW   3631          -4.0 132
## 20:    LGA  DTW   3689          -4.0 131
## 21:    LGA  DTW   3603          -3.0 257
## 22:    LGA  DTW    831          -3.0 240
## 23:    LGA  DTW    181          -3.0 170
## 24:    LGA  DTW   2449          -3.0 120
## 25:    LGA  DTW   1131          -2.0 284
## 26:    LGA  DTW   2131          -2.0 247
## 27:    LGA  DTW   1231          -2.0 231
## 28:    LGA  DTW    189          -2.0  93
## 29:    LGA  DTW   2231          -1.0 296
## 30:    LGA  DTW   2331          -1.0 241
## 31:    LGA  DTW   3592           0.0 119
##     origin dest flight med_dep_delay   n

What is the advantage of indexing by setting a key?

# Generate two columns of data with 10 million rows
N = 1e8
DT = data.table(group = sample(1:26, N, replace=TRUE))
DT = DT[ ,.(count=rpois(.N,group)), keyby=group]

# See size of this data
print(object.size(DT),units='MB')

## 762.9 Mb

## Unkeyd approach
tm0 = system.time({
  ans0 <- DT[group==1 | group==26, .(lambda_hat = mean(count)), by=group]
})

## Alternate unkeyed approach
tm1 = system.time({
  ans1 <- DT[, .(lambda_hat = mean(count)), by=group][group==1 | group==26]
})

# Set key
setkey(DT,group)

# keyed approach
tm2 = system.time({
  ans2 <- DT[.(c(1,26)),.(lambda_hat = mean(count)), by=group]
})

# Compare timings (seconds)
rbind(tm0,tm1,tm2)[,'elapsed']

##   tm0   tm1   tm2 
## 1.633 0.971 0.153

rm(DT,tm0,tm1,tm2)

Reshaping data tables

We did not cover this in class, but you may wish to read this vignette.

Resources

Professor Shedden’s Large data sets in R
The data.table vignettes on CRAN
The project homepage for data.table
A short introduction from Dr. Clark’s workshop
See “Large Memory and out-of-memory data” on CRAN’s HPC task view
The dtplyr package provides dplyr methods for data.table objects (objects of class data.table). That is dtplyr is the data.table backend for dplyr – the part facing away from the interactive user.
Practice Exam Questions. A similar question will appear on the final exam. Here are example solutions.

R’s `data.table` package

Statistics 506, 2017

About `data.table`

Creating data.table objects

“Indexing”

Subsetting

Column Selection

Computing with Columns

by

Chaining

`.SD`

Copies

Reference Semantics

Keys and Indexing

Reshaping data tables

Resources

R’s data.table package

Statistics 506, 2017

About data.table

Creating data.table objects

“Indexing”

Subsetting

Column Selection

Computing with Columns

by

Chaining

.SD

Copies

Reference Semantics

Keys and Indexing

Reshaping data tables

Resources

R’s `data.table` package

About `data.table`

`.SD`