data.table
packagePlease read and work through the examples in the following data.table
vignettes on CRAN:
Additional resources are available at the end of this document.
data.table
The data.table
package in R provides an extension of the data.frame
class that aims to be both more computationally and memory efficient. It is particularly well suited for large in-memory data sets and utilizes indexed keys to allow quick search, subset, and aggregate by group operations. The package also automatically employs multicore computations through its back end, enabling a degree of no-hassle paralellism.
The data.table
package also provides an expressive, compact syntax for working with data. However, compared to dplyr
, this syntax is less literate and may be more difficult for a non-expert to read and make sense of.
The data.table
package provides a function fread()
for reading delimited files like read.table()
and readr::read_delim()
, but returns a data.table object instead. As with the tibble class, data.table inherits from data.frame. Here is an example using the 2014 New York City Flights data.
library(data.table)
url =
'https://github.com/arunsrinivasan/flights/wiki/NYCflights14/flights14.csv'
nyc14 = fread(url)
class(nyc14)
## [1] "data.table" "data.frame"
You can also create a data.table like a data.frame using:
## a b c
## 1: 1 -0.868832673 d
## 2: 2 -1.261828030 v
## 3: 3 0.078137783 j
## 4: 4 0.079168162 r
## 5: 5 0.587739342 l
## ---
## 996: 996 1.101397110 g
## 997: 997 0.007424156 a
## 998: 998 0.284551172 z
## 999: 999 1.181893569 s
## 1000: 1000 -0.183499179 u
How can we view the print method for class data.table?
The syntax for the data.table
package is inspired by the bracket ([]
) notation for indexing matrices and data frames. At the same time, it aims to allow many common data operations (i.e. dplyr
verbs) to be expressed within these brackets. These goals are expressed at the start of the “Introduction to data.table” vignettte.
The basic idea is DT[i, j, by]
where we:
i
statement,j
statementby
statement.Additional operations can also be expressed within the brackets. Remember that, even for a data.frame or matrix the left bracket [
is actually a function:
## .Primitive("[")
Below are some examples on the nyc14
data.table created using fread()
above:
## year month day dep_time dep_delay arr_time arr_delay cancelled carrier
## 1: 2014 1 1 914 14 1238 13 0 AA
## 2: 2014 1 1 1157 -3 1523 13 0 AA
## tailnum flight origin dest air_time distance hour min
## 1: N338AA 1 JFK LAX 359 2475 9 14
## 2: N335AA 3 JFK LAX 363 2475 11 57
Find all flights from LGA to DTW:
Get the first and last rows of lga_dtw
:
## year month day dep_time dep_delay arr_time arr_delay cancelled carrier
## 1: 2014 1 1 901 -4 1102 -11 0 DL
## 2: 2014 10 31 1106 -4 1325 15 0 MQ
## tailnum flight origin dest air_time distance hour min
## 1: N917DL 181 LGA DTW 99 502 9 1
## 2: N511MQ 3592 LGA DTW 75 502 11 6
In the above, we used .N
to index the last row. This is a special symbol defined by data.table
to hold the number of rows or observations in the “current” group. “Current” here refers to the scope in which it is used; in this example, that is the entire data.table
.
Also, notice the difference from standard data.frame or matrix sub-setting:
## [1] "2014" "2014"
# Indexing a data.fame
ld_df = as.data.frame(lga_dtw)
dim(ld_df[1:2]) # What would happen if we called: ld_df[c(1,nrow(ld_df))] ?
## [1] 3663 2
You can also use the i
clause to order a data.table:
## year month day dep_time dep_delay arr_time arr_delay cancelled carrier
## 1: 2014 10 31 718 -2 904 -9 0 DL
## 2: 2014 10 31 900 -5 1051 -22 0 DL
## 3: 2014 10 31 939 -6 1138 -2 0 MQ
## 4: 2014 10 31 1106 -4 1325 15 0 MQ
## 5: 2014 10 31 1113 8 1317 8 0 DL
## ---
## 3659: 2014 1 1 1302 17 1503 21 0 DL
## 3660: 2014 1 1 1350 -5 1600 5 0 MQ
## 3661: 2014 1 1 1628 -1 1829 -8 0 DL
## 3662: 2014 1 1 1920 130 2137 147 0 MQ
## 3663: 2014 1 1 2037 52 2242 52 0 MQ
## tailnum flight origin dest air_time distance hour min
## 1: N320US 831 LGA DTW 74 502 7 18
## 2: N818DA 189 LGA DTW 79 502 9 0
## 3: N528MQ 3478 LGA DTW 80 502 9 39
## 4: N511MQ 3592 LGA DTW 75 502 11 6
## 5: N3758Y 2449 LGA DTW 74 502 11 13
## ---
## 3659: N331NW 1131 LGA DTW 98 502 13 2
## 3660: N839MQ 3340 LGA DTW 101 502 13 50
## 3661: N310NW 2231 LGA DTW 98 502 16 28
## 3662: N833MQ 3530 LGA DTW 103 502 19 20
## 3663: N856MQ 3603 LGA DTW 105 502 20 37
Get the departure and arrival times, flight number, and carrier for all flights from LGA to DTW:
## dep_time arr_time carrier flight
## 1: 901 1102 DL 181
## 2: 555 745 DL 731
## 3: 1302 1503 DL 1131
## 4: 1628 1829 DL 2231
## 5: 849 1058 MQ 3478
## ---
## 3659: 1613 1757 DL 2231
## 3660: 939 1138 MQ 3478
## 3661: 1912 2104 MQ 3603
## 3662: 1346 1535 MQ 3631
## 3663: 1106 1325 MQ 3592
Notice the use of list()
to select columns. A synonym for list()
within data.table
is .()
to save typing and enhance readability:
## dep_time arr_time carrier flight
## 1: 901 1102 DL 181
## 2: 555 745 DL 731
## 3: 1302 1503 DL 1131
## 4: 1628 1829 DL 2231
## 5: 849 1058 MQ 3478
## ---
## 3659: 1613 1757 DL 2231
## 3660: 939 1138 MQ 3478
## 3661: 1912 2104 MQ 3603
## 3662: 1346 1535 MQ 3631
## 3663: 1106 1325 MQ 3592
Columns can also be selected using a character vector of column names.
## dep_time arr_time carrier flight
## 1: 901 1102 DL 181
## 2: 555 745 DL 731
## 3: 1302 1503 DL 1131
## 4: 1628 1829 DL 2231
## 5: 849 1058 MQ 3478
## ---
## 3659: 1613 1757 DL 2231
## 3660: 939 1138 MQ 3478
## 3661: 1912 2104 MQ 3603
## 3662: 1346 1535 MQ 3631
## 3663: 1106 1325 MQ 3592
We can deselect columns using negation (-
or !
). (Once, we also had to set with = FALSE
).
## month dep_time dep_delay arr_time arr_delay carrier flight origin dest
## 1: 1 914 14 1238 13 AA 1 JFK LAX
## 2: 1 1157 -3 1523 13 AA 3 JFK LAX
## 3: 1 1902 2 2224 9 AA 21 JFK LAX
## 4: 1 722 -8 1014 -26 AA 29 LGA PBI
## 5: 1 1347 2 1706 1 AA 117 JFK LAX
## ---
## 253312: 10 1459 1 1747 -30 UA 1744 LGA IAH
## 253313: 10 854 -5 1147 -14 UA 1758 EWR IAH
## 253314: 10 1102 -8 1311 16 MQ 3591 LGA RDU
## 253315: 10 1106 -4 1325 15 MQ 3592 LGA DTW
## 253316: 10 824 -5 1045 1 MQ 3599 LGA SDF
## air_time distance
## 1: 359 2475
## 2: 363 2475
## 3: 351 2475
## 4: 157 1035
## 5: 350 2475
## ---
## 253312: 201 1416
## 253313: 189 1400
## 253314: 83 431
## 253315: 75 502
## 253316: 110 659
The j
-clause can be used to compute with column variables like dplyr::summarize()
. Below, we find the upper and lower quintiles of departure delays for flights between LGA and DTW during this period:
## p20 p80
## 1: -6 10
The j
-clause can also be used to compute with column variables much like dplyr::transmute()
. Here, we create new columns indicating whether the arrival or departure delays were greater than 15 minutes:
## delay15
## 1: 0
## 2: 0
## 3: 0
## 4: 0
## 5: 0
## ---
## 253312: 0
## 253313: 0
## 253314: 1
## 253315: 0
## 253316: 0
To get behavior like dplyr::mutate()
we need reference semantics.
First, using the special function :=()
we can add the delay15
above.
## month dep_time dep_delay arr_time arr_delay carrier flight origin dest
## 1: 1 914 14 1238 13 AA 1 JFK LAX
## 2: 1 1157 -3 1523 13 AA 3 JFK LAX
## 3: 1 1902 2 2224 9 AA 21 JFK LAX
## 4: 1 722 -8 1014 -26 AA 29 LGA PBI
## 5: 1 1347 2 1706 1 AA 117 JFK LAX
## ---
## 253312: 10 1459 1 1747 -30 UA 1744 LGA IAH
## 253313: 10 854 -5 1147 -14 UA 1758 EWR IAH
## 253314: 10 1102 -8 1311 16 MQ 3591 LGA RDU
## 253315: 10 1106 -4 1325 15 MQ 3592 LGA DTW
## 253316: 10 824 -5 1045 1 MQ 3599 LGA SDF
## air_time distance delay15
## 1: 359 2475 0
## 2: 363 2475 0
## 3: 351 2475 0
## 4: 157 1035 0
## 5: 350 2475 0
## ---
## 253312: 201 1416 0
## 253313: 189 1400 0
## 253314: 83 431 1
## 253315: 75 502 0
## 253316: 110 659 0
> Pay close attention to the first line of the code above. > Why didn’t it print and > why didn’t we need an assignment? Also, what is the role of that “L”?
Another way to accomplish this is shown below. Which do you find more intuitive?
## month dep_time dep_delay arr_time arr_delay carrier flight origin dest
## 1: 1 914 14 1238 13 AA 1 JFK LAX
## 2: 1 1157 -3 1523 13 AA 3 JFK LAX
## 3: 1 1902 2 2224 9 AA 21 JFK LAX
## 4: 1 722 -8 1014 -26 AA 29 LGA PBI
## 5: 1 1347 2 1706 1 AA 117 JFK LAX
## ---
## 253312: 10 1459 1 1747 -30 UA 1744 LGA IAH
## 253313: 10 854 -5 1147 -14 UA 1758 EWR IAH
## 253314: 10 1102 -8 1311 16 MQ 3591 LGA RDU
## 253315: 10 1106 -4 1325 15 MQ 3592 LGA DTW
## 253316: 10 824 -5 1045 1 MQ 3599 LGA SDF
## air_time distance delay15 delay30
## 1: 359 2475 0 0
## 2: 363 2475 0 0
## 3: 351 2475 0 0
## 4: 157 1035 0 0
## 5: 350 2475 0 0
## ---
## 253312: 201 1416 0 0
## 253313: 189 1400 0 0
## 253314: 83 431 1 0
## 253315: 75 502 0 0
## 253316: 110 659 0 0
To perform operations group-wise use a by
argument after the j
statement.
nyc14[ , delay30 := 1L * {dep_delay > 30 | arr_delay > 30}]
# Find the percent of flights with delays of 30 minutes or more by carrier
nyc14[, .(del30_pct = 100 * mean(delay30)), by = carrier]
## carrier del30_pct
## 1: AA 15.21177
## 2: AS 13.93728
## 3: B6 18.43117
## 4: DL 15.69945
## 5: EV 23.55157
## 6: F9 34.88372
## 7: FL 23.66107
## 8: HA 15.00000
## 9: MQ 18.32534
## 10: VX 13.00813
## 11: WN 21.30734
## 12: UA 18.89684
## 13: US 10.85373
## 14: OO 18.00000
We can use a list to specify multiple grouping variables.
# Find percent of flights with delays of 30 minutes or more by carrier and origin.
nyc14[, .(del30_pct = 100 * mean(delay30)), by = .(carrier, origin)]
## carrier origin del30_pct
## 1: AA JFK 15.71752
## 2: AA LGA 14.39898
## 3: AA EWR 16.53454
## 4: AS EWR 13.93728
## 5: B6 JFK 18.35769
## 6: B6 EWR 17.59547
## 7: B6 LGA 19.91224
## 8: DL LGA 15.09909
## 9: DL EWR 17.19239
## 10: DL JFK 15.96501
## 11: EV EWR 23.57032
## 12: EV LGA 23.69027
## 13: F9 LGA 34.88372
## 14: FL LGA 23.66107
## 15: HA JFK 15.00000
## 16: MQ LGA 18.37349
## 17: MQ JFK 17.94636
## 18: VX JFK 12.93818
## 19: VX EWR 13.26034
## 20: WN EWR 21.05263
## 21: WN LGA 21.47992
## 22: MQ EWR 26.94611
## 23: EV JFK 21.70253
## 24: UA EWR 19.45798
## 25: UA LGA 18.08973
## 26: UA JFK 15.01019
## 27: US EWR 11.03448
## 28: US JFK 13.95085
## 29: US LGA 10.02353
## 30: OO LGA 18.09045
## 31: OO EWR 0.00000
## 32: VX LGA 0.00000
## carrier origin del30_pct
Pay attention to how the result is ordered.
> How does this compare to the dplyr group_by() %>% summarize()
operation?
To order according to the values in the by
argument, use keyby
which sets a key with the data.table ordered by this key. More on keys can be found below.
## NULL
## [1] "carrier"
## carrier del30_pct carrier del30_pct
## 1: AA 15.21177 AA 15.21177
## 2: AS 13.93728 AS 13.93728
## 3: B6 18.43117 B6 18.43117
## 4: DL 15.69945 DL 15.69945
## 5: EV 23.55157 EV 23.55157
## 6: F9 34.88372 F9 34.88372
## 7: FL 23.66107 FL 23.66107
## 8: HA 15.00000 HA 15.00000
## 9: MQ 18.32534 MQ 18.32534
## 10: VX 13.00813 OO 18.00000
## 11: WN 21.30734 UA 18.89684
## 12: UA 18.89684 US 10.85373
## 13: US 10.85373 VX 13.00813
## 14: OO 18.00000 WN 21.30734
As with standard data.frame
indexing, we can compose data.table
bracketed expressions using chaining.
## Find max departure delay by flight among all flights from LGA to DTW
## Then, select flights within the shortest 10% of max_delay
nyc14[origin == 'LGA' & dest == 'DTW', .(max_delay = max(dep_delay)),
by = .(carrier, flight)
][, .(carrier, flight, max_delay, max_delay_q10 = quantile(max_delay, .1))
][max_delay < max_delay_q10, -"max_delay_q10", with=FALSE
]
## carrier flight max_delay
## 1: DL 1107 -2
## 2: EV 5405 -5
## 3: DL 796 -8
## 4: EV 5596 -10
If you prefer pipes %>%
for clarity, you can use them by appending a .
before the opening bracket:
# Same as above, but using magrittr::%>%
nyc14[origin == 'LGA' & dest == 'DTW',
.(max_delay = max(dep_delay)),
by = .(carrier, flight)
] %>%
.[, .(carrier, flight, max_delay, max_delay_q10 = quantile(max_delay, .1))
] %>%
.[max_delay < max_delay_q10, -"max_delay_q10", with = FALSE]
## carrier flight max_delay
## 1: DL 1107 -2
## 2: EV 5405 -5
## 3: DL 796 -8
## 4: EV 5596 -10
.SD
Recall that the special symbol .N
contains the number of rows in each subset defined using by
or keyby
.
## carrier N
## 1: DL 3095
## 2: EV 1584
## 3: MQ 1331
There is another special symbol .SD
which references the entire subset of data for each group. It is itself a data.table
as we can see using the code below.
nyc14[dest == 'DTW',
.( rows = nrow(.SD),
cols = ncol(.SD),
n = .N,
class = class(.SD)[1]
), by = carrier]
## carrier rows cols n class
## 1: DL 3095 12 3095 data.table
## 2: EV 1584 12 1584 data.table
## 3: MQ 1331 12 1331 data.table
As a reminder, any valid R expression can be placed in j
.
## month dep_time dep_delay arr_time arr_delay flight origin dest air_time
## 1: 1 901 -4 1102 -11 181 LGA DTW 99
## 2: 1 555 -5 745 -7 731 LGA DTW 93
## distance delay15 delay30
## 1: 502 0 0
## 2: 502 0 0
## month dep_time dep_delay arr_time arr_delay flight origin dest air_time
## 1: 1 1225 10 1428 9 4118 EWR DTW 103
## 2: 1 2055 -4 2305 11 4247 EWR DTW 102
## distance delay15 delay30
## 1: 488 0 0
## 2: 488 0 0
## month dep_time dep_delay arr_time arr_delay flight origin dest air_time
## 1: 1 849 -6 1058 -2 3478 LGA DTW 103
## 2: 1 1920 130 2137 147 3530 LGA DTW 103
## distance delay15 delay30
## 1: 502 0 0
## 2: 502 1 1
## Empty data.table (0 rows and 1 cols): carrier
## 3095
## 1584
## 1331
## Empty data.table (0 rows and 1 cols): carrier
Notice that the grouping variable carrier
is not a column in .SD
.
We can pass an additional argument .SDcols
to the bracketing function to limit the columns in .SD
.
nyc14[dest == 'DTW',
.(rows = nrow(.SD), cols = ncol(.SD), n = .N),
by = carrier,
.SDcols = c("origin", "dest", "flight", "dep_time")
]
## carrier rows cols n
## 1: DL 3095 4 3095
## 2: EV 1584 4 1584
## 3: MQ 1331 4 1331
nyc14[dest == 'DTW', print(.SD), by = carrier,
.SDcols = c("origin", "dest", "flight", "dep_time")
]
## origin dest flight dep_time
## 1: LGA DTW 181 901
## 2: LGA DTW 731 555
## 3: LGA DTW 1131 1302
## 4: JFK DTW 2184 1601
## 5: LGA DTW 2231 1628
## ---
## 3091: EWR DTW 825 1628
## 3092: LGA DTW 831 718
## 3093: LGA DTW 1131 1235
## 3094: LGA DTW 2131 1813
## 3095: LGA DTW 2231 1613
## origin dest flight dep_time
## 1: EWR DTW 4118 1225
## 2: EWR DTW 4247 2055
## 3: EWR DTW 4381 1639
## 4: EWR DTW 5078 1250
## 5: EWR DTW 4246 1117
## ---
## 1580: EWR DTW 4297 842
## 1581: EWR DTW 4911 910
## 1582: EWR DTW 4246 731
## 1583: EWR DTW 4247 2133
## 1584: EWR DTW 4911 913
## origin dest flight dep_time
## 1: LGA DTW 3478 849
## 2: LGA DTW 3530 1920
## 3: LGA DTW 3603 2037
## 4: LGA DTW 3689 1128
## 5: LGA DTW 3340 1350
## ---
## 1327: LGA DTW 3592 1058
## 1328: LGA DTW 3478 939
## 1329: LGA DTW 3603 1912
## 1330: LGA DTW 3631 1346
## 1331: LGA DTW 3592 1106
## Empty data.table (0 rows and 1 cols): carrier
This can be useful in the j
statement because it allows us to use lapply
or any other function that returns a list to compute on multiple columns.
# What is the mean departure & arrival delay for each flight to DTW?
nyc14[dest == 'DTW', lapply(.SD, mean), by=.(origin, dest, carrier, flight),
.SDcols = c("arr_delay", "dep_delay")]
## origin dest carrier flight arr_delay dep_delay
## 1: LGA DTW DL 181 5.694118 11.8176471
## 2: LGA DTW DL 731 -2.637795 -0.6653543
## 3: LGA DTW DL 1131 5.774648 9.3450704
## 4: JFK DTW DL 2184 -8.000000 -4.0000000
## 5: LGA DTW DL 2231 10.861486 17.9054054
## ---
## 168: EWR DTW DL 2509 -12.037037 -2.5555556
## 169: EWR DTW EV 5283 -14.947368 -1.9473684
## 170: EWR DTW EV 4886 -18.750000 -6.0000000
## 171: LGA DTW EV 5596 -12.000000 -10.0000000
## 172: EWR DTW EV 4911 -10.000000 -3.5000000
Columns can also be specified as ranges in .SDcols
.
nyc14[dest == 'DTW', lapply(.SD, mean),
by = .(origin, dest, carrier, flight),
.SDcols = arr_delay:dep_delay
]
## origin dest carrier flight arr_delay arr_time dep_delay
## 1: LGA DTW DL 181 5.694118 1120.9706 11.8176471
## 2: LGA DTW DL 731 -2.637795 754.5906 -0.6653543
## 3: LGA DTW DL 1131 5.774648 1450.4366 9.3450704
## 4: JFK DTW DL 2184 -8.000000 1815.0000 -4.0000000
## 5: LGA DTW DL 2231 10.861486 1872.9527 17.9054054
## ---
## 168: EWR DTW DL 2509 -12.037037 1412.0000 -2.5555556
## 169: EWR DTW EV 5283 -14.947368 1239.1579 -1.9473684
## 170: EWR DTW EV 4886 -18.750000 1398.2500 -6.0000000
## 171: LGA DTW EV 5596 -12.000000 1018.0000 -10.0000000
## 172: EWR DTW EV 4911 -10.000000 1087.0000 -3.5000000
Because .SDcols
takes a character vector it is often useful to construct it programmatically from the names()
of the data.table object.
delay_cols = names(nyc14)[ grep('delay', names(nyc14)) ]
delay_stats =
nyc14[dest == 'DTW', c( lapply(.SD, mean),
lapply(.SD, sd)
), keyby = .(carrier), .SDcols = delay_cols]
delay_stats
## carrier dep_delay arr_delay delay15 delay30 dep_delay arr_delay
## 1: DL 9.406785 3.074960 0.2100162 0.1350565 37.47738 40.83548
## 2: EV 17.391414 13.052399 0.3358586 0.2468434 43.13941 46.21437
## 3: MQ 4.903080 4.510143 0.2389181 0.1367393 26.47499 30.55465
## delay15 delay30
## 1: 0.4073856 0.3418392
## 2: 0.4724389 0.4313110
## 3: 0.4265829 0.3437011
new_names =
c( key(delay_stats),
paste(delay_cols, 'mean', sep = '_'),
paste(delay_cols, 'sd', sep = '_')
)
setnames(delay_stats, new_names)
In this example, note that setnames()
like all set*
functions in data.table updates in place by reference.
One of the goals of the data.table
package is to use memory efficiently. This is achieved in part by preferring “shallow” copies by reference over “deep copies” by value when appropriate. When an object is copied by reference it shares physical memory address with the object it is copied from. This is more efficient, but may lead to confusion as changing the value in memory also changes what is pointed to by both objects.
In the example below, we create a data.table
DT1
and then assign it to DT2
. Typical R objects would be copied by value using “copy on modify” semantics, but DT2
is copied by reference. We can ask for a copy by value explicitly using copy()
.
DT1 = data.table(A = 5:1, B = letters[5:1])
DT2 = DT1 # Copy by reference
DT3 = copy(DT1) # Copy by value
DT4 = DT1[, .SD] # Value or reference? See below.
DT1[, C := 2 * A] # Create a new column
DT1
## A B C
## 1: 5 e 10
## 2: 4 d 8
## 3: 3 c 6
## 4: 2 b 4
## 5: 1 a 2
## A B C
## 1: 5 e 10
## 2: 4 d 8
## 3: 3 c 6
## 4: 2 b 4
## 5: 1 a 2
## A B
## 1: 5 e
## 2: 4 d
## 3: 3 c
## 4: 2 b
## 5: 1 a
## A B
## 1: 5 e
## 2: 4 d
## 3: 3 c
## 4: 2 b
## 5: 1 a
After updating DT1
to include a new column, C
, the column appears in DT2
as well because DT1
and DT2
refer to the same object.
> How can we understand the DT4
example? > Does it make sense that it works as it does?
In the last example above we used reference semantics to create a new column in DT1
without copying the other other columns and reassigning to a new DT1
object.
A similar notation using :=
as a function is below.
One way in which this is useful is to modify subsets of a data.table
without re-allocating the entire thing.
## [1] -112 1494
## [1] "<0x7f86026d2000>"
## [1] 0 1494
## [1] "<0x7f86026d2000>"
If we replace
nyc14$arr_delay
withnyc14[, 'arr_delay']
ornyc14[,.(arr_delay)]
will we see the same memory location fromtracemem()
? If not, why not? What does this tell us about this syntax?
## [1] "<0x7f85e959a000>"
## tracemem[0x7f85e959a000 -> 0x7f85ea53b000]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous>
## tracemem[0x7f85ea53b000 -> 0x7f85ea823000]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous>
## [1] "<0x7f85ea823000>"
We can also delete columns by reference using NULL
:
We can use this with by
to accomplish tasks such as adding a column showing the maximum departure delay by flight.
## dep_time dep_delay arr_time arr_delay carrier flight origin dest
## 1: 914 14 1238 13 AA 1 JFK LAX
## 2: 1157 -3 1523 13 AA 3 JFK LAX
## 3: 1902 2 2224 9 AA 21 JFK LAX
## 4: 722 -8 1014 0 AA 29 LGA PBI
## 5: 1347 2 1706 1 AA 117 JFK LAX
## ---
## 253312: 1459 1 1747 0 UA 1744 LGA IAH
## 253313: 854 -5 1147 0 UA 1758 EWR IAH
## 253314: 1102 -8 1311 16 MQ 3591 LGA RDU
## 253315: 1106 -4 1325 15 MQ 3592 LGA DTW
## 253316: 824 -5 1045 1 MQ 3599 LGA SDF
## air_time distance delay15 delay30 max_dep_delay
## 1: 359 2475 0 0 156
## 2: 363 2475 0 0 284
## 3: 351 2475 0 0 848
## 4: 157 1035 0 0 89
## 5: 350 2475 0 0 248
## ---
## 253312: 201 1416 0 0 385
## 253313: 189 1400 0 0 42
## 253314: 83 431 1 0 240
## 253315: 75 502 0 0 68
## 253316: 110 659 0 0 121
With the functional form you can create multiple columns by reference.
nyc14[, `:=`(max_dep_delay = max(dep_delay),
min_dep_delay = min(dep_delay)
),
by = .(carrier, flight)
][]
## dep_time dep_delay arr_time arr_delay carrier flight origin dest
## 1: 914 14 1238 13 AA 1 JFK LAX
## 2: 1157 -3 1523 13 AA 3 JFK LAX
## 3: 1902 2 2224 9 AA 21 JFK LAX
## 4: 722 -8 1014 0 AA 29 LGA PBI
## 5: 1347 2 1706 1 AA 117 JFK LAX
## ---
## 253312: 1459 1 1747 0 UA 1744 LGA IAH
## 253313: 854 -5 1147 0 UA 1758 EWR IAH
## 253314: 1102 -8 1311 16 MQ 3591 LGA RDU
## 253315: 1106 -4 1325 15 MQ 3592 LGA DTW
## 253316: 824 -5 1045 1 MQ 3599 LGA SDF
## air_time distance delay15 delay30 max_dep_delay min_dep_delay
## 1: 359 2475 0 0 156 -12
## 2: 363 2475 0 0 284 -16
## 3: 351 2475 0 0 848 -11
## 4: 157 1035 0 0 89 -15
## 5: 350 2475 0 0 248 -10
## ---
## 253312: 201 1416 0 0 385 -7
## 253313: 189 1400 0 0 42 -5
## 253314: 83 431 1 0 240 -16
## 253315: 75 502 0 0 68 -15
## 253316: 110 659 0 0 121 -20
The last set of empty brackets above is a short-hand for a subsequent call to print(nyc14)
.
The material on in this section is based on the vignette here.
Above we used “indexing” in a generic sense to mean “subsetting”.
What we mean by “indexing” here is more specific and technical: we create an indexed data table by designating specific columns as keys and sorting the table by these keys to create more efficient look-ups and aggregations.
We can designate one or more columns as a key in data.table
using setkey()
.
## [1] "origin"
After a key has been set, we can subset in the i
-statement using lists:
## dep_time dep_delay arr_time arr_delay carrier flight origin dest
## 1: 722 -8 1014 0 AA 29 LGA PBI
## 2: 553 -7 739 0 AA 301 LGA ORD
## 3: 623 -7 815 0 AA 303 LGA ORD
## 4: 652 -8 833 0 AA 305 LGA ORD
## 5: 738 -2 940 15 AA 307 LGA ORD
## ---
## 84429: 609 24 843 0 UA 1714 LGA IAH
## 84430: 1459 1 1747 0 UA 1744 LGA IAH
## 84431: 1102 -8 1311 16 MQ 3591 LGA RDU
## 84432: 1106 -4 1325 15 MQ 3592 LGA DTW
## 84433: 824 -5 1045 1 MQ 3599 LGA SDF
## air_time distance delay15 delay30 max_dep_delay min_dep_delay
## 1: 157 1035 0 0 89 -15
## 2: 142 733 0 0 65 -13
## 3: 143 733 0 0 126 -12
## 4: 139 733 0 0 263 -13
## 5: 145 733 0 0 140 -12
## ---
## 84429: 198 1416 1 0 53 -8
## 84430: 201 1416 0 0 385 -7
## 84431: 83 431 1 0 240 -16
## 84432: 75 502 0 0 68 -15
## 84433: 110 659 0 0 121 -20
We can have more than one column contribute to the order used to form the key.
## [1] "origin" "dest"
# Get flights of any origin to DTW
microbenchmark::microbenchmark(
nyc14[.(c("JFK", "LGA", "EWR"), 'DTW')],
nyc14[.(dest = "DTW")],
nyc14[dest == 'DTW']
)
## Unit: milliseconds
## expr min lq mean median
## nyc14[.(c("JFK", "LGA", "EWR"), "DTW")] 1.674798 2.458520 3.972620 3.122006
## nyc14[.(dest = "DTW")] 1.250904 1.993277 2.875099 2.496205
## nyc14[dest == "DTW"] 1.766600 2.520252 3.856505 3.119099
## uq max neval
## 4.394652 21.138023 100
## 3.195932 7.746685 100
## 4.234634 27.183198 100
We can of course combine this with j
and by
statements.
# Find the median departure delay for all flights to DTW
nyc14[.(unique(origin), 'DTW'),
.(med_dep_delay = as.numeric(median(dep_delay)), n = .N),
by = .(origin, dest, flight)
] %>%
.[order(origin, med_dep_delay, -n)] %>% # Order
.[n > 42] # Limit to at least ~1 flight per week
## origin dest flight med_dep_delay n
## 1: EWR DTW 5203 -3.5 70
## 2: EWR DTW 4297 -3.0 188
## 3: EWR DTW 5067 -3.0 67
## 4: EWR DTW 1080 -3.0 63
## 5: EWR DTW 1129 -3.0 47
## 6: EWR DTW 4246 -2.0 78
## 7: EWR DTW 1214 -2.0 45
## 8: EWR DTW 4118 0.0 196
## 9: EWR DTW 3837 0.0 55
## 10: EWR DTW 4382 1.0 213
## 11: EWR DTW 4898 5.0 63
## 12: EWR DTW 4247 6.0 88
## 13: EWR DTW 4381 12.0 146
## 14: LGA DTW 3478 -6.0 282
## 15: LGA DTW 2810 -6.0 184
## 16: LGA DTW 3340 -5.0 77
## 17: LGA DTW 731 -4.0 254
## 18: LGA DTW 3530 -4.0 148
## 19: LGA DTW 3631 -4.0 132
## 20: LGA DTW 3689 -4.0 131
## 21: LGA DTW 3603 -3.0 257
## 22: LGA DTW 831 -3.0 240
## 23: LGA DTW 181 -3.0 170
## 24: LGA DTW 2449 -3.0 120
## 25: LGA DTW 1131 -2.0 284
## 26: LGA DTW 2131 -2.0 247
## 27: LGA DTW 1231 -2.0 231
## 28: LGA DTW 189 -2.0 93
## 29: LGA DTW 2231 -1.0 296
## 30: LGA DTW 2331 -1.0 241
## 31: LGA DTW 3592 0.0 119
## origin dest flight med_dep_delay n
In data.table
when we designate columns as keys, the rows are re-ordered by reference in increasing order. This physically reorders the rows but uses the same locations in memory for the columns.
In contrast to keys, we can use secondary indices to store the row order without physically reordering the entire data.table
. With an index, the original columns are left as is in memory but an attribute index
is created to store the order of the data.table
.
This works something like this:
# Example of a secondary index on a data-frame
df = data.frame(
ID = sample(LETTERS, 100, replace = TRUE),
value = rpois(100, 10)
)
df$index = order(df$ID)
head(df)
## ID value index
## 1 U 13 34
## 2 E 18 49
## 3 J 15 60
## 4 B 6 78
## 5 F 14 98
## 6 T 8 4
## ID value index
## 34 A 8 8
## 49 A 8 9
## 60 A 8 44
## 78 A 9 17
## 98 A 11 80
## 4 B 6 78
Here is an example using the nyc14
data to count the flights to DTW using an index to subset.
## [1] "dest"
## [1] "row.names" "class" ".internal.selfref"
## [4] "names" "sorted" "index"
## [1] 6010
> What is the advantage of indexing by setting a key?
# Generate two columns of data with 10 million rows
N = 1e8
DT = data.table(group = sample(1:26, N, replace = TRUE))
DT = DT[ ,.(count=rpois(.N, group)), by = group]
# See size of this data
print(object.size(DT), units = 'MB')
## 762.9 Mb
## [1] FALSE
tm0 = system.time({
ans0 <- DT[group == 1 | group == 26, .(lambda_hat = mean(count)), by = group]
}, gcFirst = TRUE)
## Alternate unkeyed approach
indices(DT)
## NULL
tm1 = system.time({
ans1 <-
DT[, .(lambda_hat = mean(count)), by = group][group == 1 | group == 26]
}, gcFirst = TRUE)
# Set index
tm_index = system.time({
setindex(DT, group, verbose = TRUE)
}, gcFirst = TRUE)
## forder.c received 100000000 rows and 2 columns
## forder took 1.692 sec
## [1] "group"
# Indexed approach
tm2 = system.time({
ans2 <-
DT[ .( c(1, 26) ), .(lambda_hat = mean(count)), by = group, on = 'group']
}, gcFirst = TRUE)
# Set key
tm_key = system.time({
setkey(DT, group, verbose = TRUE)
})
## setkey on columns [group] using existing index 'group'
## reorder took 0.823s elapsed (1.033s cpu)
## [1] "group"
# keyed approach
tm3 = system.time({
ans3 <- DT[ .( c(1, 26) ), .(lambda_hat = mean(count)), by = group]
}, gcFirst = TRUE)
# Compare timings (seconds)
rbind(subset_first = tm0, mean_first = tm1, index = tm2, key = tm3)[, 'elapsed']
## subset_first mean_first index key
## 2.205 2.379 0.541 0.125
## tm_key tm_index
## 0.824 1.436
data.table
’sThere is often a need to join information stored across two or more data frames. In R we have previously used dplyr::left_join()
or similar *_join()
functions for this. In base R two data.frames can be joined using the S3 generic merge()
which dispatches the merge.data.frame()
method.
merge()
The data.table
package defines a merge.data.table()
method. In addition to the tables to join, there are two key parameters: by
and all
.
We use by
to specify which columns to join on.
nyc14[ origin == 'JFK', .N, .(carrier)] %>%
merge( . , nycflights13::airlines, by = 'carrier', all = TRUE) %>%
.[order(-N)]
## carrier N name
## 1: B6 34220 JetBlue Airways
## 2: DL 18860 Delta Air Lines Inc.
## 3: AA 11923 American Airlines Inc.
## 4: MQ 5444 Envoy Air
## 5: UA 3924 United Air Lines Inc.
## 6: VX 3138 Virgin America
## 7: US 2645 US Airways Inc.
## 8: EV 1069 ExpressJet Airlines Inc.
## 9: HA 260 Hawaiian Airlines Inc.
## 10: 9E NA Endeavor Air Inc.
## 11: AS NA Alaska Airlines Inc.
## 12: F9 NA Frontier Airlines Inc.
## 13: FL NA AirTran Airways Corporation
## 14: OO NA SkyWest Airlines Inc.
## 15: WN NA Southwest Airlines Co.
## 16: YV NA Mesa Airlines Inc.
The by
variables must exist in both tables. If not, use by.x
and by.y
instead.
We can specify inner (all=FALSE
), left (all.x=TRUE
), right (all.y=TRUE
), or full (all=TRUE
) joins using the all*
parameters.
nyc14[ origin == 'JFK', .N, .(carrier)] %>%
merge( . , nycflights13::airlines, by = 'carrier', all = FALSE) %>%
.[order(-N)]
## carrier N name
## 1: B6 34220 JetBlue Airways
## 2: DL 18860 Delta Air Lines Inc.
## 3: AA 11923 American Airlines Inc.
## 4: MQ 5444 Envoy Air
## 5: UA 3924 United Air Lines Inc.
## 6: VX 3138 Virgin America
## 7: US 2645 US Airways Inc.
## 8: EV 1069 ExpressJet Airlines Inc.
## 9: HA 260 Hawaiian Airlines Inc.
The resulting merge contains all columns from both tables with duplicate names not used in by
renamed using a suffix, i.e. col.x
or col.y
.
x = data.table( id = 0:4, letter = letters[26 - 0:4])
y = data.table( id = 1:5, letter = LETTERS[26 - 1:5])
merge(x, y, by = 'id', all = TRUE)
## id letter.x letter.y
## 1: 0 z <NA>
## 2: 1 y Y
## 3: 2 x X
## 4: 3 w W
## 5: 4 v V
## 6: 5 <NA> U
[]
There may be times where you wish to perform some computation using columns from two tables without the need for an explicit merge first.
In these cases you can use the DT1[DT2, ]
syntax for joins.
## id letter i.letter
## 1: 1 y Y
## 2: 2 x X
## 3: 3 w W
## 4: 4 v V
## 5: 5 <NA> U
## id letter i.letter
## 1: 0 <NA> z
## 2: 1 Y y
## 3: 2 X x
## 4: 3 W w
## 5: 4 V v
## id letter
## 1: 1 y
## 2: 2 x
## 3: 3 w
## 4: 4 v
## 5: 5 <NA>
## id letter
## 1: 1 y
## 2: 2 x
## 3: 3 w
## 4: 4 v
If we are matching on set keys, we do not need to provide these as on
.
## id letter i.letter
## 1: 1 y Y
## 2: 2 x X
## 3: 3 w W
## 4: 4 v V
## 5: 5 <NA> U
## id letter i.letter
## 1: 0 <NA> z
## 2: 1 Y y
## 3: 2 X x
## 4: 3 W w
## 5: 4 V v
Here is the previous flights example using this syntax.
airlines = data.table(nycflights13::airlines)
nyc14[origin == 'JFK', .N, keyby = carrier][airlines, nomatch = 0L][order(-N)]
## carrier N name
## 1: B6 34220 JetBlue Airways
## 2: DL 18860 Delta Air Lines Inc.
## 3: AA 11923 American Airlines Inc.
## 4: MQ 5444 Envoy Air
## 5: UA 3924 United Air Lines Inc.
## 6: VX 3138 Virgin America
## 7: US 2645 US Airways Inc.
## 8: EV 1069 ExpressJet Airlines Inc.
## 9: HA 260 Hawaiian Airlines Inc.
And here is some additional formatting …
nyc14[ origin == 'JFK', .N, keyby=carrier][airlines, nomatch=0L][order(-N)] %>%
.[, .(Carrier = name, N = format(N, big.mark = ','))] %>%
knitr::kable(align = 'r')
Carrier | N |
---|---|
JetBlue Airways | 34,220 |
Delta Air Lines Inc. | 18,860 |
American Airlines Inc. | 11,923 |
Envoy Air | 5,444 |
United Air Lines Inc. | 3,924 |
Virgin America | 3,138 |
US Airways Inc. | 2,645 |
ExpressJet Airlines Inc. | 1,069 |
Hawaiian Airlines Inc. | 260 |
The data.table
package has methods for reshaping data called melt()
and dcast()
which you will read about in the reshape vignette.
The important points to take away from the vignette are that:
melt()
reshapes wide data into long format, anddcast()
reshapes long data into wide format.Professor Shedden’s Large data sets in R
The homepage for the data.table projec.t
A short introduction from Dr. Clark’s workshop
See “Large Memory and out-of-memory data” on CRAN at HPC task view
The dtplyr
package provides dplyr
methods for data.table
objects (objects of class data.table). That is data.table
is the dtplyr
back end for dplyr
– the part facing away from the interactive user.
Additional exercises asking you to write data.table
syntax matching dplyr
(or SQL syntax) can be found in the linked workshop notes.