data.table
packagedata.table
The data.table
package in R provides an extension of thedata.frame
class that aims to be both more computationally and memory efficient. It is particularly well suited for large in-memory data sets and utilizes indexed keys to allow quick search, subset, and aggregate by group operations.
The data.table
package provides a function fread
for reading delimited files like read.table
and readr::read_delim
, but returns a data.table object instead. As with the tibble class, data.table inherits from data.frame. Here is an example using the 2014 New York City Flights data.
library(data.table)
nyc14 = fread('https://github.com/arunsrinivasan/flights/wiki/NYCflights14/flights14.csv')
class(nyc14)
## [1] "data.table" "data.frame"
You can also create a data.table like a data.frame using:
n = 1e3
data.table(a = 1:n, b=rnorm(n), c=sample(letters, n, replace=TRUE))
## a b c
## 1: 1 1.2027769 t
## 2: 2 -1.8309957 n
## 3: 3 -0.7729531 h
## 4: 4 -1.1258606 o
## 5: 5 0.7653633 v
## ---
## 996: 996 -0.8487881 x
## 997: 997 -1.8994197 t
## 998: 998 -1.6173459 n
## 999: 999 1.8103370 q
## 1000: 1000 -0.3518696 d
Question: How can we view the print method for class data.table?
The syntax for the data.tables
package is inspired by the bracket ([]
) notation for indexing matrices and data frames. At the same time, it aims to allow many common data operations (i.e. dplyr
verbs) to be expressed within these brackets. These goals are expressed at the start of the “Introduction to data.table” vignettte.
The basic idea is DT[i, j, by]
where we: + subset or filter rows in the i
statement, + select, transform, or create columns (variables) in the j
statement + and group with the by
statement.
Additional operations can also be expressed within the brackets. Remember that, even for a data.frame or matrix the left bracket [
is actually a function:
`[`
## .Primitive("[")
Below are some examples on the nyc14
data.table created using fread()
above:
nyc14
## year month day dep_time dep_delay arr_time arr_delay cancelled
## 1: 2014 1 1 914 14 1238 13 0
## 2: 2014 1 1 1157 -3 1523 13 0
## 3: 2014 1 1 1902 2 2224 9 0
## 4: 2014 1 1 722 -8 1014 -26 0
## 5: 2014 1 1 1347 2 1706 1 0
## ---
## 253312: 2014 10 31 1459 1 1747 -30 0
## 253313: 2014 10 31 854 -5 1147 -14 0
## 253314: 2014 10 31 1102 -8 1311 16 0
## 253315: 2014 10 31 1106 -4 1325 15 0
## 253316: 2014 10 31 824 -5 1045 1 0
## carrier tailnum flight origin dest air_time distance hour min
## 1: AA N338AA 1 JFK LAX 359 2475 9 14
## 2: AA N335AA 3 JFK LAX 363 2475 11 57
## 3: AA N327AA 21 JFK LAX 351 2475 19 2
## 4: AA N3EHAA 29 LGA PBI 157 1035 7 22
## 5: AA N319AA 117 JFK LAX 350 2475 13 47
## ---
## 253312: UA N23708 1744 LGA IAH 201 1416 14 59
## 253313: UA N33132 1758 EWR IAH 189 1400 8 54
## 253314: MQ N827MQ 3591 LGA RDU 83 431 11 2
## 253315: MQ N511MQ 3592 LGA DTW 75 502 11 6
## 253316: MQ N813MQ 3599 LGA SDF 110 659 8 24
Find all flights from LGA to DTW:
lga_dtw = nyc14[origin == 'LGA' & dest == 'DTW',]
Get the first and last rows of lga_dtw
:
lga_dtw[c(1,.N)]
## year month day dep_time dep_delay arr_time arr_delay cancelled carrier
## 1: 2014 1 1 901 -4 1102 -11 0 DL
## 2: 2014 10 31 1106 -4 1325 15 0 MQ
## tailnum flight origin dest air_time distance hour min
## 1: N917DL 181 LGA DTW 99 502 9 1
## 2: N511MQ 3592 LGA DTW 75 502 11 6
In the above, we used .N
to index the last row. This is a special symbol defined by data.table
to hold the number of rows or observations in the “current” group. “Current” here refers to the scope in which it is used.
Also, notice the difference from standard data.frame or matrix sub-setting:
# Indexing a matrix
ld_mat = as.matrix(lga_dtw)
ld_mat[c(1,nrow(ld_mat))]
## [1] "2014" "2014"
# Indexing a data.fame
ld_df = as.data.frame(lga_dtw)
dim(ld_df[1:2]) # What would happen if we called: ld_df[c(1,nrow(ld_df))] ?
## [1] 3663 2
You can also use the “i” clause to order a data.table:
lga_dtw[order(-month,-day,dep_time)]
## year month day dep_time dep_delay arr_time arr_delay cancelled
## 1: 2014 10 31 718 -2 904 -9 0
## 2: 2014 10 31 900 -5 1051 -22 0
## 3: 2014 10 31 939 -6 1138 -2 0
## 4: 2014 10 31 1106 -4 1325 15 0
## 5: 2014 10 31 1113 8 1317 8 0
## ---
## 3659: 2014 1 1 1302 17 1503 21 0
## 3660: 2014 1 1 1350 -5 1600 5 0
## 3661: 2014 1 1 1628 -1 1829 -8 0
## 3662: 2014 1 1 1920 130 2137 147 0
## 3663: 2014 1 1 2037 52 2242 52 0
## carrier tailnum flight origin dest air_time distance hour min
## 1: DL N320US 831 LGA DTW 74 502 7 18
## 2: DL N818DA 189 LGA DTW 79 502 9 0
## 3: MQ N528MQ 3478 LGA DTW 80 502 9 39
## 4: MQ N511MQ 3592 LGA DTW 75 502 11 6
## 5: DL N3758Y 2449 LGA DTW 74 502 11 13
## ---
## 3659: DL N331NW 1131 LGA DTW 98 502 13 2
## 3660: MQ N839MQ 3340 LGA DTW 101 502 13 50
## 3661: DL N310NW 2231 LGA DTW 98 502 16 28
## 3662: MQ N833MQ 3530 LGA DTW 103 502 19 20
## 3663: MQ N856MQ 3603 LGA DTW 105 502 20 37
Get the departure and arrival times, flight number, and carrier for all flights from LGA to DTW:
nyc14[origin == 'LGA' & dest == 'DTW', list(dep_time, arr_time, carrier, flight)]
## dep_time arr_time carrier flight
## 1: 901 1102 DL 181
## 2: 555 745 DL 731
## 3: 1302 1503 DL 1131
## 4: 1628 1829 DL 2231
## 5: 849 1058 MQ 3478
## ---
## 3659: 1613 1757 DL 2231
## 3660: 939 1138 MQ 3478
## 3661: 1912 2104 MQ 3603
## 3662: 1346 1535 MQ 3631
## 3663: 1106 1325 MQ 3592
Notice the use of list()
to select columns. A synonym for list()
within data.table
is .()
to save typing and enhance readability:
nyc14[origin == 'LGA' & dest == 'DTW', .(dep_time, arr_time, carrier, flight)]
## dep_time arr_time carrier flight
## 1: 901 1102 DL 181
## 2: 555 745 DL 731
## 3: 1302 1503 DL 1131
## 4: 1628 1829 DL 2231
## 5: 849 1058 MQ 3478
## ---
## 3659: 1613 1757 DL 2231
## 3660: 939 1138 MQ 3478
## 3661: 1912 2104 MQ 3603
## 3662: 1346 1535 MQ 3631
## 3663: 1106 1325 MQ 3592
Columns can also be selected using a character vector of column names.
nyc14[origin == 'LGA' & dest == 'DTW', c("dep_time", "arr_time", "carrier", "flight")]
## dep_time arr_time carrier flight
## 1: 901 1102 DL 181
## 2: 555 745 DL 731
## 3: 1302 1503 DL 1131
## 4: 1628 1829 DL 2231
## 5: 849 1058 MQ 3478
## ---
## 3659: 1613 1757 DL 2231
## 3660: 939 1138 MQ 3478
## 3661: 1912 2104 MQ 3603
## 3662: 1346 1535 MQ 3631
## 3663: 1106 1325 MQ 3592
We can deselect columns using negation (-
or !
) if we also set “with” to false:
nyc14 = nyc14[, -c("tailnum"), with=FALSE]
nyc14 = nyc14[, !c("cancelled", "year", "day", "hour", "min"), with=FALSE]
nyc14
## month dep_time dep_delay arr_time arr_delay carrier flight origin
## 1: 1 914 14 1238 13 AA 1 JFK
## 2: 1 1157 -3 1523 13 AA 3 JFK
## 3: 1 1902 2 2224 9 AA 21 JFK
## 4: 1 722 -8 1014 -26 AA 29 LGA
## 5: 1 1347 2 1706 1 AA 117 JFK
## ---
## 253312: 10 1459 1 1747 -30 UA 1744 LGA
## 253313: 10 854 -5 1147 -14 UA 1758 EWR
## 253314: 10 1102 -8 1311 16 MQ 3591 LGA
## 253315: 10 1106 -4 1325 15 MQ 3592 LGA
## 253316: 10 824 -5 1045 1 MQ 3599 LGA
## dest air_time distance
## 1: LAX 359 2475
## 2: LAX 363 2475
## 3: LAX 351 2475
## 4: PBI 157 1035
## 5: LAX 350 2475
## ---
## 253312: IAH 201 1416
## 253313: IAH 189 1400
## 253314: RDU 83 431
## 253315: DTW 75 502
## 253316: SDF 110 659
The “j”-clause can be used to compute with column variables like “dplyr::summarize”. Below, we find the upper and lower quintiles of departure delays for flights between LGA and DTW during this period:
lga_dtw[,.(p20=quantile(dep_delay,.2), p80=quantile(dep_delay,.8))]
## p20 p80
## 1: -6 10
The “j”-clause can also be used to compute with column variables much like “dplyr::transmute”. Here, we create new columns indicating whether the arrival or departure delays were greater than 15 minutes:
nyc14[,.(delay15 = 1*{dep_delay > 15 | arr_delay > 15})]
## delay15
## 1: 0
## 2: 0
## 3: 0
## 4: 0
## 5: 0
## ---
## 253312: 0
## 253313: 0
## 253314: 1
## 253315: 0
## 253316: 0
To get behavior like “dplyr::mutate” we need reference semantics.
First, using the special function “:=()
” we can add the delay15
above.
nyc14[,`:=`(delay15 = 1L*{dep_delay > 15 | arr_delay > 15})]
nyc14
## month dep_time dep_delay arr_time arr_delay carrier flight origin
## 1: 1 914 14 1238 13 AA 1 JFK
## 2: 1 1157 -3 1523 13 AA 3 JFK
## 3: 1 1902 2 2224 9 AA 21 JFK
## 4: 1 722 -8 1014 -26 AA 29 LGA
## 5: 1 1347 2 1706 1 AA 117 JFK
## ---
## 253312: 10 1459 1 1747 -30 UA 1744 LGA
## 253313: 10 854 -5 1147 -14 UA 1758 EWR
## 253314: 10 1102 -8 1311 16 MQ 3591 LGA
## 253315: 10 1106 -4 1325 15 MQ 3592 LGA
## 253316: 10 824 -5 1045 1 MQ 3599 LGA
## dest air_time distance delay15
## 1: LAX 359 2475 0
## 2: LAX 363 2475 0
## 3: LAX 351 2475 0
## 4: PBI 157 1035 0
## 5: LAX 350 2475 0
## ---
## 253312: IAH 201 1416 0
## 253313: IAH 189 1400 0
## 253314: RDU 83 431 1
## 253315: DTW 75 502 0
## 253316: SDF 110 659 0
Pay close attention to the first line of the code above. Why didn’t it print and why didn’t we need an assignment? Also, what is the role of that “L”?
Another way to accomplish this is as below. Which do you find more intuitive?
nyc14[,"delay30" := 1L*{dep_delay > 30 | arr_delay > 30}]
nyc14
## month dep_time dep_delay arr_time arr_delay carrier flight origin
## 1: 1 914 14 1238 13 AA 1 JFK
## 2: 1 1157 -3 1523 13 AA 3 JFK
## 3: 1 1902 2 2224 9 AA 21 JFK
## 4: 1 722 -8 1014 -26 AA 29 LGA
## 5: 1 1347 2 1706 1 AA 117 JFK
## ---
## 253312: 10 1459 1 1747 -30 UA 1744 LGA
## 253313: 10 854 -5 1147 -14 UA 1758 EWR
## 253314: 10 1102 -8 1311 16 MQ 3591 LGA
## 253315: 10 1106 -4 1325 15 MQ 3592 LGA
## 253316: 10 824 -5 1045 1 MQ 3599 LGA
## dest air_time distance delay15 delay30
## 1: LAX 359 2475 0 0
## 2: LAX 363 2475 0 0
## 3: LAX 351 2475 0 0
## 4: PBI 157 1035 0 0
## 5: LAX 350 2475 0 0
## ---
## 253312: IAH 201 1416 0 0
## 253313: IAH 189 1400 0 0
## 253314: RDU 83 431 1 0
## 253315: DTW 75 502 0 0
## 253316: SDF 110 659 0 0
To perform operations group-wise use a by
argument after the “j” statement.
nyc14[,"delay30" := 1L*{dep_delay > 30 | arr_delay > 30}]
# Find percent of flights with delays of 15 minutes or more by carrier
nyc14[,.(del30_pct = 100*mean(delay30)), by=carrier]
## carrier del30_pct
## 1: AA 15.21177
## 2: AS 13.93728
## 3: B6 18.43117
## 4: DL 15.69945
## 5: EV 23.55157
## 6: F9 34.88372
## 7: FL 23.66107
## 8: HA 15.00000
## 9: MQ 18.32534
## 10: VX 13.00813
## 11: WN 21.30734
## 12: UA 18.89684
## 13: US 10.85373
## 14: OO 18.00000
We can use a list to specify multiple grouping variables.
# Find percent of flights with delays of 15 minutes or more by carrier and origin.
nyc14[,.(del30_pct = 100*mean(delay30)), by=.(carrier, origin)]
## carrier origin del30_pct
## 1: AA JFK 15.71752
## 2: AA LGA 14.39898
## 3: AA EWR 16.53454
## 4: AS EWR 13.93728
## 5: B6 JFK 18.35769
## 6: B6 EWR 17.59547
## 7: B6 LGA 19.91224
## 8: DL LGA 15.09909
## 9: DL EWR 17.19239
## 10: DL JFK 15.96501
## 11: EV EWR 23.57032
## 12: EV LGA 23.69027
## 13: F9 LGA 34.88372
## 14: FL LGA 23.66107
## 15: HA JFK 15.00000
## 16: MQ LGA 18.37349
## 17: MQ JFK 17.94636
## 18: VX JFK 12.93818
## 19: VX EWR 13.26034
## 20: WN EWR 21.05263
## 21: WN LGA 21.47992
## 22: MQ EWR 26.94611
## 23: EV JFK 21.70253
## 24: UA EWR 19.45798
## 25: UA LGA 18.08973
## 26: UA JFK 15.01019
## 27: US EWR 11.03448
## 28: US JFK 13.95085
## 29: US LGA 10.02353
## 30: OO LGA 18.09045
## 31: OO EWR 0.00000
## 32: VX LGA 0.00000
## carrier origin del30_pct
As with standard data.frame
indexing, we can compose data.table
bracketed expressions using chaining.
## Find max departure delay by flight among all flights from LGA to DTW
## Then, select flights within the shortest 10% of max_delay
nyc14[origin=='LGA' & dest=='DTW', .(max_delay = max(dep_delay)),
by = .(carrier, flight)
][, .(carrier, flight, max_delay, max_delay_q10 = quantile(max_delay, .1))
][max_delay < max_delay_q10, -"max_delay_q10", with=FALSE
]
## carrier flight max_delay
## 1: DL 1107 -2
## 2: EV 5405 -5
## 3: DL 796 -8
## 4: EV 5596 -10
If you prefer pipes %>%
for clarity, you can use them by appending a .
before the opening bracket:
# Same as above, but using magrittr::%>%
nyc14[origin=='LGA' & dest=='DTW', .(max_delay = max(dep_delay)), by = .(carrier, flight)] %>%
.[, .(carrier, flight, max_delay, max_delay_q10 = quantile(max_delay, .1))] %>%
.[max_delay < max_delay_q10, -"max_delay_q10", with=FALSE]
## carrier flight max_delay
## 1: DL 1107 -2
## 2: EV 5405 -5
## 3: DL 796 -8
## 4: EV 5596 -10
Here is a quick timing comparison between chaining, piping, and an equivalent dplyr
expression. The differences are negligible for this data size.
# data.table chaining
f_chain <- function(){
nyc14[origin=='LGA' & dest=='DTW', .(max_delay = max(dep_delay)),
by = .(carrier, flight)
][, .(carrier, flight, max_delay, max_delay_q10 = quantile(max_delay, .1))
][max_delay < max_delay_q10, -"max_delay_q10", with=FALSE
]
}
# data.table with pipe
f_pipe <- function(){
nyc14[origin=='LGA' & dest=='DTW', .(max_delay = max(dep_delay)), by = .(carrier, flight)] %>%
.[, .(carrier, flight, max_delay, max_delay_q10 = quantile(max_delay, .1))] %>%
.[max_delay < max_delay_q10, -"max_delay_q10", with=FALSE]
}
# dplyr pipe
f_dplyr <- function(){
nyc14 %>%
filter(origin=='LGA' & dest=='DTW') %>%
group_by(carrier, flight) %>%
summarize(max_delay = max(dep_delay)) %>%
ungroup() %>%
mutate(max_delay_q10 = quantile(max_delay, .1)) %>%
filter(max_delay < max_delay_q10) %>%
select(-max_delay_q10)
}
# compare timing
microbenchmark(f_chain, f_pipe, f_dplyr, times=1e3L)
## Unit: nanoseconds
## expr min lq mean median uq max neval
## f_chain 26 33 75.513 35 39 11949 1000
## f_pipe 23 31 45.389 33 36 451 1000
## f_dplyr 25 31 48.424 33 36 7176 1000
.SD
Recall that the special symbol .N
contains the number of rows in each subset defined using by
.
# How many total flights by each carrier to DTW?
nyc14[dest=='DTW', .N, by=carrier]
## carrier N
## 1: DL 3095
## 2: EV 1584
## 3: MQ 1331
There is another special symbol .SD
which references the entire subset of data for each group. It is itself a data.table
as we can see using the code below.
nyc14[dest=='DTW', .(rows=nrow(.SD), cols=ncol(.SD), n=.N), by=carrier]
## carrier rows cols n
## 1: DL 3095 10 3095
## 2: EV 1584 10 1584
## 3: MQ 1331 10 1331
nyc14[dest=='DTW', print(.SD), by=carrier]
## month dep_time dep_delay arr_time arr_delay flight origin dest
## 1: 1 901 -4 1102 -11 181 LGA DTW
## 2: 1 555 -5 745 -7 731 LGA DTW
## 3: 1 1302 17 1503 21 1131 LGA DTW
## 4: 1 1601 -4 1815 -8 2184 JFK DTW
## 5: 1 1628 -1 1829 -8 2231 LGA DTW
## ---
## 3091: 10 1628 -2 1817 -5 825 EWR DTW
## 3092: 10 718 -2 904 -9 831 LGA DTW
## 3093: 10 1235 0 1451 10 1131 LGA DTW
## 3094: 10 1813 -2 2010 -11 2131 LGA DTW
## 3095: 10 1613 -2 1757 -23 2231 LGA DTW
## air_time distance
## 1: 99 502
## 2: 93 502
## 3: 98 502
## 4: 97 509
## 5: 98 502
## ---
## 3091: 89 488
## 3092: 74 502
## 3093: 94 502
## 3094: 83 502
## 3095: 82 502
## month dep_time dep_delay arr_time arr_delay flight origin dest
## 1: 1 1225 10 1428 9 4118 EWR DTW
## 2: 1 2055 -4 2305 11 4247 EWR DTW
## 3: 1 1639 39 1853 51 4381 EWR DTW
## 4: 1 1250 10 1452 12 5078 EWR DTW
## 5: 1 1117 170 1323 167 4246 EWR DTW
## ---
## 1580: 10 842 -8 1033 -17 4297 EWR DTW
## 1581: 10 910 -5 1059 -18 4911 EWR DTW
## 1582: 10 731 2 922 -7 4246 EWR DTW
## 1583: 10 2133 169 2322 154 4247 EWR DTW
## 1584: 10 913 -2 1115 -2 4911 EWR DTW
## air_time distance
## 1: 103 488
## 2: 102 488
## 3: 116 488
## 4: 97 488
## 5: 102 488
## ---
## 1580: 79 488
## 1581: 78 488
## 1582: 80 488
## 1583: 83 488
## 1584: 86 488
## month dep_time dep_delay arr_time arr_delay flight origin dest
## 1: 1 849 -6 1058 -2 3478 LGA DTW
## 2: 1 1920 130 2137 147 3530 LGA DTW
## 3: 1 2037 52 2242 52 3603 LGA DTW
## 4: 1 1128 33 1330 40 3689 LGA DTW
## 5: 1 1350 -5 1600 5 3340 LGA DTW
## ---
## 1327: 10 1058 -12 1252 -18 3592 LGA DTW
## 1328: 10 939 -6 1138 -2 3478 LGA DTW
## 1329: 10 1912 -3 2104 -6 3603 LGA DTW
## 1330: 10 1346 -4 1535 -10 3631 LGA DTW
## 1331: 10 1106 -4 1325 15 3592 LGA DTW
## air_time distance
## 1: 103 502
## 2: 103 502
## 3: 105 502
## 4: 98 502
## 5: 101 502
## ---
## 1327: 76 502
## 1328: 80 502
## 1329: 82 502
## 1330: 82 502
## 1331: 75 502
## Empty data.table (0 rows) of 1 col: carrier
We can pass an additional argument .SDcols
to the bracketing function to limit the columns in .SD
. Note that .SD
does not contain the grouping columns.
nyc14[dest=='DTW', .(rows=nrow(.SD), cols=ncol(.SD), n=.N), by=carrier,
.SDcols = c("origin", "dest", "flight", "dep_time")]
## carrier rows cols n
## 1: DL 3095 4 3095
## 2: EV 1584 4 1584
## 3: MQ 1331 4 1331
nyc14[dest=='DTW', print(.SD), by=carrier, .SDcols = c("origin", "dest", "flight", "dep_time")]
## origin dest flight dep_time
## 1: LGA DTW 181 901
## 2: LGA DTW 731 555
## 3: LGA DTW 1131 1302
## 4: JFK DTW 2184 1601
## 5: LGA DTW 2231 1628
## ---
## 3091: EWR DTW 825 1628
## 3092: LGA DTW 831 718
## 3093: LGA DTW 1131 1235
## 3094: LGA DTW 2131 1813
## 3095: LGA DTW 2231 1613
## origin dest flight dep_time
## 1: EWR DTW 4118 1225
## 2: EWR DTW 4247 2055
## 3: EWR DTW 4381 1639
## 4: EWR DTW 5078 1250
## 5: EWR DTW 4246 1117
## ---
## 1580: EWR DTW 4297 842
## 1581: EWR DTW 4911 910
## 1582: EWR DTW 4246 731
## 1583: EWR DTW 4247 2133
## 1584: EWR DTW 4911 913
## origin dest flight dep_time
## 1: LGA DTW 3478 849
## 2: LGA DTW 3530 1920
## 3: LGA DTW 3603 2037
## 4: LGA DTW 3689 1128
## 5: LGA DTW 3340 1350
## ---
## 1327: LGA DTW 3592 1058
## 1328: LGA DTW 3478 939
## 1329: LGA DTW 3603 1912
## 1330: LGA DTW 3631 1346
## 1331: LGA DTW 3592 1106
## Empty data.table (0 rows) of 1 col: carrier
This can be useful in the “j” statement because it allows use lapply
or other functions returning a list to.
# What is the mean departure & arrival delay for each flight to DTW?
nyc14[dest=='DTW', lapply(.SD, mean), by=.(origin, dest, carrier, flight),
.SDcols = c("arr_delay", "dep_delay")]
## origin dest carrier flight arr_delay dep_delay
## 1: LGA DTW DL 181 5.694118 11.8176471
## 2: LGA DTW DL 731 -2.637795 -0.6653543
## 3: LGA DTW DL 1131 5.774648 9.3450704
## 4: JFK DTW DL 2184 -8.000000 -4.0000000
## 5: LGA DTW DL 2231 10.861486 17.9054054
## ---
## 168: EWR DTW DL 2509 -12.037037 -2.5555556
## 169: EWR DTW EV 5283 -14.947368 -1.9473684
## 170: EWR DTW EV 4886 -18.750000 -6.0000000
## 171: LGA DTW EV 5596 -12.000000 -10.0000000
## 172: EWR DTW EV 4911 -10.000000 -3.5000000
Columns can also be specified as ranges in .SDcols
.
nyc14[dest=='DTW', lapply(.SD, mean),
by=.(origin, dest, carrier, flight),
.SDcols = arr_delay:dep_delay]
## origin dest carrier flight arr_delay arr_time dep_delay
## 1: LGA DTW DL 181 5.694118 1120.9706 11.8176471
## 2: LGA DTW DL 731 -2.637795 754.5906 -0.6653543
## 3: LGA DTW DL 1131 5.774648 1450.4366 9.3450704
## 4: JFK DTW DL 2184 -8.000000 1815.0000 -4.0000000
## 5: LGA DTW DL 2231 10.861486 1872.9527 17.9054054
## ---
## 168: EWR DTW DL 2509 -12.037037 1412.0000 -2.5555556
## 169: EWR DTW EV 5283 -14.947368 1239.1579 -1.9473684
## 170: EWR DTW EV 4886 -18.750000 1398.2500 -6.0000000
## 171: LGA DTW EV 5596 -12.000000 1018.0000 -10.0000000
## 172: EWR DTW EV 4911 -10.000000 1087.0000 -3.5000000
One of the goals of the data.table
package is to use less memory. This is achieved in part by preferring “shallow” copies by reference over “deep copies” by value when appropriate. When an object is copied by reference it shares physical memory address with the object it is copied from. This is more efficient, but may lead to confusion as changing the value in memory also changes what is pointed to by both objects.
In the example below, we create a data.table
DT1
and then create a copy DT2
. Typical R objects would be copied by value, but DT2
is copied by reference. We can ask for a copy by value explicitly using copy()
.
DT1 = data.table(A=5:1, B=letters[5:1])
DT2 = DT1 # Copy by reference
DT3 = copy(DT1) # Copy by value
DT4 = DT1[,.SD]
DT1[,C := 2*A] # Create a new column
DT1
## A B C
## 1: 5 e 10
## 2: 4 d 8
## 3: 3 c 6
## 4: 2 b 4
## 5: 1 a 2
DT2
## A B C
## 1: 5 e 10
## 2: 4 d 8
## 3: 3 c 6
## 4: 2 b 4
## 5: 1 a 2
DT3
## A B
## 1: 5 e
## 2: 4 d
## 3: 3 c
## 4: 2 b
## 5: 1 a
After updating DT1
to include a new column, C
, the column appears in DT2
as well because DT1
and DT2
refer to the same object.
In the last example above we used reference semantics to create a new column in DT1
without copying the other other columns and reassigning to a new DT1
object.
A similar notation using :=
as a function is below.
DT1[,`:=`(D=C+1)]
One way in which this is useful is to modify subsets of a data.table
without re-allocating the entire thing.
# Truncated all arrival delays at 0
nyc14[arr_delay<0, arr_delay := 0][]
## month dep_time dep_delay arr_time arr_delay carrier flight origin
## 1: 1 914 14 1238 13 AA 1 JFK
## 2: 1 1157 -3 1523 13 AA 3 JFK
## 3: 1 1902 2 2224 9 AA 21 JFK
## 4: 1 722 -8 1014 0 AA 29 LGA
## 5: 1 1347 2 1706 1 AA 117 JFK
## ---
## 253312: 10 1459 1 1747 0 UA 1744 LGA
## 253313: 10 854 -5 1147 0 UA 1758 EWR
## 253314: 10 1102 -8 1311 16 MQ 3591 LGA
## 253315: 10 1106 -4 1325 15 MQ 3592 LGA
## 253316: 10 824 -5 1045 1 MQ 3599 LGA
## dest air_time distance
## 1: LAX 359 2475
## 2: LAX 363 2475
## 3: LAX 351 2475
## 4: PBI 157 1035
## 5: LAX 350 2475
## ---
## 253312: IAH 201 1416
## 253313: IAH 189 1400
## 253314: RDU 83 431
## 253315: DTW 75 502
## 253316: SDF 110 659
range(nyc14$arr_delay)
## [1] 0 1494
We can also delete columns by reference using NULL
:
nyc14[,c("month") := NULL]
# i.e. nyc14$month = NULL
We can use this with by
to accomplish tasks such as adding a column showing the maximum departure delay by flight.
nyc14[,max_dep_delay := max(dep_delay), by=.(carrier, flight)][]
## dep_time dep_delay arr_time arr_delay carrier flight origin dest
## 1: 914 14 1238 13 AA 1 JFK LAX
## 2: 1157 -3 1523 13 AA 3 JFK LAX
## 3: 1902 2 2224 9 AA 21 JFK LAX
## 4: 722 -8 1014 0 AA 29 LGA PBI
## 5: 1347 2 1706 1 AA 117 JFK LAX
## ---
## 253312: 1459 1 1747 0 UA 1744 LGA IAH
## 253313: 854 -5 1147 0 UA 1758 EWR IAH
## 253314: 1102 -8 1311 16 MQ 3591 LGA RDU
## 253315: 1106 -4 1325 15 MQ 3592 LGA DTW
## 253316: 824 -5 1045 1 MQ 3599 LGA SDF
## air_time distance max_dep_delay
## 1: 359 2475 156
## 2: 363 2475 284
## 3: 351 2475 848
## 4: 157 1035 89
## 5: 350 2475 248
## ---
## 253312: 201 1416 385
## 253313: 189 1400 42
## 253314: 83 431 240
## 253315: 75 502 68
## 253316: 110 659 121
With the functional form you can create multiple columns by reference.
nyc14[, `:=`(max_dep_delay = max(dep_delay),
min_dep_delay = min(dep_delay)
),
by=.(carrier, flight)
][]
## dep_time dep_delay arr_time arr_delay carrier flight origin dest
## 1: 914 14 1238 13 AA 1 JFK LAX
## 2: 1157 -3 1523 13 AA 3 JFK LAX
## 3: 1902 2 2224 9 AA 21 JFK LAX
## 4: 722 -8 1014 0 AA 29 LGA PBI
## 5: 1347 2 1706 1 AA 117 JFK LAX
## ---
## 253312: 1459 1 1747 0 UA 1744 LGA IAH
## 253313: 854 -5 1147 0 UA 1758 EWR IAH
## 253314: 1102 -8 1311 16 MQ 3591 LGA RDU
## 253315: 1106 -4 1325 15 MQ 3592 LGA DTW
## 253316: 824 -5 1045 1 MQ 3599 LGA SDF
## air_time distance max_dep_delay min_dep_delay
## 1: 359 2475 156 -12
## 2: 363 2475 284 -16
## 3: 351 2475 848 -11
## 4: 157 1035 89 -15
## 5: 350 2475 248 -10
## ---
## 253312: 201 1416 385 -7
## 253313: 189 1400 42 -5
## 253314: 83 431 240 -16
## 253315: 75 502 68 -15
## 253316: 110 659 121 -20
The last set of empty brackets above is a short-hand for a subsequent call to print(nyc14)
.
The material on in this section is based on the vignette here.
Above we used “indexing” in a generic sense to mean “subsetting”. What we mean by “indexing” here is more specific and technical: we create an indexed data table by designating specific columns as keys and sorting the table by these keys to create more efficient lookups and aggregations.
We can designate one or more columns as a key in data.table
using setkey()
.
setkey(nyc14, origin) #also, setkeyv(nyc14, "origin") if character is preferred.
key(nyc14)
## [1] "origin"
In data.table
when we designate columns as keys, the rows are re-ordered by reference in increasing order. In other words, the original columns are left as is in memory but a new column key
is created to store the order of the data.table
. This works something like this:
# Example of keys on a data-frame
df = data.frame(ID = sample(LETTERS,100,replace=TRUE), value=rpois(100, 10))
df$key = order(df$ID)
head(df)
## ID value key
## 1 P 8 4
## 2 D 13 50
## 3 B 11 61
## 4 A 8 80
## 5 O 9 88
## 6 Y 5 3
head(df[df$key, ])
## ID value key
## 4 A 8 80
## 50 A 12 73
## 61 A 8 1
## 80 A 17 48
## 88 A 8 17
## 3 B 11 61
After a key has been set, we can compute in the “i”-statement using lists:
nyc14[.("LGA")]
## month dep_time dep_delay arr_time arr_delay carrier flight origin
## 1: 1 722 -8 1014 -26 AA 29 LGA
## 2: 1 553 -7 739 -6 AA 301 LGA
## 3: 1 623 -7 815 0 AA 303 LGA
## 4: 1 652 -8 833 -17 AA 305 LGA
## 5: 1 738 -2 940 15 AA 307 LGA
## ---
## 84429: 10 609 24 843 -5 UA 1714 LGA
## 84430: 10 1459 1 1747 -30 UA 1744 LGA
## 84431: 10 1102 -8 1311 16 MQ 3591 LGA
## 84432: 10 1106 -4 1325 15 MQ 3592 LGA
## 84433: 10 824 -5 1045 1 MQ 3599 LGA
## dest air_time distance
## 1: PBI 157 1035
## 2: ORD 142 733
## 3: ORD 143 733
## 4: ORD 139 733
## 5: ORD 145 733
## ---
## 84429: IAH 198 1416
## 84430: IAH 201 1416
## 84431: RDU 83 431
## 84432: DTW 75 502
## 84433: SDF 110 659
We can have more than one column contribute to the order used to form the key.
# key by origin and destination
setkey(nyc14,origin,dest)
key(nyc14)
## [1] "origin" "dest"
# Get flighs of any origin to DTW
nyc14[.(unique(origin),'DTW')]
## month dep_time dep_delay arr_time arr_delay carrier flight origin
## 1: 1 1054 -6 1300 4 DL 2508 EWR
## 2: 1 905 -7 1121 3 DL 1807 EWR
## 3: 1 1225 10 1428 9 EV 4118 EWR
## 4: 1 2055 -4 2305 11 EV 4247 EWR
## 5: 1 1639 39 1853 51 EV 4381 EWR
## ---
## 6006: 10 1613 -2 1757 -23 DL 2231 LGA
## 6007: 10 939 -6 1138 -2 MQ 3478 LGA
## 6008: 10 1912 -3 2104 -6 MQ 3603 LGA
## 6009: 10 1346 -4 1535 -10 MQ 3631 LGA
## 6010: 10 1106 -4 1325 15 MQ 3592 LGA
## dest air_time distance
## 1: DTW 101 488
## 2: DTW 102 488
## 3: DTW 103 488
## 4: DTW 102 488
## 5: DTW 116 488
## ---
## 6006: DTW 82 502
## 6007: DTW 80 502
## 6008: DTW 82 502
## 6009: DTW 82 502
## 6010: DTW 75 502
We can of course combine this with ‘j’ and ‘by’ statements.
# Find the median departure delay for all flights to DTW
nyc14[.(unique(origin),'DTW'),
.(med_dep_delay = as.numeric(median(dep_delay)), n=.N),
by=.(origin, dest, flight)
] %>%
.[order(origin, med_dep_delay, -n)] %>% # Order
.[n>42] # Limit to at least ~1 flight per week
## origin dest flight med_dep_delay n
## 1: EWR DTW 5203 -3.5 70
## 2: EWR DTW 4297 -3.0 188
## 3: EWR DTW 5067 -3.0 67
## 4: EWR DTW 1080 -3.0 63
## 5: EWR DTW 1129 -3.0 47
## 6: EWR DTW 4246 -2.0 78
## 7: EWR DTW 1214 -2.0 45
## 8: EWR DTW 4118 0.0 196
## 9: EWR DTW 3837 0.0 55
## 10: EWR DTW 4382 1.0 213
## 11: EWR DTW 4898 5.0 63
## 12: EWR DTW 4247 6.0 88
## 13: EWR DTW 4381 12.0 146
## 14: LGA DTW 3478 -6.0 282
## 15: LGA DTW 2810 -6.0 184
## 16: LGA DTW 3340 -5.0 77
## 17: LGA DTW 731 -4.0 254
## 18: LGA DTW 3530 -4.0 148
## 19: LGA DTW 3631 -4.0 132
## 20: LGA DTW 3689 -4.0 131
## 21: LGA DTW 3603 -3.0 257
## 22: LGA DTW 831 -3.0 240
## 23: LGA DTW 181 -3.0 170
## 24: LGA DTW 2449 -3.0 120
## 25: LGA DTW 1131 -2.0 284
## 26: LGA DTW 2131 -2.0 247
## 27: LGA DTW 1231 -2.0 231
## 28: LGA DTW 189 -2.0 93
## 29: LGA DTW 2231 -1.0 296
## 30: LGA DTW 2331 -1.0 241
## 31: LGA DTW 3592 0.0 119
## origin dest flight med_dep_delay n
What is the advantage of indexing by setting a key?
# Generate two columns of data with 10 million rows
N = 1e8
DT = data.table(group = sample(1:26, N, replace=TRUE))
DT = DT[ ,.(count=rpois(.N,group)), keyby=group]
# See size of this data
print(object.size(DT),units='MB')
## 762.9 Mb
## Unkeyd approach
tm0 = system.time({
ans0 <- DT[group==1 | group==26, .(lambda_hat = mean(count)), by=group]
})
## Alternate unkeyed approach
tm1 = system.time({
ans1 <- DT[, .(lambda_hat = mean(count)), by=group][group==1 | group==26]
})
# Set key
setkey(DT,group)
# keyed approach
tm2 = system.time({
ans2 <- DT[.(c(1,26)),.(lambda_hat = mean(count)), by=group]
})
# Compare timings (seconds)
rbind(tm0,tm1,tm2)[,'elapsed']
## tm0 tm1 tm2
## 1.633 0.971 0.153
rm(DT,tm0,tm1,tm2)
We did not cover this in class, but you may wish to read this vignette.
Professor Shedden’s Large data sets in R
The data.table
vignettes on CRAN
The project homepage for data.table
A short introduction from Dr. Clark’s workshop
See “Large Memory and out-of-memory data” on CRAN’s HPC task view
The dtplyr
package provides dplyr
methods for data.table
objects (objects of class data.table). That is dtplyr
is the data.table
backend for dplyr
– the part facing away from the interactive user.
Practice Exam Questions. A similar question will appear on the final exam. Here are example solutions.