Instructions

Consider the 2014 flights data used for the data.table examples. For each code snippent first provide a single-setence summary of the value(s) being computed. Make your answer as concise and descriptive as possible.

Questions

  1. Consider the following data.table expression:
tab1 =
  nyc14[ , .(n = .N), by=.(origin, dest)] %>%
    .[, .(origin, n = n, pct = n / sum(n)), by=dest] %>%
    .[pct > .75] %>%
    .[order(-pct, dest)]
  1. Provide a one-sentence summary of what is being computed.

This code finds all destination cities for which more than 75% of the flights are from a single origin airport.

  1. Provide a translation using dplyr syntax.
nyc14 %>%
  group_by(dest, origin) %>%
  summarize(n=n()) %>%
  mutate(pct = n/sum(n)) %>%
  filter(pct > .75) %>%
  arrange(-pct, dest)
## # A tibble: 47 x 4
## # Groups:   dest [47]
##     dest origin     n   pct
##    <chr>  <chr> <int> <dbl>
##  1   ABQ    JFK   278     1
##  2   ACK    JFK   277     1
##  3   AGS    LGA     3     1
##  4   ALB    EWR   169     1
##  5   ANC    EWR    13     1
##  6   AVP    EWR     1     1
##  7   BDL    EWR   210     1
##  8   BGR    LGA   274     1
##  9   BHM    LGA   124     1
## 10   BUR    JFK   289     1
## # ... with 37 more rows

  1. Consider the dplyr code snippet below.
tab2 =
  nyc14 %>%
    group_by(origin, dest, carrier) %>%
    summarize(n = n()) %>%
    filter(n >= 80) %>%
    group_by(origin, carrier) %>%
    summarize(n = n()) %>%
    arrange(origin,-n)
  1. Provide a one-sentence summary of what is being computed. For each carrier and origin airport, this code counts the number of destination cities with at least 80 observed flights.

  2. Provide a translation using data.table syntax.

nyc14[, .(n=.N), by = .(origin, dest, carrier)
     ][n>=80
     ][, .(n=.N), by = .(origin, carrier)
     ][order(origin,-n)
     ]
##     origin carrier  n
##  1:    EWR      EV 47
##  2:    EWR      UA 29
##  3:    EWR      WN  8
##  4:    EWR      B6  7
##  5:    EWR      DL  4
##  6:    EWR      AA  3
##  7:    EWR      US  3
##  8:    EWR      VX  2
##  9:    EWR      AS  1
## 10:    EWR      MQ  1
## 11:    JFK      B6 42
## 12:    JFK      DL 22
## 13:    JFK      AA 15
## 14:    JFK      MQ 11
## 15:    JFK      VX  3
## 16:    JFK      UA  2
## 17:    JFK      US  2
## 18:    JFK      HA  1
## 19:    JFK      EV  1
## 20:    LGA      EV 26
## 21:    LGA      MQ 20
## 22:    LGA      DL 13
## 23:    LGA      WN  8
## 24:    LGA      B6  6
## 25:    LGA      AA  4
## 26:    LGA      US  4
## 27:    LGA      UA  3
## 28:    LGA      FL  2
## 29:    LGA      F9  1
## 30:    LGA      OO  1
##     origin carrier  n

  1. Consider the R code snippet below.
nyc14_df = as.data.frame(nyc14)
tab3 =
 with(
  with(nyc14_df,nyc14_df[grepl('HOU',dest),]),
      {
       keys = paste(carrier,month,sep=':')
       u = unique(keys)
       n = sapply(u, function(key) sum(key==keys))

       tmp = strsplit(u,':')
       carrier = sapply(tmp,function(x) x[1])
       month  = sapply(tmp,function(x) x[2])
       
       cr  = unique(carrier)
       ind = sapply(cr, function(x) grep(x,carrier, fixed=TRUE))

       data.frame(carrier,month,n)[ind,]
      }
 )
  1. Provide a one-sentence summary of what is being computed.

This code computes the number of flights from NYC to HOU each month by carrier.

  1. Provide a translation using data.table syntax.
setkey(nyc14, dest)
nyc14['HOU', .(n = .N), by=.(carrier, month)][order(carrier,month)]
##     carrier month   n
##  1:      B6     1  50
##  2:      B6     2  47
##  3:      B6     3  55
##  4:      B6     4  54
##  5:      B6     5  60
##  6:      B6     6  59
##  7:      B6     7  61
##  8:      B6     8  62
##  9:      B6     9  60
## 10:      B6    10  62
## 11:      WN     1 130
## 12:      WN     2 110
## 13:      WN     3 144
## 14:      WN     4 144
## 15:      WN     5 185
## 16:      WN     6 199
## 17:      WN     7 205
## 18:      WN     8 194
## 19:      WN     9 186
## 20:      WN    10 192
  1. Provide a translation using dplyr syntax.
tab3c = nyc14 %>%
  filter(dest=='HOU') %>%
  group_by(carrier, month) %>%
  summarize(n = n())