Question 1 [25 points]

A [15 points]

Solution

  1. c, 8 MB
  2. e, 68.9 kB
  3. a, 326 kB
  4. b, 80 kB
  5. d, 680 kB

B [10 points]

Solution

The largest amount of memory in use at one time is just after the creation of the arrays xb and yb. The other major objects in memory are xind and yind. Here are there approximate sizes:

  1. xind: \(30 \times 1e4 \times 1e3 = 3e8\) 4 B integers or about 1.2 GB
  2. yind: \(20 \times 1e4 \times 1e3 = 2e8\) 4 B integers or about 0.8 GB
  3. xb: \(30 \times 1e4 \times 1e3 = 3e8\) 8 B doubles or about 2.4 GB
  4. yb: \(20 \times 1e4 \times 1e3 = 2e8\) 8 B doubles or about 1.6 GB

Including the dimension attibutes for the matrices, these four object alone require a little over 6 GB in memory.

The handful of other objects, together with the ~150 MB of other mean we require (rounding up) 7 GB of memory to execute this script.

Note, in the orignal homework solution a missing L when incrementing xind and yind led these to be doubles leading those four objects to total 8 GB there.


Question 2 [20 points]

A [8 points]

Solution

  1. <task1>: Description: This script computes estimates and confidence intervals for the average maximum (sustained) wind speed and (air) pressure among storms classified as tropical depressions, tropical storms, or hurricanes category 1-5. Storms are classified into their highest category attained.

You should also add a date and author name to the header.

  1. <task2>: “average maximum, with confidence”

  2. <task3>: avg_max or ci_avg_max or compute_avg_max_ci or similar.

B [12 points]

Solution

  1. You are looking for new tables with a two-part file name prefixed by mylib. These are all created within the macro: task3_wind and task3_pressure, both of which have 5 columns. You may have chosen to replace task3 with the name you selected for the macro in part A. Note that storms is already there and simply sorted.

  2. The tables created in work are: storms, max_cat, max_wind, summary_wind, max_pressure, and summary_pressure.


Question 3 [30 points]

A [10 points]

proc sql;

 create table storms as
  select *, max(category) as max_cat
  from mylib.storms
  group by id
  having category = max(category)
  order by category, id;

quit;
run;

B [20 points]

Here is one possible body for the revised macro. Other solutions are also possible.

proc sql;

 create table mylib.avg_max_&var._sql as
  select category, n, xbar as avg_max_&var.,
         xbar - &z.*sd/sqrt(n) as lwr,
         xbar + &z.*sd/sqrt(n) as upr
  from (
   select category, count(id) as n, mean(max_&var.) as xbar,
          std(max_&var.) as sd
   from (
     select category, id, max(&var.) as max_&var.
     from storms
     group by category, id
   )
   group by category
  );

Question 4 [30 points]

Solution

# Task A, use regex to create "left" and "freq" from thresh: -------------------
aud_long = aud_long %>%
  mutate( left = 1 * grepl('l$', thresh),
          freq = stringr::str_sub(thresh, 1, 1),
          freq = ifelse( freq == 5, 0.5, freq)
  )

# Task B, compute nominal confidence intervals: --------------------------------
#   comparing young to old for each ear/frequency

# The easiest way to do this, is not reshape but to use an additional summarize
aud_t = aud_long %>% 
  group_by(freq, left, old) %>% # will be ordered accordingly
  summarize( m = mean(auxu), v = var(auxu), N = n() ) %>%
  #group_by(freq, left) %>% ## this is implicit
  summarize( 
          myoung = m[1], mold = m[2],
          d = diff(m), 
          se = sqrt( sum( v / N) ),
          lower = d - 1.96*se,
          upper = d + 1.96*se
  ) 

# Task C, clean for presentation: ----------------------------------------------
# For the solution, it is sufficient to have the right columns with clean names,
# they need not be in the correct order, as long as the order is logical.
aud_t %>%
  transmute(
    Frequency = freq,
    Ear = ifelse(left, "left", "right"),
    Avg_Young = myoung, 
    Avg_Older = mold,
    Difference = d,
    lower, 
    upper
  )

Question 5 [30 points]

question 3

Solution

## part a, without remerging
cols = setdiff( names(storms), 'id')
storms = storms[ , c(.SD, .(max_cat = max(category))) , id, .SDcols = cols] %>%
 .[category == max_cat] %>%
 .[order(category, id)]

## part a, with remerging
max_cat = storms[ , .(max_cat = max(category)), id]
storms = merge(storms, max_cat, by = 'id', all.x = TRUE) %>% 
 .[category == max_cat]

## part b
### rewrite macro as function
comp_avg_max = function(storms, var = 'wind') {
 z = qnorm(.975)
 
 out = storms[ , .( max_var = max(.SD[[var]] ) ), .(category, id) ] %>%
 .[ , .(n = .N, xbar = mean(max_var), sd = sd(max_var) ), keyby = category] %>%
 .[ , .(category, n, xbar, 
        lwr = xbar - z*sd/sqrt(n), upr = xbar + z*sd/sqrt(n) ) 
  ]

 # Include "var" in names
 nms = getnames(out)
 nms[ nms == 'xbar' ] = paste0('avg_max_', var)
 setnames(out, nms)
 
 out
}

question 4

Solution

# Task A, use regex to create "left" and "freq" from thresh: -------------------
aud_long[ , `:=`(  left = 1 * grepl('l$', thresh),
                   freq = stringr::str_sub(thresh, 1, 1),
                   freq = ifelse( freq == 5, 0.5, freq)
                 )]

# Task B, compute nominal confidence intervals: --------------------------------
#   comparing young to old for each ear/frequency

# The easiest way to do this, is not reshape but to use an additional summarize
aud_t = 
 aud_long[ , .(m = mean(auxu), v = var(auxu), .N), 
           keyby = .(freq, left, old)
 ] %>% # .[order(freq, left, old)] ## explicit order if by instead of keyby
.[ , .( myoung = m[1], mold = m[2],
        d = diff(m), 
        se = sqrt( sum( v / N) ),
        lower = d - 1.96*se,
        upper = d + 1.96*se
      )
 ]

# Task C, clean for presentation: ----------------------------------------------
# For the solution, it is sufficient to have the right columns with clean names,
# they need not be in the correct order, as long as the order is logical.
aud_t[ , 
  .(
    Frequency = freq,
    Ear = ifelse(left, "left", "right"),
    Avg_Young = myoung, 
    Avg_Older = mold,
    Difference = d,
    lower, 
    upper
  )
]

Question 6 [10 points]

Consider the following code using data.table syntax.

med_data[ , `:=`(totpay = LINE_SRVC_CNT * AVERAGE_MEDICARE_PAYMENT_AMT)]
avg_pay = 
 med_data[ , .( n = sum(LINE_SRVC_CNT), totpay = sum(TOTPAY), 
                desc = HCPCS_CODE_DESCRIPTION[1]
              ), .(HCPCS_CODE)]
avg_pay[ , `:=`(avg = totpay / n)]
top10 = avg_pay[ n > 1e5 ][order(-avg)][1:10]

Solution

  1. This is thoe top 10 medical procedures by average payment amount, among procedues billed to Medicare more than 100,000 times.

top10 = med_data %>% 
 mutate( totpay = LINE_SRVC_CNT * AVERAGE_MEDICARE_PAYMENT_AMT ) %>%
 group_by(HCPCS_CODE) %>%
 summarize( n = sum(LINE_SRVC_CNT), 
            totpay = sum(totpay), 
            desc = HCPCS_CODE_DESCRIPTION[1]
 ) %>%
 mutate( avg = totpay / n ) %>%
 arrange(desc(avg)) %>%
 filter( n >= n[10] ) # head(n = 10)

Question 7 [20 points]

Solution

The options below will typically complete sooner than the opposite mc.preschedule setting.

library(parallel)
wait = function(x){
 # wait 1/x seconds unless x is integer-valued
 # when x is integer valued, wait x seconds if even or
 #  6 * x seconds if odd
 
 if ( x == floor(x) ) {
   y = ifelse( {x %% 2} == 0, x, 6 * x )
   Sys.sleep(y)
 } else {
   Sys.sleep( 1 / x )
 }
}

## a
system.time({
  parallel::mclapply(1:6, wait, mc.cores = 2, mc.preschedule = TRUE)
})
##    user  system elapsed 
##   0.010   0.008  54.019
## b
system.time({
mclapply( runif(1000, min = 5e1, max = 1.5e2), wait, 
  mc.cores = 2, mc.preschedule = TRUE)
})
##    user  system elapsed 
##   0.030   0.023   5.966

For a, with mc.preschedule = TRUE the first worker gets the three longest tasks and takes just over \(6 \times (1 + 3 + 5) = 54\)s to complete. With the selected mc.preschedule = FALSE the long running tasks are split between the two workers, with an approximate run time of \(\max(1 \times 6 + 4 + 5 \times 6, 2 + 3 \times 6 + 6) = 40\)s. The extra overhaead of is small relative to this 14s difference.

In contrast, for b, the 1000 tasks take about 0.01s on average and are spread evenly between the workers with high probability. As a first order appoximation, we can expect this to take a bit more than 5s = \(500 \times 0.01\)s.

Note, the approximate time of each task is \(\mathbb E[1 / U ]\) for \(U \sim \textrm{Uniform(50, 150)}\) so \(\mathbb E[1/u] = \log(3)/100 \approx 0.01\).

Additional observed run time in b comes from the overhead of calling wait 500 times on each worker and from the fact that \(\mathbb{E}[\max( \sum_{i=1}^{500} U_i, \sum_{i=501}^{1000} U_i)] \ne \max( \mathbb{E}[\sum_{i=1}^{500} U_i], \mathbb{E}[\sum_{i=501}^{1000} U_i])\).


Question 8 [35 points]

In this question, you will write code to compare two machine learning regression models. Each model has a tuning parameter to be selected using cross validation.

You have 100,000 data points organized into a data.table object dt. This data table has a continuous response y and 200 features x1:x200.

Model 1 is based on kernel regression and has a bandwidth parameter h. Assume that you have already written a function fit_mode1 which takes a data.table and a value of h and returns the fitted model object: fit_model1(dt, h).

Model 2 utilizes ridge regression and has a smoothing parameter lambda. As with model 1, assume you have written a function fit_model2(dt, lambda) which accepts an input data set and a value of lambda and returns the fitted model object.

Both model objects belong to classes with associated predict methods, predict(fitted_model, new_data) that return a prediction for y based on the fitted model and new data.

You need to use 20-fold cross validation to choose the best h and lambda from a set of previously determined values h_ops and lambda_ops. Each model takes around 1s to fit.

Solution

dt[ , test := 0]
dt[sample(1:.N, floor(.2 * .N), replace = FALSE), test := 1 ]
  1. Here is a solution using reference semantics, any correct solution is fine.
## order the data by reference
dt[ , new_ord := runif(.N) ]
setkey(dt, test, new_ord) 
dt[test == 0, fold := rep(1:20, each = {.N - sum(test)} / 20)]  # requires equi-split
dt[test == 1, fold := 21]
setkey(dt, fold, test)
  1. The approach taken here uses futures to allow workers not occupied with fitting model 1 to begin working on model 2. In addition, we fit models for each value of the tuning parameter serially choosing to take folds in parallel.
## functions for mse on one fold, for all h or lambda
mse_model = function(hold_out, dt, fit_model, hyper ) {
  mse = vector( mode = "double", length = length(hyper) )
  names(mse) = hyper
  y = dt[ fold == hold_out, y]
  for ( h in hyper ) {
     fit = fit_model( dt[ fold != hold_out & test == 0, ], h)
     mse[h] = mean( {y - predict(fit, dt[ fold == hold_out] )}^2 )
  } 
  mse
}

# model futures
m1_futures = m2_futures = vector( mode = 'list', length = 20)

for ( hold_out in 1:20 ) {
 m1_futures[[hold_out]] = future({ 
   mse_model(hold_out, dt, fit_model1, h_ops) 
 })
}

for ( hold_out in 1:20 ) {
 m2_futures[[hold_out]] = future({
   mse_model(hold_out, dt, fit_model2, lambda_ops)
 })
}

# values
m1_xmse = colMeans( do.call(lapply(m1_futures, value), "rbind") )
m2_xmse = colManes( do.call(lapply(m2_futures, value), "rinbd") )

h_opt = h_ops[ which.min(m1_xmse) ]
lambda_opt = lambda_ops[ which.min(m2_xmse) ]
m1 = fit_model1(dt[ test == 0], h_opt)
m2 = fit_model2(dt[ test == 0], lambda_opt)

## this is sufficient for full points
mse1 = mean( {dt[test == 1, y] - predict(m1, dt[test == 1])}^2 )
mse2 = mean( {dt[test == 1, y] - predict(m2, dt[test == 1])}^2 )

result = data.table(
 model = c('kernel', 'ridge'), 
 tuning = sprintf('%s %4.1f', c('h', 'lambda'), c(h_opt, lambda_opt)),
 mse = sprintf('%4.1f', c(mse1, mse2) )            
)