The largest amount of memory in use at one time is just after the creation of the arrays xb
and yb
. The other major objects in memory are xind
and yind
. Here are there approximate sizes:
xind
: \(30 \times 1e4 \times 1e3 = 3e8\) 4 B integers or about 1.2 GByind
: \(20 \times 1e4 \times 1e3 = 2e8\) 4 B integers or about 0.8 GBxb
: \(30 \times 1e4 \times 1e3 = 3e8\) 8 B doubles or about 2.4 GByb
: \(20 \times 1e4 \times 1e3 = 2e8\) 8 B doubles or about 1.6 GBIncluding the dimension attibutes for the matrices, these four object alone require a little over 6 GB in memory.
The handful of other objects, together with the ~150 MB of other mean we require (rounding up) 7 GB of memory to execute this script.
Note, in the orignal homework solution a missing L
when incrementing xind
and yind
led these to be doubles leading those four objects to total 8 GB there.
<task1>
: Description: This script computes estimates and confidence intervals for the average maximum (sustained) wind speed and (air) pressure among storms classified as tropical depressions, tropical storms, or hurricanes category 1-5. Storms are classified into their highest category attained.You should also add a date and author name to the header.
<task2>
: “average maximum, with confidence”
<task3>
: avg_max
or ci_avg_max
or compute_avg_max_ci
or similar.
You are looking for new tables with a two-part file name prefixed by mylib
. These are all created within the macro: task3_wind
and task3_pressure
, both of which have 5 columns. You may have chosen to replace task3
with the name you selected for the macro in part A. Note that storms
is already there and simply sorted.
The tables created in work are: storms
, max_cat
, max_wind
, summary_wind
, max_pressure
, and summary_pressure
.
proc sql;
create table storms as
select *, max(category) as max_cat
from mylib.storms
group by id
having category = max(category)
order by category, id;
quit;
run;
Here is one possible body for the revised macro. Other solutions are also possible.
proc sql;
create table mylib.avg_max_&var._sql as
select category, n, xbar as avg_max_&var.,
xbar - &z.*sd/sqrt(n) as lwr,
xbar + &z.*sd/sqrt(n) as upr
from (
select category, count(id) as n, mean(max_&var.) as xbar,
std(max_&var.) as sd
from (
select category, id, max(&var.) as max_&var.
from storms
group by category, id
)
group by category
);
# Task A, use regex to create "left" and "freq" from thresh: -------------------
aud_long = aud_long %>%
mutate( left = 1 * grepl('l$', thresh),
freq = stringr::str_sub(thresh, 1, 1),
freq = ifelse( freq == 5, 0.5, freq)
)
# Task B, compute nominal confidence intervals: --------------------------------
# comparing young to old for each ear/frequency
# The easiest way to do this, is not reshape but to use an additional summarize
aud_t = aud_long %>%
group_by(freq, left, old) %>% # will be ordered accordingly
summarize( m = mean(auxu), v = var(auxu), N = n() ) %>%
#group_by(freq, left) %>% ## this is implicit
summarize(
myoung = m[1], mold = m[2],
d = diff(m),
se = sqrt( sum( v / N) ),
lower = d - 1.96*se,
upper = d + 1.96*se
)
# Task C, clean for presentation: ----------------------------------------------
# For the solution, it is sufficient to have the right columns with clean names,
# they need not be in the correct order, as long as the order is logical.
aud_t %>%
transmute(
Frequency = freq,
Ear = ifelse(left, "left", "right"),
Avg_Young = myoung,
Avg_Older = mold,
Difference = d,
lower,
upper
)
## part a, without remerging
cols = setdiff( names(storms), 'id')
storms = storms[ , c(.SD, .(max_cat = max(category))) , id, .SDcols = cols] %>%
.[category == max_cat] %>%
.[order(category, id)]
## part a, with remerging
max_cat = storms[ , .(max_cat = max(category)), id]
storms = merge(storms, max_cat, by = 'id', all.x = TRUE) %>%
.[category == max_cat]
## part b
### rewrite macro as function
comp_avg_max = function(storms, var = 'wind') {
z = qnorm(.975)
out = storms[ , .( max_var = max(.SD[[var]] ) ), .(category, id) ] %>%
.[ , .(n = .N, xbar = mean(max_var), sd = sd(max_var) ), keyby = category] %>%
.[ , .(category, n, xbar,
lwr = xbar - z*sd/sqrt(n), upr = xbar + z*sd/sqrt(n) )
]
# Include "var" in names
nms = getnames(out)
nms[ nms == 'xbar' ] = paste0('avg_max_', var)
setnames(out, nms)
out
}
# Task A, use regex to create "left" and "freq" from thresh: -------------------
aud_long[ , `:=`( left = 1 * grepl('l$', thresh),
freq = stringr::str_sub(thresh, 1, 1),
freq = ifelse( freq == 5, 0.5, freq)
)]
# Task B, compute nominal confidence intervals: --------------------------------
# comparing young to old for each ear/frequency
# The easiest way to do this, is not reshape but to use an additional summarize
aud_t =
aud_long[ , .(m = mean(auxu), v = var(auxu), .N),
keyby = .(freq, left, old)
] %>% # .[order(freq, left, old)] ## explicit order if by instead of keyby
.[ , .( myoung = m[1], mold = m[2],
d = diff(m),
se = sqrt( sum( v / N) ),
lower = d - 1.96*se,
upper = d + 1.96*se
)
]
# Task C, clean for presentation: ----------------------------------------------
# For the solution, it is sufficient to have the right columns with clean names,
# they need not be in the correct order, as long as the order is logical.
aud_t[ ,
.(
Frequency = freq,
Ear = ifelse(left, "left", "right"),
Avg_Young = myoung,
Avg_Older = mold,
Difference = d,
lower,
upper
)
]
Consider the following code using data.table syntax.
med_data[ , `:=`(totpay = LINE_SRVC_CNT * AVERAGE_MEDICARE_PAYMENT_AMT)]
avg_pay =
med_data[ , .( n = sum(LINE_SRVC_CNT), totpay = sum(TOTPAY),
desc = HCPCS_CODE_DESCRIPTION[1]
), .(HCPCS_CODE)]
avg_pay[ , `:=`(avg = totpay / n)]
top10 = avg_pay[ n > 1e5 ][order(-avg)][1:10]
This is thoe top 10 medical procedures by average payment amount, among procedues billed to Medicare more than 100,000 times.
top10 = med_data %>%
mutate( totpay = LINE_SRVC_CNT * AVERAGE_MEDICARE_PAYMENT_AMT ) %>%
group_by(HCPCS_CODE) %>%
summarize( n = sum(LINE_SRVC_CNT),
totpay = sum(totpay),
desc = HCPCS_CODE_DESCRIPTION[1]
) %>%
mutate( avg = totpay / n ) %>%
arrange(desc(avg)) %>%
filter( n >= n[10] ) # head(n = 10)
The options below will typically complete sooner than the opposite mc.preschedule
setting.
library(parallel)
wait = function(x){
# wait 1/x seconds unless x is integer-valued
# when x is integer valued, wait x seconds if even or
# 6 * x seconds if odd
if ( x == floor(x) ) {
y = ifelse( {x %% 2} == 0, x, 6 * x )
Sys.sleep(y)
} else {
Sys.sleep( 1 / x )
}
}
## a
system.time({
parallel::mclapply(1:6, wait, mc.cores = 2, mc.preschedule = TRUE)
})
## user system elapsed
## 0.010 0.008 54.019
## b
system.time({
mclapply( runif(1000, min = 5e1, max = 1.5e2), wait,
mc.cores = 2, mc.preschedule = TRUE)
})
## user system elapsed
## 0.030 0.023 5.966
For a, with mc.preschedule = TRUE
the first worker gets the three longest tasks and takes just over \(6 \times (1 + 3 + 5) = 54\)s to complete. With the selected mc.preschedule = FALSE
the long running tasks are split between the two workers, with an approximate run time of \(\max(1 \times 6 + 4 + 5 \times 6, 2 + 3 \times 6 + 6) = 40\)s. The extra overhaead of is small relative to this 14s difference.
In contrast, for b, the 1000 tasks take about 0.01s on average and are spread evenly between the workers with high probability. As a first order appoximation, we can expect this to take a bit more than 5s = \(500 \times 0.01\)s.
Note, the approximate time of each task is \(\mathbb E[1 / U ]\) for \(U \sim \textrm{Uniform(50, 150)}\) so \(\mathbb E[1/u] = \log(3)/100 \approx 0.01\).
Additional observed run time in b comes from the overhead of calling wait
500 times on each worker and from the fact that \(\mathbb{E}[\max( \sum_{i=1}^{500} U_i, \sum_{i=501}^{1000} U_i)] \ne \max( \mathbb{E}[\sum_{i=1}^{500} U_i], \mathbb{E}[\sum_{i=501}^{1000} U_i])\).
In this question, you will write code to compare two machine learning regression models. Each model has a tuning parameter to be selected using cross validation.
You have 100,000 data points organized into a data.table
object dt
. This data table has a continuous response y
and 200 features x1:x200
.
Model 1 is based on kernel regression and has a bandwidth parameter h
. Assume that you have already written a function fit_mode1
which takes a data.table and a value of h
and returns the fitted model object: fit_model1(dt, h)
.
Model 2 utilizes ridge regression and has a smoothing parameter lambda
. As with model 1, assume you have written a function fit_model2(dt, lambda)
which accepts an input data set and a value of lambda
and returns the fitted model object.
Both model objects belong to classes with associated predict
methods, predict(fitted_model, new_data)
that return a prediction for y
based on the fitted model and new data.
You need to use 20-fold cross validation to choose the best h
and lambda
from a set of previously determined values h_ops
and lambda_ops
. Each model takes around 1s to fit.
dt[ , test := 0]
dt[sample(1:.N, floor(.2 * .N), replace = FALSE), test := 1 ]
## order the data by reference
dt[ , new_ord := runif(.N) ]
setkey(dt, test, new_ord)
dt[test == 0, fold := rep(1:20, each = {.N - sum(test)} / 20)] # requires equi-split
dt[test == 1, fold := 21]
setkey(dt, fold, test)
## functions for mse on one fold, for all h or lambda
mse_model = function(hold_out, dt, fit_model, hyper ) {
mse = vector( mode = "double", length = length(hyper) )
names(mse) = hyper
y = dt[ fold == hold_out, y]
for ( h in hyper ) {
fit = fit_model( dt[ fold != hold_out & test == 0, ], h)
mse[h] = mean( {y - predict(fit, dt[ fold == hold_out] )}^2 )
}
mse
}
# model futures
m1_futures = m2_futures = vector( mode = 'list', length = 20)
for ( hold_out in 1:20 ) {
m1_futures[[hold_out]] = future({
mse_model(hold_out, dt, fit_model1, h_ops)
})
}
for ( hold_out in 1:20 ) {
m2_futures[[hold_out]] = future({
mse_model(hold_out, dt, fit_model2, lambda_ops)
})
}
# values
m1_xmse = colMeans( do.call(lapply(m1_futures, value), "rbind") )
m2_xmse = colManes( do.call(lapply(m2_futures, value), "rinbd") )
h_opt = h_ops[ which.min(m1_xmse) ]
lambda_opt = lambda_ops[ which.min(m2_xmse) ]
m1 = fit_model1(dt[ test == 0], h_opt)
m2 = fit_model2(dt[ test == 0], lambda_opt)
## this is sufficient for full points
mse1 = mean( {dt[test == 1, y] - predict(m1, dt[test == 1])}^2 )
mse2 = mean( {dt[test == 1, y] - predict(m2, dt[test == 1])}^2 )
result = data.table(
model = c('kernel', 'ridge'),
tuning = sprintf('%s %4.1f', c('h', 'lambda'), c(h_opt, lambda_opt)),
mse = sprintf('%4.1f', c(mse1, mse2) )
)