This is an example solution to Problem Set 5 for Stats 506 in Fall 2020.
To build this document, run Rscript -e "rmarkdown::render('./PS5_solution.Rmd')"
at the command line or bash ps5_make.sh
to build this document after running the solution scripts.
In this question, you were asked to pose and an answer a question about US residences using the 2009 and 2015 Residential Energy Consumption Survey RECS data. Additionally, you were asked to use data.table
for data manipulations. This is an example question, yours will differ.
Which Census Divisions, if any, have the largest disparity in home internet access between urban and rural residences?
I will answer this question using the variable INTERNET
to identify home internet access, DIVISION
to represent Census Division, and UATYP10
for urban / rural status. In addition, I will use the survey weights NWEIGHT
to estimate population proportions and the replicate weights BRRWT1
-BRRWT96
to compute standard errors.
The analysis for this question is in ps5_q1.R
which can be found at the Stats506_F20 repository.
cap1 = paste(
"**Figure 1.**",
"*Proportion of residences with home internet access in 2015.*",
"Values are estimated using the 2015 Residential Energy Consumption Survey",
"data."
)
p1
cap2 = paste(
"**Figure 2.**",
"*Proportion of residences with home internet access in 2015.*",
"Values are estimated using the 2015 Residential Energy Consumption Survey",
"data."
)
p2
tabcap1 = paste(
"**Table 1.** *Home internet access and urban/rural disparity by Census",
"Division.*"
)
tab[.N:1] %>%
knitr::kable(format = 'html', caption = tabcap1) %>%
kableExtra::kable_styling("striped", full_width = TRUE) %>%
kableExtra::add_header_above(
header = c(' ' = 1, 'Home Internent Access, proportion (95% CI)' = 3)
)
Census Division | Rural | Urban | Disparity |
---|---|---|---|
Mountain South | 0.67 (0.58, 0.75) | 0.85 (0.81, 0.89) | 0.19 (0.07, 0.30) |
East South Central | 0.69 (0.63, 0.75) | 0.78 (0.71, 0.86) | 0.09 (-0.01, 0.20) |
West North Central | 0.80 (0.71, 0.89) | 0.88 (0.85, 0.91) | 0.08 (-0.02, 0.18) |
Mountain North | 0.82 (0.74, 0.90) | 0.87 (0.82, 0.93) | 0.05 (-0.06, 0.17) |
West South Central | 0.77 (0.72, 0.81) | 0.82 (0.76, 0.87) | 0.05 (-0.02, 0.13) |
Pacific | 0.85 (0.77, 0.93) | 0.89 (0.86, 0.91) | 0.03 (-0.05, 0.11) |
South Atlantic | 0.82 (0.76, 0.88) | 0.85 (0.83, 0.88) | 0.03 (-0.04, 0.10) |
New England | 0.86 (0.82, 0.89) | 0.88 (0.83, 0.93) | 0.02 (-0.02, 0.06) |
East North Central | 0.86 (0.82, 0.91) | 0.86 (0.84, 0.89) | 0.00 (-0.05, 0.05) |
Middle Atlantic | 0.91 (0.85, 0.97) | 0.89 (0.84, 0.95) | -0.02 (-0.09, 0.05) |
Based on these data, only the Mountain South region has an urban-rural disparity in home internet access whose 95% confidence interval does lies entirely above zero. The Mountain South division has an estimated disparity of 0.19 (0.07, 0.30) [Urban = 0.85 (0.81, 0.89); Rural = 0.67 (0.58, 0.75)] while all other divisions have an estimated disparity of less than 10%.
In this question we use cross-validation to compare three models in terms of their out-of-sample predictions for the presence of a permanent tooth as a function of age in four years of continuous NHANES data (2011 - 2018).
All three models use logistic regression and cubic smoothing splines to model the relationship of the outcome with with age. However, the models differ in how they account for each subject contributing multiple teeth.
Throughout, we focus on the conditional predictions and therefore do not use the survey weights or design. All models are fit using mgcv::bam()
in R.
perm_tooth ~ tooth + s(age, bs = 'cs') + s(id, bs = 're')
.perm_tooth ~ tooth + s(age, bs = 'cs', by = 'tooth') + s(id, bs = 're')
.perm_tooth ~ s(age, bs = 'cs')
.Parts a-c can be found in ps5_q2.R
at the Stats506_F20 repo. Here is a comparison of the four models using the cross-entropy loss on the cross-validated predictions.
Examining Table 2 below, Models 2 and 3 both perform better than Model 1 indicating not all teeth share a common relation with age. Model 2, making predictions on all teeth jointly, performs marginally better than Model 3 though the difference is likely too small to be meaningful.
foo = load('ps5_q2_cross_entropy_loss.RData')
loss = data.table(
Model = paste('Model', 1:3),
loss = round(unlist(loss), 4)
)
## Approximate std error
se = loss_by_cohort[,
lapply(.SD, function(x) sd(x) / sqrt(.N)),
.SDcols = grep('^yhat', names(loss_by_cohort))
] %>%
melt(
measure.vars = grep('^yhat', names(loss_by_cohort), value = TRUE),
variable.name = 'Model',
value.name = 'se'
)
se[, Model := str_replace_all(Model, 'yhat_v', 'Model ')]
## Approximate confidence interval
ci_fmt = '%6.4f (%6.4f, %6.4f)'
m = qnorm(.975)
tab2 = merge(loss, se, by = 'Model') %>%
.[, .(Model, Loss = sprintf(ci_fmt, loss, loss - m * se, loss + m * se))]
tabcap2 = paste(
"**Table 2.** *Cross-validated cross-entropy loss for models predicting",
"the presence of a permanent tooth.* Confidence intervals are approximate",
"and represent 1.96 times the standard error of the loss across each of the",
"four folds."
)
tab2 %>%
knitr::kable(format = 'html', caption = tabcap2) %>%
kableExtra::kable_styling("striped", full_width = TRUE)
Model | Loss |
---|---|
Model 1 | 0.3600 (0.3495, 0.3705) |
Model 2 | 0.3097 (0.2975, 0.3219) |
Model 3 | 0.3104 (0.2985, 0.3223) |