suppressPackageStartupMessages({
  library(tidyverse); library(data.table)
})

About

This is an example solution to Problem Set 4 for Stats 506 in Fall 2020.

To build this document, run Rscript -e "rmarkdown::render('./PS4_solution.Rmd')" at the command line or bash ps4_make.sh to build this document after running the solution scripts.

Question 1

In this quesiton, we use the 2009 and 2015 Residential Energy Consumption Survey RECS data to profile the quantities and types of televisions in US homes, by Census Region.

Data Preparation

The script 0-ps4_q1_data.sh downloads the source data files from the RECS website. These files are then prepared for analysis in SAS by 1-ps4_q1_prepdata.sas. Use the links below to view these files or refer to the solutions at the Stats506_F20 GitHub repository.

0-ps4_q1_data.sh Click Code to view of Hide to collapse.

writeLines(c('```bash', readLines("./0-ps4_q1_data.sh"), '```'))
#!usr/bin/env bash

# 79: -------------------------------------------------------------------------         
# Problem Set 4, Question 1
#
# Download 2009 and 2015 RECS microdata and 2009 replicate weights in SAS
# format for PS4, Q1. 
#
# Author: James Henderson (jbhender@umich.edu)
# Updated: Nov 4, 2020
# 79: -------------------------------------------------------------------------

# input parameters: -----------------------------------------------------------
base_url=https://www.eia.gov/consumption/residential/data

# download microdata: ---------------------------------------------------------

## 2009 RECS
if [ ! -f recs2009_public_v4.sas7bdat ]; then
    z=recs2009_public_v4.zip
    wget $base_url/2009/sas/$z
    unzip $z
    rm $z
fi

## 2009 RECS replicate weights
if [ ! -f recs2009_public_repweights.sas7bdat ]; then
    z=recs2009_public_repweights_sas7bdat.zip
    wget $base_url/2009/sas/$z
    unzip $z
    rm $z
fi

## 2015 RECS (includes replicate weights)
if [ ! -f recs2015_public_v4.sas7bdat ]; then
    z=recs2015_public_v4.zip
    wget $base_url/2015/sas/$z
    unzip $z
    rm $z    
fi

# 79: -------------------------------------------------------------------------

1-ps4_q1_prepdata.sas

writeLines(c('```SAS', readLines("./1-ps4_q1_prepdata.sas"), '```'))
/* ------------------------------------------------------------------------- *
 * Problem Set 4, Question 1 - Data Preparation
 * Stats 506, Fall 2020
 *
 * This script prepares data for Q1 by creating the following .sas7bdat's:
 *    - recs2009 - needed variables from 2009 RECS data
 *    - recs2015 - needed variables from 2015 RECS data
 *    - wt09_long - BRR weights for 2009 data in a long format
 *    - wt15_long - BRR weights for 2015 data in a long format
 *
 * The weights are sorted by id and replicate. 
 *
 * Author: James Henderson
 * Updated: November 23, 2020
 * ------------------------------------------------------------------------- *
*/

/* 79: --------------------------------------------------------------------- */

/* sas library: ------------------------------------------------------------ */    
libname mylib './'; 

/* formats: ---------------------------------------------------------------- */
proc format lib=mylib.recs_formats;  
value region 
 1="Northeast" 
 2="Midwest"
 3="South"
 4="West";

value type09_tv
 1="Standard Tube"
 2="LCD"
 3="Plasma"
 4="Projection"
 5="LED"
 -2="Not Applicable";

value type15_tv
 1="LCD"
 2="Plasma"
 3="LED"
 4="Projection"
 5="Standard Tube"
 -2="Not Applicable";
run;

options fmtsearch = (mylib.recs_formats work); 
run;

/* data preparation: ------------------------------------------------------- */
%let recs_vars = DOEID NWEIGHT REGIONC TVCOLOR TVTYPE1; 
%let recs_names = DOEID=id NWEIGHT=w REGIONC=region TVCOLOR=tv_n 
                  TVTYPE1=tv_type;

/* recs 2009 */
data mylib.recs2009; 
  set mylib.recs2009_public_v4;
  keep &recs_vars.;
  rename &recs_names.;
  format REGIONC region.; 
  format TVTYPE1 type09_tv.;
run; 

/* map 2009 types to 2015 types */
data mylib.recs2009; 
 set mylib.recs2009;
 tv_type15=-2;
 if tv_type=1 then tv_type15=5;
 if tv_type=2 then tv_type15=1;
 if tv_type=3 then tv_type15=2;
 if tv_type=4 then tv_type15=4;
 if tv_type=5 then tv_type15=3;
 drop tv_type;
 rename tv_type15=tv_type;
 format tv_type15 type15_tv.; 
run;

/* recs 2015 */
data mylib.recs2015; 
 set mylib.recs2015_public_v4;
 keep &recs_vars.;
 rename &recs_names.;
 format REGIONC region.; 
 format TVTYPE1 type15_tv.; 
run;


/* transform replicate weights to a long format: --------------------------- */

/* 2009 */
proc transpose 
  data=mylib.recs2009_public_repweights
  out=mylib.w09_long
  prefix=brr_weight_;
 by DOEID;
 var brr_weight_1-brr_weight_244; 
run;  

data mylib.w09_long;
 set mylib.w09_long;
 rename DOEID=id _NAME_=repl brr_weight_1=w;

proc sort data=mylib.w09_long out=mylib.w09_long;    
 by id repl; 

/* 2015 */
proc transpose
  data=mylib.recs2015_public_v4
  out=mylib.w15_long
  prefix=brrwt;
 by DOEID;
 var brrwt1-brrwt96;
run;

data mylib.w15_long;
 set mylib.w15_long;
 rename DOEID=id _NAME_=repl brrwt1=w; 

proc sort data=mylib.w15_long out=mylib.w15_long;
 by id repl; 

/* 79: --------------------------------------------------------------------- */

Analysis

The analyses for both parts of this question are carried out in the script 2-ps4_q1_analysis.sas which can be viewed below. The script defines a number of SAS macros to facilitate the analysis. Briefly, these are:

  • %recs_sum() - a high level macro for summarizing a variable in the RECS data by group.
  • %est_mean() - a macro used by %recs_sum() to specifically compute the mean of a variable by group.
  • %est_prop() - a macro, like %est_mean(), used by %recs_sum() to specifically compute the proportions for levels of a categorical variable by group.
  • %csv_export() - a macro for writing a SAS data set to csv. This macro is also used in the solution to question 2.

2-ps4_q1_analysis.sas

writeLines(c('```SAS', readLines("./2-ps4_q1_analysis.sas"), '```'))
/* ------------------------------------------------------------------------- *
 * Problem Set 4, Question 1
 * Stats 506, Fall 2020
 *
 * This is the solution to Q1. Data manipulations are collected into a 
 * macro to limit repetition. 
 *
 * See the link below for the source of testing the case when `catvar` is
 * an empty string (`%str()`):
 * > https://support.sas.com/resources/papers/proceedings09/022-2009.pdf
 * 
 * Author: James Henderson
 * Updated: November 23, 2020
 * ------------------------------------------------------------------------- *
*/

/* 79: --------------------------------------------------------------------- */

/* sas library: ------------------------------------------------------------ */    
libname mylib './'; 
options fmtsearch=(mylib.recs_formats work); 

/* macros: ----------------------------------------------------------------- */
/* normal quantile for 95% confidence level */
%let z = quantile('NORMAL', .975);      

/* estimate the mean of a continuous variable in recs, 
 *  see recs_sum macro below for argument documentation 
 */
%macro est_mean(est_dat, est_var, est_grp, est_w, est_out=pe, out_var=pe); 
 
 proc summary data=&est_dat.;
  by &est_grp.;
  var &est_var.;
  output out = &est_out.
    mean(&est_var.) = &out_var.;
  weight &est_w.;
 run;

%mend est_mean; 

/* estimate the proportion of a categorigcal variable in RECS,
 * see res_sum macro below for argument documentation
 */
%macro est_prop(dat, var, catvar, grp, w, est_out=pe, out_var=pe);

 /* totals for each level of catvar */
 proc summary data=&dat.;
  by &grp. &catvar.;
  var &var.;
  output out = n_catvar
    sum(&var.) = n;
  weight &w.;
 run;

 /* totals for each group */
 proc summary data=&dat.;
  by &grp.;
  var &var.;
  output out = n_total
    sum(&var.) = total;
  weight &w.;
 run;

 /* merge and output proportion estimates */
 data &est_out.; 
  merge n_catvar n_total; 
  by &grp.;
  &out_var. = n / total; 
  drop n total; 
 run;

 /* clean up temporary datasets */
 proc datasets gennum=all;
  delete n_total;
  delete n_catvar; 
 run;
  
%mend est_prop;

/* aggregate RECS data by group and use BRR weights to compute CIs */
%macro recs_sum(out, dat, wt_long, lib, id, var, grp, catvar = %str(),
                post = %str(), dat_w = w, wt_w = w, wt_repl = repl, 
                fay = 0.5, level = 0.95);

 /* out - a name for the summary data set being created
  * dat - the primary dataset
  * wt_long - data set of weights
  * lib - library where dat and wt_long (below) can be found
  * id - variable(s) to merge data and wt_long on
  * var - variable (singular only) in dat to summarize
  * catvar - use with var identically 1 for propotions in a long format
  * grp - any grouping variables to use (required) 
  * post - optional postscript to use when naming variables in out
  * dat_w - the name of the weighting variable in dat
  * wt_w - the name of the variable with eh (replicate) weights in wt_long
  * wt_repl - the name of the variable in wt_long identifying the replicates.
  * fay - the Fay coefficient for use in calculating standard errors
  * level - confidence level used for computing confidence intervals.
  */

 %let z = quantile('NORMAL', 1 - (1 - &level.) / 2); 

 /* point estimates */
 proc sort data=&lib..&dat. out=work.dat; 
  by &grp. &catvar.;
 run;

 data work.dat;
  set work.dat;
  all=1; 

 %if %sysevalf(%superq(catvar)=, boolean)
   %then %est_mean(work.dat, &var., &grp., &dat_w.);
   %else %est_prop(work.dat, all, &catvar., &grp., &dat_w.);

 /* replicate estimates */
 proc sort data=work.dat out=work.dat;
  by &id.;
 run;

 data work.dat_long; 
  merge work.dat(in=in_dat) &lib..&wt_long.;
   by &id.;
   if in_dat; 
  all=1; 
 run;

 proc sort data=work.dat_long out=work.dat_long;
  by &grp. &wt_repl. &catvar.; 
 run;

 %if %sysevalf(%superq(catvar)=, boolean)
   %then %est_mean(work.dat_long, &var., &grp. &wt_repl., &wt_w., 
           est_out=re, out_var=re);
   %else %est_prop(work.dat_long, all, &catvar., &grp. &wt_repl., 
               &wt_w., est_out=re, out_var=re);


 /* standard errors and confidence bounds */ 
 proc sort data=work.re out=work.re;
  by &grp. &catvar.;

 data re; 
  merge pe(keep=&grp. &catvar. pe) re(keep=&grp. &catvar. re);
  by &grp. &catvar.;
  d_sq = (re - pe)**2;
 run;

 proc summary data=re;
  by &grp. &catvar.;
  output out=mse
    mean(d_sq)=mse;
  run;

 data &out.;
  merge pe mse;
   by &grp.;
  se&post. = sqrt(mse) / (1 - &fay.);
  lwr&post. = pe - &z. * se&post.;
  upr&post. = pe + &z. * se&post.;
  rename pe=est&post.; 
  drop mse _TYPE_ _FREQ_;
 run; 

 /* clean up */
 proc datasets gennum=all; 
  delete pe;
  delete re;
  delete dat;
  delete dat_long; 
 run; 

%mend recs_sum;

%macro csvexport(dataset, lib=work);
 proc export
   data = &lib..&dataset
   outfile = "./&dataset..csv"
   dbms = dlm
   replace;
  delimiter=",";
%mend csvexport;

/* (a) number of tvs: ------------------------------------------------------ */

/* 2009 */
%recs_sum(
 out=n_tv09, 
 dat=recs2009, 
 wt_long=w09_long,
 lib=mylib,
 id=id,
 var=tv_n,
 grp=region,
 post=09
)

/* 2015 */
%recs_sum(
 out=n_tv15,
 dat=recs2015,
 wt_long=w15_long,
 lib=mylib,
 id=id,
 var=tv_n,
 grp=region,
 post=15
)

/* differences */
data n_tv;
 merge n_tv09 n_tv15; 
 by region;
 diff = est15 - est09; 
 se_diff = sqrt(se15**2 + se09**2);
 lwr_diff = diff - &z. * se_diff;
 upr_diff = diff + &z. * se_diff;

%csvexport(n_tv);
run;

/* (b) tv type: ------------------------------------------------------------ */

%recs_sum(
 out=type_tv09,
 dat=recs2009,
 wt_long=w09_long,
 lib=mylib,
 id=id,
 grp=region,
 catvar=tv_type,
 post=09
); 

%recs_sum(
 out=type_tv15,
 dat=recs2015,
 wt_long=w15_long,
 lib=mylib,
 id=id,
 grp=region,
 catvar=tv_type,
 post=15
);

/* differences */
data type_tv;
 merge type_tv09 type_tv15;
 by region tv_type;
 diff = est15 - est09;
 se_diff = sqrt(se15**2 + se09**2);
 lwr_diff = diff - &z. * se_diff;
 upr_diff = diff + &z. * se_diff;

%csvexport(type_tv);
run;

/* 79: --------------------------------------------------------------------- */

Part a

In part a we compare the average number of televisions (TVCOLOR) in US homes in 2009 and 2015 by Census Region. The data displayed below is read from the file n_tv.csv written by 2-ps4_q1_analysis.sas.

i - 2009 and 2015 estimates

n_tv = fread('./n_tv.csv')

rbind(
  n_tv[, .(Year = "2009", region, est = est09, lwr = lwr09, upr = upr09)],
  n_tv[, .(Year = "2015", region, est = est15, lwr = lwr15, upr = upr15)]
) %>%
  ggplot( aes(x = est, y = region, color = Year) ) +
   geom_point() +
   geom_errorbarh( aes(xmin = lwr, xmax = upr) ) +
   xlab('Average # of TVs') +
   ylab('') + 
   theme_bw() +
   scale_color_manual(values = c('darkblue', 'orange'))

ii - 2015 less 2009 differences

cap2 = paste(
  "**Figure 2.** *Differences (2015 less 2009) in Average # of Televisions", 
  "by Census Region.*"
)
n_tv %>%
  ggplot( aes(x = diff, y = region) ) +
  geom_point() +
  geom_errorbarh( aes(xmin = lwr_diff, xmax = upr_diff) ) +
  theme_bw() +
  ylab('') +
  xlab('Difference in Average # of Televisions (2015 less 2009)') +
  xlim(c(-.4, 0))
## Warning: Removed 1 rows containing missing values (geom_errorbarh).
**Figure 2.** *Differences (2015 less 2009) in Average # of Televisions by Census Region.*

Figure 2. Differences (2015 less 2009) in Average # of Televisions by Census Region.

iii - tables

tab1_cap = paste(
  "**Table 1.** *Average number of televisions in US homes in 2009 and 2015,",
  "by Census Region.*"
)
n_tv[, .(
  `Census Region` = region,
  `2009` = sprintf('%4.2f (%4.2f, %4.2f)', est09, lwr09, upr09),
  `2015` = sprintf('%4.2f (%4.2f, %4.2f)', est15, lwr15, upr15),
  `Difference` = sprintf('%4.2f (%4.2f, %4.2f)', diff, lwr_diff, upr_diff)
)] %>%
  knitr::kable(format = 'html', cap = tab1_cap, align = c('rccc')) %>%
  kableExtra::kable_styling("striped", full_width = TRUE)
Table 1. Average number of televisions in US homes in 2009 and 2015, by Census Region.
Census Region 2009 2015 Difference
Northeast 2.52 (2.47, 2.57) 2.32 (2.19, 2.44) -0.20 (-0.34, -0.07)
Midwest 2.65 (2.58, 2.72) 2.34 (2.28, 2.40) -0.31 (-0.40, -0.21)
South 2.59 (2.55, 2.64) 2.42 (2.33, 2.51) -0.18 (-0.28, -0.08)
West 2.38 (2.33, 2.43) 2.18 (2.10, 2.26) -0.20 (-0.29, -0.11)

Part b

In part b we compare the proportion of primary televisions by display type for most used television (TVTYPE1) in US homes in 2009 and 2015 by Census Region. The data displayed below is read from the file tv_type.csv written by 2-ps4_q1_analysis.sas.

i - 2009 and 2015 estimates

cap3 = paste(
  "**Figure 3.** *Proportion of primary television type, by",
  "Census Region.* "             
)

# read in tv type results
tv_type = fread('./type_tv.csv')

# make tv_type a factor based on total change
tt_levels = tv_type[, .(dbar = mean(diff)), .(tv_type)][order(-dbar), tv_type]
tv_type[, tv_type := factor(tv_type, ..tt_levels)]
rbind(
  tv_type[, .(Year = "2009", region, tv_type, 
              est = est09 * 100, lwr = lwr09 * 100, upr = upr09 * 100)],
  tv_type[, .(Year = "2015", region, tv_type, 
              est = est15 * 100, lwr = lwr15 * 100, upr = upr15 * 100)]
) %>%
  ggplot( aes(x = est, y = region, color = Year) ) +
   geom_point() +
   geom_errorbarh( aes(xmin = lwr, xmax = upr) ) +
   facet_wrap(~tv_type) + 
   xlab('% of primary TVs') +
   ylab('') + 
   theme_bw() +
   scale_color_manual(values = c('darkblue', 'orange'))
**Figure 3.** *Proportion of primary television type, by Census Region.*

Figure 3. Proportion of primary television type, by Census Region.

ii - 2015 less 2009 differences

cap4 = paste(
  "**Figure 4.** *Differences (2015 less 2009) in % of primary TVs for each", 
  "type by Census Region.*"
)
tv_type %>%
  ggplot( aes(x = diff * 100, y = region) ) +
  geom_point() +
  geom_errorbarh( aes(xmin = lwr_diff * 100, xmax = upr_diff * 100) ) +
  facet_wrap(~ tv_type) +
  geom_vline(xintercept = 0, lty = 'dashed') + 
  theme_bw() +
  ylab('') +
  xlab('Difference in Average # of Televisions (2015 less 2009)') 
**Figure 4.** *Differences (2015 less 2009) in % of primary TVs for each type by Census Region.*

Figure 4. Differences (2015 less 2009) in % of primary TVs for each type by Census Region.

iii - table

tab2_cap = paste(
  "**Table 2.** *Percent of primary televisions by type in US homes in 2009",
  "and 2015 for each Census Region.*"
)

tab2 = tv_type[order(tv_type, region), .(
  `Census Region` = region,
  `TV Type` = tv_type, 
  `2009` = sprintf('%3.1f <br>(%3.1f, %3.1f)', 
                   est09 * 100, lwr09 * 100, upr09 * 100),
  `2015` = sprintf('%3.1f <br>(%3.1f, %3.1f)', 
                   est15 * 100, lwr15 * 100, upr15 * 100),
  `Difference` = sprintf('%3.1f <br>(%3.1f, %3.1f)', 
                         diff * 100, lwr_diff * 100, upr_diff * 100)
)] %>%
  knitr::kable(
    format = 'html', cap = tab1_cap, align = c('rrccc'), escape = FALSE
  ) %>%
  kableExtra::kable_styling("striped", full_width = TRUE)

for (i in 1:length(tt_levels)) {
  tab2 = kableExtra::group_rows(
    tab2, 
    group_label = tt_levels[i], 
    start_row = 1 + (i - 1) * 4, 
    end_row = 4 * i
  )
}
tab2
Table 1. Average number of televisions in US homes in 2009 and 2015, by Census Region.
Census Region TV Type 2009 2015 Difference
LED
Midwest LED 1.4
(0.8, 2.0)
36.5
(33.9, 39.0)
35.1
(32.5, 37.6)
Northeast LED 0.6
(0.3, 0.9)
32.2
(28.5, 36.0)
31.7
(27.9, 35.4)
South LED 1.0
(0.7, 1.4)
34.7
(32.4, 37.0)
33.7
(31.3, 36.0)
West LED 1.0
(0.5, 1.4)
31.0
(28.1, 33.9)
30.0
(27.1, 33.0)
Plasma
Midwest Plasma 7.7
(6.6, 8.9)
12.8
(11.0, 14.7)
5.1
(2.9, 7.3)
Northeast Plasma 8.1
(6.6, 9.7)
12.9
(10.9, 14.9)
4.7
(2.2, 7.2)
South Plasma 8.8
(7.7, 9.9)
15.1
(13.7, 16.5)
6.3
(4.5, 8.1)
West Plasma 9.5
(8.4, 10.6)
15.2
(12.0, 18.3)
5.7
(2.3, 9.0)
Not Applicable
Midwest Not Applicable 1.1
(0.5, 1.7)
2.2
(1.1, 3.3)
1.1
(-0.1, 2.4)
Northeast Not Applicable 1.9
(1.0, 2.7)
3.6
(1.9, 5.4)
1.8
(-0.2, 3.7)
South Not Applicable 1.1
(0.7, 1.6)
2.5
(1.6, 3.3)
1.3
(0.4, 2.3)
West Not Applicable 1.6
(1.1, 2.1)
3.1
(2.3, 3.9)
1.5
(0.5, 2.4)
LCD
Midwest LCD 40.0
(38.1, 41.9)
38.0
(35.8, 40.1)
-2.0
(-4.9, 0.8)
Northeast LCD 42.6
(40.0, 45.1)
39.1
(36.5, 41.7)
-3.5
(-7.1, 0.2)
South LCD 39.7
(37.9, 41.4)
37.1
(35.1, 39.0)
-2.6
(-5.2, 0.0)
West LCD 40.5
(38.1, 42.9)
40.1
(37.6, 42.7)
-0.4
(-3.9, 3.1)
Projection
Midwest Projection 3.6
(2.7, 4.5)
1.7
(1.0, 2.4)
-1.9
(-3.0, -0.8)
Northeast Projection 2.9
(2.0, 3.8)
2.7
(1.0, 4.3)
-0.2
(-2.1, 1.7)
South Projection 4.8
(4.1, 5.5)
1.3
(1.0, 1.7)
-3.4
(-4.2, -2.7)
West Projection 5.9
(4.9, 6.8)
2.0
(1.1, 3.0)
-3.8
(-5.2, -2.5)
Standard Tube
Midwest Standard Tube 46.2
(43.9, 48.5)
8.8
(7.0, 10.6)
-37.4
(-40.3, -34.5)
Northeast Standard Tube 44.0
(42.3, 45.7)
9.5
(7.2, 11.7)
-34.5
(-37.3, -31.7)
South Standard Tube 44.6
(42.8, 46.5)
9.3
(7.7, 10.9)
-35.3
(-37.7, -32.9)
West Standard Tube 41.5
(39.7, 43.4)
8.6
(7.2, 10.0)
-32.9
(-35.3, -30.6)

Question 2 [25 points]

In this question we used the NHANES dentition and demographics data from PS3 in order to model the probability that a permanent tooth is present as a (non-linear) function of age and other demographics. I chose to focus on tooth OHX03TC, the upper right 1st molar. The solution uses logistic regression, as required, and smoothing splines to allow the probability to vary smoothly with age. The full analysis can be viewed below.

3-ps4_q2_analysis.sas

writeLines(c('```SAS', readLines("./3-ps4_q2.sas"), '```'))
/* ------------------------------------------------------------------------- *
 * Problem Set 4, Question 2 
 * Stats 506, Fall 2020
 *
 * In this script, we Examine the relationship between age and presence of 
 * a permanent upper right first molar using 4 NHANES cohorts from 
 * 2011-2018. 
 *  
 * Author: James Henderson
 * Updated: November 23, 2020
 * ------------------------------------------------------------------------- *
*/

/* 79: --------------------------------------------------------------------- */

/* sas library: ------------------------------------------------------------ */    
libname mylib './'; 

/* data preparation: ------------------------------------------------------- */

/* cleaned demographics from PS3 */
proc import out=mylib.demo datafile = "./demo_ps3.dta" replace;

proc sort data=mylib.demo out=mylib.demo;
 by id;
run;

/* survey design and weights */
proc import
 datafile="./nhanes_demo.csv"
 out=mylib.weights
 replace;
run;

data mylib.weights; 
 set mylib.weights;
 keep SEQN SDMVPSU SDMVSTRA WTMEC2YR;
 rename SEQN=id SDMVPSU=psu SDMVSTRA=strata WTMEC2YR=w;
run;

proc sort data=mylib.weights out=mylib.weights;
 by id;

/* merge weights to demo */
data mylib.demo; 
 merge mylib.demo mylib.weights;
 by id;
 w = w / 4; 
run;

proc contents data=mylib.demo;
run;

/* cleaned dentition data from ps3 */
proc import out=mylib.ohxden datafile = "./ohxden_ps3.dta" replace;

proc sort data=mylib.ohxden out=mylib.ohxden;
 by id;
run;

data mylib.ps4_q2;
 merge mylib.demo mylib.ohxden;
 by id;
run;

/* drop missing cases, flag tooth */
data mylib.ps4_q2a;
 set mylib.ps4_q2;
 where ohx=1; 
 y=0;
 if ohx03tc=2 then y=1;
 if college=. then college=0;
run;

/* center and scale age: -------------------------------------------------- */

/* compute mean */ 
proc means data=mylib.ps4_q2a;
 var age;
 output out = mean_age 
   mean = avg_age;
run;

/* create a macro with mean */ 
data mean_age; 
 set mean_age;
 call symput('avg_age', avg_age); /* For exact centering. */
run;

/* center, scale to decades */ 
data mylib.ps4_q2a;
  set mylib.ps4_q2a;
  /*age_c = (age - &avg_age.) / 10;*/ /* If you want exact centering. */
  age_c = (age - 33) / 10; 
run;

/* logistic regression models: -------------------------------------------- */

/* macro for concision:
 *  arguments: 
 *    age_eff - specification of an age transform in the "effect" statement
 *      ivars - independent variables in the model statement
 *      class - the class statement
 *        out - name of sas dataset to create with OR and CI's
 *        dat - the name of the data to compute with
 */
%macro logit(age_eff, ivars, class, out, dat = mylib.ps4_q2a);
 proc logistic data=&dat.;
  class &class. / param=reference; 
  effect sm_age = &age_eff.; 
  model y(event='1') = sm_age &ivars. / clodds=WALD;
  store &out.; 
  ods output 
   CLOddsWald=&out._or 
   ParameterEstimates=&out._par
   FitStatistics=&out._ic;
 run;
%mend logit;

/* quadratic age and gender to establish a baseline */
%logit(poly(age_c / degree=2), gender, gender, mod1); 

proc print data = mod1_or;
run; 

/* splines */ 
%let sm_age = spline(age_c / basis=bspline knotmethod=equal(13) degree=3);
%logit(&sm_age., gender, gender, mod2); 

/* other demographics */
%logit(&sm_age., gender race college, gender race(ref="3"), mod3);

proc print data = mod3_or;
run;

/* survey models: ---------------------------------------------------------- */

proc surveylogistic data=mylib.ps4_q2a;
  strata strata;
  cluster psu;
  weight w; 
  class gender race(ref="3") / param=reference;
  effect sm_age = &sm_age.;
  model y(event='1') = sm_age gender race college;
  store mod4; 
  ods output 
    OddsRatios=mod4_or 
    ParameterEstimates=mod4_par
    FitStatistics=mod4_ic;
run;

proc print data=mod4_or; 
run;

/* select and merge key results: ------------------------------------------- */
data pars;
 set mod3_par; 
 keep Variable ClassVal0 Estimate StdErr;
 rename Estimate=mod3_est StdErr=mod3_se;
run;

proc sort data=pars out=pars;
 by Variable ClassVal0;

proc sort data=mod4_par out=mod4_par;
 by Variable ClassVal0;

data pars;
 merge pars mod4_par(keep = Variable ClassVal0 Estimate StdErr); 
 by Variable ClassVal0; 
 rename Estimate=mod4_est StdErr=mod4_se;
run; 

/* predictions at specified ages: ------------------------------------------ */
data key_age; 
 input age_c gender race college;
 datalines;
-3 2 1 0
-2.9 2 1 0
-2.8 2 1 0
-2.7 2 1 0
-2.6 2 1 0
-2.5 2 1 0
-2 2 1 0
-1 2 1 0
 0 2 1 0
 1 2 1 0
 2 2 1 0
 3 2 1 0
 4 2 1 0
; 

proc plm restore=mod3; 
 score data=key_age 
       out=yhat_mod3
       predicted lclm uclm / ilink;
run; 

proc plm restore=mod4;
 score data=key_age
       out=yhat_mod4
       predicted lclm uclm / ilink;
run;

data yhat_mod3; 
 set yhat_mod3;
run;

data yhat_mod4;
 set yhat_mod4;
run;

data pars;
 set pars;
run;

/* export results: --------------------------------------------------------- */
%macro csvexport(dataset, lib=work);
 proc export
   data = &lib..&dataset
   outfile = "./&dataset..csv"
   dbms = dlm
   replace;
  delimiter=",";
 run;
%mend csvexport;

%csvexport(mod1_ic); 
%csvexport(mod2_ic);
%csvexport(mod3_par);
%csvexport(mod3_ic);
%csvexport(mod4_par);
%csvexport(mod4_ic); 
%csvexport(pars); 
%csvexport(yhat_mod3);  
%csvexport(yhat_mod4);

/* 79: --------------------------------------------------------------------- */

(a) Presense of a Permanent Tooth as a Function of Age

In this portion, we treated the data is iid. For modeling purposes, age was approximately centered at 33 years old and scaled to units of decades (10 years). Cubic smoothing splines for age were implemented with 13 knots (interior break points) resulting in 16 basis functions. In addition, demographic controls were included for gender, race, and college attendance (with or without graduation, and limited to those over 20).

The relationship with age is shown in Figure 5. See part c for a regression table.

cap5 = paste(
  "**Figure 5.** *Expected probability for presence of a permanent upper", 
  "right first molar.* This plot shows estimated probabilities at specific",
  "ages for the presence of a permanent upper right first molar with", 
  "(pointwise) 95% confidence intervals for these probabilities. The specific",
  "proabilities here are for a reference group of female gender, Mexican",
  "American race, and no college. The relationship with age is modeled as the",
  "same for all groups, so other groups would show the same shape with",
  "probabilities shifted up or down."
)
# predictions at specific ages
yhat_mod3 = fread('./yhat_mod3.csv')
ggplot(yhat_mod3, aes(x = 10 * age_c + 33, y = Predicted)) + 
  geom_point() + 
  geom_line() +
  geom_errorbar( aes(ymin = LCLM, ymax = UCLM) ) + 
  xlab('age') + 
  ylab('(modeled) probability that permanent tooth is present') + 
  theme_bw() 
**Figure 5.** *Expected probability for presence of a permanent upper right first molar.* This plot shows estimated probabilities at specific ages for the presence of a permanent upper right first molar with (pointwise) 95% confidence intervals for these probabilities. The specific proabilities here are for a reference group of female gender, Mexican American race, and no college. The relationship with age is modeled as the same for all groups, so other groups would show the same shape with probabilities shifted up or down.

Figure 5. Expected probability for presence of a permanent upper right first molar. This plot shows estimated probabilities at specific ages for the presence of a permanent upper right first molar with (pointwise) 95% confidence intervals for these probabilities. The specific proabilities here are for a reference group of female gender, Mexican American race, and no college. The relationship with age is modeled as the same for all groups, so other groups would show the same shape with probabilities shifted up or down.

(b) & (c) Comparison with Survey Weights

The model from part was refit using the medical examination survey weights and the survey design. Weights were divided by the number of cohorts used, 4, so that results represent an average estimated relationship over the time period used. Table 3 below presents a comparison of the survey results to the results obtained in part a.

# model comparison table 
pars = fread('./pars.csv')
pars = pars[!is.na(mod3_se), ]

# make nicer labels
race = c(
  'race-1' = 'Mexian American', 
  'race-2' = 'Other Hispanic',
  'race-3' = 'Non-Hispanic White',
  'race-4' = 'Non-Hispanic Black', 
  'race-6' = 'Non-Hispanic Asian',
  'race-7' = 'Other' 
)
age = sprintf('f%i(age)', 1:16)
names(age) = sprintf('sm_age-%i', 1:16)

vars = c('Intercept' = 'Intercept', 
         'college' = 'College', 
         'gender-1' = 'Male', 
         race, 
         age)

pars[, 
 Variable := ifelse(is.na(ClassVal0), 
                     Variable, 
                     sprintf('%s-%i', Variable, ClassVal0) ) ]
pars[, `:=`(Variable = factor(vars[Variable], vars), ClassVal0 = NULL)]
pars = pars[order(Variable)]

## format as a table
ci = function(est, se) {
  bnds = matrix(
    rep(est, each = 3) + c(0, -qnorm(.975), qnorm(.975)) * rep(se, each = 3),
    3, length(est)
  )
  sprintf('%4.2f <br>(%4.2f, %4.2f)', bnds[1, ], bnds[2, ], bnds[3, ])
}

## caption
tab3_cap = paste(
  "**Table 3.** *Coefficients for Logistic Regression Models.* Coefficients",
  "are given on the logistic scale owing to the large coefficents for the",
  "the first two basis functions for age."
)

## the table
tab3 = pars[, .(Variable, 
         'Part a, Coefficient <br> (95% CI)' = ci(mod3_est, mod3_se), 
         'Part b, Coefficient <br> (95% CI)' = ci(mod4_est, mod4_se)
         )]

knitr::kable(
  tab3, format = 'html', escape = FALSE, caption = tab3_cap, align = c('rcc')
) %>%
  kableExtra::kable_styling("striped", full_width = TRUE)
Table 3. Coefficients for Logistic Regression Models. Coefficients are given on the logistic scale owing to the large coefficents for the the first two basis functions for age.
Variable Part a, Coefficient
(95% CI)
Part b, Coefficient
(95% CI)
Intercept -0.41
(-3.16, 2.34)
-0.65
(-3.76, 2.46)
College 0.90
(0.83, 0.98)
1.17
(1.06, 1.28)
Male -0.11
(-0.17, -0.04)
-0.09
(-0.17, -0.02)
Mexian American 0.35
(0.24, 0.46)
0.17
(0.02, 0.31)
Other Hispanic -0.06
(-0.18, 0.05)
-0.23
(-0.41, -0.05)
Non-Hispanic Black -0.56
(-0.64, -0.47)
-0.77
(-0.91, -0.64)
Non-Hispanic Asian 0.23
(0.11, 0.35)
0.01
(-0.15, 0.17)
Other -0.46
(-0.63, -0.28)
-0.71
(-0.96, -0.46)
f1(age) 14.52
(-2.37, 31.42)
11.00
(-8.35, 30.35)
f2(age) -15.61
(-18.80, -12.41)
-15.27
(-18.56, -11.99)
f3(age) 5.85
(2.99, 8.71)
5.91
(2.44, 9.38)
f4(age) 7.38
(4.39, 10.37)
7.91
(4.72, 11.09)
f5(age) 4.98
(2.12, 7.84)
5.16
(1.75, 8.56)
f6(age) 3.39
(0.57, 6.21)
3.70
(0.65, 6.74)
f7(age) 2.86
(0.08, 5.65)
3.42
(0.19, 6.65)
f8(age) 2.15
(-0.63, 4.94)
2.35
(-0.77, 5.46)
f9(age) 1.90
(-0.86, 4.67)
2.19
(-0.89, 5.27)
f10(age) 1.47
(-1.31, 4.25)
1.78
(-1.39, 4.95)
f11(age) 0.79
(-1.95, 3.54)
1.19
(-1.93, 4.31)
f12(age) 0.69
(-2.10, 3.49)
1.12
(-1.98, 4.21)
f13(age) 0.13
(-2.58, 2.85)
0.41
(-2.69, 3.51)
f14(age) -0.10
(-2.97, 2.77)
0.42
(-2.91, 3.75)
f15(age) -0.21
(-2.74, 2.31)
0.22
(-2.53, 2.97)
f16(age) -0.70
(-4.14, 2.74)
-0.31
(-4.22, 3.59)

Note that the large coefficients for the first two basis functions of age could likely be reduced by including additional knots in the ~3-6 year age range where this probability changes fastest.

tab4_cap = paste(
  "**Table 4.** *Comparison of predicted probabilities at specific ages.*",
  "The table shows predictions from the model without (a) and with (b)",
  "use of the survey weights and design. The specific",
  "proabilities here are for a reference group of female gender, Mexican",
  "American race, and no college."
)

# merge probabilities for table 4
yhat_mod4 = fread('./yhat_mod4.csv')

m3 = 
  yhat_mod3[, .(
    Age = 10 * age_c + 33,
    `Part a, p` = sprintf('%4.2f (%4.2f, %4.2f)', Predicted, LCLM, UCLM)
  )]

m4 = 
  yhat_mod4[, .(
    Age = 10 * age_c + 33,
    `Part b, p` = sprintf('%4.2f (%4.2f, %4.2f)', Predicted, LCLM, UCLM)
  )]

tab4 = merge(m3, m4, by = 'Age')
knitr::kable(tab4, format = 'html', caption = tab4_cap, align = 'rcc')  %>%
  kableExtra::kable_styling("striped", full_width = TRUE)
Table 4. Comparison of predicted probabilities at specific ages. The table shows predictions from the model without (a) and with (b) use of the survey weights and design. The specific proabilities here are for a reference group of female gender, Mexican American race, and no college.
Age Part a, p Part b, p
3 0.00 (0.00, 0.00) 0.00 (0.00, 0.00)
4 0.02 (0.01, 0.03) 0.02 (0.01, 0.02)
5 0.19 (0.17, 0.22) 0.16 (0.13, 0.19)
6 0.74 (0.71, 0.77) 0.69 (0.65, 0.73)
7 0.96 (0.95, 0.97) 0.95 (0.94, 0.96)
8 0.99 (0.99, 0.99) 0.99 (0.99, 0.99)
13 1.00 (1.00, 1.00) 1.00 (1.00, 1.00)
23 0.97 (0.97, 0.98) 0.97 (0.96, 0.98)
33 0.91 (0.90, 0.93) 0.91 (0.88, 0.92)
43 0.84 (0.82, 0.86) 0.82 (0.79, 0.85)
53 0.68 (0.65, 0.71) 0.68 (0.63, 0.71)
63 0.53 (0.50, 0.57) 0.51 (0.46, 0.56)
73 0.43 (0.39, 0.47) 0.44 (0.39, 0.50)