In this question you will use SAS to fit mixed models to the audiometry data from the 2009 NHANES survey previously used for problem set 1, question 3. As before you may treat this as a simple random sample. You may wish to refer to the previous solution when specifying the mixed models.
An executable sas script for this question can be found here.
a. Determine how to load the audiometry and demographic data into SAS and then merge on the common identifier seqn. Drop all cases without audiometry data.
Here is the sas code I used for this.
libname mylib './data/';
/* (a) Read and merge data.
* I also do part of (b) here.
*/
libname demod xport './data/DEMO_D.XPT';
libname auxd xport './data/AUX_D.XPT';
data nhanes;
merge demod.DEMO_D auxd.AUX_D;
by seqn;
keep seqn riagendr ridageyr auxu:;
drop auxu1k2:;
if auxu1k1r = . then delete;
b. Produce a reduced data set in long format containing columns for:
For each person use just the first test at each frequency for each ear.
/* (b) Reshape to long format.
* The approach used here is:
* 1. separate demographics & AUXU*
* 2. convert AUXU* to long
* 3. Merge demographics back in.
*/
data nhanes_demo;
set nhanes(rename=(riagendr=gender ridageyr=age));
keep seqn gender age;
data nhanes_aux;
set nhanes(rename=(AUXU1K1R=AUXU1KR AUXU1K1L=AUXU1KL));
keep seqn auxu:;
/* drop additional missing values 888 or 666 */
proc transpose data=nhanes_aux out=aux_long;
by seqn;
var auxu:;
run;
data nhanes_long;
merge aux_long(drop=_LABEL_ rename=(_NAME_=test col1=thresh))
nhanes_demo;
by seqn;
ear = 'right';
if prxmatch("/L/",test) then ear = 'left';
right = 1;
if ear = 'left' then right=0;
freq = prxchange('s/AUXU(.)K(.)/$1/', 1, test);
if prxmatch("/U500/", test) then freq='5';
fr = input(freq, 1.);
if fr ne 5 then fr=10*fr;
drop freq;
if thresh = 888 then delete;
if thresh = 666 then delete;
older = 0;
if age ge 25 then older=1;
older_age = age*older;
female = 1;
if gender = 1 then female = 0;
When creating nhanes_long
we also create an explicit interaction older_age
between the age group indicator older
and age
for later use. The code below will export this data to csv for use in answering question 2.
proc export data=nhanes_long
outfile = '~/nhanes_long.csv'
dbms=dlm replace;
delimiter = ',';
run;
c. Filter your data to contain only the 1000 Hz test for the right ear so that each unique id appears just once. Use proc reg to fit regression models for answering the following questions
You can find additional discussion of parts c and d in the solution for question 2. The code for producing these estimates along with sas output is below.
i. At this frequency, is there a significant interaction between age group and gender in determining how well an individual hears form their right ear?
The data step below subsets the data and creates an explicit interaction between older
and female
.
data nhanes_1kr;
set nhanes_long;
older_female = older*female;
where ear='right' & fr=10;
Here is the proc reg
statement to carry out the regression.
proc reg data=nhanes_1kr outest=sum_1kr
rsquare;
model thresh = older female older_female;
And here is the output from proc print data=sum_1kr;
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 5.48000 0.37792 14.50 <.0001
older 1 23.17633 0.71545 32.39 <.0001
female 1 -0.14799 0.53326 -0.28 0.7814
older_female 1 1.38780 1.03789 1.34 0.1813
ii. After controlling for age group and gender, is age still important as a continuous variable?
We will first re-center age relative to the minimum age in each group.
data nhanes_1kr_ii;
set nhanes_1kr;
group_age = age - 12;
if older = 1 then group_age = age - 70;
Then we can run the regression.
proc reg data=nhanes_1kr_ii outest=sum_1kr_ii rsquare;
model thresh = older female group_age;
Examining the results below we can see that age within group remains important.
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 2.44549 0.41400 5.91 <.0001
older 1 20.38474 0.57773 35.28 <.0001
female 1 0.08467 0.44564 0.19 0.8493
group_age 1 0.83558 0.06806 12.28 <.0001
iii. Is the effect of age, as a continuous variable, significantly different among the older and/or younger age groups?
The SAS code below creates an explicit interaction between group_age
and older
, fits a model with this interaction, and then prints the results.
data nhanes_1kr_iii;
set nhanes_1kr_ii;
older_age = older*group_age;
proc reg data=nhanes_1kr_iii outest=sum_1kr_iii rsquare;
model thresh = older group_age older_age female;
proc print data = sum_1kr_iii;
Based on the results below, we see that the slope for age in the older group is significantly higher than for the younger group.
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 4.82370 0.51785 9.31 <.0001
older 1 15.13856 0.90257 16.77 <.0001
group_age 1 0.15745 0.11263 1.40 0.1623
older_age 1 1.05596 0.14054 7.51 <.0001
female 1 0.06412 0.44119 0.15 0.8845
d. Repeat part (c) using proc mixed
and data from both ears and all frequencies.
In each of the models below, the fixed effects in our model statement will be the same as above with the exception that we also account for the frequency of each hearing test.
There are various ways to specify the random effects to account for subject level differences. The approach below includes random intercepts for each person and for each ear within each person. We can speed up the fitting process by leaving the person level id seqn
as continuous and sorting on this.
First, we will add some derived variables to the full data set.
/* add derived variables to full data */
data nhanes_d;
set nhanes_long;
older_female = older*female;
group_age = age - 12;
if older = 1 then group_age = age - 70;
Next, we will sort on seqn
and ear
.
/* after sorting on SEQN we can use it as
* continuous in specifying random effects
*/
proc sort data=nhanes_d out=nd_sorted;
by seqn ear;
In the models below, the option subject = seqn
in the ranodm
statement ensures the random effects across subjects are independent. In the model statement, the options s cl
request a parameter summary table with confidence limits.
Here is the statement for (i).
proc mixed data=nd_sorted method=REML noclprint;
class ear;
model thresh = fr older female older_female /
s cl;
random intercept ear / subject=seqn;
run;
Here are the estimated fixed effects and variances for the random effects.
This output tells us that yes, there is a significant interaction between gender and age group when considering all frequencies with older women having average hearing threshold 8.2 (7.37 + .86) db lower than older men. We see also that the subject level intercepts have standard deviation \(\sqrt{81.1}\) = 9 db and that ear level effects have standard deviation \(\sqrt{13.3} = 3.6\) db.
ii. Here is the proc mixed
specification for part (ii).
proc mixed data=nd_sorted method=REML noclprint;
class ear;
model thresh = fr older female group_age /
s cl;
random intercept ear / subject=seqn;
run;
Here are the estimated fixed effects and variances for the random effects.
Based on these results we can answer that age is important over and above the differences between age groups with each additional year of age over the group minimum raising the expected hearing threshold by .8 db.
As an aside, notice the reduced variance for the random intercept on account of additional variance being explained by fixed effects.
iii. And finally, the call to part (iii).
proc mixed data=nd_sorted method=REML noclprint;
class ear;
model thresh = fr older female group_age older_age /
s cl;
random intercept ear / subject=seqn;
run;
The results, shown below, indicate that among the older group each additional year of age raises the expected hearing threshold by 1.05 + .13 = 1.2 db, as compared to only .13 db per year among the younger group. The difference in slopes represented by older_age
is highly significant.
For this question, we export the “long” data produced from parts a and b of question 1 for use in R. Here’s an approach to these questions using mixed models fit and specified by lme4
.
First some preliminaries.
#load packages
library(lme4)
## read the data
aud_long = data.table::fread('./nhanes_long.csv')
In part (c) we use a subset of data and linear models with fixed effects only so we create that subset here.
# Filter
aud_subset = dplyr::filter(aud_long, ear=='right' & fr == 10)
i. At this frequency, is there a significant interaction between age group and gender in determining how well an individual hears form their right ear?
To answer this question, we fit a model with our age group variable “older” and our gender variable “female”.
fit_ci = lm(thresh~older*female,data=aud_subset)
summary(fit_ci)
##
## Call:
## lm(formula = thresh ~ older * female, data = aud_subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.896 -5.480 -0.480 4.668 89.520
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.4800 0.3779 14.501 <2e-16 ***
## older 23.1763 0.7154 32.394 <2e-16 ***
## female -0.1480 0.5333 -0.278 0.781
## older:female 1.3878 1.0379 1.337 0.181
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.95 on 2729 degrees of freedom
## Multiple R-squared: 0.4369, Adjusted R-squared: 0.4363
## F-statistic: 705.9 on 3 and 2729 DF, p-value: < 2.2e-16
At a 5% significance level, there is no significant main effect for gender nor for the interaction between gender and age group.
ii. After controlling for age group and gender, is age still important as a continuous variable? To understand the “still” here consider the following model showing the importance of age after controlling for gender:
fit_cii0 = lm(thresh ~ age+gender, data=aud_subset)
summary(fit_cii0)
##
## Call:
## lm(formula = thresh ~ age + gender, data = aud_subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.155 -5.973 -0.805 4.417 88.415
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.991605 0.759531 -1.306 0.192
## age 0.389990 0.008129 47.976 <2e-16 ***
## gender 0.167264 0.448913 0.373 0.709
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.73 on 2730 degrees of freedom
## Multiple R-squared: 0.4576, Adjusted R-squared: 0.4572
## F-statistic: 1151 on 2 and 2730 DF, p-value: < 2.2e-16
A simple approach to the question as posed indicates that continuous age remains important. However, the intercept and the coefficient on “older” are difficult to interpret.
fit_cii1 = lm(thresh ~ age + older + gender, data=aud_subset)
summary(fit_cii1)
##
## Call:
## lm(formula = thresh ~ age + older + gender, data = aud_subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.857 -6.623 -0.872 4.212 86.705
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.66613 1.26211 -6.074 1.42e-09 ***
## age 0.83558 0.06806 12.276 < 2e-16 ***
## older -28.07887 4.25887 -6.593 5.15e-11 ***
## gender 0.08467 0.44564 0.190 0.849
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.64 on 2729 degrees of freedom
## Multiple R-squared: 0.4661, Adjusted R-squared: 0.4655
## F-statistic: 794 on 3 and 2729 DF, p-value: < 2.2e-16
We can make these easier to interpret by creating a new variable “group_age” giving the age above the minimum age in each group. This way the intercept represents the threshold for a hypothetical 12 year old and the coefficient on “older” represents the difference between hypothetical 12 and 70 year olds.
aud_subset$group_age = with(aud_subset, ifelse(older==1,age-70,age-12))
fit_cii2 = lm(thresh ~ older + group_age + gender, data=aud_subset)
summary(fit_cii2)
##
## Call:
## lm(formula = thresh ~ older + group_age + gender, data = aud_subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.857 -6.623 -0.872 4.212 86.705
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.36082 0.75119 3.143 0.00169 **
## older 20.38474 0.57773 35.284 < 2e-16 ***
## group_age 0.83558 0.06806 12.276 < 2e-16 ***
## gender 0.08467 0.44564 0.190 0.84933
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.64 on 2729 degrees of freedom
## Multiple R-squared: 0.4661, Adjusted R-squared: 0.4655
## F-statistic: 794 on 3 and 2729 DF, p-value: < 2.2e-16
Based on this model, we can say that the older group has a hearing threshold about 20 db higher than the younger group at 1000 kHz. Each additional year of age above the group minimums, 12 and 70, respectively, increases the expected threshold by about .8 db.
iii. Is the effect of age, as a continuous variable, significantly different among the older and/or younger age groups?
To answer this, we include an interaction between group_age
and older
. You could also include the interaction explicitly by creating separate age variables for each group.
fit_ciii = lm(thresh ~ older*group_age + female, data=aud_subset)
summary(fit_ciii)
##
## Call:
## lm(formula = thresh ~ older * group_age + female, data = aud_subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.523 -5.675 -0.518 4.547 89.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.82370 0.51785 9.315 < 2e-16 ***
## older 15.13856 0.90257 16.773 < 2e-16 ***
## group_age 0.15745 0.11263 1.398 0.162
## female 0.06412 0.44119 0.145 0.884
## older:group_age 1.05596 0.14054 7.514 7.75e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.52 on 2728 degrees of freedom
## Multiple R-squared: 0.4769, Adjusted R-squared: 0.4761
## F-statistic: 621.7 on 4 and 2728 DF, p-value: < 2.2e-16
This output tells us that each year of age among the older group increases the expected hearing threshold by about 1.2 db (1.06 + .16). In contrast the expected increase for each year of age in the younger group is only around .16 and not significant at the 5% level.
d. Now we answer the three questions above using data from both ears and at all frequencies. The key here is to account for the nesting measurements within people and within ears. This question asks you use a mixed model for this and we will specify a random intercept for each ear within each person. We will allow the random effects for each ear to be correlated within a person.
Below is the model with just frequency, age group, and gender along with a random intercept for each person and ear.
fit_di0 = lmer(thresh ~ (1 | ear / SEQN) + older + fr + older + female, data=aud_long)
summary(fit_di0)
## Linear mixed model fit by REML ['lmerMod']
## Formula: thresh ~ (1 | ear/SEQN) + older + fr + older + female
## Data: aud_long
##
## REML criterion at convergence: 303330.5
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.5101 -0.5300 0.0063 0.5257 5.1068
##
## Random effects:
## Groups Name Variance Std.Dev.
## SEQN:ear (Intercept) 97.02 9.85
## ear (Intercept) 0.00 0.00
## Residual 129.15 11.36
## Number of obs: 38100, groups: SEQN:ear, 5458; ear, 2
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 1.941084 0.238255 8.15
## older 40.191992 0.330247 121.70
## fr 0.190270 0.002314 82.22
## female -2.778436 0.291179 -9.54
##
## Correlation of Fixed Effects:
## (Intr) older fr
## older -0.386
## fr -0.340 0.002
## female -0.615 0.032 0.000
Based on the table summarizing the random effects we see that, as one might expect, there is more variance between individuals (SEQN
) than between ears (ear
) within an individual.
In the call to lmer
observe that fitting via REML is the default. The following links may be helpful in understanding the syntax further:
Below are models for answering the actual questions.
i. Is there an interaction between age group and gender?
fit_di = lmer(thresh ~ (1 | SEQN / ear) + older*female + fr, data=aud_long)
summary(fit_di)
## Linear mixed model fit by REML ['lmerMod']
## Formula: thresh ~ (1 | SEQN/ear) + older * female + fr
## Data: aud_long
##
## REML criterion at convergence: 301253
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.7456 -0.5358 0.0047 0.5281 4.8618
##
## Random effects:
## Groups Name Variance Std.Dev.
## ear:SEQN (Intercept) 13.31 3.649
## SEQN (Intercept) 81.08 9.005
## Residual 129.25 11.369
## Number of obs: 38100, groups: ear:SEQN, 5458; SEQN, 2735
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 0.980370 0.321876 3.05
## older 43.733411 0.590198 74.10
## female -0.861730 0.439354 -1.96
## fr 0.190235 0.002315 82.19
## older:female -7.367949 0.855824 -8.61
##
## Correlation of Fixed Effects:
## (Intr) older female fr
## older -0.511
## female -0.686 0.374
## fr -0.252 0.002 0.000
## older:femal 0.352 -0.690 -0.513 -0.001
When incorporating data for all frequencies we find that there is a meaningful interaction between age group and gender and that the audible thresholds for older women are are roughly 7 db lower than for older men on average.
ii. After controlling for age group and gender, is age still important as a continuous variable?
aud_long$group_age = with(aud_long, ifelse(older==1,age-70,age-12))
fit_dii = lmer(thresh ~ (1 | SEQN / ear) + older + group_age + female + fr, data=aud_long)
#fit_cii2 = lm(thresh ~ older + group_age + gender, data=aud_subset)
summary(fit_dii)
## Linear mixed model fit by REML ['lmerMod']
## Formula: thresh ~ (1 | SEQN/ear) + older + group_age + female + fr
## Data: aud_long
##
## REML criterion at convergence: 301136
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.7089 -0.5380 0.0020 0.5254 4.8902
##
## Random effects:
## Groups Name Variance Std.Dev.
## ear:SEQN (Intercept) 13.32 3.650
## SEQN (Intercept) 76.78 8.762
## Residual 129.25 11.369
## Number of obs: 38100, groups: ear:SEQN, 5458; SEQN, 2735
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) -0.781370 0.352114 -2.22
## older 36.916476 0.478180 77.20
## group_age 0.802399 0.056343 14.24
## female -2.929512 0.368704 -7.95
## fr 0.190231 0.002315 82.18
##
## Correlation of Fixed Effects:
## (Intr) older grop_g female
## older -0.024
## group_age -0.546 -0.486
## female -0.513 0.040 -0.024
## fr -0.230 0.001 0.001 0.000
As before age remains a significant factor even after controlling for differences between these two age groups and genders.
iii. Is the effect of age, as a continuous variable, significantly different among the older and/or younger age groups?
fit_diii = lmer(thresh ~ (1 | SEQN / ear) + older*group_age + female + fr, data=aud_long)
summary(fit_diii)
## Linear mixed model fit by REML ['lmerMod']
## Formula: thresh ~ (1 | SEQN/ear) + older * group_age + female + fr
## Data: aud_long
##
## REML criterion at convergence: 301056.9
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.6988 -0.5377 0.0029 0.5266 4.8946
##
## Random effects:
## Groups Name Variance Std.Dev.
## ear:SEQN (Intercept) 13.33 3.651
## SEQN (Intercept) 74.06 8.606
## Residual 129.25 11.369
## Number of obs: 38100, groups: ear:SEQN, 5458; SEQN, 2735
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 1.587259 0.434010 3.66
## older 31.685314 0.743421 42.62
## group_age 0.126829 0.092729 1.37
## female -2.949964 0.363273 -8.12
## fr 0.190233 0.002315 82.19
## older:group_age 1.052728 0.115747 9.10
##
## Correlation of Fixed Effects:
## (Intr) older grop_g female fr
## older -0.476
## group_age -0.742 0.435
## female -0.414 0.030 -0.009
## fr -0.186 0.000 0.000 0.000
## older:grp_g 0.600 -0.774 -0.801 -0.006 0.001
As in the previous model, we find that age is more impact among the older group than among the younger group.
You can view the SAS files for this exercise at the links below:
.sas7bdat
.For part a, use the file provided and edit the INFILE
statement to point to the text file. I also added a libname
statement and a data step to save the file for subsequent use.
libname mylib '/home/jbhender/ps4q3/data/';
/* INFILE data step provided with download */
/* edit this line to match linux file name (TXT to txt and correct path) */
INFILE './data/Medicare_Provider_Util_Payment_PUF_CY2014.txt'
/* Save as sas7bdat */
DATA mylib.Medicare_PS_PUF_2014;
set Medicare_PS_PUF;
The results shown here are for the 2014 data.
For (b) read in the data and create totpay
:
data da;
set mylib.medicare_ps_puf_2014;
totpay = line_srvc_cnt*average_medicare_payment_amt;
For part (c.i), use proc sql
to compute the average cost for each procedure and order to answer the questions.
proc sql;
create table dax as
select sum(totpay) as s, sum(line_srvc_cnt) as n, sum(totpay)/sum(line_srvc_cnt) as avg,
hcpcs_code as hcpcs, min(hcpcs_description) as service
from da
group by hcpcs_code
order by -avg;
create table daz as
select *
from dax
where n > 1e5;
quit;
/* Highest avg cost */
proc print data=dax(obs=1);
/* Highest avg cost among services with n > 1e5 */
proc print data=daz(obs=1);
For part (c.ii) begin by limiting the data to individual providers,
proc sql;
/* Limit the data to individual providers and summarize */
create table dai as
select npi, sum(totpay) as s, min(provider_type) as type
from da
where nppes_entity_code = "I"
group by npi;
quit;
You can then create a table with provider counts by type among those with more than $1 million in total charges.
proc sql;
create table hcf as
select count(npi) as n, type
from dai
where s > 1e6
group by type
order by -n;
quit;
proc print data=high_cost_freq(obs=10);
The top ten are:
Rank | N | Provider Type |
---|---|---|
1 | 1185 | Ophthalmology |
2 | 651 | Hematology/Oncology |
3 | 369 | Radiation Oncology |
4 | 293 | Rheumatology |
5 | 239 | Dermatology |
6 | 237 | Cardiology |
7 | 212 | Medical Oncology |
8 | 133 | Internal Medicine |
9 | 77 | Diagnostic Radiology |
10 | 75 | Nephrology |
Likewise, create a table with average total payments for each provider type.
create table prov_avg as
select avg(s) as avg, type
from dai
group by type
order by -avg;
proc print data=prov_avg(obs=2);
/* Create table with last two rows */
data low_prov_avg;
set prov_avg nobs=nobs;
if _n_ ge (nobs-2);
proc print data=low_prov_avg(obs=2);
The provider types with the two highest average total charges among individual providers in 2014 were Ophthalmology ($343,777) and Radiation Oncology ($338,148).
The provider types with the two lowest average total charges in 2014 were Mass Immunization Roster Biller ($4,047) and Anesthesiologist Assistants ($6,577).
In this question you were asked to use logistic regression to predict y
as a smooth function of x
from the provided data. Below, I show to to do this using the default ‘thin plate’ splines for the smooth function. I also provide a plot showing the estimated probabilities conditional on x
.
# Load libraries
library(mgcv); library(tidyverse); library(doParallel)
foo = load('~/ps506/PS4/ps4_q4.RData')
## Part a
fit = gam(y~s(x), family=binomial(link='logit'), data=get(foo))
# New data for plotting
new_data = with(get(foo), tibble(x=seq(min(x),max(x),length.out=1e3)))
pred = predict(fit,new_data, type='response', se.fit=TRUE)
new_data = new_data %>%
mutate(y = pred$fit, u = y + 2*pred$se.fit, l=y - 2*pred$se.fit)
new_data %>%
ggplot(aes(x=x,y=y)) +
geom_ribbon(aes(ymin=l,ymax=u),fill='grey') +
geom_line() +
theme_bw() +
ylab('P(Y=1|X)')
Estimated probability that Y=1 given X from a logistic model using a cubic regression spline to create a smooth function of X. The shaded area shows an approximate pointwise 95% confidence region using plus or minus two standard errors of the fit.
cap = 'Estimated probability that Y=1 given X from a logistic model using
a cubic regression spline to create a smooth function of X. The shaded area
shows an approximate pointwise 95% confidence region using plus or minus
two standard errors of the fit.'
Though you weren’t asked to we can check the in-sample error:
mean(sample_data$y != 1*{fit$fitted.values>.5})
## [1] 0.22
Here is a cross-validation function for part b making use of the predict.gam
method.
xvalidate_seq = function(df,folds=10){
# df - a data frame
n = nrow(df)
start = seq(0,n,length.out={folds+1})
n_right = 0
for(k in 1:folds){
ind = {start[k]+1}:{start[k+1]}
df_in = df[-ind,]
df_out = df[ind,]
fit = gam(y~s(x), data=df_in, family=binomial(link='logit'))
y_hat = {predict(fit,df_out,type='response') > .5}
n_right = n_right + sum(y_hat == df_out$y)
}
c('ErrorRate'=1- n_right / n)
}
xvalidate_seq(sample_data,folds=10)
## ErrorRate
## 0.2204
c. Now we modify to run in parallel when “cores > 1”.
xvalidate = function(df,folds=10,cores=1){
# df - a data frame
n = nrow(df)
start = seq(0,n,length.out={folds+1})
do_fold = function(k){
ind = {start[k]+1}:{start[k+1]}
df_in = df[-ind,]
df_out = df[ind,]
fit = gam(y~s(x), data=df_in, family=binomial(link='logit'))
y_hat = {predict(fit,df_out,type='response') > .5}
sum(y_hat == df_out$y)
}
## Compute serially or in parallel
if(cores == 1){
n_right = 0
for(k in 1:folds) n_right = n_right + do_fold(k)
} else{
#cat('Running in parallel...')
n_right = foreach(k=1:folds, .packages='mgcv', .combine='+') %dopar% {
do_fold(k)
}
}
c('ErrorRate'= 1 - n_right/n)
}
Below is a quick (local) test that this works.
ncores = 2
cl = makeCluster(ncores)
registerDoParallel(cl)
xvalidate(sample_data,folds=5,cores=ncores)
## ErrorRate
## 0.2209
stopCluster(cl)
d. To do this, I first saved the xvalidate
function from part c in a script xvalidate.R
. I then copied this script and the data to a folder in my Flux home directory:
scp xvalidate.R ps4_q4.RData flux-xfer.arc-ts.umich.edu:/home/jbhender/ps4q4/
The script below (shown without headers) will carry out the execution and is saved as P4Q4d.R
:
library(mgcv); library(doParallel)
load('/home/jbhender/ps4q4/ps4_q4.RData')
source('/home/jbhender/ps4q4/xvalidate.R')
ncores = 2
cl = makeCluster(ncores)
registerDoParallel(cl)
xvalidate(sample_data,folds=10,cores=ncores)
stopCluster(cl)
Here is a pbs script used to execute the code.
e. Here is a modified script which accepts command line arguments for the number of cores and folds.
# Libraries, data, source files
library(mgcv); library(doParallel)
load('/home/jbhender/ps4q4/ps4_q4.RData')
source('/home/jbhender/ps4q4/xvalidate.R')
# Get command line arguments and assign as global variables
# Use to assign "cores" and "folds"
ca = commandArgs()
ca = ca[grep('=',ca)]
ca = strsplit(ca,'=')
lapply(ca,function(x) assign(x[1],as.numeric(x[2]), envir=.GlobalEnv))
# Warn and quit if problem.
if(sum(c('cores','folds') %in% ls())<2) stop("Please specify 'cores' and 'folds'.")
# Setup cluster
cl = makeCluster(cores)
registerDoParallel(cl)
# Computation
xvalidate(sample_data,folds=folds,cores=cores)
# Close cluser
stopCluster(cl)
There are a number of ways you can pass the folds using $PBS_ARRAYID
. The way I show here uses a string in exponential notation. In the sh shell, we need to use double quotes “” to expand shell variables within a string:
R CMD BATCH --vanilla "--args cores=8 folds=1e${PBS_ARRAYID}" \
/home/jbhender/ps4q4/P4Q4e.R \
/home/jbhender/ps4q4/P4Q4e_Rout_${PBS_ARRAYID}_jbhender.txt
Here \
is used to split a long command over multiple lines for the purpose of display. Here is a copy of the pbs script.
I ran these jobs using 8 cores. The table below shows the walltime, CPU time, and reported prediction error for each number of folds.
Folds | Wall time | CPU time | Error | |
---|---|---|---|---|
10 | 11s | 27s | 22.04% | |
100 | 27s | 2m 36s | 22.07% | |
1000 | 3m 12s | 24m 31s | 22.04% | |
10000 | 31m 8s | 4h 7m 7s | 22.02% |
The out-of-sample accuracy is nearly constant, while the estimation time is approximately linear taking ~1.5s CPU time per fold.