Problem Set 3

Instructions

Submit the assignment by the due date via Canvas. Assignments may be submitted up to 72 hours late for a 5 point reduction.
All files read, sourced, or referred to within scripts should be assumed to be in the same working directory (./).
Your code should be clearly written and it should be possible to assess it by reading it. Use appropriate variable names and comments. Your style will be graded using the style rubric [15 points].
Some of these exercises may require you to use commands or techniques that were not covered in class or in the course notes. You can use the web as needed to identify appropriate approaches. Part of the purpose of these exercises is for you to learn to be resourceful and self sufficient. Questions are welcome at all times, but please make an attempt to locate relevant information yourself first.
Please use the provided templates.
This assignment should be done primarily in Stata, with the exception that the write up and any associated figures or tables may be produce in R. As always you may use the Linux shell for data preparation and download documentation.
Your submission should include a write-up as a pdf or HTML document and all scripts needed to reproduce it. In your document, describe how the files submitted relate to one another and be sure to answer the questions.
For this assignment scripts you should submit: Stata (.do), Rmarkdown (.Rmd or .R with spin) for the write-up, and the write up itself (.pdf or .html), and (optionally) a shell script (.sh) ps3_make.sh to build the assignment.

Questions

Question 1 [25 points]

In this question, you will construct a demographics balance table similar to the one constructed in the week6 class activity. Specifically, you should construct a balance table comparing demographics for those who are or are not missing the oral health examination in the NHANES data prepared in problem set 1. The data are available in the Stats506_F20 repository under problem_sets/data/. You will need the following two files:

nahanes_demo.csv
nhanes_ohxden.csv

Documentations for the variables included can be found at:

demo
dentition.

Organize your table to fill in the numbers and percent of participants with a complete dentition examination for each of the demographics identified below. Separate the presentation into those over 20 and those younger than 20. As in the activity, you can exclude those who did not participate in any of the medical examinations (MEC) as identified by the RIDSTATR variable. For all others consider the dentition examination non-missing only when OHDDESTS from the oral health examination data takes the value “complete” and is non-missing.

Your table or tables should compare balance for the demographics below. Use a chi-squared test to assess whether there are marginal associations between categorical variables and whether the dentition exam is complete. Use t-tests to compare continuous variables (e.g. age) and present such variables as mean (IQR) rather than n (%).

For those under 20 years old:

age (present as mean (IQR), use a t-test to compare)
gender
race/ethnicity from RIDRETH3

For those 20 or older:

age (present as mean (IQR), use a t-test to compare)
gender
race/ethnicity from RIDRETH3
college - with two levels, ‘some college/college graduate’ or ‘No college/Unknown’ where the latter category includes “Don’t Know” and all levels less than ‘some college’.

Question 2 [20 points]

Use logistic regression to model the probability that an NHANES participant has a complete dentition exam conditional on the demographics include in the table above. You may wish to use (e.g. quadratic) functions of age and to consider interactions. Use AIC for model selection. Present your final model as a regression table. In a separate table, report the AIC for each model considered.

Question 3 [40 points]

In this question you will compute descriptive statistics to investigate the relationship between a tooth’s status as captured by the ‘tooth count’ variables OHXxxTC and age. The goal is to quantify how teeth progress as we age.

First, divide participants into regularly spaced age groups of six consecutive years, e.g. \([0, 6), [6, 12), [12, 18), \cdots\). Then, separately for each (numbered) tooth, compute proportions for each level of the tooth count variable by age group. Although there are survey weights present, you can treat the data as iid for this analysis and compute confidence intervals for each proportion using the standard error \(\sqrt{\hat{p}(1 - \hat{p})/n}\). Your Stata code for this question should output a .csv or .xlsx file with the estimated proportions and standard errors for each tooth and age group.

Read this file into R and construct a figure or figures presenting the results.

Notes:

Elements of style we will emphasize in your Stata code include:
- headers
- line length
- use of comments
- use of macros, globs (*), or built in Stata commands to avoid excessive repetition (e.g. repeating the same block of code for each of the 32 teeth)
For question 1, you should use some combination of putexcel and/or export delimited to output your results in Stata to be read into your write-up document.
Consider using the user-written command outreg2 or the built-in putexcel to transfer results for question 2 to your Rmd report in a reproducible way. In a pinch, a screen shot will suffice.