Problem Set 4

Instructions

Submit the assignment by the due date via canvas. Assignments may be submitted up to 72 hours late for a 5 point reduction.
All files read, sourced, or referred to within scripts should be assumed to be in the same working directory (./).
Your code should be clearly written and it should be possible to assess it by reading it. Use appropriate variable names and comments. Your style will be graded using the style rubric [10 points].
Some of these exercises may require you to use commands or techniques that were not covered in class or in the course notes. You can use the web as needed to identify appropriate approaches. Part of the purpose of these exercises is for you to learn to be resourceful and self sufficient. Questions are welcome at all times, but please make an attempt to locate relevant information yourself first.
You should do the core computations for this assignment entirely in Stata, though you may use Rmarkdown to produce a write up, format tables, or make plots.

Questions

Question 1 [10 points]

Repeat question 2, part d from problem set 2 using Stata.
You can use your output from part c of that same question as the input to your script here. Hint: Use the mixed or meglm commands.

Question 2 [25 points]

Use the RECS 2015 data to answer this question.

Which census division has the largest disparity between urban and rural areas in terms of the proportion of homes with internet access? For this question, you can treat “Urban Cluster” and “Urban” as the same.

Answer the question using Stata, programming the standard error computations yourself (rather than using a command such as svy brr). Your script should output the key data summaries (estimates and CIs by division) to a csv file.

In your write up, read the file output by your Stata script and then produce a nicely formatted graph or table presenting your results using R. You may create a graph in Stata if you prefer.

Question 3 [35 points]

In this question, you will use data from the 2005-2006 NHANES survey to answer the following question:

Are people in the US more likely to drink water on a weekday than a weekend day?

Use the “dietary interview, total nutrient intakes”" data available here to answer this question. You will need both the day 1 and day 2 data. You will also need the demographic data available from this page. For the purposes of this assignment, you can ignore the survey weights and perform inference as though this were iid data.

[10 pts] Use the internet to figure out how to read the XPT data files into Stata. Then, prepare the data for analysis by creating a single data frame with the following format:
1. Each row represents a single response day for a study participant, i.e. the repeated measures from day 1 and day 2 are in “long” form.
2. You have variables representing the following quantities:
- respondent id
- survey day (1 or 2)
- intake day of the week and a binary indicator “weekday” (M-F)
- Any water drank the previous day?
- demographic controls: gender, age (use ridageyr), poverty income ratio (pir)
- whether the respondents exam was done in “winter” (Nov 1 - Apr 30)
[5 pts] Further prepare the data for regression analysis by centering the continuous variables, age and pir, around the mean value for those with no missing data on any of the variables (including the response) used in the regression below. To do this, first create a variable “missing” which is 1 if any of the variables used in the regression are missing and 0 otherwise. What are the mean age and pir around which our regression variables are centered? Finally, change the units of the centered age variable to “decades” to make the regression coefficients easier to interpret.
[10 pts] Using only the day 1 data, use the glm command or its alias logistic to fit a logistic regression investigating how the odds of drinking water are change for weekdays relative to weekends, while controlling for “winter”, age and age squared (using the centered version), gender, and (centered) pir. Then, use the margins command to determine the average marginal effect, on the probability scale, for each of the variables in the previous model. Create a summary table showing odds ratios for each term and marginal effects for each independent variable. Make sure to include CIs in your table. Hint: Use the dydx() option to margins.
[10 pts] Fit a mixed logistic model similar to the one above using data from both interview days and including a random intercept for each respondent. You do not need any random slopes. Hint: use the meglm command and a syntax similar to that in question 1. As above, use the margins command to determine the average marginal effect, on the probability scale, for each of the variables. Once again, create a summary table with both odds ratios and marginal effects. Finally, briefly compare the results of the two logistic models.