Statistics 507, Fall 2021

Instructions

Complete all questions of the assignment below and submit to Canvas by the due date. Remember, if you are using late days you should submit a draft of the assignment by the due date and leave a comment indicating how many late days you want to use.

For this problem set, you should submit source code as plain text Python scripts (with extension .py) and an associated Jupyter notebook (with extension .ipynb). Use Jupytext and (preferably) the light format to associate the two files.

Questions on this and future problem sets may ask you to use concepts or ideas that have not been discussed in class. One of the goals of these assignments is to help you learn to be independent, read documentation, and otherwise make reasonable decisions about how to analyze and present data, code, or other data science material.

You may discuss the problem set and its solution with your peers, but you are required to work independently on the files to be submitted and to submit your own original work. If you use or closely follow patterns or code from sources other than the course notes or texts you should cite the source.

In addition to the content of your submission, you will be graded on the quality of your source code and the professionalism of your notebook file.

Maintain a consistent and literate style in your Python script and work to make your notebook look professional and well polished. Follow all style rules from previous problem sets. Here are 3 additional style rules focused on naming and indentation you should follow to receive full credit:

Name all variables using all lower-case,
Use snake_case for variable names with multiple words,
When chaining 3 or more method calls, put each on its own line with leading periods aligned. (You may but are not required to do this for fewer method calls).

When asked for a nicely formatted table you should produce one using HTML (perhaps via Markdown) and it should follow standards suitable for publication, e.g. columns should be named in English rather than code_speak. Tables should also be produced using code so that minimal manual work would be needed to update the tables if the results were to change.

Addendum (October 5): You may have Markdown lines longer than 79 characters for long URLS if you use reference style links.

Question 0 - RECS and Replicate Weights [15 points]

In this problem set, you will analyze data from the 2009 and 2015 Residential Energy Consumption Surveys RECS put out by the US Energy Information Agency. In this warm up question, you will:

find URLs from which to download the data for subsequent questions,
determine variables you will need by examining the code-books, and
familiarize yourself with the replicate weight method used for computing standard errors and constructing confidence intervals.

Data Files

Find links to the 2009 and 2015 RECS micro-data files and report them here. In the 2009 data year the replicate weights (see below) are distributed in a separate file. Find and report the link to that file as well.

Variables

Using the code-books for the associated data files, determine what variables you will need to answer the following question. Be sure to include variables like the unit id and sample weights.

Estimate the average number of heating and cooling degree days (base temperature 65 °F) for residences in each Census region for both 2009 and 2015.

Weights and Replicate Weights

Residences participating in the RECS surveys are not an independent and identically distributed sample of residences from the population of all US residences in the survey years. Instead, a complex sampling procedure is used. The details of the sampling process are beyond the scope of this course and assignment. However, you do need to know that residences in the surveys are weighted and to make use of those weights in your solution. Moreover, these surveys include balanced repeated replicate (brr) weights to facilitate computation of standard errors and, in turn, construction of confidence intervals.

On the same page as the data, you will find a link explaining how to use the replicate weights. Report that link here; then, citing the linked document, briefly explain how the replicate weights are used to estimate standard errors for weighted point estimates. The standard error of an estimator is the square root of that estimator’s variance. Please note that I am asking you about the standard error and not the relative standard error. In your explanation please retype the key equation making sure to document what each variable in the equation is. Don’t forget to include the Fay coefficient and its value(s) for the specific replicate weights included with the survey data.

Question 1 - Data Preparation [20 points]

In this question you will use Pandas to download, read, and clean the RECS data needed for subsequent questions. Use pandas to read data directly from the URLs you documented in the warmup when local versions of the files are not available. After download and/or cleaning, write local copies of the datasets. Structure your code to use exists from the (built-in) os module so the data are only downloaded when not already available locally.

part a)

Separately for 2009 and 2015, construct datasets containing just the minimal necessary variables identified in the warmup, excluding the replicate weights. Choose an appropriate format for each of the remaining columns, particularly by creating categorical types where appropriate.

part b)

Separately for 2009 and 2015, construct datasets containing just the unique case ids and the replicate weights (not the primary final weight) in a “long” format with one weight and residence per row.

Question 2 - Construct and report the estimates [45 points]

part a)

Estimate the average number of heating and cooling degree days for residences in each Census region for both 2009 and 2015. You should construct both point estimates (using the weights) and 95% confidence intervals (using standard errors estimated with the replicate weights). Report your results in a nicely formatted table.

For this question, you should use pandas DataFrame methods wherever possible. Do not use a module specifically supporting survey weighting.

part b)

Using the estimates and standard errors from part a, estimate the change in heating and cooling degree days (base temperature °F) between 2009 and 2015 for each Census region. In constructing interval estimates, use the facts that the estimators for each year are independent and that,

\[ \mathrm{var}\left(\hat{\theta}_0, \hat{\theta}_1\right) = \mathrm{var}\left(\hat{\theta}_0\right) + \mathrm{var}\left(\hat{\theta}_1\right), \]

when the estimators \(\hat{\theta}_0\) and \(\hat{\theta}_1\) are independent.

Once again, report your results in a nicely formatted table.

Question 3 - [20 points]

Use pandas and/or matplotlib to create visualizations for the results reported as tables in parts a and b of question 2. As with the tables, your figures should be “polished” and professional in appearance, with well-chosen axis and tick labels, English rather than code_speak, etc. Use an adjacent markdown cell to write a caption for each figure.

Statistics 507, Fall 2021

Problem Set 3

Due Friday October 8 by 5pm.

Instructions

Question 0 - RECS and Replicate Weights [15 points]

Data Files

Variables

Weights and Replicate Weights

Question 1 - Data Preparation [20 points]

part a)

part b)

Question 2 - Construct and report the estimates [45 points]

part a)

part b)

Question 3 - [20 points]