Instructions

Question 1 [60 points]

First, repeat question 3 parts a-c from problem set 1 using data.table for all computations and data manipulations.

Then, formulate and state a question answerable using the RECS data. Your question should be similar in scope to (one of) parts a-c above and should rely on one or more variables not previously used. Answer your question (using data.table) and provide supporting evidence in the form of nicely formatted graphs and/or tables.

Question 2 [25 points]

In this question you will design a Monte Carlo study in R to compare the performance of different methods that adjust for multiple comparisons. You can read more about each of these methods by referring to help(p.adjust) in R and the references listed there.

Throughout this question, let \(n = 1000\), \(p = 100\) and

\[ \beta_i = \begin{cases} 1 \qquad i \in \{1, \dots, 10\}, \\ 0 \qquad \textrm{else}. \end{cases} \]

Let \(X \in \mathbb{R}^{n \times p}\) with \(X \sim N(0_p, \Sigma)\) and \(Y \sim N(X\beta, \sigma^2I_n)\) where \(I_n\) is an \(n \times n\) identify matrix and \(\Sigma\) is a \(p \times p\), symmetric, positive definite covariance matrix.

  1. Write a function that accepts matrices X and beta and returns a p by mc_rep matrix of p-values corresponding to p-values for the hypothesis tests: \[ H_0: \beta_i = 0, H_1: \beta_i \ne 0. \] In addition to X and beta your function should have arguments sigma (\(\sigma\)) and mc_rep controlling the error variance of \(Y\) and number of Monte Carlo replicates, respectively. Your function should solve the least-squares problems using the QR decomposition of \(X'X\). This decomposition should only be computed once each time your function is called.
    1. Refer to the course notes to find \(\hat \beta\).
    2. Use \(Y\) and \(\hat Y = X \hat \beta\) to estimate the error variance for each Monte Carlo trial \(m\): \[ \hat \sigma_m^2 = \frac{1}{n-p} \sum_{i = 1}^n (Y_{im} - \hat Y_{im})^2 \]
    3. Use the result from ii and the QR decomposition to find the variance of \(\hat \beta_i\), \(v_i = \hat\sigma^2(X'X)^{-1}_{ii}\). [Note: you will need to do some algebra to determine how to compute \((X'X)^{-1}\) using Q and R. Or you can use the function chol2inv().]
    4. Form \(Z_i = \hat \beta_i / \sqrt{v_i}\) and find \(p = 2(1 - \Phi^{-1}(|Z_i|)).\)

Test your function with a specific \(X\) and \(Y\) by comparing to the output from appropriate methods applied to the object returned by lm(Y ~ 0 + X). It’s okay if there is some finite precision error less than ~1e-3 in magnitude. Hint: use set.seed() to generate the same \(Y\) inside and outside the scope of the function for the purpose of testing.

  1. Choose \(\Sigma\) and \(\sigma\) as you like. Use the Cholesky factorization of \(\Sigma\) to generate a single \(X\). Pass \(X\), \(\beta\), and \(\sigma\) to your function from the previous part.

  2. Write a function evaluate that takes your results and a set of indices where \(\beta \ne 0\), and returns Monte Carlo estimates for the following quantities:

See this page for additional details.

  1. Apply your function from the previous part to the matrix of uncorrected P-values generated in part B. Use the function p.adjust() to correct these p-values for multiple comparisons using ‘Bonferroni’, ‘Holm’, ‘BH’ (Benjamini-Hochberg), and ‘BY’ (Benjamini-Yekuteli). Use your evaluate() function for each set of adjusted p-values.

  2. Produce one or more nicely formatted graphs or tables reporting your results. Briefly discuss what you found.

Question 3 (Optional) [30 points]

This is a bonus question related to problem 6 from the midterm. First, review the script written in Stata available here. In this question, you will work through various options for translating this analysis into R. You may submit all or some of these, but each part must be entirely correct to earn the points listed.

  1. Write a translation using data.table for the computations. [5 pts]

  2. Write a function to compute the univariate regression coefficient by group for arbitrary dependent, independent, and grouping variables. Use data.table for computations within your function. Test your function by showing it produces the same results as in part a. [10 pts]

  3. Compute the regression coefficients using the dplyr verb summarize_at(). [5 pts]

  4. Write a function similar to the one in part b to compute arbitrary univariate regression coefficients by group. Use dplyr for computations within your function. You should read the “Programming with dplyr” vignette at vignette('programming', 'dplyr') before attempting this. Warning: this may be difficult to debug! [10 points]