Submit the assignment by the due date via canvas. If you intend to utilize late days, please upload partial progress to Canvas and comment that you will utilize late days before the assignment is due. In your comment please indicate how many days you intend to use.
Use Rmarkdown to create and submit a single html or pdf with your answers to each question along with supporting evidence in the form of tables and graphs.
All tables and graphs should be neatly labeled and appear polished.
All question ask you to use R. You should submit your code for each problem as ps3_qX.R
.
.zip
) which contains the following files:
ps3.pdf
or ps3.html
ps3.Rmd
ps3_q1.R
ps3_q2.R
ps3_q3.R
(Optionally)All files should be executable without errors.
All files read, sourced, or referred to within scripts should be assumed to be in the same working directory (./
).
Your code should be clearly written and it should be possible to assess it by reading it. Use appropriate variable names and comments. Your style will be graded using the style rubric [15 points].
Some of these exercises may require you to use commands or techniques that were not covered in class or in the course notes. You can use the web as needed to identify appropriate approaches. Part of the purpose of these exercises is for you to learn to be resourceful and self sufficient. Questions are welcome at all times, but please make an attempt to locate relevant information yourself first.
First, repeat question 3 parts a-c from problem set 1 using data.table
for all computations and data manipulations.
Then, formulate and state a question answerable using the RECS data. Your question should be similar in scope to (one of) parts a-c above and should rely on one or more variables not previously used. Answer your question (using data.table) and provide supporting evidence in the form of nicely formatted graphs and/or tables.
In this question you will design a Monte Carlo study in R to compare the performance of different methods that adjust for multiple comparisons. You can read more about each of these methods by referring to help(p.adjust)
in R and the references listed there.
Throughout this question, let \(n = 1000\), \(p = 100\) and
\[ \beta_i = \begin{cases} 1 \qquad i \in \{1, \dots, 10\}, \\ 0 \qquad \textrm{else}. \end{cases} \]
Let \(X \in \mathbb{R}^{n \times p}\) with \(X \sim N(0_p, \Sigma)\) and \(Y \sim N(X\beta, \sigma^2I_n)\) where \(I_n\) is an \(n \times n\) identify matrix and \(\Sigma\) is a \(p \times p\), symmetric, positive definite covariance matrix.
X
and beta
and returns a p
by mc_rep
matrix of p-values corresponding to p-values for the hypothesis tests: \[ H_0: \beta_i = 0, H_1: \beta_i \ne 0. \] In addition to X
and beta
your function should have arguments sigma
(\(\sigma\)) and mc_rep
controlling the error variance of \(Y\) and number of Monte Carlo replicates, respectively. Your function should solve the least-squares problems using the QR decomposition of \(X'X\). This decomposition should only be computed once each time your function is called.
chol2inv()
.]Test your function with a specific \(X\) and \(Y\) by comparing to the output from appropriate methods applied to the object returned by lm(Y ~ 0 + X)
. It’s okay if there is some finite precision error less than ~1e-3 in magnitude. Hint: use set.seed()
to generate the same \(Y\) inside and outside the scope of the function for the purpose of testing.
Choose \(\Sigma\) and \(\sigma\) as you like. Use the Cholesky factorization of \(\Sigma\) to generate a single \(X\). Pass \(X\), \(\beta\), and \(\sigma\) to your function from the previous part.
Write a function evaluate
that takes your results and a set of indices where \(\beta \ne 0\), and returns Monte Carlo estimates for the following quantities:
See this page for additional details.
Apply your function from the previous part to the matrix of uncorrected P-values generated in part B. Use the function p.adjust()
to correct these p-values for multiple comparisons using ‘Bonferroni’, ‘Holm’, ‘BH’ (Benjamini-Hochberg), and ‘BY’ (Benjamini-Yekuteli). Use your evaluate()
function for each set of adjusted p-values.
Produce one or more nicely formatted graphs or tables reporting your results. Briefly discuss what you found.
This is a bonus question related to problem 6 from the midterm. First, review the script written in Stata available here. In this question, you will work through various options for translating this analysis into R. You may submit all or some of these, but each part must be entirely correct to earn the points listed.
Write a translation using data.table
for the computations. [5 pts]
Write a function to compute the univariate regression coefficient by group for arbitrary dependent, independent, and grouping variables. Use data.table
for computations within your function. Test your function by showing it produces the same results as in part a. [10 pts]
Compute the regression coefficients using the dplyr verb summarize_at()
. [5 pts]
Write a function similar to the one in part b to compute arbitrary univariate regression coefficients by group. Use dplyr
for computations within your function. You should read the “Programming with dplyr” vignette at vignette('programming', 'dplyr')
before attempting this. Warning: this may be difficult to debug! [10 points]