Instructions

Questions

Question 1 [20 points]

Repeat question 2 from problem set 4 using R and the data.table package.

Question 2 [50 points]

In this question, you will use data.table to clean and analyze the DNA methylation data available here. You can read more about the experiment and data here

  1. Download the data and inspect the first 100 rows at the command line. Write the commands for doing so in your solution. How many lines of header information are there?

  2. Use fread to read the data into R as a data.table. Filter the data to those probes (ID_REF) corresponding to chromosomal locations beginning with “ch” and discard the sample with all missing values. Finally, pivot the data to a longer format using melt, with each row corresponding to a single sample-probe pair.

  3. Refer to the second link above to determine which samples correspond to individuals with Crohn’s disease and which to non-Crohn’s samples. Add a column sample_group to the data.table by reference recording this information.

  4. Create a new data.table by computing a t-statistic (with homogeneous/pooled variance) comparing the difference in means between groups for each unique probe. Hint: refer to this document if needed.

  5. Add a column probe_group by reference assigning probes to groups using the first 5 digits of the probe ID.

  6. Compute the proportion of probes within each probe_group that are nominally significant at the 5% level assuming a two-tailed test. Produce a figure comparing these percentages. Which group stands out as potentially over-represented?

  7. Next, we will use permutation tests to assess the statistical significance of each probe group, using one of the custom statistics below. Write a function taking three arguments: (1) the data.table produced in part c, (2) a type (two-tailed, greater, or lesser), and (3) a logical flag “permute”. Your function should compute t-statistics as in part (d), after, when the permute flag is true, permuting the sample group labels, and then compute the appopriate statistic from the list below. Here \(G\) is the number of probes in a given group, \(t^*_\alpha\) is the \(\alpha\)-quantile from a t-distribution (with appropriate degrees of freedom) and \(t_i\) is the t-statistic for probe \(i\).

    1. “two-tailed”: \(T_{abs} = \frac{1}{G}\sum_{i=1}^G |t_i|1[|t_i| > t_{1-\alpha/2}^*]\)
    2. “greater”: \(T_{up} = \frac{1}{G}\sum_{i=1}^G t_i1[t_i > t_{1-\alpha}^*]\)
    3. “lesser”: \(T_{down} = \frac{1}{G}\sum_{i=1}^G t_i1[t_i < t_\alpha^*]\)
  8. Use your function to compute the \(T_{abs}\) score for each probe group on the original data. Then, use your function to compute scores for each of 1,000 permutations and compute p-values for testing whether the observed \(T_{abs}\) score for each group is larger than expected under the null hypothesis that patterns of gene expression are the same across the Crohn’s and non-Crohn’s groups. Time how long it takes to compute the 1,000 permutations.

  9. Repeat the previous part using the \(T_{up}\) score and using mclapply for parallelism.

  10. Repat the previous part using the \(T_{down}\) score and using futures for parallelism.