Instructions

Below, you will find parts of four questions that originally appeared on the midterm exam. Some of these parts have been slightly modified from the orignal version.

These questions should be considered extra credit. You are weclome to answer all or some of them. There is no penalty for not answering them as the total points available in the course remains 1000.

Submit your answers as a single PDF document created using Rmarkdown to Canvas. Also, for questions which involve R, submit R scripts named “midterm_ec_qX.R” where “X” is the number of the question you are answering.

If you choose to do part of the assignment, a draft is due Tuesday October 29 by 5pm.

After the due date, you will be asked to review the submissions of two peers. Please follow the peer review guidelines available at: https://jbhender.github.io/Stats506/F19/peer_review.html. Peer reviews are due Thursday at midnight.

After peer review, you will have an opportunity to modify your submission. On your final submission, you will receive credit only for those questions for which you submitted reasonably complete solutions on the draft. The final submission will be due Tuesday November 5 at 5pm.


Part One

Question 2 - Regular Expressions [4 points]

c.

You have been asked to update a collection of related R scripts you wrote a few years ago. The scripts all use dplyr and tidyr and there are several thousand lines of code. Before making the requested updates, you decide it would be a good idea to replace instances of spread with pivot_wider and instances of gather with pivot_longer. You decide to start by using grep at the command line to create a checklist of files/lines to change.

After your first attempt, you notice you are getting a lot of lines where you used the words gather or spread in the comments. You would like to avoid having these appearing in your search.

Task: Write a call to grep that will find all lines in .R or .Rmd files in the local directory that use the functions gather or spread.

A directory of files to test against are available at the course git repo.

Question 4 - dplyr and tidyr [6 points]

In this question you will write or interpret short dplyr pipes that explore or analyze the Orange data, which contains 35 rows and 3 columns recording the growth of orange trees. The dataset has three columns:

  • Tree: an ordered factor, identifying individual trees,
  • age: a numeric vector giving the number of days since the tree was planted,
  • circumference: a numeric vector recording the circumference of the trunk in mm.

The earliest age in the data is after planting (age 0). Where necessary, assume the circumference was zero at age 0.

  1. Write a dplyr pipe to calculate the average rate of growth (mm/day) between between age 0 and the final measurement across all trees.

Part Two

Question 5 - S3 Methods in R [20 points]

In this question, you will define a new S3 generic about for summarizing basic information about an R object in a compact fashion.

  1. Define a new S3 generic about. Use ... in the function template.

  2. Define a default method for the about generic. Your method should call the str generic with the option give.attr = FALSE.

  3. Define an about method for objects of class data.frame that prints the following information to the console:

    1. A string, “‘data.frame’: NN obs of PP variables:” with NN and PP replaced with appropriate numbers. This should be its own line. If the object being summarized is a member of one or more data.frame sub-classes replace ‘data.frame’ with the name of the primary class.

    2. A string " PP numeric variables: [V1, V2, V3, …]" giving the numbers of numeric columns (PP), and listing the names (V1, V2, … ) as a comma separated list.

    3. A string " PP factor variables: [V1, V2, V3, …]" giving the numbers of factor and/or character columns (PP), and listing the names (V1, V2, …) as a comma separated list.

  4. Define an about method for objects of class tbl (a tibble, inheriting from the data.table class) that calls the data.frame method above and then also prints, when applicable, the number and names of any “list” columns. Hint: use NextMethod().

Question 6 - Vectorization [20 points]

On the course git repo, you will find an R script which implements a Monte Carlo study to estimate the number of games won by each of three teams in a hypothetical baseball league. Refer to the header and comments for additional context.

Within the script, there are three “tasks”, denoted <task X>, asking you either to rewrite and improve existing portions of the script or complete it by filling in missing code chunks.

Use vectoriztation wherever possible in your answers. Include only the code for each <task> in your pdf document. In your R script, clearly label the code related to each <task>.