Instructions

Question 1

In this question you will use command line tools to answer question about the 2015 Residential Energy Consumption Survey (RECS 2015) data set.

In addition to your Rmd file, please submit a shell script ps1_q1.sh written in Bash using the “shebang” #!/bin/bash. Your script should assume the file recs2015_public_v3.csv is in the same directory and be executable as bash ps1_q1.sh.

Part A [5 points; 2.5 each]

In part A, your solution to each question should be a Linux “one-liner”, i.e. a series of one or more commands connected by pipes “|”. Please provide both your solution and the result. Your solution must be written in text so that it can be copied and pasted if needed.

  1. How many rows are there for region 3 in the RECS 2015 data set?

  2. Write a one-liner to create a compressed data set containing only the variables: DOEID, NWEIGHT, and BRRWT1-BRRWT96.

Part B [10 points; 5 each]

  1. Write a Bash for loop to count and print the number of observations within each region.

  2. Produce a file region_division.txt providing a sorted list showing unique combinations of values from REGIONC and DIVISION. Include the contents of that file in your solution. Hint: See man uniq.

Question 2 [25 pts]

In this question, you will use R to answer questions about flights originating in New York City, NY (NYC) in 2013 and 2014. Data for 2013 can be found in the nycflights2013 R package. Data through October 2014 is available here. Your answers should be submitted as nicely formatted tables produced using Rmarkdown.

  1. Which airlines were responsible for at least 1% of the flights departing any of the three NYC airports between January 1 and October 31, 2013?

  2. Among the airlines from part “a”, compare the number and percent of annual flights in the first 10 months of 2013 and the first 10 months of 2014. Your table should include: the airline name (not carrier code), a nicely formatted number (see format()), percents for each year with 95% CI, and change in percent with 95% CI. Which airlines showed the largest increase and decrease? Why do some airlines show an increase in the percent of flights but a decrease in the number of flights?

  3. Among of the three NYC airports, produce a table showing the percent of flights each airline is responsible for. Limit the table to the airlines identified in part a and include confidence intervals for your estimates. Which airline is the largest carrier at each airport?

Question 3 [45 pts; 15 pts each]

In this question, you will use R to answer questions about the RECS 2015 data. You should read the section on computing standard errors available here. For each question, produce a nicely formatted table and graph to support you answer. In your tables and graphs please provide standard errors for all point estimates.

  1. What percent of homes have stucco construction as the major outside wall material within each division? Which division has the highest proportion? Which the lowest?

  2. What is average total electricity usage in kilowatt hours in each division? Answer the same question stratified by urban and rural status.

  3. Which division has the largest disparity between urban and rural areas in terms of the proportion of homes with internet access?