Memory in R

Key Concepts

  1. storage for types: logical/integer 4B each, double/pointer 8B each
  2. SI units: “B”, “kB”, “MB”, “GB”, “TB”, …
  3. R’s global string pool: character vectors are pointers to this pool
  4. Empty vectors take 48B, increments are always multiples of 8
  5. type promotion

Example Question

  1. Put the following in order from smallest to largest memory footprint:
x = rnrom(n = 5e3)
y = sample(LETTERS, size = 5e3, replace = TRUE)
z = factor(y)
  1. Approximately how much memory does the following matrix occupy? Pick appropriate units.
mat = matrix( 0L, nrow = 1e4, ncol = 5e3 )
  1. See review quiz.

SAS

Key Concepts

  1. libname statements and two part file names (active)
  2. macro variables: %let var = 'text'; and "&var." (passive)
  3. proc summary (passive)
  4. proc sql (active)
  5. basic data step programming (active)
  6. proc transpose (passive)

Example Question

The script below reads in data related to hospital stays among Medicare patients in 2014 and 2016 and then limits to rows related to pneumonia or other respiratory infections. After sub-setting, each row represents a hospital (“hosp_id”) and a particular type of hospital stay (“drg”). The script computes z-scores for each hospital comparing the difference in proportions for one type of stay (“drg” = 178) to the others between the two years.

Note: This is based on question 4 from the 2018 final exam.

  1. List all tables written to disk (in non temporary locations) as sas7bdat files after running the script below. For each table, also give the number of columns it contains.

  2. List all tables in the work library after running this script.

  3. Rewrite the chunks labeled “merge” and “Compute totals, remerge, and create pooled estimates” using proc sql.

  4. Find all missing comments indicated as [Comment X] and provide an appropriate, descriptive comment.

/* libraries: ------------------------------------------- */
%let datpath = '~/path/to/project/data';
libname mylib "&datpath.";
%let respath = '~/path/to/project/results'; 
libname reslib "&respath.";

/* Import the data: ------------------------------------- */
proc import datafile='./data2014.csv' out=ip14;
proc import datafile='./data2016.csv' out=ip16;
run;

/* [Comment 1]: ----------------------------------------- */
data mylib.p14;
 set ip14;
 where drg in (178, 179, 185);
 n14 = total_discharges;
 keep drg hosp_id n14;
 
data mylib.p16;
 set ip16;
 where drg in (178, 179, 185);
 n16 = total_discharges;
 keep drg hosp_id n16;
run;

/* Merge: ----------------------------------------------- */
data pneum; 
  merge mylib.p14 mylib.p16;
  by drg hosp_id;

/* Compute totals, remerge, and create pooled estimates:- */
proc summary data = pneum;
 by hosp_id;
 output out = totals
   sum(n14) = tot14
   sum(n16) = tot16;
   
/* Remerge and create pooled estimate*/
data pneum;
 merge pneum totals;
 by hosp_id;
 p_pool = (n14 + n16) / (tot14 + tot16);
run;

/* Compute z-scores: ------------------------------------ */
data result;
 set pneum;
 where drg = 178 and p_pool lt 1; /* lt = < */
 p14 = n14/tot14;
 p16 = n16/tot16;
 z = (p16 - p14)/(p_pool*(1-p_pool)*(1/n14 + 1/n16));
run;

/* [Comment 2]: ----------------------------------------- */
proc sort data=reslib.result;
 by descending z; 
run;

/* 60: -------------------------------------------------- */

A solution to the proc sql question:

/* merge: ----------------------------------------------- */
proc sql;
  create table pneum as
  select *
  from mylib.p14
  left join mylib.p16
  on p14.hosp_id = p16.hosp_id and p14.drg = p16.drg;
quit;
run;

/* compute totals, remerge, and create pooled estimates:- */
proc sql;

 /* Compute totals */
 create table totals as
 select hosp_id, sum(n14) as tot14, sum(n16) as tot16
 from pneum
 group by hosp_id; 

 /* Remerge and create pooled estimate*/
 create table pneum as
 select *, (n14 + n16) / (tot14 + tot16) as p_pool
 from pneum p
 left join totals t 
 on p.hosp_id = t.hosp_id;
 
quit; 
run;

Stata

Key Concepts

  1. basic data management: generate, replace, keep, etc.
  2. “locals”: local myvars "a b c"; `myvar' (active)
  3. loops, (passive)
  4. “one data principal” and use of preserve / restore (passive)
  5. collapse (active)
  6. pivoting with reshape (passive)

Example Question

See question 6 from the Fall 2018 midterm

SQL

Key Concepts

  1. canonical order: select, from, where, group by, having, order by
  2. joins: left, inner, outer, using ‘on’ to specify matching rows
  3. aliases: for tables and in select statements

Example Questions

  1. Question 2 from the Fall 2018 final.

  2. The example SAS question above.

  3. https://jbhender.github.io/data.table_intro.html#exercises

dplyr

Key Concepts

  1. You won’t be tested explicitly on dplyr, but will be asked to use dplyr/tidyr to demonstrate your knowledge of SAS/Stata/SQL/data.table.
  2. key verbs: filter, mutate, select, transmute, group_by, summarize, left_join (active)
  3. pivoting (passive)

Example Questions

See above or any prior midterm/final exam.

data.table

Key Concepts

  1. dt[i, j, by] (active)
  2. by vs keyby (passive)
  3. When j is/returns a list, the result is a data.table. (concept)
  4. reference semantics: dt[ , new := comp(old) ] (active)
  5. special symbols: .N, .SD, and use of .SDcols argument (passive)
  6. pivot methods (passive): dcast (wider), melt (longer)

Example Questions

  1. See the exam review quiz.
  2. https://jbhender.github.io/data.table_intro.html#exercises
  3. Consider question 5 from the Fall 2018 final. If given a partial solution, be able to fill in the missing pieces.

resampling methods

Key Concepts

  1. bootstrap resampling (active)
  2. cross validation (active)

Example Question

  1. Question 3 from the Fall 2018 final

parallel or asynchronous computing

Key Concepts

  1. parallel computing with mclapply and the role of mc.preschedule (active)
  2. the difference between a future and its value, blocking (active)
  3. a conceptual understanding of communication overhead and role of chunking/ prescheduling to minimized this

Example Question

  1. If given any of the loops in questions 3, 6, or 7 from last the 2018 final be able to re-write it using mclapply or future.

  2. Consider the example here. Given a plan and a number or workers, be able to estimate how much time a block of code will take and to explain which processes are blocking.