Split, Apply, Combine and dplyr

One of the most frequently used patterns in data analysis and management is performing some operation repeatedly over subsets of data. This is often called “split-apply-combine” and is an approach emphasized by the R package dplyr.

Reading

Please read the introduction to dplyr and work through the examples there.

You should also read about tidyr in R for Data Science.

The two readings below are optional but you may find them useful.

Read more about tibbles here.

Read more about pipes here.

Case Study: Home types by state

In this case study, we will use the 2009 RECS data to demonstrate some of the functionality of dplyr. We’ll approach this as a case study in which we set out to answer the question:

Which state has the highest proportion of single-family attached homes?

Data cleaning

In this section we’ll cover the dplyr functions:

  • select(): choose a subs of variables
  • mutate(): transform variables or create new ones
  • transmute(): transform or select variables and drop others.

Let’s begin by creating a clean and tidy data frame with the necessary variables. We’ll need to keep the two variables of interest and the sample weight. Later we will also make use of the repeated weights to compute standard errors as well.

Here we read in the data, either from a local file or directly from the web.

> # 80:  -------------------------------------------------------------------------
> # Libraries: -------------------------------------------------------------------
> library(tidyverse)
> 
> # Get data: --------------------------------------------------------------------
> file = '../data/recs2009_public.csv'
> if (file.exists(file)){
+   recs_tib = readr::read_delim(file, delim=',')  
+ } else{
+  recs_tib = readr::read_delim(
+    "https://www.eia.gov/consumption/residential/data/2009/csv/recs2009_public.csv",
+     delim = ',' )
+  readr::write_delim(recs_tib, path = file, delim = ',')
+ }

Next, we’ll use transmute() to drop all but a subset of variables. You could use select() here as well.

> ## Subset recs to home-type, state, and sample weight: ------------------------
> recs_homes = transmute(recs_tib, 
+                        State = REPORTABLE_DOMAIN,
+                        Type = TYPEHUQ, 
+                        Weight = NWEIGHT)
> recs_homes
## # A tibble: 12,083 x 3
##    State  Type   Weight
##    <int> <int>    <dbl>
##  1    12     2  2471.68
##  2    26     2  8599.17
##  3     1     5  8969.92
##  4     7     2 18003.64
##  5     1     3  5999.61
##  6    10     2  4232.49
##  7     3     2  7862.34
##  8    17     2  6297.04
##  9     5     3 12156.72
## 10    12     2  3242.22
## # ... with 12,073 more rows

Next, we clean up the values to something more easily interpreted. We will write functions for the mappings so we can reuse them later. The functions will begin with the decode_ prefix so we can remember them as a group.

The values used here come from the code book available here.

> # Functions to decode states or reportable_domains from RECS: -----------------
> 
> ## States
> decode_state = function(x){
+   # Decodes numeric codes for REPORTABLE_DOMAIN to state labels
+   #
+   # Args: 
+   #   x: a single numeric code for a REPORTABLE_DOMAIN
+   #
+   # Returns: The state label
+   
+   # Throw an error if x isn't numeric
+   if(!is.numeric(x)) stop('decode_states expects numeric input indexed from 1!')
+   
+   switch(x,
+       "CT, ME, NH, RI, VT", "MA", "NY", "NJ", "PA", "IL", "IN, OH", "MI", "WI",
+       "IA, MN, ND, SD", "KS, NE", "MO", "VA", "DE, DC, MD, WV", "GA",
+       "NC, SC" , "FL", "AL, KY, MS", "TN", "AR, LA, OK",
+       "TX", "CO", "ID, MT, UT, WY", "AZ", "NV, NM",
+       "CA", "AK, HI, OR, WA"
+     )
+ }
> 
> decode_all_states = function(x){
+   # Vectorizes decode_state above
+   #
+   # Args: 
+   #  x: a vector of integer-valued reportable_domains
+   #
+   # Returns: A vector of state labes or a "list" if some are unmatched.
+   sapply(x, decode_state)
+ }
> ## Housing types
> decode_house_type = function(x){
+   # Decodes numeric codes for housing type
+   #
+   # Args: 
+   #   x: a single numeric code for a housing type
+   #
+   # Returns: The label for a housing type
+   
+   # Error checking
+   if(!is.numeric(x)) stop('decode_house_type expects numeric input indexed from 1!')
+   
+   switch(x,
+          'MobileHome',
+          'SingleFamilyDetached',
+          'SingleFamilyAttached',
+          'ApartmentFew',
+          'ApartmentMany'
+          )
+ }
> 
> decode_all_house_types = function(x){
+   # Vectorizes decode_houst_type above
+   #
+   # Args: 
+   #  x: a vector of integer-valued reportable_domains
+   #
+   # Returns: A vector of state labes or a "list" if some are unmatched.
+   
+   sapply(x, decode_house_type)
+ }

Next we mutate() our data frame, replacing our numeric codes with decoded values.

> recs_homes = mutate(recs_homes, 
+                     State = decode_all_states(State),
+                     Type = decode_all_house_types(Type)
+              )
> recs_homes
## # A tibble: 12,083 x 3
##                 State                 Type   Weight
##                 <chr>                <chr>    <dbl>
##  1                 MO SingleFamilyDetached  2471.68
##  2                 CA SingleFamilyDetached  8599.17
##  3 CT, ME, NH, RI, VT        ApartmentMany  8969.92
##  4             IN, OH SingleFamilyDetached 18003.64
##  5 CT, ME, NH, RI, VT SingleFamilyAttached  5999.61
##  6     IA, MN, ND, SD SingleFamilyDetached  4232.49
##  7                 NY SingleFamilyDetached  7862.34
##  8                 FL SingleFamilyDetached  6297.04
##  9                 PA SingleFamilyAttached 12156.72
## 10                 MO SingleFamilyDetached  3242.22
## # ... with 12,073 more rows

Exercise

  1. Replace the decode_all_* functions above with a call to factor() within the mutate() function.

  2. Optional challenge. Write a script that reads recs_2009_public_codebook.xlsx and creates a list in which each entry contains a vector of labels for the numeric values of all appropriate recs_2009 variables.

  3. Optional. Modify your answer to 1 above to use the list created in 2.

Aggregrate by group

In this section we’ll cover the dplyr() functions:

  • group_by(): to create groups (i.e. “split”),
  • summarize(): to aggregate (combine) by group .

Recall that we are interested in computing the proportion of each housing type by state or reportable domain. We can do this using the split, apply, combine, paradigm – in dplyr and other languages this is accomplished using a group operation with an aggregation function. In dplyr we use group_by to group and summarize to aggregate.

First the grouping:

> # Aggregrate weights by state and home type: ----------------------------------
> recs_homes_group_states = group_by(recs_homes, State, Type)
> recs_homes_group_states
## # A tibble: 12,083 x 3
## # Groups:   State, Type [134]
##                 State                 Type   Weight
##                 <chr>                <chr>    <dbl>
##  1                 MO SingleFamilyDetached  2471.68
##  2                 CA SingleFamilyDetached  8599.17
##  3 CT, ME, NH, RI, VT        ApartmentMany  8969.92
##  4             IN, OH SingleFamilyDetached 18003.64
##  5 CT, ME, NH, RI, VT SingleFamilyAttached  5999.61
##  6     IA, MN, ND, SD SingleFamilyDetached  4232.49
##  7                 NY SingleFamilyDetached  7862.34
##  8                 FL SingleFamilyDetached  6297.04
##  9                 PA SingleFamilyAttached 12156.72
## 10                 MO SingleFamilyDetached  3242.22
## # ... with 12,073 more rows

Now the aggregation:

> recs_type_state_sum = summarize( recs_homes_group_states, Homes=sum(Weight) )
> recs_type_state_sum
## # A tibble: 134 x 3
## # Groups:   State [?]
##             State                 Type     Homes
##             <chr>                <chr>     <dbl>
##  1 AK, HI, OR, WA         ApartmentFew  374742.9
##  2 AK, HI, OR, WA        ApartmentMany  946196.4
##  3 AK, HI, OR, WA           MobileHome  384298.0
##  4 AK, HI, OR, WA SingleFamilyAttached  189645.3
##  5 AK, HI, OR, WA SingleFamilyDetached 2833057.4
##  6     AL, KY, MS         ApartmentFew  183982.7
##  7     AL, KY, MS        ApartmentMany  201343.6
##  8     AL, KY, MS           MobileHome  422086.3
##  9     AL, KY, MS SingleFamilyAttached  192720.3
## 10     AL, KY, MS SingleFamilyDetached 3637140.9
## # ... with 124 more rows

Pay close attention to the change in grouping. When summarize() is called we lose the most nested group.

Style Note

In developing your own code, you may or may not want to print out the results after every step. If you do, you should comment these lines out or remove them in the final version so it can be executed without unnecessary printing to the console.

Reshaping and formatting results for presentation

In this section we will cover the functions:

  • tidyr::spread(): reshape data from long to wide,
  • dplyr::select(): choose a subset of variables,
  • dplyr::arrange(): reorder the rows in data frame.

To proceed, let’s reshape the data to have one row per state. We can do this using the tidyr::spread function.

> # Reshape to wide format for easy computation of proportions: -----------------
> recs_type_state = tidyr::spread(recs_type_state_sum, Type, Homes)
> 
> # Compute proportions: --------------------------------------------------------
> recs_type_state = mutate(recs_type_state,
+                          Total = ApartmentFew + ApartmentMany + MobileHome +
+                            SingleFamilyAttached + SingleFamilyDetached,
+                          ApartmentFew         = 100*ApartmentFew/Total,
+                          ApartmentMany        = 100*ApartmentMany/Total,
+                          MobileHome           = 100*MobileHome/Total,
+                          SingleFamilyAttached = 100*SingleFamilyAttached/Total,
+                          SingleFamilyDetached = 100*SingleFamilyDetached/Total
+                          )
> 
> # Drop total: ------------------------------------------------------------------
> recs_type_state = select(recs_type_state, -Total)
> 
> # Sort by Single Family Attached: ----------------------------------------------
> recs_type_state = arrange(recs_type_state, SingleFamilyAttached)
> 
> ## Use desc() to sort in descending order
> recs_type_state = arrange(recs_type_state, desc(SingleFamilyAttached))
> recs_type_state
## # A tibble: 27 x 6
## # Groups:   State [27]
##             State ApartmentFew ApartmentMany MobileHome
##             <chr>        <dbl>         <dbl>      <dbl>
##  1             PA    10.965041      11.47363   5.287897
##  2 DE, DC, MD, WV     3.207400      18.54110   7.422471
##  3             CO     7.706639      13.63568   5.099064
##  4             VA     4.092552      16.01914  10.639862
##  5             CA     8.466971      23.50956   3.226220
##  6         IN, OH     7.556989      12.47549   3.304245
##  7             NJ    11.303660      14.53719   1.880313
##  8             WI    11.340003      12.36640   2.186126
##  9             MA    24.400382      21.14616   1.584660
## 10 IA, MN, ND, SD     3.946439      14.28259   5.224894
## # ... with 17 more rows, and 2 more variables: SingleFamilyAttached <dbl>,
## #   SingleFamilyDetached <dbl>

A nicely formatted table would be rounded to a whole or tenth of a percent. In general, it is better to round only for display and not in the data itself to prevent rounding error from accumulating.

> # Output a markdown table: -----------------------------------------------------
> cap_title = '**Table 1.** *Proprtion of home types by State(s).*'
> cap_text0 = 'Each row shows the home type distribution for one or more states.'
> cap_text1 = 'Rows are sorted by the proportion of single family attached homes.'
> cap = paste(cap_title, cap_text0, cap_text1)
> 
> cols = c('State(s)', 'Apartment (2-5 Units)', 'Apartment (5+ units)', 
+          'Mobile Home', 'Single Family Attached', 'Single Family Detached')
> 
> knitr::kable(recs_type_state, digits=1, caption=cap, col.names = cols)
Table 1. Proprtion of home types by State(s). Each row shows the home type distribution for one or more states. Rows are sorted by the proportion of single family attached homes.
State(s) Apartment (2-5 Units) Apartment (5+ units) Mobile Home Single Family Attached Single Family Detached
PA 11.0 11.5 5.3 19.4 52.8
DE, DC, MD, WV 3.2 18.5 7.4 17.3 53.6
CO 7.7 13.6 5.1 10.7 62.9
VA 4.1 16.0 10.6 7.3 62.0
CA 8.5 23.5 3.2 7.0 57.8
IN, OH 7.6 12.5 3.3 6.3 70.4
NJ 11.3 14.5 1.9 5.9 66.3
WI 11.3 12.4 2.2 5.9 68.3
MA 24.4 21.1 1.6 5.7 47.2
IA, MN, ND, SD 3.9 14.3 5.2 5.7 70.9
NV, NM 6.8 9.9 5.0 5.7 72.7
NY 16.9 33.4 1.6 5.3 42.8
AR, LA, OK 7.6 14.3 5.6 5.1 67.4
MO 6.0 8.2 7.6 4.9 73.3
CT, ME, NH, RI, VT 13.9 16.5 1.5 4.8 63.3
KS, NE 6.3 13.6 6.8 4.6 68.7
ID, MT, UT, WY 2.2 5.8 7.2 4.6 80.3
AL, KY, MS 4.0 4.3 9.1 4.2 78.4
TX 5.4 16.8 7.2 4.1 66.5
AK, HI, OR, WA 7.9 20.0 8.1 4.0 59.9
FL 5.9 16.4 14.0 3.7 60.0
AZ 1.1 16.7 14.8 3.4 64.0
NC, SC 6.4 15.3 13.6 3.2 61.4
GA 3.6 13.4 3.7 2.9 76.5
MI 5.3 15.3 7.3 2.8 69.4
TN 4.5 18.4 9.7 1.9 65.5
IL NA NA NA NA NA

Subsetting rows

Next we take a quick look at just Michigan to demonstrate the use of filter().

> recs_type_state %>% filter( State == 'MI' )
## # A tibble: 1 x 6
## # Groups:   State [1]
##   State ApartmentFew ApartmentMany MobileHome SingleFamilyAttached
##   <chr>        <dbl>         <dbl>      <dbl>                <dbl>
## 1    MI     5.258925      15.27065   7.270111              2.77751
## # ... with 1 more variables: SingleFamilyDetached <dbl>

We might also want to find all states with at least 25% of people living in apartments,

> recs_type_state %>% filter( {ApartmentFew + ApartmentMany} >= 25 )
## # A tibble: 6 x 6
## # Groups:   State [6]
##                State ApartmentFew ApartmentMany MobileHome
##                <chr>        <dbl>         <dbl>      <dbl>
## 1                 CA     8.466971      23.50956   3.226220
## 2                 NJ    11.303660      14.53719   1.880313
## 3                 MA    24.400382      21.14616   1.584660
## 4                 NY    16.927978      33.41180   1.614217
## 5 CT, ME, NH, RI, VT    13.933993      16.52327   1.489281
## 6     AK, HI, OR, WA     7.926135      20.01287   8.128232
## # ... with 2 more variables: SingleFamilyAttached <dbl>,
## #   SingleFamilyDetached <dbl>

Caution

If you look at the bottom of the home-types by state table, you’ll notice Illinois has NA rather than values. What happened?

Looking at the state/type summaries we will notice that there is no mobile home type for Illinois.

> recs_type_state_sum %>% filter( State == 'IL' )
## # A tibble: 4 x 3
## # Groups:   State [1]
##   State                 Type     Homes
##   <chr>                <chr>     <dbl>
## 1    IL         ApartmentFew  546751.7
## 2    IL        ApartmentMany  944531.9
## 3    IL SingleFamilyAttached  144198.4
## 4    IL SingleFamilyDetached 3121970.1

In our previous call to mutate, we didn’t guard against this. We can do something safer to get more robust results. One way to do this would be to use the rowwise function which ensures factorized R functions such as sum are interpreted to act row by row. This illustrated below.

Pipes %>%

You may have notice an odd symbol %>% above. This “pipe” operator from the magritr package can be used to pass a data frame implicitly to dplyr functions. The utility of this lies in allowing us to string dplyr functions together in such a way that they can be read in the same order as they are performed rather from the inside out. This also prevents us from needing to keep track of so many intermediate objects. Together these two properties make our code cleaner and more literate – it easier to understand and debug.

Here is the entire example as a single piped chain,

> # Comptue the proportion of each home type in each reportable domain: ---------
> home_type_prop = recs_tib %>% 
+   transmute(State=REPORTABLE_DOMAIN, Type=TYPEHUQ, Weight = NWEIGHT) %>%  
+   mutate(State=decode_all_states(State), Type=decode_all_house_types(Type)) %>%
+   group_by(State, Type) %>%
+   summarize(Homes=sum(Weight)) %>%
+   tidyr::spread(Type,Homes) %>%
+   mutate(
+     Total = ApartmentFew + ApartmentMany + MobileHome + 
+            SingleFamilyAttached + SingleFamilyDetached,
+     ApartmentFew         = 100*ApartmentFew/Total,
+     ApartmentMany        = 100*ApartmentMany/Total,
+     MobileHome           = 100*MobileHome/Total,
+     SingleFamilyAttached = 100*SingleFamilyAttached/Total,
+     SingleFamilyDetached = 100*SingleFamilyDetached/Total
+   ) %>%
+   select(-Total) %>%
+   arrange( desc(SingleFamilyAttached) )
> # Change options for display
>  options( list('digits'=2, dplyr.width=Inf) )
>  home_type_prop
## # A tibble: 27 x 6
## # Groups:   State [27]
##             State ApartmentFew ApartmentMany MobileHome SingleFamilyAttached SingleFamilyDetached
##             <chr>        <dbl>         <dbl>      <dbl>                <dbl>                <dbl>
##  1             PA         11.0            11        5.3                 19.4                   53
##  2 DE, DC, MD, WV          3.2            19        7.4                 17.3                   54
##  3             CO          7.7            14        5.1                 10.7                   63
##  4             VA          4.1            16       10.6                  7.3                   62
##  5             CA          8.5            24        3.2                  7.0                   58
##  6         IN, OH          7.6            12        3.3                  6.3                   70
##  7             NJ         11.3            15        1.9                  5.9                   66
##  8             WI         11.3            12        2.2                  5.9                   68
##  9             MA         24.4            21        1.6                  5.7                   47
## 10 IA, MN, ND, SD          3.9            14        5.2                  5.7                   71
## # ... with 17 more rows

Here is the same example using rowwise to avoid NA from missing values in our sum (assuming that is the behavior we want.)

> # Comptue the proportion of each home type in each reportable domain: ----------
> 
> # with missing values assumed 0
> home_type_prop_rowwise = recs_tib %>% 
+   transmute(State=REPORTABLE_DOMAIN, Type=TYPEHUQ, Weight = NWEIGHT) %>%  
+   mutate(State=decode_all_states(State), Type=decode_all_house_types(Type)) %>%
+   group_by(State, Type) %>%
+   summarize(Homes=sum(Weight)) %>%
+   tidyr::spread(Type, Homes) %>%
+   rowwise() %>%  # to guard against NA when summing "Total"
+   mutate(
+     Total = sum(ApartmentFew, ApartmentMany, MobileHome,
+               SingleFamilyAttached, SingleFamilyDetached, na.rm=TRUE),
+          ApartmentFew         = 100*ApartmentFew/Total,
+          ApartmentMany        = 100*ApartmentMany/Total,
+          MobileHome           = 100*MobileHome/Total,
+          SingleFamilyAttached = 100*SingleFamilyAttached/Total,
+          SingleFamilyDetached = 100*SingleFamilyDetached/Total
+   ) %>%
+   select(-Total) %>%
+   arrange( desc(SingleFamilyAttached) )
> 
> home_type_prop_rowwise %>% filter( State == "IL" )
## # A tibble: 1 x 6
##   State ApartmentFew ApartmentMany MobileHome SingleFamilyAttached SingleFamilyDetached
##   <chr>        <dbl>         <dbl>      <dbl>                <dbl>                <dbl>
## 1    IL           11            20         NA                    3                   66

Extended Case Study

An extended version of the case study above is available as case_study_1_RECS.R in the Stats506_F18 repo. The extended example adds three things to the above:

  1. It more closely resembles my own working R style rather than the more pedantic example here.

  2. It demonstrates how to approach the standard error computations with replicate weights using dplyr operations.

  3. It has a basic example of using ggplot2.