Introduction to the tidyverse

Reading
Style Note
Data transformation with dplyr
Case Study: Home types by state
Graphics with ggplot2

Reading

Please read the introduction to dplyr and work through the examples there.
Sections 4 (“Pipes”) and 5 (“ggplot”) at https://style.tidyverse.org/
Optional readings from R for Data Science recommended for those new to the tidyverse:
- Data Transformation,
- Tibbles,
- Pipes,
- Visualization.

Style Note

Despite what is said about “pipes” in the style guide do not assign at the end of a pipe -> result. Always assign right to left at the start of the pipe result = ... %>%.

Data transformation with `dplyr`

When working with data it is often necessary to make transformations in order to:

define literate, easy to remember column names,
choose a subset of cases,
create derived variables based on one or more columns,
compute summary statistics, often by group,
combine information from multiple sources.

There are means for doing all of these things in the base R packages. However, many serious R users prefer to use the data transformation syntax provided in the package dplyr because it makes for code that is:

literate: easier for non-R users to make sense of and less prone to bugs,
portable: by abstracting over how the data is stored, the dplyr front end can be employed with a number of “back-ends”, e.g. data.frames, tibbles, SQL data bases, or data.tables.

Another package commonly used for manipulating, managing, transforming, and tidying data is data.table which we will learn more about later in the semester.

Case Study: Home types by state

In this case study, we will use the 2009 RECS data to demonstrate some of the functionality of dplyr. We’ll approach this as a case study in which we set out to answer the question:

Which state has the highest proportion of single-family attached homes?

A similar case study is also presented in the recorded lectures. There, however, it is done in a style closer to my everyday coding practice than to the more pedantic version here.

Data cleaning

In this section we’ll discuss the dplyr functions:

select(): choose a subset of variables
mutate(): transform variables or create new ones
transmute(): transform or select variables and drop others.

Let’s begin by creating a clean and tidy data frame with the necessary variables. We’ll need to keep the two variables of interest and the sample weight. Later we will also make use of the replicate weights to compute standard errors.

Here we read in the data, either from a local file or directly from the web.

# 79:  ------------------------------------------------------------------------
# libraries: ------------------------------------------------------------------
library(tidyverse)

# get data: -------------------------------------------------------------------
url = paste0(
  "https://www.eia.gov/consumption/residential/data/2009/csv/",
  "recs2009_public.csv"
)
file = './recs2009_public.csv'
if (file.exists(file)){
  recs_tib = readr::read_delim(file, delim=',')  
} else{
  recs_tib = readr::read_delim(url, delim = ',' )
  readr::write_delim(recs_tib, path = file, delim = ',')
}

Next, we’ll use transmute() to drop all but a subset of variables.You could use select() here as well.

## Subset recs to home-type, state, and sample weight: ------------------------
recs_homes = transmute(recs_tib, 
                       state = REPORTABLE_DOMAIN,
                       type = TYPEHUQ, 
                       weight = NWEIGHT)
recs_homes

## # A tibble: 12,083 x 3
##    state  type weight
##    <dbl> <dbl>  <dbl>
##  1    12     2  2472.
##  2    26     2  8599.
##  3     1     5  8970.
##  4     7     2 18004.
##  5     1     3  6000.
##  6    10     2  4232.
##  7     3     2  7862.
##  8    17     2  6297.
##  9     5     3 12157.
## 10    12     2  3242.
## # … with 12,073 more rows

Next, we clean up the values to something more easily interpreted. We will write functions for the mappings so we can reuse them later. The functions will begin with the decode_ prefix so we can remember them as a group. (These functions are more pedantic; the function in the lecture is more useful)

The values used here come from the code book available here.

# Functions to decode states or reportable_domains from RECS: -----------------

## states
decode_state = function(x) {
  # Decodes numeric codes for REPORTABLE_DOMAIN to state labels
  # Inputs: 
  #   x - a single numeric code for a REPORTABLE_DOMAIN
  # Returns: The state label
  
  # Throw an error if x isn't numeric
  if( !is.numeric(x) ) {
    stop('decode_states expects numeric input indexed from 1!')
  } 
  
  switch(x,
      "CT, ME, NH, RI, VT", "MA", "NY", "NJ", "PA", "IL", "IN, OH", "MI", "WI",
      "IA, MN, ND, SD", "KS, NE", "MO", "VA", "DE, DC, MD, WV", "GA",
      "NC, SC" , "FL", "AL, KY, MS", "TN", "AR, LA, OK",
      "TX", "CO", "ID, MT, UT, WY", "AZ", "NV, NM",
      "CA", "AK, HI, OR, WA"
    )
}

decode_all_states = function(x) {
  # Vectorizes decode_state above
  # Inputs: 
  #  x - a vector of integer-valued reportable_domains
  # Output: A vector of state labels or a "list" if some are unmatched.
  sapply(x, decode_state)
}

## Housing types
decode_house_type = function(x) {
  # Decodes numeric codes for housing type
  # Inputs: 
  #   x - a single numeric code for a housing type
  # Output: The label for a housing type
  
  # Error checking
  if ( !is.numeric(x) ) {
    stop('decode_house_type expects numeric input indexed from 1!')
  }
  
  switch(x,
         'mobile_home',
         'single_family_detached',
         'single_family_attached',
         'apartment_few',
         'apartment_many'
         )
}

decode_all_house_types = function(x) {
  # Vectorizes decode_house_type above
  # Inputs: 
  #  x - a vector of integer-valued reportable_domains
  # Outputs: A vector of state labels or a "list" if some are unmatched.
  
  sapply(x, decode_house_type)
}

Next we mutate() our data frame, replacing our numeric codes with decoded values.

recs_homes = mutate(recs_homes, 
                    state = decode_all_states(state),
                    type = decode_all_house_types(type)
             )
recs_homes

## # A tibble: 12,083 x 3
##    state              type                   weight
##    <chr>              <chr>                   <dbl>
##  1 MO                 single_family_detached  2472.
##  2 CA                 single_family_detached  8599.
##  3 CT, ME, NH, RI, VT apartment_many          8970.
##  4 IN, OH             single_family_detached 18004.
##  5 CT, ME, NH, RI, VT single_family_attached  6000.
##  6 IA, MN, ND, SD     single_family_detached  4232.
##  7 NY                 single_family_detached  7862.
##  8 FL                 single_family_detached  6297.
##  9 PA                 single_family_attached 12157.
## 10 MO                 single_family_detached  3242.
## # … with 12,073 more rows

Exercise

Replace the decode_all_* functions above with a call to factor() within the mutate() function.
Optional challenge. Write a script that reads recs_2009_public_codebook.xlsx and creates a list in which each entry contains a vector of labels for the numeric values of all appropriate recs_2009 variables. (Or do this for recs_2015 data instead.)
Optional. Modify your answer to 1 above to use the list created in 2.

Aggregrate by group

In this section we’ll cover the dplyr() functions:

group_by(): to create groups (i.e. “split”),
summarize(): to aggregate (combine) by group .

Recall that we are interested in computing the proportion of each housing type by state or reportable domain. We can do this using the split-apply-combine paradigm – in dplyr and other languages this is accomplished using a group operation with an aggregation function. In dplyr we use group_by() to group and summarize() to aggregate.

First the grouping:

# Aggregate weights by state and home type: ----------------------------------
recs_homes_group_states = group_by(recs_homes, state, type)
recs_homes_group_states

## # A tibble: 12,083 x 3
## # Groups:   state, type [134]
##    state              type                   weight
##    <chr>              <chr>                   <dbl>
##  1 MO                 single_family_detached  2472.
##  2 CA                 single_family_detached  8599.
##  3 CT, ME, NH, RI, VT apartment_many          8970.
##  4 IN, OH             single_family_detached 18004.
##  5 CT, ME, NH, RI, VT single_family_attached  6000.
##  6 IA, MN, ND, SD     single_family_detached  4232.
##  7 NY                 single_family_detached  7862.
##  8 FL                 single_family_detached  6297.
##  9 PA                 single_family_attached 12157.
## 10 MO                 single_family_detached  3242.
## # … with 12,073 more rows

Now the aggregation:

recs_type_state_sum = summarize(recs_homes_group_states, homes = sum(weight) )

## `summarise()` regrouping output by 'state' (override with `.groups` argument)

recs_type_state_sum

## # A tibble: 134 x 3
## # Groups:   state [27]
##    state          type                      homes
##    <chr>          <chr>                     <dbl>
##  1 AK, HI, OR, WA apartment_few           374743.
##  2 AK, HI, OR, WA apartment_many          946196.
##  3 AK, HI, OR, WA mobile_home             384298.
##  4 AK, HI, OR, WA single_family_attached  189645.
##  5 AK, HI, OR, WA single_family_detached 2833057.
##  6 AL, KY, MS     apartment_few           183983.
##  7 AL, KY, MS     apartment_many          201344.
##  8 AL, KY, MS     mobile_home             422086.
##  9 AL, KY, MS     single_family_attached  192720.
## 10 AL, KY, MS     single_family_detached 3637141.
## # … with 124 more rows

Pay close attention to the change in grouping. When summarize() is called we lose the most nested group with the default .groups = "drop_last".

Style Note

In developing your own code, you generally don’t want to print out the results after every step. If you do, you should comment these lines out or remove them in the final version so it can be executed without unnecessary printing to the console.

Reshaping and formatting results for presentation

In this section we will use the functions:

tidyr::pivot_wider(): reshape data from a longer to a wider format,
dplyr::select(): choose a subset of variables,
dplyr::arrange(): reorder the rows in data frame.

To proceed, let’s reshape the data to have one row per state. We can do this using the tidyr::pivot_wider() function.

# Reshape to a wide format for clear computation of proportions: --------------
recs_type_state = 
  tidyr::pivot_wider(
    recs_type_state_sum, 
    names_from = type, 
    values_from = homes
  )

# Compute proportions: --------------------------------------------------------
recs_type_state = 
  mutate(recs_type_state,
         Total = apartment_few + apartment_many + mobile_home +
                  single_family_attached +single_family_detached,
         apartment_few = 100 * apartment_few / Total,
         apartment_many = 100 * apartment_many / Total,
         mobile_home = 100 * mobile_home / Total,
         single_family_attached = 100 * single_family_attached / Total,
        single_family_detached = 100 *single_family_detached / Total
  )

# Drop total: -----------------------------------------------------------------
recs_type_state = select(recs_type_state, -Total)

# Sort by Single Family Attached: ---------------------------------------------
recs_type_state = arrange(recs_type_state, single_family_attached)

## Use desc() to sort in descending order
recs_type_state = arrange(recs_type_state, desc(single_family_attached))
recs_type_state

## # A tibble: 27 x 6
## # Groups:   state [27]
##    state apartment_few apartment_many mobile_home single_family_a…
##    <chr>         <dbl>          <dbl>       <dbl>            <dbl>
##  1 PA            11.0            11.5        5.29            19.4 
##  2 DE, …          3.21           18.5        7.42            17.3 
##  3 CO             7.71           13.6        5.10            10.7 
##  4 VA             4.09           16.0       10.6              7.29
##  5 CA             8.47           23.5        3.23             7.01
##  6 IN, …          7.56           12.5        3.30             6.29
##  7 NJ            11.3            14.5        1.88             5.93
##  8 WI            11.3            12.4        2.19             5.85
##  9 MA            24.4            21.1        1.58             5.70
## 10 IA, …          3.95           14.3        5.22             5.68
## # … with 17 more rows, and 1 more variable: single_family_detached <dbl>

A nicely formatted table would be rounded to a whole or tenth of a percent. In general, it is better to round only for display and not in the data itself to prevent rounding error from accumulating.

# Output a markdown table: ----------------------------------------------------
cap_title = '**Table 1.** *Proprtion of home types by state(s).*'
cap_text0 = 'Each row shows the home type distribution for one or more states.'
cap_text1 = 
  'Rows are sorted by the proportion of single family attached homes.'
cap = paste(cap_title, cap_text0, cap_text1)

cols = c('state(s)', 'Apartment (2-5 Units)', 'Apartment (5+ units)', 
         'Mobile Home', 'Single Family Attached', 'Single Family Detached')

knitr::kable(recs_type_state, digits = 1, caption = cap, col.names = cols)

**Table 1.** *Proprtion of home types by state(s).* Each row shows the home type distribution for one or more states. Rows are sorted by the proportion of single family attached homes.
state(s)	Apartment (2-5 Units)	Apartment (5+ units)	Mobile Home	Single Family Attached	Single Family Detached
PA	11.0	11.5	5.3	19.4	52.8
DE, DC, MD, WV	3.2	18.5	7.4	17.3	53.6
CO	7.7	13.6	5.1	10.7	62.9
VA	4.1	16.0	10.6	7.3	62.0
CA	8.5	23.5	3.2	7.0	57.8
IN, OH	7.6	12.5	3.3	6.3	70.4
NJ	11.3	14.5	1.9	5.9	66.3
WI	11.3	12.4	2.2	5.9	68.3
MA	24.4	21.1	1.6	5.7	47.2
IA, MN, ND, SD	3.9	14.3	5.2	5.7	70.9
NV, NM	6.8	9.9	5.0	5.7	72.7
NY	16.9	33.4	1.6	5.3	42.8
AR, LA, OK	7.6	14.3	5.6	5.1	67.4
MO	6.0	8.2	7.6	4.9	73.3
CT, ME, NH, RI, VT	13.9	16.5	1.5	4.8	63.3
KS, NE	6.3	13.6	6.8	4.6	68.7
ID, MT, UT, WY	2.2	5.8	7.2	4.6	80.3
AL, KY, MS	4.0	4.3	9.1	4.2	78.4
TX	5.4	16.8	7.2	4.1	66.5
AK, HI, OR, WA	7.9	20.0	8.1	4.0	59.9
FL	5.9	16.4	14.0	3.7	60.0
AZ	1.1	16.7	14.8	3.4	64.0
NC, SC	6.4	15.3	13.6	3.2	61.4
GA	3.6	13.4	3.7	2.9	76.5
MI	5.3	15.3	7.3	2.8	69.4
TN	4.5	18.4	9.7	1.9	65.5
IL	NA	NA	NA	NA	NA

Subsetting rows

Next we take a quick look at just Michigan to demonstrate the use of filter(). Observe that we pass a logical vector of the same length as the data to filter. Here, that logical vector is created by testing when the variable state is equal to “MI”.

recs_type_state %>% filter(state == 'MI' )

## # A tibble: 1 x 6
## # Groups:   state [1]
##   state apartment_few apartment_many mobile_home single_family_a…
##   <chr>         <dbl>          <dbl>       <dbl>            <dbl>
## 1 MI             5.26           15.3        7.27             2.78
## # … with 1 more variable: single_family_detached <dbl>

We might also want to find all states with at least 25% of people living in apartments,

recs_type_state %>% filter( {apartment_few + apartment_many} >= 25 )

## # A tibble: 6 x 6
## # Groups:   state [6]
##   state apartment_few apartment_many mobile_home single_family_a…
##   <chr>         <dbl>          <dbl>       <dbl>            <dbl>
## 1 CA             8.47           23.5        3.23             7.01
## 2 NJ            11.3            14.5        1.88             5.93
## 3 MA            24.4            21.1        1.58             5.70
## 4 NY            16.9            33.4        1.61             5.29
## 5 CT, …         13.9            16.5        1.49             4.75
## 6 AK, …          7.93           20.0        8.13             4.01
## # … with 1 more variable: single_family_detached <dbl>

Caution

If you look at the bottom of the home-types by state table, you’ll notice Illinois has NA rather than values. What happened?

Looking at the state/type summaries we will notice that there is no mobile home type for Illinois.

recs_type_state_sum %>% filter( state == 'IL' )

## # A tibble: 4 x 3
## # Groups:   state [1]
##   state type                      homes
##   <chr> <chr>                     <dbl>
## 1 IL    apartment_few           546752.
## 2 IL    apartment_many          944532.
## 3 IL    single_family_attached  144198.
## 4 IL    single_family_detached 3121970.

In our previous call to mutate, we didn’t guard against this.
We can do something safer to get more robust results. One way to do this would be to use the rowwise() function which ensures vectorized R functions such as sum are interpreted to act row by row. This is illustrated below. A better solution can be found in the lecture version of the case study – most operations are done more robustly in a longer format, though you may need to use tidyr::complete() to handle the implicit missing value made explicit here by the transformation to a wider format.

Pipes `%>%`

You may have noticed an odd symbol %>% above. This “pipe” operator from the dplyr package (adopted from the original in magritr) can be used to pass a data frame implicitly to dplyr functions. The utility of this lies in allowing us to string dplyr functions together in such a way that they can be read in the same order as they are performed rather from the inside out. This also prevents us from needing to keep track of as many intermediate objects. Together these two properties make our code cleaner and more literate – easier to understand and to debug.

Here is the entire example as a single piped chain,

# Compute the proportion of each home type in each reportable domain: ---------
home_type_prop = recs_tib %>% 
  transmute(state = REPORTABLE_DOMAIN, type = TYPEHUQ, weight = NWEIGHT) %>%  
  mutate(
    state = decode_all_states(state), 
    type = decode_all_house_types(type)
  ) %>%
  group_by(state, type) %>%
  summarize(homes = sum(weight), .groups = 'drop_last') %>%
  tidyr::pivot_wider(names_from = type, values_from = homes) %>%
  mutate(
    Total = apartment_few + apartment_many + mobile_home + 
            single_family_attached + single_family_detached,
    apartment_few = 100 * apartment_few / Total,
    apartment_many = 100* apartment_many / Total,
    mobile_home = 100 * mobile_home / Total,
    single_family_attached = 100 * single_family_attached / Total,
   single_family_detached = 100 *single_family_detached / Total
  ) %>%
  select(-Total) %>%
  arrange( desc(single_family_attached) )

# Change options for display
 options( list('digits' = 2, dplyr.width = Inf) )
 home_type_prop

## # A tibble: 27 x 6
## # Groups:   state [27]
##    state          apartment_few apartment_many mobile_home
##    <chr>                  <dbl>          <dbl>       <dbl>
##  1 PA                     11.0            11.5        5.29
##  2 DE, DC, MD, WV          3.21           18.5        7.42
##  3 CO                      7.71           13.6        5.10
##  4 VA                      4.09           16.0       10.6 
##  5 CA                      8.47           23.5        3.23
##  6 IN, OH                  7.56           12.5        3.30
##  7 NJ                     11.3            14.5        1.88
##  8 WI                     11.3            12.4        2.19
##  9 MA                     24.4            21.1        1.58
## 10 IA, MN, ND, SD          3.95           14.3        5.22
##    single_family_attached single_family_detached
##                     <dbl>                  <dbl>
##  1                  19.4                    52.8
##  2                  17.3                    53.6
##  3                  10.7                    62.9
##  4                   7.29                   62.0
##  5                   7.01                   57.8
##  6                   6.29                   70.4
##  7                   5.93                   66.3
##  8                   5.85                   68.3
##  9                   5.70                   47.2
## 10                   5.68                   70.9
## # … with 17 more rows

Here is the same example using tidyr::complete() to avoid NA from missing values in our sum (assuming that is the behavior we want.)

# Compute the proportion of each home type in each reportable domain: ---------

# with missing values assumed 0
home_type_prop_complete = recs_tib %>% 
  transmute(state = REPORTABLE_DOMAIN, type = TYPEHUQ, weight = NWEIGHT) %>%  
  mutate(
    state = decode_all_states(state),
    type = decode_all_house_types(type)
  ) %>%
  group_by(state, type) %>%
  summarize(homes = sum(weight), .groups = 'drop') %>% # don't complete by group
  tidyr::complete(state, type, fill = list(homes = 0)) %>%
  #filter(state == 'IL')
  pivot_wider(names_from = type, values_from = homes) %>%
  mutate(
    Total = apartment_few + apartment_many + mobile_home +
       single_family_attached + single_family_detached,
    apartment_few = 100 * apartment_few / Total,
    apartment_many = 100* apartment_many / Total,
    mobile_home = 100 * mobile_home / Total,
    single_family_attached = 100 * single_family_attached / Total,
    single_family_detached = 100 *single_family_detached / Total
  ) %>%
  select(-Total) %>%
  arrange( desc(single_family_attached) )

Alternate Case Study

An alternate case study similar to the one above is available as` at the Stats506_F20 repo.

Graphics with ggplot2

Learning ggplot2

The package ggplot2 is one of the most popular in R for producing high quality, professional, graphics. The “gg” at the start stands for the “grammar of graphics”. Much like “dplyr” part of the success of “ggplot” is that it provides a simple syntax for a number of powerful graphical concepts, but, at the same time, limits your options in order to force you to think about what you want to communicate rather than how you create the plot that occurred to you.

If you are new to ggplot2 the syntax can be frustrating and even counter intuitive at times. However, here are some (opinionated) tips for working with ggplot.

Treat ggplot2 as an imperative syntax, use this syntax do specify what you want your plot to look like and don’t think to much about how it accomplishes that.
Don’t ask ggplot to do what you can do yourself by changing the data, e.g. change a variable’s names or scale;
A corollary – pass the data you want to plot to ggplot rather than using geom_’s to compute summary statistics, e.g. use group_by() and summarize() with geom_col() rather than geom_bar().
Like all of programming, start simple and iterate, adding components one step at a time.
Read the help and pay attention to the required aesthetics.
At the console type ggplot2:: hit tab and scroll through the functions.

These are my suggestions, not style mandates.
Tip 1 is important because a ggplot is an object whose print method produces a plot as a side effect. This is unlike base R graphics where each function produce side effects that set up or add to the plotting area. New ggplot2 users (myself included once) often make the mistake of trying to use ggplot2 as a functional language.

ggplot basics

Set up a ggplot with a call to ggplot(), some data, and (usually) the x and y aesthetics, note the use of aes() to match aesthetics to columns in the data:

ggplot(home_type_prop_complete, aes(x = apartment_few, y = apartment_many))

Add (using +) the geom_’s you need for the data. I most commonly use geom_point(), geom_line(), and geom_col().

ggplot(home_type_prop_complete, aes(x = apartment_few, y = apartment_many)) +
  geom_point()

Constants do not go in the aes() call.

ggplot(home_type_prop_complete, aes(x = apartment_few, y = apartment_many)) +
  geom_point(col = 'darkblue')

Note the difference between the color and fill aesthetics:

p4a = pivot_longer(home_type_prop_complete, 
             contains('_'), 
             names_to = "Housing Type",
             values_to = "proportion"
) %>%
  ggplot( aes(x = state, y = proportion, fill = `Housing Type`) ) +
   geom_col( position = position_stack(), color = 'black' ) +
   coord_flip() 
p4a

p4b = pivot_longer(home_type_prop_complete, 
             contains('_'), 
             names_to = "Housing Type",
             values_to = "proportion"
) %>%
  ggplot( aes(x = state, y = proportion, color = `Housing Type`) ) +
   geom_col( position = position_stack()) +
   coord_flip() 
p4b

Use theme_*() for global aesthetics and x/ylab() to modify x/ylim() the axis labels or limits.

ggplot(home_type_prop_complete, aes(x = apartment_few, y = apartment_many)) +
  geom_point(col = 'darkblue') +
  theme_bw() +
  ylab("% Large Apartment Complexes") + 
  xlab("% Small Apartment Complexes") +
  xlim(c(0, 35)) + 
  ylim(c(0, 35)) +
  geom_abline(slope = 1)

Use scales_*_*() to change values mapped to aesthetics.

p4a + 
  scale_fill_brewer(palette = "OrRd")

Introduction to the tidyverse

Statistics 506, Fall 2020

Reading

Style Note

Data transformation with dplyr

Case Study: Home types by state

Data cleaning

Exercise

Aggregrate by group

Style Note

Reshaping and formatting results for presentation

Subsetting rows

Caution

Pipes %>%

Alternate Case Study

Graphics with ggplot2

Learning ggplot2

ggplot basics

Data transformation with `dplyr`

Pipes `%>%`