In this document we will use the RECS data as a case study for introducing “split, apply, combine” and several R packages from the tidyverse.
We will take a brief look at readr
and then illustrate some of the functionality of dplyr
and tidyr
. You should read the following introduction to dplyr and work through the examples there on your own.
The readr package is one of several alternatives to read.csv()
and read.table
in the base R package utils
. Let’s do a quick comparison of the two using system.time()
.
## load (install) readr
#install.packages(readr)
library(readr)
# setwd('~/Stats506/')
## Base R read.csv is a wrapper around read.table
system.time({
recs_df = read.table('./recs2009_public.csv',
stringsAsFactors = FALSE, header=TRUE,
sep=',')
}
)
## user system elapsed
## 5.535 0.173 5.863
class(recs_df)
## [1] "data.frame"
## readr
system.time({
suppressMessages({
recs_tib = read_delim('./recs2009_public.csv', delim=',',
col_names=TRUE)
})
})
## user system elapsed
## 1.008 0.042 1.071
class(recs_tib)
## [1] "tbl_df" "tbl" "data.frame"
In the example above, note the use of braces {}
to encapsulate an expression with assignment within the call to system.time()
.
One of the most frequently used patterns in data analysis and management is performing some operation repeatedly over subsets of data. This is often called “split-apply-combine” and is an approach emphasized by the R package dplyr. You should read the introduction to dplyr and work through the examples there.
We will use the RECS data imported above to demonstrate some of the functionality of dplyr. Let’s set out to answer the question, which state has the highest proportion of single-family attached homes?
Let’s begin by creating a clean data frame with the necessary variables. We’ll need to keep the two variables of interest and the sample weight.
## Subset recs to home-type, state, and sample weight.
recs_homes = transmute(recs_tib, State=REPORTABLE_DOMAIN, Type=TYPEHUQ,
Weight = NWEIGHT)
recs_homes
## # A tibble: 12,083 x 3
## State Type Weight
## <dbl> <dbl> <dbl>
## 1 12 2 2471.68
## 2 26 2 8599.17
## 3 1 5 8969.92
## 4 7 2 18003.64
## 5 1 3 5999.61
## 6 10 2 4232.49
## 7 3 2 7862.34
## 8 17 2 6297.04
## 9 5 3 12156.72
## 10 12 2 3242.22
## # ... with 12,073 more rows
Next, let’s clean up the values to something more easily interpreted. We will write functions for the mappings so we can reuse them later. The functions will begin with the decode_
prefix so we can remember them as a group.
# Functions to decode states or reportable_domains from RECS.
# decode_state returns a single value
decode_state = function(x){
if(!is.numeric(x)) stop('decode_states expects numeric input indexed from 1!')
switch(x,
"CT, ME, NH, RI, VT", "MA", "NY", "NJ", "PA", "IL", "IN, OH", "MI", "WI",
"IA, MN, ND, SD", "KS, NE", "MO", "VA", "DE, DC, MD, WV", "GA",
"NC, SC" , "FL", "AL, KY, MS", "TN", "AR, LA, OK",
"TX", "CO", "ID, MT, UT, WY", "AZ", "NV, NM",
"CA", "AK, HI, OR, WA"
)
}
# this is a wrapper to an apply call
decode_all_states = function(x){
sapply(x, decode_state)
}
# Functions to decode housing type
decode_house_type = function(x){
if(!is.numeric(x)) stop('decode_house_type expects numeric input indexed from 1!')
switch(x,
'MobileHome',
'SingleFamilyDetached',
'SingleFamilyAttached',
'ApartmentFew',
'ApartmentMany'
)
}
decode_all_house_types = function(x){
sapply(x,decode_house_type)
}
Next we mutate
our data frame, replacing our numeric codes with decoded values.
recs_homes = mutate(recs_homes,
State=decode_all_states(State),
Type=decode_all_house_types(Type))
recs_homes
## # A tibble: 12,083 x 3
## State Type Weight
## <chr> <chr> <dbl>
## 1 MO SingleFamilyDetached 2471.68
## 2 CA SingleFamilyDetached 8599.17
## 3 CT, ME, NH, RI, VT ApartmentMany 8969.92
## 4 IN, OH SingleFamilyDetached 18003.64
## 5 CT, ME, NH, RI, VT SingleFamilyAttached 5999.61
## 6 IA, MN, ND, SD SingleFamilyDetached 4232.49
## 7 NY SingleFamilyDetached 7862.34
## 8 FL SingleFamilyDetached 6297.04
## 9 PA SingleFamilyAttached 12156.72
## 10 MO SingleFamilyDetached 3242.22
## # ... with 12,073 more rows
Recall that we are interested in computing the proportion of each housing type by state. We can do this using the split, apply, combine, paradigm – in dplyr and other languages this is accomplished using a group operation with an aggregation function. In dplyr we use group_by
to group and summarize
to aggregate.
First the grouping,
recs_homes_group_states = group_by(recs_homes, State, Type)
recs_homes_group_states
## # A tibble: 12,083 x 3
## # Groups: State, Type [134]
## State Type Weight
## <chr> <chr> <dbl>
## 1 MO SingleFamilyDetached 2471.68
## 2 CA SingleFamilyDetached 8599.17
## 3 CT, ME, NH, RI, VT ApartmentMany 8969.92
## 4 IN, OH SingleFamilyDetached 18003.64
## 5 CT, ME, NH, RI, VT SingleFamilyAttached 5999.61
## 6 IA, MN, ND, SD SingleFamilyDetached 4232.49
## 7 NY SingleFamilyDetached 7862.34
## 8 FL SingleFamilyDetached 6297.04
## 9 PA SingleFamilyAttached 12156.72
## 10 MO SingleFamilyDetached 3242.22
## # ... with 12,073 more rows
and now the aggregation
recs_type_state_sum = summarize(recs_homes_group_states, Homes=sum(Weight))
recs_type_state_sum
## # A tibble: 134 x 3
## # Groups: State [?]
## State Type Homes
## <chr> <chr> <dbl>
## 1 AK, HI, OR, WA ApartmentFew 374742.9
## 2 AK, HI, OR, WA ApartmentMany 946196.4
## 3 AK, HI, OR, WA MobileHome 384298.0
## 4 AK, HI, OR, WA SingleFamilyAttached 189645.3
## 5 AK, HI, OR, WA SingleFamilyDetached 2833057.4
## 6 AL, KY, MS ApartmentFew 183982.7
## 7 AL, KY, MS ApartmentMany 201343.6
## 8 AL, KY, MS MobileHome 422086.3
## 9 AL, KY, MS SingleFamilyAttached 192720.3
## 10 AL, KY, MS SingleFamilyDetached 3637140.9
## # ... with 124 more rows
Pay close attention to the change in grouping. When summarize
is called we lose the most nested group.
To proceed, let’s reshape the data to have one row per state. We can do this using the tidyr::spread
function.
# Reshape to wide format for easy computation of proportions
recs_type_state = tidyr::spread(recs_type_state_sum,Type,Homes)
# Compute proportions
recs_type_state = mutate(recs_type_state,
Total = ApartmentFew + ApartmentMany + MobileHome +
SingleFamilyAttached + SingleFamilyDetached,
ApartmentFew = 100*ApartmentFew/Total,
ApartmentMany = 100*ApartmentMany/Total,
MobileHome = 100*MobileHome/Total,
SingleFamilyAttached = 100*SingleFamilyAttached/Total,
SingleFamilyDetached = 100*SingleFamilyDetached/Total
)
# Drop total
recs_type_state = select(recs_type_state,-Total)
# Sort by Single Family Attached
recs_type_state = arrange(recs_type_state, SingleFamilyAttached)
# Use desc() to sort in descending order
recs_type_state = arrange(recs_type_state, desc(SingleFamilyAttached))
recs_type_state
## # A tibble: 27 x 6
## # Groups: State [27]
## State ApartmentFew ApartmentMany MobileHome
## <chr> <dbl> <dbl> <dbl>
## 1 PA 10.965041 11.47363 5.287897
## 2 DE, DC, MD, WV 3.207400 18.54110 7.422471
## 3 CO 7.706639 13.63568 5.099064
## 4 VA 4.092552 16.01914 10.639862
## 5 CA 8.466971 23.50956 3.226220
## 6 IN, OH 7.556989 12.47549 3.304245
## 7 NJ 11.303660 14.53719 1.880313
## 8 WI 11.340003 12.36640 2.186126
## 9 MA 24.400382 21.14616 1.584660
## 10 IA, MN, ND, SD 3.946439 14.28259 5.224894
## # ... with 17 more rows, and 2 more variables: SingleFamilyAttached <dbl>,
## # SingleFamilyDetached <dbl>
A nicely formatted table would be rounded to a whole or tenth of a percent. In general, it is better to round only for display and not in the data itself to prevent rounding error from accumulating.
knitr::kable(recs_type_state,digits=1,caption='Proprtion of home types by State(s).')
State | ApartmentFew | ApartmentMany | MobileHome | SingleFamilyAttached | SingleFamilyDetached |
---|---|---|---|---|---|
PA | 11.0 | 11.5 | 5.3 | 19.4 | 52.8 |
DE, DC, MD, WV | 3.2 | 18.5 | 7.4 | 17.3 | 53.6 |
CO | 7.7 | 13.6 | 5.1 | 10.7 | 62.9 |
VA | 4.1 | 16.0 | 10.6 | 7.3 | 62.0 |
CA | 8.5 | 23.5 | 3.2 | 7.0 | 57.8 |
IN, OH | 7.6 | 12.5 | 3.3 | 6.3 | 70.4 |
NJ | 11.3 | 14.5 | 1.9 | 5.9 | 66.3 |
WI | 11.3 | 12.4 | 2.2 | 5.9 | 68.3 |
MA | 24.4 | 21.1 | 1.6 | 5.7 | 47.2 |
IA, MN, ND, SD | 3.9 | 14.3 | 5.2 | 5.7 | 70.9 |
NV, NM | 6.8 | 9.9 | 5.0 | 5.7 | 72.7 |
NY | 16.9 | 33.4 | 1.6 | 5.3 | 42.8 |
AR, LA, OK | 7.6 | 14.3 | 5.6 | 5.1 | 67.4 |
MO | 6.0 | 8.2 | 7.6 | 4.9 | 73.3 |
CT, ME, NH, RI, VT | 13.9 | 16.5 | 1.5 | 4.8 | 63.3 |
KS, NE | 6.3 | 13.6 | 6.8 | 4.6 | 68.7 |
ID, MT, UT, WY | 2.2 | 5.8 | 7.2 | 4.6 | 80.3 |
AL, KY, MS | 4.0 | 4.3 | 9.1 | 4.2 | 78.4 |
TX | 5.4 | 16.8 | 7.2 | 4.1 | 66.5 |
AK, HI, OR, WA | 7.9 | 20.0 | 8.1 | 4.0 | 59.9 |
FL | 5.9 | 16.4 | 14.0 | 3.7 | 60.0 |
AZ | 1.1 | 16.7 | 14.8 | 3.4 | 64.0 |
NC, SC | 6.4 | 15.3 | 13.6 | 3.2 | 61.4 |
GA | 3.6 | 13.4 | 3.7 | 2.9 | 76.5 |
MI | 5.3 | 15.3 | 7.3 | 2.8 | 69.4 |
TN | 4.5 | 18.4 | 9.7 | 1.9 | 65.5 |
IL | NA | NA | NA | NA | NA |
Next we take a quick look at just Michigan to demonstrate the use of filter
.
recs_type_state %>% filter(State=='MI')
## # A tibble: 1 x 6
## # Groups: State [1]
## State ApartmentFew ApartmentMany MobileHome SingleFamilyAttached
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 MI 5.258925 15.27065 7.270111 2.77751
## # ... with 1 more variables: SingleFamilyDetached <dbl>
We might also want to find all states with at least 25% of people living in apartments,
recs_type_state %>% filter( {ApartmentFew+ApartmentMany} >= 25)
## # A tibble: 6 x 6
## # Groups: State [6]
## State ApartmentFew ApartmentMany MobileHome
## <chr> <dbl> <dbl> <dbl>
## 1 CA 8.466971 23.50956 3.226220
## 2 NJ 11.303660 14.53719 1.880313
## 3 MA 24.400382 21.14616 1.584660
## 4 NY 16.927978 33.41180 1.614217
## 5 CT, ME, NH, RI, VT 13.933993 16.52327 1.489281
## 6 AK, HI, OR, WA 7.926135 20.01287 8.128232
## # ... with 2 more variables: SingleFamilyAttached <dbl>,
## # SingleFamilyDetached <dbl>
If you look at the bottom of the home-types by state table, you’ll notice Illinois has NA
rather than values. What happened?
Looking at the state/type summaries we will notice that there is no mobile home type for Illinois.
recs_type_state_sum %>% filter(State=='IL')
## # A tibble: 4 x 3
## # Groups: State [1]
## State Type Homes
## <chr> <chr> <dbl>
## 1 IL ApartmentFew 546751.7
## 2 IL ApartmentMany 944531.9
## 3 IL SingleFamilyAttached 144198.4
## 4 IL SingleFamilyDetached 3121970.1
In our previous call to mutate, we didn’t guard against this. We can do something safer to get more robust results. One way to do this would be to use the rowwise
function which ensures vectorized R functions such as sum
are interpreted to act row by row. This illustrated under the “pipes” tab.
%>%
At then end of the case study, you may have notice an odd symbol %>%
. This “pipe” operator can be used to pass a data frame implicitly to dplyr functions. The utility of this lies in allowing us to string dplyr functions together in such a way that they read in the same order as performed rather from the inside out. This also prevents us from needing to keep track of so many intermediate objects. Together these two properites make our code cleaner – making it easier to understand and debug.
Here is the entire example as a single piped chain,
home_type_prop = recs_tib %>%
transmute(State=REPORTABLE_DOMAIN, Type=TYPEHUQ, Weight = NWEIGHT) %>%
mutate(State=decode_all_states(State), Type=decode_all_house_types(Type)) %>%
group_by(State, Type) %>%
summarize(Homes=sum(Weight)) %>%
tidyr::spread(Type,Homes) %>%
mutate(Total = ApartmentFew + ApartmentMany + MobileHome +
SingleFamilyAttached + SingleFamilyDetached,
ApartmentFew = 100*ApartmentFew/Total,
ApartmentMany = 100*ApartmentMany/Total,
MobileHome = 100*MobileHome/Total,
SingleFamilyAttached = 100*SingleFamilyAttached/Total,
SingleFamilyDetached = 100*SingleFamilyDetached/Total
) %>%
select(-Total) %>%
arrange(desc(SingleFamilyAttached))
# Change options for display
options(list('digits'=2,dplyr.width=Inf))
home_type_prop
## # A tibble: 27 x 6
## # Groups: State [27]
## State ApartmentFew ApartmentMany MobileHome SingleFamilyAttached SingleFamilyDetached
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 PA 11.0 11 5.3 19.4 53
## 2 DE, DC, MD, WV 3.2 19 7.4 17.3 54
## 3 CO 7.7 14 5.1 10.7 63
## 4 VA 4.1 16 10.6 7.3 62
## 5 CA 8.5 24 3.2 7.0 58
## 6 IN, OH 7.6 12 3.3 6.3 70
## 7 NJ 11.3 15 1.9 5.9 66
## 8 WI 11.3 12 2.2 5.9 68
## 9 MA 24.4 21 1.6 5.7 47
## 10 IA, MN, ND, SD 3.9 14 5.2 5.7 71
## # ... with 17 more rows
Here is the same example using rowwise
to avoid NA
from missing values in our sum (assuming that is the behavior we want.)
home_type_prop_rowwise = recs_tib %>%
transmute(State=REPORTABLE_DOMAIN, Type=TYPEHUQ, Weight = NWEIGHT) %>%
mutate(State=decode_all_states(State), Type=decode_all_house_types(Type)) %>%
group_by(State, Type) %>%
summarize(Homes=sum(Weight)) %>%
tidyr::spread(Type,Homes) %>%
rowwise() %>% # to guard against NA when summing "Total"
mutate(Total = sum(ApartmentFew, ApartmentMany, MobileHome,
SingleFamilyAttached, SingleFamilyDetached, na.rm=TRUE),
ApartmentFew = 100*ApartmentFew/Total,
ApartmentMany = 100*ApartmentMany/Total,
MobileHome = 100*MobileHome/Total,
SingleFamilyAttached = 100*SingleFamilyAttached/Total,
SingleFamilyDetached = 100*SingleFamilyDetached/Total
) %>%
select(-Total) %>%
arrange(desc(SingleFamilyAttached))
## In an R script, options() sets global options.
## These seem to be wiped between code chunks in R-Markdown.
getOption('digits')
## [1] 7
options(list('digits'=2,dplyr.width=Inf))
getOption('digits')
## [1] 2
home_type_prop_rowwise %>% filter(State=="IL")
## # A tibble: 1 x 6
## State ApartmentFew ApartmentMany MobileHome SingleFamilyAttached SingleFamilyDetached
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 IL 11 20 NA 3 66