Resampling is an umbrella term for a number of statistical techniques used to estimate quantities related to the sampling distribution of an estimator.
Techniques that fall under this umbrella include:
The utility of all of these techniques is greatly enhanced by the ability to automate the re-sampling process and subsequent computations.
The bootstrap is a generic statistical method for assessing the accuracy of an estimator. It is frequently used to find confidence intervals when exact or asymptotic analytic formulas are unavailable or unsatisfactory.
The basic idea of the bootstrap is to build up the sampling distribution of an estimator by re-sampling the data many times. In the non-parametric bootstrap this is done by drawing \(B\) copies of the data from the empirical distribution, \(\mathbb{P}\):
\[ X_1, \dots, X_n \sim_{iid} P, \qquad \mathbb{P}(t) = \frac{1}{n} \sum_{i=1}^n 1[X_i \le t] \] In the parametric boostrap the data are re-sampled instead from an (assumed) parametric (e.g. Gaussian) estimate of \(P\). For a comparison of the two see this talk by Robert Tibshirani.
Continuing with the non-parametric bootstrap, the basic idea is to draw many artificial data sets from \(\mathbb{P}\),
\[ \{(X^*_1, \dots, X_n^*)_b\}_{b=1}^B \sim_{iid} \mathbb{P} \] which is equivalent to drawing a sample of size \(n\) from the reference data \(X_1, \dots, X_n\) with replacement.
There are various methods for constructing a confidence interval for an estimator \(\bar{\theta}(X)\) using a bootstrap sample. The most common is the percentile method in which the confidence bounds are taken as samples quantiles from the bootstrapped distribution of the estimator. We continue with an example below.
Code for this example is available at Stats506_F18 as bootstarap_iqr_example.R
.
Suppose we have iid data \(X_1, \dots, X_n\) from an unknown distribution and want to construct a confidence interval for the inter-quartile range.
n = 50
x = rgamma(n, shape=3, rate=1)
boxplot(x, las=1, main='Boxplot of Data')
iqr_est = unname(diff(quantile(x, c(.25, .75))))
iqr_est
## [1] 1.905841
To form a (non-parametric) bootstrap estimate, we draw \(B\) new data sets from the original data.
B = 1e3 ## Number of bootstrap samples
boot_samples = sample(x, n*B, replace=TRUE) ## Sample with replacement
dim(boot_samples) = c(n, B) ## each column is a dataset
For each bootstrap sample we compute the statistic of interest, in this case the IQR. This operation is not easy to vectorize so we use an implicit loop instead.
boot_iqr = apply(boot_samples, 2,
function(b) unname( diff( quantile(b, c(.25, .75) ) ) )
)
This gives us \(B\) samples of the IQR. We use these to estimate the sampling distribution and construct a 95% confidence interval.
hist(boot_iqr, las=1, col='green4', xlab='Estimated IQR',
cex.axis=1.5, cex.lab=1.5, main = '')
boot_q = quantile(boot_iqr, c(.025, .975))
abline(v=boot_q, lty='dashed', lwd=2)
boot_ci = sprintf('%3.1f (%3.1f, %3.1f)', iqr_est, boot_q[1], boot_q[2])
cat(boot_ci,'\n')
## 1.9 (1.5, 2.6)
Here is a function that collects the above computations into one place and is repeatable.
iqr_ci = function(y, n_boot=1e3){
m = length(y)
mat = sample(y, n_boot*m, replace=TRUE)
dim(mat) = c(n_boot, m)
f = function(x) diff( quantile(x, c(.25, .75) ) )
v = apply(mat, 1, f)
lcb = quantile(v, 0.025)
ucb = quantile(v, 0.975)
return( c(lcb, ucb) )
}
iqr_ci(x)
## 2.5% 97.5%
## 1.452141 2.669621
Below is a function for comparison slightly modified from Professor Shedden’s 2016 notes on vectorization.
# Return a bootstrap 95% confidence interval for the
# q^th population quantile based on an iid sample.
quant_ci = function(y, q=0.5, n_boot=1000) {
m = length(y)
mat = sample(y, n_boot*m, replace=TRUE)
dim(mat) = c(n_boot, m)
f = function(x) quantile(x, q)
v = apply(mat, 1, f)
lcb = quantile(v, 0.025)
ucb = quantile(v, 0.975)
return( c(lcb, ucb) )
}
For a gentle introduction to the underlying theory, see this blog post from Larry Wasserman.
In the linked case study, we use the bootstrap and the data.table package to estimate confidence intervals for the median weight at 21 weeks of chicks on various types of feed.
The case study is available at the course materials repo: chick_weight_data.table.R.
Another versatile statistical method heavily reliant on computation is permutation testing. Permutation tests are often used with non iid data or when comparing non standard statistics. The general idea is to construct a reference distribution for determining the significance of an observed test statistic by repeatedly permuting group labels.
Suppose we have \(n\) total samples from 2 groups. Then there are \(m = \begin{pmatrix} n \\ k \end{pmatrix}\) possible group assignments where \(k\) is the size of the first group. Note that there are actually \(n!\) permutations, but they are not all unique.
How many possible groupings are there if we have \(g\) groups of size \(k_1, \dots, k_g\)?
If \(m\) is reasonably small all possible group assignments can be considered. Let \(\pi_i\) denote the \(i^{th}\) permutation and \(\bar \theta(X)\) be the statistic of interest. Then, the permutation p-value for a two-sided test is:
\[ p = \frac{1}{m} \sum_{i=1}^m 1[~|\bar \theta(X)|~ \ge ~|\bar\theta(\pi_i(X))|~]. \]
More commonly, when \(m\) is large we approximate the sum above using some large but tractable number of permutations.
To illustrate the basic idea behind permutation tests we consider the problem of comparing the weights of chicks fed on a variety of diets. The data for this example comes from the R datasets
package and can be loaded using data(chickwts)
.
data(chickwts)
str(chickwts)
## 'data.frame': 71 obs. of 2 variables:
## $ weight: num 179 160 136 227 217 168 108 124 143 140 ...
## $ feed : Factor w/ 6 levels "casein","horsebean",..: 2 2 2 2 2 2 2 2 2 2 ...
with(chickwts, boxplot(weight~feed, las=1), 'Chick Weight Data')
In this setting a permutation test is not necessary and an anova would likely be more appropriate.
anova(lm(weight~feed, data=chickwts))
## Analysis of Variance Table
##
## Response: weight
## Df Sum Sq Mean Sq F value Pr(>F)
## feed 5 231129 46226 15.365 5.936e-10 ***
## Residuals 65 195556 3009
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
However, if we were concerned about normality or dependence in similarly structured data we could carry out a permutation test using the F statistics as above. First, here is a function to compute \(F\) from a data frame with two columns ‘response’ and ‘group’,
# df should have exactly two columns "response" and "group"
compute_F = function(df){
# compute the grand mean
gm = mean(df$response)
# compute group means
sum_stat = df %>% group_by(group) %>%
summarize(xbar = mean(response), v = var(response), n=n())
# compute the mean squares
ms_group = sum({sum_stat$xbar - gm}^2*sum_stat$n) / {nrow(sum_stat) - 1}
ms_error = sum(sum_stat$v*{sum_stat$n-1}) / {nrow(df) - nrow(sum_stat)}
ms_group / ms_error
}
Here’s a similar function that allows the user to pass ‘response’ and ‘group’ as a string written using dplyr programming syntax.
compute_F2 = function(df, response, group) {
# response and group should be strings that refer to columns in df
stopifnot( length(response) == 1 & length(group) == 1)
stopifnot( all( c(response, group) %in% names(df) ) )
# Compute grand mean
gm = mean(df[[response]])
sum_stat = df %>% ungroup() %>%
group_by(.data[[ !!group ]]) %>%
summarize( xbar = mean(.data[[ !! response ]]),
v = var(.data[[ !! response ]]),
n = n()
)
ms_group = with(sum_stat, {xbar-gm}^2*n / { nrow(sum_stat) - 1 } )
ms_error = with(sum_stat, v * {n - 1} / { sum(n) - nrow(sum_stat) })
sum(ms_group) / sum(ms_error)
}
Now, let’s look at a single permutation:
df = chickwts %>% rename(response = weight, group = feed)
df_perm = df %>% mutate( group = group[ sample(1:nrow(df), replace=FALSE) ] )
head(df_perm)
## response group
## 1 179 meatmeal
## 2 160 linseed
## 3 136 meatmeal
## 4 227 horsebean
## 5 217 meatmeal
## 6 168 soybean
compute_F(df_perm)
## [1] 1.442968
Check that both functions give the same value.
near(compute_F(df), compute_F2(df, 'response', 'group'))
## [1] TRUE
We can write a function to compute the F statistic for the permuted sample:
permute_F = function(df){
compute_F({df %>%
mutate( group = group[ sample(1:nrow(df), replace=FALSE) ] )
}
)
}
permute_F(df)
## [1] 0.3568678
Now we are ready to carry out the permutation test. First, compute the observed statistic:
F_obs = compute_F(df)
Next we permute many times to build up the reference distribution,
nperm = 1e3
perm_F = sapply(1:nperm, function(i) permute_F(df) )
hist(perm_F, las = 1, main='F statistics for permuted Chick Weight Data',
col=rgb(0,0,1,.5), xlab='F')
Finally we can report a p-value by comparing the observed F to the values simulated from the null distribution by permutation:
p = mean(F_obs <= perm_F)
pval_str = ifelse(p > 0, sprintf('p = %5.3f', p), sprintf('p < %5.3f', 1/nperm))
cat(pval_str)
## p < 0.001
In this example the observed F statistic was greater than all sampled permuted values so we estimate it to be less than one over the number of permutations.
In some contexts, such as statistical or machine learning (ML) we prefer to evaluate our model based on the quality of its predictions rather than via inference on model parameters.
In ML, when building a model for some form of prediction task available data is split as follows:
The image here helps to clarify this distinction.
We use out-of-sample data for validation and testing to get unbiased estimates of the out of sample error and avoid overfitting.
In cross validation we try to make efficient use of our available non-test data by repeatedly interchanging which observations are considered training and which validation data. This is typically done by dividing the data into sub-groups called folds. If we have k of these groups we refer to it as k-fold cross validation. The special case when k equals the total number of not-test samples is known as leave-one-out cross-validation.
When dividing data into folds – and also when generating the test-train split – it is important that observations be randomly distributed among the groups. For example, if you had previously sorted your data set you would not want to assign a contiguous block of rows to the same fold.
Here is an example using cross-validation in the context of ridge regression from the course materials page.