Resampling Techniques

Resampling is an umbrella term for a number of statistical techniques used to estimate quantities related to the sampling distribution of an estimator.

Techniques that fall under this umbrella include:

The utility of all of these techniques is greatly enhanced by the ability to automate the re-sampling process and subsequent computations.

Resources

The Bootstrap

The bootstrap is a generic statistical method for estimating the variance of an estimator. It is frequently used to find confidence intervals when exact or asymptotic analytic formulas are unavailable or unsatisfactory.

The basic idea of the bootstrap is to build up the sampling distribution of an estimator by re-sampling the data many times. In the non-parametric bootstrap this is done by drawing \(B\) copies of the data from the empirical distribution, \(\mathbb{P}\):

\[ X_1, \dots, X_n \sim_{iid} P, \qquad \mathbb{P}(t) = \frac{1}{n} \sum_{i=1}^n 1[X_i \le t] \]

In the parametric bootstrap the data are instead re-sampled from an (assumed) parametric (e.g. Gaussian) estimate of \(P\). For a comparison of the two see the slides from this talk by Robert Tibshirani.

Continuing with the non-parametric bootstrap, the basic idea is to draw many artificial data sets from \(\mathbb{P}\),

\[ \{(X^*_1, \dots, X_n^*)_b\}_{b=1}^B \sim_{iid} \mathbb{P} \] which is equivalent to drawing a sample of size \(n\) from the reference data \(X_1, \dots, X_n\) with replacement.

There are various methods for constructing a confidence interval for an estimator \(\bar{\theta}(X)\) using a bootstrap sample. The one I will emphasize in these notes is the percentile method in which the confidence bounds are taken as samples quantiles from the bootstrapped distribution of the estimator. We continue with an example below.

Bootstrap Confidence Bounds for the IQR

Code for this example is available at Stats506_F20 as bootstarap_iqr_example.R.

Suppose we have iid data \(X_1, \dots, X_n\) from an unknown distribution and want to construct a confidence interval for the inter-quartile range.

n = 50
x = rgamma(n, shape = 3, rate = 1)
boxplot(x, las = 1, main = 'Boxplot of Data')

iqr_est = unname(diff(quantile(x, c(.25, .75))))
iqr_est
## [1] 2.535554

To form a (non-parametric) bootstrap estimate, we draw \(B\) new data sets from the original data.

B = 1e3  ## Number of bootstrap samples
boot_samples = sample(x, n * B, replace=TRUE)  ## Sample with replacement
dim(boot_samples) = c(n, B) ## each column is a dataset

For each bootstrap sample we compute the statistic of interest, in this case the IQR. This operation is not easy to vectorize so we use an implicit loop instead.

boot_iqr = apply(boot_samples, 2, 
                function(b) { unname(diff(quantile(b, c(.25, .75)))) }
            )

This gives us \(B\) samples of the IQR. We use these to estimate the sampling distribution and construct a 95% confidence interval.

hist(boot_iqr, las = 1, col = 'green4', xlab = 'Estimated IQR', 
     cex.axis = 1.5, cex.lab = 1.5, main = '')

boot_q = quantile(boot_iqr, c(.025, .975))
abline(v = boot_q, lty = 'dashed', lwd = 2)

boot_ci = sprintf('%3.1f (%3.1f, %3.1f)', iqr_est, boot_q[1], boot_q[2])
cat(boot_ci,'\n')
## 2.5 (1.9, 3.1)

Here is a function that collects the above computations into one place and can be used repeatably.

iqr_ci = function(y, n_boot = 1e3){
  m = length(y)
  mat = sample(y, n_boot * m, replace=TRUE)
  dim(mat) = c(n_boot, m)
  f = function(x) { diff(quantile(x, c(.25, .75))) }
  v = apply(mat, 1, f)
  lcb = quantile(v, 0.025)
  ucb = quantile(v, 0.975)
  return( c(lcb, ucb) )
}
iqr_ci(x)
##     2.5%    97.5% 
## 1.879793 3.135028

Bootstrap function for quantile estimates

Below is a function for comparison slightly modified from Professor Shedden’s 2016 notes on vectorization.

# Return a bootstrap 95% confidence interval for the
# q^th population quantile based on an iid sample.
quant_ci = function(y, q = 0.5, n_boot = 1000) {
    m = length(y)
    mat = sample(y, n_boot * m, replace = TRUE)
    dim(mat) = c(n_boot, m)
    f = function(x) {quantile(x, q)}
    v = apply(mat, 1, f)
    lcb = quantile(v, 0.025)
    ucb = quantile(v, 0.975)
    return( c(lcb, ucb) )
}

When and why the bootstrap works

For a gentle introduction to the underlying theory, see this blog post from Larry Wasserman.

Case Study

We will consider the chick weight case study illustrating the use of the bootstrap to estimate the uncertainty in the median. This case study also introduces several concepts for programming with data.table.

Permutation Tests

Another versatile statistical method heavily reliant on computation is permutation testing. Permutation tests are often used with non iid data or when comparing non standard statistics. The general idea is to construct a reference distribution for determining the significance of an observed test statistic by repeatedly permuting group labels.

Permutation tests are useful for testing a null hypothesis of no association, but require observations to be exchangeable under the null hypothesis.

Suppose we have \(n\) total samples from 2 groups. Then there are \(m_c = \begin{pmatrix} n \\ k \end{pmatrix}\) possible group assignments where
\(k\) is the size of the first group. Note that there are actually \(m_p = n!\) permutations, but they are not all unique.

How many possible groupings are there if we have \(g\) groups of size \(k_1, \dots, k_g\)?

If \(m_c\) is reasonably small all possible group assignments can be considered. Let \(\kappa_i\) denote the \(i^{th}\) combination and \(\bar \theta(X)\) be the statistic of interest. Then, the permutation p-value for a two-sided test is:

\[ p = \frac{1}{m_c} \sum_{i=1}^{m_c} 1[~|\bar \theta(X)|~ \le ~|\bar\theta(\kappa_i(X))|~]. \]

More commonly, when \(m_c\) is large we approximate the sum above using some large but tractable number of permutations sampled at random. If we sample \(N\) permutations \(\{\tilde \pi_i(X)]\}_{i=1}^N\) we can form a Monte Carlo approximation of the sum above:

\[ \hat p = \frac{1}{N} \sum_{i=1}^N 1[~|\bar \theta(X)|~ \le ~|\bar\theta(\tilde\pi_i(X))|~]. \]

The estimate, \(\hat p\), follows a Binomial distribution which can be used to estimate the Monte Carlo error in our estimate. This is particularly important if our estimate is near a decision boundary, such as \(\alpha = 0.05\).

Because at least one permutation, the identity permutation \(\pi(X) = X\), has \(|\bar \theta(X)| \le |\bar\theta(\pi(X))|\) it is conventional to add one to both the numerator and denominator of the estimate above:

\[ \hat p = \frac{1}{N+1} \left(1 + \sum_{i=1}^N 1[~|\bar \theta(X)|~ \le ~|\bar\theta(\tilde\pi_i(X))|~]\right). \]

Example: Chick Weights

To illustrate the basic idea behind permutation tests we consider the problem of comparing the weights of chicks fed on a variety of diets. The data for this example comes from the R datasets package and can be loaded using data(chickwts).

data(chickwts)
str(chickwts)
## 'data.frame':    71 obs. of  2 variables:
##  $ weight: num  179 160 136 227 217 168 108 124 143 140 ...
##  $ feed  : Factor w/ 6 levels "casein","horsebean",..: 2 2 2 2 2 2 2 2 2 2 ...
with(chickwts, boxplot(weight ~ feed, las = 1), 'Chick Weight Data')

In this setting a permutation test is not necessary and an ANOVA would likely be more appropriate.

anova(lm(weight ~ feed, data = chickwts))
## Analysis of Variance Table
## 
## Response: weight
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## feed       5 231129   46226  15.365 5.936e-10 ***
## Residuals 65 195556    3009                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

However, if we were concerned about normality or dependence in similarly structured data we could carry out a permutation test using the F statistic as above. First, here is a function to compute \(F\) from a data.table.

compute_F = function(dt, response, group){
  # compute the ANOVA F statistic
  # dt - a data.table with columns response and group
  # response - the name of a (numeric) column in dt, used as the response (DV)
  # group - the name of a column in dt, used as the grouping factor (IV)

  stopifnot( c(response, group) %in% names(dt) )
  stopifnot( "data.table" %in% class(dt) )
  
  # compute the grand mean
  gm = dt[, mean(.SD[[response]])]
  
  # compute group means
  sum_stat = 
    dt[, .(N = .N, xbar = mean(.SD[[1]]), v = var(.SD[[1]])), 
       c(group), .SDcols = response]
  
  # compute the mean squares
  ms_group = sum_stat[, sum( (xbar - gm)^2 * N / (.N - 1) )]
  ms_error = sum_stat[,  sum(v * (N - 1))] / (nrow(dt) - nrow(sum_stat))
  
  # F
  ms_group / ms_error
}

Now, let’s look at a single permutation:

dt_chickwts = as.data.table(chickwts)
dt_chickwts[, group := feed[sample(.N, replace = FALSE)]]
head(dt_chickwts)
##    weight      feed     group
## 1:    179 horsebean  meatmeal
## 2:    160 horsebean    casein
## 3:    136 horsebean sunflower
## 4:    227 horsebean sunflower
## 5:    217 horsebean   soybean
## 6:    168 horsebean  meatmeal
compute_F(dt_chickwts, 'weight', 'group')
## [1] 0.9242074

We can write a function to compute the F statistic for a permuted sample:

permute_F = function(dt, response = 'weight', group = 'feed'){
  dt[, group := .SD[[..group]][sample(.N, replace = FALSE)]]
  compute_F(dt, response, "group")
}
permute_F(dt_chickwts, 'weight', 'feed')
## [1] 0.2730153

Now we are ready to carry out the permutation test. First, compute the observed statistic:

F_obs = compute_F(dt_chickwts, 'weight', 'feed')

Next we permute many times to build up the reference distribution,

nperm = 1e3
perm_F = sapply(1:nperm, function(i) permute_F(dt_chickwts, 'weight', 'feed') )
hist(perm_F, las = 1, main = 'F statistics for permuted Chick Weight Data', 
     col = rgb(0, 0, 1, .5), xlab = 'F')

Finally we can report an estimated p-value by comparing the observed F to the values simulated from the null distribution by permutation:

p = {1 + sum(F_obs <= perm_F)} / {1 + length(perm_F)}
pval_str = sprintf('p = %5.3f', p)
cat(pval_str)
## p = 0.001

Cross Validation

In some contexts, such as statistical or machine learning (ML) we prefer to evaluate our model based on the quality of its predictions rather than via inference on model parameters or its fit to data used for estimation.

In ML, when building a model for some form of prediction task available data is split as follows:

The image here helps to clarify this distinction.

We use out-of-sample data for validation and testing to get unbiased estimates of the out of sample error and avoid over-fitting.

In cross validation we try to make efficient use of our available non-test data by repeatedly interchanging which observations are considered training and which validation data. This is typically done by dividing the data into sub-groups called folds. If we have k of these groups we refer to it as k-fold cross validation. The special case when k equals the total number of non-test samples is known as leave-one-out cross-validation.

When dividing data into folds – and also when generating the test-train split –it is important that observations be randomly distributed among the groups. For example, if you had previously sorted your data set you would not want to assign a contiguous block of rows to the same fold.

Example: Ridge Regression

The code for this example can be found at xvalidate_ridge.R.