Resampling Techniques

Resampling is an umbrella term for a number of statistical techniques used to estimate quantities related to the sampling distribution of an estimator.

Techniques that fall under this umbrella include:

The utility of all of these techniques is greatly enhanced by the ability to automate the re-sampling process and subsequent computations.

Resources

The Bootstrap

The bootstrap is a generic statistical method for assessing the accuracy of an estimator. It is frequently used to find confidence intervals when exact or asymptotic analytic formulas are unavailable or unsatisfactory.

The basic idea of the bootstrap is to build up the sampling distribution of an estimator by re-sampling the data many times. In the non-parametric bootstrap this is done by drawing \(B\) copies of the data from the empirical distribution, \(\mathbb{P}\):

\[ X_1, \dots, X_n \sim_{iid} P, \qquad \mathbb{P}(t) = \frac{1}{n} \sum_{i=1}^n 1[X_i \le t] \] In the parametric boostrap the data are re-sampled instead from an (assumed) parametric (e.g. Gaussian) estimate of \(P\).
For a comparison of the two see this talk by Robert Tibshirani.

Continuing with the non-parametric bootstrap, the basic idea is to draw many artificial data sets from \(\mathbb{P}\),

\[ \{(X^*_1, \dots, X_n^*)_b\}_{b=1}^B \sim_{iid} \mathbb{P} \] which is equivalent to drawing a sample of size \(n\) from the reference data \(X_1, \dots, X_n\) with replacement.

There are various methods for constructing a confidence interval for an estimator \(\bar{\theta}(X)\) using a bootstrap sample. The one I wish to emphasize is the percentile method in which the confidence bounds are taken as samples quantiles from the bootstrapped distribution of the estimator. We continue with an example below.

Bootstrap Confidence Bounds for the IQR

Code for this example is available at Stats506_F19 as bootstarap_iqr_example.R.

Suppose we have iid data \(X_1, \dots, X_n\) from an unknown distribution and want to construct a confidence interval for the inter-quartile range.

## [1] 2.532801

To form a (non-parametric) bootstrap estimate, we draw \(B\) new data sets from the original data.

For each bootstrap sample we compute the statistic of interest, in this case the IQR. This operation is not easy to vectorize so we use an implicit loop instead.

This gives us \(B\) samples of the IQR. We use these to estimate the sampling distribution and construct a 95% confidence interval.

## 2.5 (1.8, 3.1)

Here is a function that collects the above computations into one place and is repeatable.

##     2.5%    97.5% 
## 1.789583 3.060147

When and why the bootstrap works

For a gentle introduction to the underlying theory, see this blog post from Larry Wasserman.

Case Study

We will consider the chick weight case study illustrating the use of the boostrap to estimate the uncertainty in the median. This case study also introduces several concepts for programming with dplyr.

Permutation Tests

Another versatile statistical method heavily reliant on computation is permutation testing. Permutation tests are often used with non iid data or when comparing non standard statistics. The general idea is to construct a reference distribution for determining the significance of an observed test statistic by repeatedly permuting group labels.

Permutation tests are useful for testing a null hypothesis of no association, but require observations to exchangeable under the null hypothesis.

Suppose we have \(n\) total samples from 2 groups. Then there are \(m = \begin{pmatrix} n \\ k \end{pmatrix}\) possible group assignments where
\(k\) is the size of the first group. Note that there are actually \(n!\) permutations, but they are not all unique.

How many possible groupings are there if we have \(g\) groups of size \(k_1, \dots, k_g\)?

If \(m\) is reasonably small all possible group assignments can be considered. Let \(\pi_i\) denote the \(i^{th}\) permutation and \(\bar \theta(X)\) be the statistic of interest. Then, the permutation p-value for a two-sided test is:

\[ p = \frac{1}{m} \sum_{i=1}^m 1[~|\bar \theta(X)|~ \le ~|\bar\theta(\pi_i(X))|~]. \]

More commonly, when \(m\) is large we approximate the sum above using some large but tractable number of permutations sampled at random. If we sample N permutations \(\tilde \pi_i(X)\), we can form a Monte Carlo approximation of the sum above:

\[ \hat p = \frac{1}{N} \sum_{i=1}^N 1[~|\bar \theta(X)|~ \le ~|\bar\theta(\tilde\pi_i(X))|~]. \]

The estimate, \(\hat p\), follows a Binomial distribution which can be used to estimate the Monte Carlo error in our estimate. This is particularly important if our estimate is near a decision boundary, such as \(\alpha = 0.05\).

Because at least one permutation, the idenity permutation \(\pi(X) = X\), has \(|\bar \theta(X)| \le |\bar\theta(\pi(X))|\) it is conventional to add one to both the numerator and denominator of the estimate above:

\[ \hat p = \frac{1}{N+1} \left(1 + \sum_{i=1}^N 1[~|\bar \theta(X)|~ \le ~|\bar\theta(\tilde\pi_i(X))|~]\right). \]

Example: Chick Weights

To illustrate the basic idea behind permutation tests we consider the problem of comparing the weights of chicks fed on a variety of diets. The data for this example comes from the R datasets package and can be loaded using data(chickwts).

## 'data.frame':    71 obs. of  2 variables:
##  $ weight: num  179 160 136 227 217 168 108 124 143 140 ...
##  $ feed  : Factor w/ 6 levels "casein","horsebean",..: 2 2 2 2 2 2 2 2 2 2 ...

In this setting a permutation test is not necessary and an anova would likely be more appropriate.

## Analysis of Variance Table
## 
## Response: weight
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## feed       5 231129   46226  15.365 5.936e-10 ***
## Residuals 65 195556    3009                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

However, if we were concerned about normality or dependence in similarly structured data we could carry out a permutation test using the F statistics as above. First, here is a function to compute \(F\) from a data frame with two columns ‘response’ and ‘group’,

Here’s a similar function that allows the user to pass ‘response’ and ‘group’ as a string written using dplyr programming syntax.

Now, let’s look at a single permutation:

##   response     group
## 1      179 sunflower
## 2      160 horsebean
## 3      136 sunflower
## 4      227  meatmeal
## 5      217   soybean
## 6      168    casein
## [1] 0.7672322

Check that both functions give the same value.

## [1] TRUE

We can write a function to compute the F statistic for the permuted sample:

## [1] 0.7387909

Now we are ready to carry out the permutation test. First, compute the observed statistic:

Next we permute many times to build up the reference distribution,

Finally we can report an estimated p-value by comparing the observed F to the values simulated from the null distribution by permutation:

## p = 0.001

In this example the observed F statistic was greater than all sampled permuted values so we estimate it to be less than one over the number of permutations.

Cross Validation

In some contexts, such as statistical or machine learning (ML) we prefer to evaluate our model based on the quality of its predictions rather than via inference on model parameters or its fit to data used for estimation.

In ML, when building a model for some form of prediction task available data is split as follows:

The image here helps to clarify this distinction.

We use out-of-sample data for validation and testing to get unbiased estimates of the out of sample error and avoid overfitting.

In cross validation we try to make efficient use of our available non-test data by repeatedly interchanging which observations are considered training and which validation data. This is typically done by dividing the data into sub-groups called folds. If we have k of these groups we refer to it as k-fold cross validation. The special case when k equals the total number of not-test samples is known as leave-one-out cross-validation.

When dividing data into folds – and also when generating the test-train split – it is important that observations be randomly distributed among the groups. For example, if you had previously sorted your data set you would not want to assign a contiguous block of rows to the same fold.