Resampling Techniques

Resampling Techniques
Resources
The Bootstrap
Permutation Tests
- Example: Chick Weights
Cross Validation
- Ridge Regression Example

Resampling Techniques

Resampling is an umbrella term for a number of statistical techniques used to estimate quantities related to the sampling distribution of an estimator.

Techniques that fall under this umbrella include:

the bootstrap
subsampling or the “m out of n” bootstrap
the jack-knife
permutation testing
cross validation.

The utility of all of these techniques is greatly enhanced by the ability to automate the re-sampling process and subsequent computations.

Resources

Two applications of the bootstrap by Robert Tibshirani.
Bootstrapping and Subsampling by Larry Wasserman.
The boot package
The coin package

The Bootstrap

The bootstrap is a generic statistical method for assessing the accuracy of an estimator. It is frequently used to find confidence intervals when exact or asymptotic analytic formulas are unavailable or unsatisfactory.

The basic idea of the bootstrap is to build up the sampling distribution of an estimator by re-sampling the data many times. In the non-parametric bootstrap this is done by drawing \(B\) copies of the data from the empirical distribution, \(\mathbb{P}\):

\[ X_1, \dots, X_n \sim_{iid} P, \qquad \mathbb{P}(t) = \frac{1}{n} \sum_{i=1}^n 1[X_i \le t] \] In the parametric boostrap the data are re-sampled instead from an (assumed) parametric (e.g. Gaussian) estimate of \(P\). For a comparison of the two see this talk by Robert Tibshirani.

Continuing with the non-parametric bootstrap, the basic idea is to draw many artificial data sets from \(\mathbb{P}\),

\[ \{(X^*_1, \dots, X_n^*)_b\}_{b=1}^B \sim_{iid} \mathbb{P} \] which is equivalent to drawing a sample of size \(n\) from the reference data \(X_1, \dots, X_n\) with replacement.

There are various methods for constructing a confidence interval for an estimator \(\bar{\theta}(X)\) using a bootstrap sample. The most common is the percentile method in which the confidence bounds are taken as samples quantiles from the bootstrapped distribution of the estimator. We continue with an example below.

Bootstrap Confidence Bounds for the IQR

Code for this example is available at Stats506_F18 as bootstarap_iqr_example.R.

Suppose we have iid data \(X_1, \dots, X_n\) from an unknown distribution and want to construct a confidence interval for the inter-quartile range.

n = 50
x = rgamma(n, shape=3, rate=1)
boxplot(x, las=1, main='Boxplot of Data')

iqr_est = unname(diff(quantile(x, c(.25, .75))))
iqr_est

## [1] 1.905841

To form a (non-parametric) bootstrap estimate, we draw \(B\) new data sets from the original data.

B = 1e3  ## Number of bootstrap samples
boot_samples = sample(x, n*B, replace=TRUE)  ## Sample with replacement
dim(boot_samples) = c(n, B) ## each column is a dataset

For each bootstrap sample we compute the statistic of interest, in this case the IQR. This operation is not easy to vectorize so we use an implicit loop instead.

boot_iqr = apply(boot_samples, 2, 
                function(b) unname( diff( quantile(b, c(.25, .75) ) ) ) 
            )

This gives us \(B\) samples of the IQR. We use these to estimate the sampling distribution and construct a 95% confidence interval.

hist(boot_iqr, las=1, col='green4', xlab='Estimated IQR', 
     cex.axis=1.5, cex.lab=1.5, main = '')

boot_q = quantile(boot_iqr, c(.025, .975))
abline(v=boot_q, lty='dashed', lwd=2)

boot_ci = sprintf('%3.1f (%3.1f, %3.1f)', iqr_est, boot_q[1], boot_q[2])
cat(boot_ci,'\n')

## 1.9 (1.5, 2.6)

Here is a function that collects the above computations into one place and is repeatable.

iqr_ci = function(y, n_boot=1e3){
  m = length(y)
  mat = sample(y, n_boot*m, replace=TRUE)
  dim(mat) = c(n_boot, m)
  f = function(x) diff( quantile(x, c(.25, .75) ) )
  v = apply(mat, 1, f)
  lcb = quantile(v, 0.025)
  ucb = quantile(v, 0.975)
  return( c(lcb, ucb) )
}
iqr_ci(x)

##     2.5%    97.5% 
## 1.452141 2.669621

Bootstrap function for quantile estimates

Below is a function for comparison slightly modified from Professor Shedden’s 2016 notes on vectorization.

# Return a bootstrap 95% confidence interval for the
# q^th population quantile based on an iid sample.
quant_ci = function(y, q=0.5, n_boot=1000) {
    m = length(y)
    mat = sample(y, n_boot*m, replace=TRUE)
    dim(mat) = c(n_boot, m)
    f = function(x) quantile(x, q)
    v = apply(mat, 1, f)
    lcb = quantile(v, 0.025)
    ucb = quantile(v, 0.975)
    return( c(lcb, ucb) )
}

When and why the bootstrap works

For a gentle introduction to the underlying theory, see this blog post from Larry Wasserman.

Case Study

In the linked case study, we use the bootstrap and the data.table package to estimate confidence intervals for the median weight at 21 weeks of chicks on various types of feed.

The case study is available at the course materials repo: chick_weight_data.table.R.

Permutation Tests

Another versatile statistical method heavily reliant on computation is permutation testing. Permutation tests are often used with non iid data or when comparing non standard statistics. The general idea is to construct a reference distribution for determining the significance of an observed test statistic by repeatedly permuting group labels.

Suppose we have \(n\) total samples from 2 groups. Then there are \(m = \begin{pmatrix} n \\ k \end{pmatrix}\) possible group assignments where \(k\) is the size of the first group. Note that there are actually \(n!\) permutations, but they are not all unique.

How many possible groupings are there if we have \(g\) groups of size \(k_1, \dots, k_g\)?

If \(m\) is reasonably small all possible group assignments can be considered. Let \(\pi_i\) denote the \(i^{th}\) permutation and \(\bar \theta(X)\) be the statistic of interest. Then, the permutation p-value for a two-sided test is:

\[ p = \frac{1}{m} \sum_{i=1}^m 1[~|\bar \theta(X)|~ \ge ~|\bar\theta(\pi_i(X))|~]. \]

More commonly, when \(m\) is large we approximate the sum above using some large but tractable number of permutations.

Example: Chick Weights

To illustrate the basic idea behind permutation tests we consider the problem of comparing the weights of chicks fed on a variety of diets. The data for this example comes from the R datasets package and can be loaded using data(chickwts).

data(chickwts)
str(chickwts)

## 'data.frame':    71 obs. of  2 variables:
##  $ weight: num  179 160 136 227 217 168 108 124 143 140 ...
##  $ feed  : Factor w/ 6 levels "casein","horsebean",..: 2 2 2 2 2 2 2 2 2 2 ...

with(chickwts, boxplot(weight~feed, las=1), 'Chick Weight Data')

In this setting a permutation test is not necessary and an anova would likely be more appropriate.

anova(lm(weight~feed, data=chickwts))

## Analysis of Variance Table
## 
## Response: weight
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## feed       5 231129   46226  15.365 5.936e-10 ***
## Residuals 65 195556    3009                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

However, if we were concerned about normality or dependence in similarly structured data we could carry out a permutation test using the F statistics as above. First, here is a function to compute \(F\) from a data frame with two columns ‘response’ and ‘group’,

# df should have exactly two columns "response" and "group"
compute_F = function(df){
  # compute the grand mean
  gm = mean(df$response)
  
  # compute group means
  sum_stat = df %>% group_by(group) %>% 
    summarize(xbar = mean(response), v = var(response), n=n())
  
  # compute the mean squares
  ms_group = sum({sum_stat$xbar - gm}^2*sum_stat$n) / {nrow(sum_stat) - 1}
  ms_error = sum(sum_stat$v*{sum_stat$n-1}) / {nrow(df) - nrow(sum_stat)}
  
  ms_group / ms_error
}

Here’s a similar function that allows the user to pass ‘response’ and ‘group’ as a string written using dplyr programming syntax.

compute_F2 = function(df, response, group) {
  # response and group should be strings that refer to columns in df
  stopifnot( length(response) == 1 & length(group) == 1)
  stopifnot( all( c(response, group) %in% names(df) ) )
  
  # Compute grand mean
  gm = mean(df[[response]])
  
  sum_stat = df %>% ungroup() %>%
    group_by(.data[[ !!group ]]) %>%
    summarize( xbar = mean(.data[[ !! response ]]), 
               v = var(.data[[ !! response ]]), 
               n = n()
  )
  
  ms_group = with(sum_stat, {xbar-gm}^2*n / { nrow(sum_stat) - 1 } )
  ms_error = with(sum_stat, v * {n - 1} / { sum(n) - nrow(sum_stat) })
  
  sum(ms_group) / sum(ms_error)
}

Now, let’s look at a single permutation:

df = chickwts %>% rename(response = weight, group = feed)
df_perm = df %>% mutate( group = group[ sample(1:nrow(df), replace=FALSE) ] )
head(df_perm)

##   response     group
## 1      179  meatmeal
## 2      160   linseed
## 3      136  meatmeal
## 4      227 horsebean
## 5      217  meatmeal
## 6      168   soybean

compute_F(df_perm)

## [1] 1.442968

Check that both functions give the same value.

near(compute_F(df), compute_F2(df, 'response', 'group'))

## [1] TRUE

We can write a function to compute the F statistic for the permuted sample:

permute_F = function(df){
  compute_F({df %>%
              mutate( group = group[ sample(1:nrow(df), replace=FALSE) ] )
             }
           )
}
permute_F(df)

## [1] 0.3568678

Now we are ready to carry out the permutation test. First, compute the observed statistic:

F_obs = compute_F(df)

Next we permute many times to build up the reference distribution,

nperm = 1e3
perm_F = sapply(1:nperm, function(i) permute_F(df) )
hist(perm_F, las = 1, main='F statistics for permuted Chick Weight Data', 
     col=rgb(0,0,1,.5), xlab='F')

Finally we can report a p-value by comparing the observed F to the values simulated from the null distribution by permutation:

p = mean(F_obs <= perm_F)
pval_str = ifelse(p > 0, sprintf('p = %5.3f', p), sprintf('p < %5.3f', 1/nperm))
cat(pval_str)

## p < 0.001

In this example the observed F statistic was greater than all sampled permuted values so we estimate it to be less than one over the number of permutations.

Cross Validation

In some contexts, such as statistical or machine learning (ML) we prefer to evaluate our model based on the quality of its predictions rather than via inference on model parameters.

In ML, when building a model for some form of prediction task available data is split as follows:

training data is used to estimate model parameters,
validation data is used to estimate how well trained models predict out-of-sample data allowing us to choose hyper parameters or between models,
test data is used to evaluate the model chosen.

The image here helps to clarify this distinction.

We use out-of-sample data for validation and testing to get unbiased estimates of the out of sample error and avoid overfitting.

In cross validation we try to make efficient use of our available non-test data by repeatedly interchanging which observations are considered training and which validation data. This is typically done by dividing the data into sub-groups called folds. If we have k of these groups we refer to it as k-fold cross validation. The special case when k equals the total number of not-test samples is known as leave-one-out cross-validation.

When dividing data into folds – and also when generating the test-train split – it is important that observations be randomly distributed among the groups. For example, if you had previously sorted your data set you would not want to assign a contiguous block of rows to the same fold.

Ridge Regression Example

Here is an example using cross-validation in the context of ridge regression from the course materials page.

Resampling Techniques

Statistics 506, Fall 2018