The bootstrap is a generic statistical method for assessing the accuracy of an estimator. It is frequently used to find confidence intervals when exact or asymptotic analytic formulas are unavailable or unsatisfactory.
The basic idea of bootstrapping is to build up the sampling distribution of an estimator by re-sampling the data many times. In the non-parametric bootstrap this is done by drawing \(B\) copies of the data from the empirical distribution, \(\mathbb{P}\):
\[ X_1, \dots, X_n \sim_{iid} P, \qquad \mathbb{P}(t) = \frac{1}{n} \sum_{i=1}^n 1[X_i \le t] \] In the parametric boostrap the data are re-sampled instead from an (assumed) parametric (i.e. Gaussian) estimate of \(P\). For a comparison of the two see this talk by Robert Tibshirani.
Continuing with the non-parametric bootstrap, the basic idea is to draw many artificial data sets from \(\mathbb{P}\),
\[ \{(X^*_1, \dots, X_n^*)_b\}_{b=1}^B \sim_{iid} \mathbb{P} \] which is equivalent to drawing a sample of size \(n\) from the data \(X_1, \dots, X_n\) with replacement.
There are various methods for constructing a confidence interval for an estimator \(\bar{\theta}(X)\) using a bootstrap sample. The most common is the percentile method in which the confidence bounds are taken as samples quantiles from the bootstrapped distribution of the estimator. We continue with an example below.
Suppose we have iid data \(X_1, \dots, X_n\) from an unknown distribution and want to construct a confidence interval for the inter-quartile range.
n=50
x = rgamma(n,shape=3,rate=1)
boxplot(x, las=1, main='Boxplot of Data')
iqr_est = unname(diff(quantile(x, c(.25, .75))))
iqr_est
## [1] 2.558567
To form a (non-parametric) bootstrap estimate, we draw \(B\) new data sets from the original data.
B = 1e3 ## Number of bootstrap samples
boot_samples = sample(x, n*B, replace=TRUE) ## Sample with replacement
dim(boot_samples) = c(n, B) ## each column is a dataset
For each bootstrap sample we compute the statistic of interest, in this case the IQR. This operation is not easy to vectorize so we use an implicit loop instead.
boot_iqr = apply(boot_samples, 2, function(b) unname(diff(quantile(b, c(.25, .75)))))
This gives us \(B\) samples of the IQR which we use to estimate its sampling distribution and construct a 95% confidence interval.
hist(boot_iqr, las=1, col='green4', xlab='Estimated IQR',
cex.axis=1.5, cex.lab=1.5, cex.main=1.5)
boot_q = quantile(boot_iqr, c(.025, .975))
abline(v=boot_q, lty='dashed', lwd=2)
boot_ci = sprintf('%3.1f (%3.1f, %3.1f)', iqr_est, boot_q[1], boot_q[2])
cat(boot_ci,'\n')
## 2.6 (1.5, 3.3)
Below is a function for comparison slightly modified from Professor Shedden’s 2016 notes on vectorization.
# Return a bootstrap 95% confidence interval for the
# q^th population quantile based on an iid sample.
quant_ci = function(y, q=0.5, n_boot=1000) {
m = length(y)
mat = sample(y, n_boot*m, replace=TRUE)
dim(mat) = c(n_boot, m)
f = function(x)quantile(x, q)
v = apply(mat, 1, f)
lcb = quantile(v, 0.025)
ucb = quantile(v, 0.975)
return(c(lcb, ucb))
}
For a gentle introduction to the underlying theory, see this blog post by Professor Larry Wasserman.
Another versatile statistical method heavily reliant on computation is permutation testing. Permutation tests are often used with non iid data or when comparing non standard statistics. The general idea is to construct a reference distribution for determining the significance of an observed test statistic by repeatedly permuting group labels.
Suppose we have \(n\) total samples from 2 groups. Then there are \(m = \begin{pmatrix} n \\ k \end{pmatrix}\) possible group assignments where \(k\) is the size of the first group. Note that there are actually \(n!\) permutations, but they are not all unique. If this number is reasonably small all possible group assignments can be considered. Let \(\pi_i\) denote the \(i^{th}\) permutation and \(\bar \theta(X)\) be the statistic of interest. Then, the permutation p-value for a two-sided test is:
\[ p = \frac{1}{m} \sum_{i=1}^m 1[~|\bar \theta(X)|~ \ge ~|\bar\theta(\pi_i(X))|~]. \] ### Basic Example: Chick Weights
To illustrate the basic idea behind permutation tests we consider the problem of comparing the weights of chicks fed on a variety of diets. The data for this example comes from the R datasets
package and can be loaded using data(chickwts)
.
data(chickwts)
str(chickwts)
## 'data.frame': 71 obs. of 2 variables:
## $ weight: num 179 160 136 227 217 168 108 124 143 140 ...
## $ feed : Factor w/ 6 levels "casein","horsebean",..: 2 2 2 2 2 2 2 2 2 2 ...
with(chickwts, boxplot(weight~feed, las=1), 'Chick Weight Data')
In this setting a permutation test is not necesssary and an anova would likely be more appropriate.
anova(lm(weight~feed, data=chickwts))
## Analysis of Variance Table
##
## Response: weight
## Df Sum Sq Mean Sq F value Pr(>F)
## feed 5 231129 46226 15.365 5.936e-10 ***
## Residuals 65 195556 3009
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
However, if we were concerned about normality or dependence in similarly sturcutured data we could carry out a permutation test using the F statistics as above. First, here is a function to compute \(F\) from a data frame with two columns ‘response’ and ‘group’,
# df should have exactly two columns "response" and "group"
compute_F = function(df){
# compute the grand mean
gm = mean(df$response)
# compute group means
sum_stat = df %>% group_by(group) %>%
summarize(xbar = mean(response), v = var(response), n=n())
# compute the mean squares
ms_group = sum({sum_stat$xbar - gm}^2*sum_stat$n) / {nrow(sum_stat) - 1}
ms_error = sum(sum_stat$v*{sum_stat$n-1}) / {nrow(df) - nrow(sum_stat)}
ms_group / ms_error
}
Now, let’s look at a single permutation:
df = chickwts %>% rename(response = weight, group = feed)
df_perm = df %>% mutate(group = group[sample(1:nrow(df), replace=FALSE)])
head(df_perm)
## response group
## 1 179 casein
## 2 160 horsebean
## 3 136 meatmeal
## 4 227 meatmeal
## 5 217 soybean
## 6 168 horsebean
compute_F(df_perm)
## [1] 0.2793237
We can write a function to compute the F statistic for the permuted sample:
permute_F = function(df){
compute_F({df %>% mutate(group = group[sample(1:nrow(df), replace=FALSE)])})
}
permute_F(df)
## [1] 1.667214
Now we are ready to carry out the permutation test. First, compute the observed statistic:
F_obs = compute_F(df)
Now we can permute many times to build up the reference distribution,
nperm = 1e3
perm_F = sapply(1:nperm, function(i) permute_F(df) )
hist(perm_F, las = 1, main='F statistics for permuted Chick Weight Data',
col=rgb(0,0,1,.5), xlab='F')
Finally we can report a p-value by comparing the observed F to the values simulated from the null distribution by permutation:
p = mean(F_obs <= perm_F)
pval_str = ifelse(p > 0, sprintf('p = %5.3f', p), sprintf('p < %5.3f', 1/nperm))
cat(pval_str)
## p < 0.001
In this example the observed F statistic was greater than all sampled permuted values so we estimate it to be less than one over the number of permutations.
In reality, we would prefer a vectorized version but this example is intended to illustrate the principle of permutation testing only.
The Kolmogorov-Smirnov (KS) test is a general non-parametric test for the equality of two distributions. Let \(X_1, \dots, X_n \sim F_x\) and \(Y_1, \dots, Y_n \sim F_y\).
The KS statistic for testing the equality of these two distributions is the maximum difference in the corresponding empirical distributions:
\[ D(X,Y) = \sup_t |\mathbb{F}_x(t) - \mathbb{F}_y(t)|. \] When \(X\) and \(Y\) are iid samples there is asymptotic theory for the distribution of \(D(X,Y)\) that can be used for assessing significance. However, non-iid data are frequently encountered in science such as in the case of repeated measures.
Suppose for instance we would like to compare the distribution of some structural property from a specific vertebra between two treatment groups in an animal study. In this example, we choose four location from each animal and then image each location three times to obtain our sample. Measurements are correlated within animals and within images but assumed independent between animals.
Below we first generate data from the structure above assuming a 50% increase in variance between treatment and control groups. Then we construct a permutation test that respects this dependence and apply it our synthetic data.
## Settings
nAn = 12 # anmimals
nLo = 4 # locations
nIm = 3 # images
rLo = .2 # correlation between locations
rIm = .7 # correlation between images
## Form the correlation matrix within an animal
R = matrix(rLo, nLo*nIm, nLo*nIm)
B = matrix(rIm, nIm, nIm)
for(i in 1:nLo){
R[seq((i-1)*nIm+1,i*nIm),seq((i-1)*nIm+1,i*nIm)] = B
}
diag(R) = 1
## Here is a quick image to illustrate
R_all = as.matrix(Matrix::bdiag(lapply(1:nAn, function(x) R)))
filled.contour(R_all[144:1,])
#my_colors = colorRampPalette(c('white','red'),alpha=.5)(10)
#image(R_all[144:1,], col=my_colors, breaks=seq(0,1,.1), xaxt='n', yaxt='n')
## Introduce correlation by right multiplying by Cholesky
## decomposition of correlation matrix.
chR = chol(R)
set.seed(103)
control = matrix(rnorm(nAn*nLo*nIm, 0, 1), nrow=nAn) %*% chR
treat = matrix(rnorm(nAn*nLo*nIm, 0, 1.5), nrow=nAn) %*% chR
# Convert back to vector i.e. "long form" keeping
# measurements from the same animal in blocks
control = as.vector(t(control))
treat = as.vector(t(treat))
# Compute the CDFs
F.t = ecdf(treat)
F.c = ecdf(control)
# Plot the CDFs
plot(F.t, col='blue', main='Cumulative Distribution Functions', las=1)
lines(F.c, col='black')
legend('topleft', legend=c('Treatment', 'Control'),
lty='solid', lwd=2, col=c('blue','black'), bty='n')
# Compute the test statistic and show as red-dashed line
z = c(treat,control)
z = z[order(z)]
delta = abs(F.t(z) - F.c(z))
x = z[which.max(delta)]
lines(c(x, x), c(F.t(x), F.c(x)), col='red', lty='dashed', lwd=2)
## Function to compute the KS statistic
ks_stat = function(x, y){
# form unique points and order
z = c(x, y)
# compute ecdfs and differences
max(abs(ecdf(x)(z) - ecdf(y)(z)))
}
## x and y here are vectors
## block_size tells us how to chunk x and y
permute_ks = function(x, y, block_size=FALSE){
## construct one block per column
z = c(x,y)
dim(z) = c(block_size, {length(x) + length(y)} / block_size)
## sample columns for group assignment
ind = sample(1:ncol(z), length(x)/block_size, replace=FALSE)
x = z[,ind]; dim(x) = NULL
y = z[,-ind]; dim(y) = NULL
# compute stat for permuted columns
ks_stat(x, y)
}
## ks_perm_test test the distribution of x versus y
## by permuting group labels in blocks of the indicated size.
ks_perm_test = function(x, y, nperm=1e3, block_size=nLo*nIm){
ks_obs = ks_stat(x, y)
ks_perm = sapply(1:nperm, function(i){permute_ks(x, y, block_size=block_size)})
mean(ks_obs <= ks_perm)
}
ks_perm_test(treat, control)
## [1] 0.716
Finally, let us compute the power of this test at the 5% significance level. Warning: this will take several minutes to run.
mcrep = 1e3
results = vector(length=mcrep, mode='logical')
for(i in 1:mcrep){
control = matrix(rnorm(nAn*nLo*nIm, 0, 1), nrow=nAn) %*% chR
treat = matrix(rnorm(nAn*nLo*nIm, 0, 1.5), nrow=nAn) %*% chR
results[i] = {ks_perm_test(treat, control) <= .05}
}
lcb = mean(results) - qnorm(.975)*.5/sqrt(mcrep)
ucb = mean(results) + qnorm(.975)*.5/sqrt(mcrep)
result_str = sprintf('Estimated power: %3.1f%% (%3.1f%%, %3.1f%%).\n',
100*mean(results), 100*lcb, 100*ucb)
cat(result_str)
## Estimated power: 72.9% (69.8%, 76.0%).