Introduction to Permutation Analysis (Resampling)

A permutation test is a type of non-parametric statistical significance test in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points (in case the ‘labels’ are quantitative values, these are randomized).

Dataset used here

In this tutorial, we shall work on a dataset with 2 variables: Running distance and Gene Expression. This dataset was obtained from a laboratory at the University of Michigan from rats. For each rats, the investigators measured the maximum distance the rat could run, and the expression level of a gene of interest. Our test statistic is the Pearson’s correlation between the two variables. We will test the significance of this statistic by using permutations to obtain the null distribution of the correlation statistic.

Running permutation analysis in Python (in this tutorial we used Python version 3.1)

Running mathematical operations in Python is convenient with the numpy, pandas and statsmodels packages, which have pre-written functions similar to R. First, to load all the libraries we need for our analysis:

import sys
import pandas as pd
import numpy as np
import statsmodels.formula.api as sm

We then read in the dataset as a pandas data frame. This is what the first 10 rows of the data look like:

samples = pd.read_table("samplegeneset.txt")
samples.loc[0:10]
      GENE_A   DIST_m
1   1.0696046 1137.60
2   1.022521  1192.80
3   1.034164   710.88
4   1.096667   781.00
5   1.190602  1146.00
6   1.177998   870.53
7   1.013153   768.50
8   1.060913  1086.58
9   1.165786   791.50
10   0.916338   909.47

First, we have to obtain the statistic (Pearson’s correlation coefficient) seen in the actual data.

#Obtaining the actual correlation seen in the dataset
result = scipy.stats.pearsonr(samples.GENE_A, samples.DIST_m)[0]
result
0.0369

Then, we permute the values of DIST_m (in a 1000 permutations) to see what the null distribution of the correlation coefficient is. We have chosen an arbitrary number of 1000 permutations. The variable NUM_PERMS can be modified to increase this number. In general, the larger the number of permutations, the better, though you are limited by the size of your dataset (for example, if you have only 3 samples, you cannot have more than 3! permutations).

In Python, we use a for loop to loop through each permutation, in which we first randomize one of the variables, and calculate the test statistic.

#Permuting distances to obtain null distribution
NUM_PERMS = 1000

random_corr = []
for perm in range(0, NUM_PERMS):
        y = random.sample(population=list(samples.DIST_m), k=len(samples.DIST_m))
        random_corr.append(scipy.stats.pearsonr(samples.GENE_A, y)[0])

The P-value from a permutation test is the fraction of times the permuted statistic is more significant (in our case, larger) than the actual value of the statistic.

p_val = len([i for i in random_corr if i >= result])/NUM_PERMS

print(p_val)
0.225

We see that the correlation is not very significant: permutation testing gives a P-value of 0.225, which means that this correlation can be seen by chance about 1/5th of the time. This is not significant at a standard P-value cutoff of 0.05.

Visualizing the P-value

Python has plotting libraries which make it easy to visualize data. We shall be using matplotlib’s pyplot functionality to visualize our results.

#Add some plots for the p-value
import matplotlib.pyplot as plt

n, bins, patches = plt.hist(random_corr, 50, normed=1, facecolor='g', alpha=0.75)
plt.axvline(x=result, color='r', linestyle='dashed', linewidth=2)
plt.title('Randomized correlation coefficients, with actual value shown in red',fontsize=9)
plt.show()

Running permutation analysis in R

Firstly we need to load data into R by using read.table since it’s in txt format.

samples<-read.table("samplegeneset.txt",head=TRUE)

This is what the first ten rows look like in R:

samples[1:10,]
      GENE_A  DIST_m
1  1.0696046 1137.60
2  1.0225211 1192.80
3  1.0341644  710.88
4  1.0966669  781.00
5  1.1906022 1146.00
6  1.1779982  870.53
7  1.0131527  768.50
8  1.0609133 1086.58
9  1.1657860  791.50
10 0.9163378  909.47

Secondly we obtain the actual statistic (Pearson’s correlation coefficient) seen in the data by using’cor’function in R.

attach(samples)
actual_correlation<-cor(GENE_A,DIST_m,method = "pearson")
actual_correlation
[1] 0.03696368

Then, we permute the values of DIST_m (in a 1000 permutations) to see what the null distribution of the correlation coefficient is. We permute it without replacement.Also the number of permutations can be modified by change the variable’NUM_PERMS’. In R we do permutation by for loop and functionsample, here is how we can approach it:

NUM_PERMS=1e3
permuted_correlation<-rep(NA,NUM_PERMS)
for(i in 1:NUM_PERMS){
  resampled_DIST_m<-sample(DIST_m,replace=FALSE)
  permuted_correlation[i]<-cor(GENE_A,resampled_DIST_m,method="pearson")
}

Lastly calculate the P-value from a permutation test which is the percentage of times the permuted statistic is more significant (in our case, larger) than the actual value of the statistic. In R we use sum function to sum up the number of times that the permuted statistic is larger than actual statistic, and then divide it by the total number of permutations which gives us the percentage, the syntax goes like this:

p_val=sum(permuted_correlation>=actual_correlation)/NUM_PERMS
p_val
[1] 0.199

The correlation is not very significant: permutation testing gives a P-value of 0.199, which means that this correlation can be seen by chance about 1/5th of the time.

Running permutation analysis in STATA

As with R and Python, we need to first load the data into STATA. We do this with import delimited.

import delimited "./samplegeneset.txt", clear

To view the first ten rows of the data, we use list.

list in 1/10

     +--------------------+
     |   gene_a    dist_m |
     |--------------------|
  1. | 1.069605    1137.6 |
  2. | 1.022521    1192.8 |
  3. | 1.034164    710.88 |
  4. | 1.096667       781 |
  5. | 1.190602      1146 |
     |--------------------|
  6. | 1.177998    870.53 |
  7. | 1.013153     768.5 |
  8. | 1.060913   1086.58 |
  9. | 1.165786     791.5 |
 10. | .9163378    909.47 |
     +--------------------+

Unlike R and Python, there are no ‘libraries’ to load in STATA. We use the pwcorr command to obtain the Pearson correlation coefficient (which is stored in r(Rho)), and then run permutation tests using the permute command. We specify the model to do 1000 replications, report the right tailed p-value, and save the permuted correlation coefficients to a file called “out.dta”. From the results, we see that the reported p-value is 0.2150. This means that Pearson correlation value of 0.037 can be observed 21.5% of the time, assuming the null hypothesis is true.

permute dist_m r=r(rho), nodots reps(1000) right saving(out, replace): pwcorr dist_m gene_a 

Monte Carlo permutation results                 Number of obs     =        434

      command:  pwcorr dist_m gene_a
            r:  r(rho)
  permute var:  dist_m

------------------------------------------------------------------------------
T            |     T(obs)       c       n   p=c/n   SE(p) [95% Conf. Interval]
-------------+----------------------------------------------------------------
           r |   .0369637     215    1000  0.2150  0.0130  .1899067   .2417832
------------------------------------------------------------------------------
Note: Confidence interval is with respect to p=c/n.
Note: c = #{T >= T(obs)}

In order to graph our permuted values, we load the dataset “out.dta”, and build a histogram. The red line indicates where the observed value of 0.037 lies in this distribution.

use "./out.dta", clear
hist r, addplot(pci 0 0.0369637 10 0.0369637) legend(label(2 "Observed Value"))
Histogram of Permuted r values

Histogram of Permuted r values

Things to consider

It may not always be necessary to run a permutation analysis if a parametric test fits your dataset. The first step would always be to check your data for normality and standard assumptions, and then use a non-parametric test if these are not met.

The number of permutations you run limits the P-value you can obtain. For example, we ran 1000 permutations, so the minimum P-value we can obtain from this is P-value < 0.0001. It would be incorrect to say that we got a P-value of 0.

References

Wikipedia: Permutation testing