Vectorization is a programming technique used to avoid explicit loops in order to improve the performance and readability of code.
In its simplest form, vectorization means using built in vectorized functions. That is, rather than looping over a vector to compute its sum, we instead use the vectorized function sum()
.
## Not vectorized
x = -5e2:5e2
s0 = 0
for ( i in 1:length(x) ) {
s0 = s0 + x[i]
}
## Vectorized
s1 = sum(x)
## Do we get the same sum?
s0 == s1
## [1] TRUE
The vectorized code is not only easier to type and read but is also faster.
# Timing comparison for sums: -------------------------------------------------
library(tidyverse); library(microbenchmark)
# Functions to compare: -------------------------------------------------------
f0 = function(n) {
# Explicit looping with indexing
x = 0:n
s0 = 0
for ( i in 1:length(x) ) {
s0 = s0 + x[i]
}
s0
}
f1 = function(n) {
# Explicit looping without indexing
s0 = 0
for ( i in 1:n ) {
s0 = s0 + i
}
s0
}
f2 = function(n){
# Call sum
x = 0:n
s0 = sum(x)
s0
}
# Units here should be nanoseconds
cap = paste0(
'**Demonstration 2.**',
'*Comparing the run times of loops to a vectorized sum.*'
)
microbenchmark( f0 = f0(1e4), f1 = f1(1e4), f2 = f2(1e4), times = 1e3 ) %>%
summary() %>%
knitr::kable( digits = 1, caption = cap, format = 'html' ) %>%
kableExtra::kable_styling(full_width = TRUE)
expr | min | lq | mean | median | uq | max | neval |
---|---|---|---|---|---|---|---|
f0 | 358966 | 360987 | 390435.1 | 362749.5 | 385318.0 | 4098913 | 1000 |
f1 | 203268 | 205251 | 222887.0 | 206092.5 | 212877.0 | 2942064 | 1000 |
f2 | 673 | 825 | 5063.5 | 994.5 | 1653.5 | 2262563 | 1000 |
The reason vectorization is faster has to do with the fact that R is an interpreted language and not a compiled one. The difference in time seen here is primarily due to two factors:
R needs to repeatedly interpret what your code means for each iteration of the loop;
each iteration of the for
loop in f0
requires indexing into x
using the subset function .[
.
The vectorized sum()
function also loops over and repeatedly indexes x
but it does so in the compiled language C and has been optimized to take advantage of the fact that the elements of a vector are contiguous in memory.
Writing efficient R code requires an understanding of these factors and how vectorization helps to overcome them.
Vectorization can be particularly useful in Monte Carlo studies where we might otherwise be inclined to use explicit loops. We will look at some examples after an introduction to Monte Carlo estimates.
In statistics and data science we are often interested in computing expectations (means) of random outcomes of various types.
When analytic expectations are unavailable or cumbersome to compute, it can be useful to obtain Monte Carlo approximations by simulating a random process and then directly averaging the values of interest.
This works because the sample average is generally a good estimate of the corresponding expectation:
\[ {\bar{\theta}}_{n} := \sum_{i=1}^n X_i / n \to_p \theta := E[X]. \]
\[ \bar{\theta}_n \to_p \theta \]
In fact, assuming our data are independent and identically distributed (iid) from a distribution with finite variance, we can characterize the rate of convergence of a sample average to its population counterpart using the central limit theorem (CLT),
\[ \sqrt{n}(\bar{\theta}_n - \theta) \to_{d} N(0,\sigma^2) \]
where \(\sigma^2 = E[X^2] - E[X]^2\) is the variance of the underlying distribution from which \(X\) is drawn. This can be useful for constructing approximate confidence intervals for the Monte Carlo error.
There are vectorized functions in R for simulating from many common distributions. Here are a few:
rnorm()
- Normalrunif()
- Uniformrt()
- the t-distributionrexp()
- Exponentialrpois()
- PoissonAnother useful function in R is sample()
for sampling from a finite set of values, i.e. the discrete uniform distribution or any finite probability mass function.
As an aside, you should be aware that each of the distribution families above have corresponding d*
, p*
, and q*
functions for computing densities, percentiles (CDF), or quantiles (inverse CDF) for each distribution.
When we call one of the r*
functions to generate random draws from a distribution, R relies on a pseudo-random number generate to generate from \(U(0, 1)\) and produce the results. Thus the outcome of these calls depends on the current state of the generator. It is sometimes desirable to reproduce exactly the same pseudo-random sequence. You can do this by fixing the random seed using set.seed()
which takes an integer argument. The function RNGkind()
can be used to display or set the random number generator.
## [1] "Mersenne-Twister" "Inversion" "Rejection"
## [1] FALSE
## [1] TRUE
As a quick example, let’s use these functions to compute percentiles for t-distributions with various degrees of freedom. Let \(\theta_q\) be the parameter of interest,
\[ \theta_q := F(q) = \int_{-\infty}^{q} f(x) dx = \int_{-\infty}^{\infty} 1[x \le q]f(x) dx \]
where \(F(\cdot)\) is the CDF and \(f(\cdot)\) the PDF of a given t-distribution.
## simulation parameters
n = 1e4
df = 3
quantiles = c(-1.96, 1.96)
## simulate data and draw a historgram
dat = rt(n, df)
hist(dat, prob = TRUE, las = 1, col = 'darkgreen',
breaks = function(x) seq(min(x) - 1, max(x) + 1, 1),
ylim = c(0, ceiling(50 * dt(0, df)) / 50)
)
## distribution we drew from
curve(dt(x, df = df), -10, 10, add = TRUE, col = 'red')
abline(v = quantiles, col = 'red')
## Function(s) of interest
#theta_bar = c( mean(dat <= quantiles[1]), mean(dat <= quantiles[2]) )
theta_bar = vapply(quantiles, function(x) mean( dat <= x ), FUN.VALUE = 1.0 )
In this case, our Monte Carlo estimate of \((\theta_{-1.96}, \theta_{1.96})\) is \(\bar{\theta}=\) (0.0704, 0.9269). The actual values are \((\theta_{-1.96}, \theta_{1.96})\) = (0.0724261, 0.9275739).
Suppose we are interested in computing the following integral where \(\phi\) is the standard normal density function:
\[ \int_{-\infty}^\infty [\sin(x)-\cos(x)]^2\phi(x) dx. \]
We can recast this as the expectation below,
\[ E[h(X)], \quad h(x) = [\sin(x)-\cos(x)]^2, \quad X \sim N(0,1). \]
The following R code provides a Monte Carlo estimate,
n = 1e4 # number of Monte Carlo samples
x = rnorm(n) # Monte Carlo sample
mean({sin(x) - cos(x)}^2) # estimate
## [1] 1.00058
Compare this to an estimate using numerical integration,
## 1 with absolute error < 9.4e-05
These values are fairly close to the analytic solution based on the identity \([\sin(x)-\cos(x)]^2 = 1 - \sin(2x)\) and the symmetry about zero of both \(\sin(\cdot)\) and \(\phi(\cdot)\). Suppose \(X \sim N(0,1)\), then
\[ \begin{align} E[(\sin(X)-\cos(x))^2], &= E[1-\sin(2X)], \\ &=1 - E[\sin(2X)], \\ & = 1 - 0, \\ & = 1. \end{align} \]
The code below will produce a plot that illustrates why \(E[\sin(a*X)] = 0\) for \(X \sim N(0,1)\)
cap = "**Figure 1.** *Illustrating why E[sin(X)] = 0 for X ~ N(0,1).*"
curve(sin, -2 * pi, 2 * pi, n = 1e3 + 1, lwd = 2, las = 1, xaxt = 'n')
curve(dnorm, -2 * pi, 2 * pi, n = 1e3 + 1, lwd = 2, col = 'red', add = TRUE)
curve(dnorm(x) * sin(x), -2 * pi, 2 * pi, n = 1e3 + 1,
lwd = 2, col = 'blue', add = TRUE)
abline(h = 0, v = c(-pi, 0, pi), lty = 'dashed', col = 'grey')
legend('topright', col = c('black', 'red', 'blue'),
lwd = 2, bty = 'n', cex = 1.2,
legend = c('sin(x)', expression(phi*'(x)'), expression(phi*'(x)sin(x)'))
)
axis(1, at = pi * seq(-2, 2, 2),
labels = c(expression("-2" * pi), 0, expression("2" * pi)) )
axis(1, at = pi * seq(-2, 2, .5), labels = FALSE)
The examples in this section are taken directly from Professor Shedden’s 2016 course notes which you can find here.
We will investigate the coverage probability of nominal 95% confidence intervals when the data does not come from a Normal (Gaussian) distribution.
In the first example, we will assume the data come from an exponential distribution with mean one. The strategy here is to generate many (mcrep
) data sets of size n
.
For each data vector, we then calculate a nominal 95% confidence interval for the mean and check whether this interval contains the true value of one.
mcrep = 10000 # Simulation replications
n = 30 # Sample size we are studying
xmat = rexp(n * mcrep) # Simulate standard exponential data
dim(xmat) = c(mcrep, n) # Each row is a dataset
mn = apply(xmat, 1, mean) # Sample mean of each row
std = apply(xmat, 1, sd) # Sample SD of each row
se = std / sqrt(n) # Standard errors of the mean
conf_level = .95 # The nominal confidence level
m = qt( 1 - {1 - conf_level} / 2, df = n - 1) # Multiplier for confidence level
# m ~= 2.04
lcb = mn - m * se # lower confidence bounds
ucb = mn + m * se # upper confidence bounds
target = 1 # value we are estimating
cvrg_prob = mean( {lcb < target} & {ucb > target} ) # coverage probability
print(cvrg_prob)
## [1] 0.9308
Since coverage is binary with a fixed probability \(p\), the number of intervals that cover one (“successes”) in our study is Binomial(mcrep
, \(p\)). We can use this fact to estimate the Monte Carlo error which represents the uncertainty in our estimate from the chosen number of replications.
mc_se = sqrt(cvrg_prob * {1 - cvrg_prob} / mcrep)
cvrg_prob_str =
sprintf("%4.2f (%5.3f, %5.3f)",
cvrg_prob, cvrg_prob - m * mc_se, cvrg_prob + m * mc_se)
In this case the estimated coverage is 0.93 (0.926, 0.936).
We have previously discussed the need to be careful about R recycling values when performing arithmetic on arrays with differing dimensions. The formal name for this is broadcasting and it can be quite useful for vectorization.
Generally, when you perform a simple arithmetic operation “*” on arrays with the same dimensions the operation is applied point-wise so that (A * B)[i, j] = A[i, j] * B[i, j]
. However, if the objects A
and B
have different dimensions, but flattened lengths with one a multiple of the other, then the values can be broadcast like this:
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
## [,1] [,2]
## [1,] 0 3
## [2,] 0 3
## [3,] 0 3
Compare this to:
## [,1] [,2] [,3]
## [1,] 0 0 3
## [2,] 0 3 3
In this case, if X
has dimensions m
and n
and v
length k
we see that (X-v)[i, j] = X[i, j] - v[{(j - 1) * m + i mod k} + 1]
.
This is useful for centering each row of a matrix or computing row-wise variances.
We can compare this to direct computation using an implicit loop via the apply
function:
## [1] TRUE
Let’s compare the timing of these two approaches:
m = 100000
n = 30
xmat = rnorm(m * n)
dim(xmat) = c(m, n)
tm1 = proc.time()
rowvar1 = apply(xmat, 1, var)
tm1 = proc.time() - tm1
## Time a vectorized approach
tm2 = proc.time()
rowvar2 =
rowMeans( {xmat - rowMeans(xmat)}^2 ) * dim(xmat)[2] / {dim(xmat)[2] - 1}
tm2 = proc.time() - tm2
## Report on the difference
cat("Apply: ", tm1[3], "s, Vectorized: ", tm2[3], "s, Ratio: ",
round( tm1[3] / tm2[3], 1), '.\n', sep = '' )
## Apply: 1.488s, Vectorized: 0.034s, Ratio: 43.8.
## [1] TRUE
In the next example, we consider the problem of computing the correlation coefficient between many vectors \(x\) and a single outcome \(y\). This can be useful when screening a large set of potential predictors for a relationship with an outcome of interest \(y\).
First, we generate data that has n
observations, m
predictors, and expected correlation r
between each predictor and \(y\).
n = 30
m = 10000
r = 0.4
yvec = rnorm(n)
xmat = outer(array(1, m), yvec) # m x n
xmat = r * xmat + sqrt(1 - r^2) * rnorm(n * m)
Now, we can compute the correlations between each row of xmat
and y
and compare approaches.
# First approach, calculate as a loop
tm1 = proc.time()
r1 = NULL
for ( i in 1:m ) {
r1[i] = cor(xmat[i, ], yvec)
}
tm1 = proc.time() - tm1
# Second approach, functional style with apply
tm2 = proc.time()
r2 = apply(xmat, 1, function(v) cor(v, yvec) )
tm2 = proc.time() - tm2
all.equal(r1, r2)
## [1] TRUE
# Third approach, use linear algebra
tm3 = proc.time()
rmn = rowMeans(xmat)
xmat_c = xmat - outer(rmn, array(1, n))
rsd = apply(xmat, 1, sd)
xmat_s = xmat_c / outer(rsd, array(1, n))
yvec_s = {yvec - mean(yvec)} / sd(yvec)
r3 = xmat_s %*% yvec_s / {n - 1}
r3 = as.vector(r3)
tm3 = proc.time() - tm3
all.equal(r1, r3)
## [1] TRUE
# Fourth approach, use linear algebra with broadcasting
tm4 = proc.time()
rmn = rowMeans(xmat)
xmat_c = xmat - rmn
rvar = rowSums(xmat_c^2) / {dim(xmat)[2] - 1}
rsd = sqrt(rvar)
xmat_s = xmat_c / rsd
yvec_s = {yvec - mean(yvec)} / sd(yvec)
r4 = xmat_s %*% yvec_s / {n - 1}
r4 = as.vector(r4)
tm4 = proc.time() - tm4
all.equal(r1, r4)
## [1] TRUE
# Format and print the results
cat(sprintf("1: %5.3f s \n2: %5.3f s \n3: %5.3f s \n4: %5.3f s \n",
tm1[3], tm2[3], tm3[3], tm4[3]
)
)
## 1: 0.476 s
## 2: 0.367 s
## 3: 0.226 s
## 4: 0.020 s
The fourth approach using linear algebra and broadcasting is by far the most efficient here. All approaches are much more efficient than computing \({10,001 \choose 2}\) correlations when we only need 10,000.
While we should keep in mind that this was a single trial and not a formal comparison with replicates, a difference of this size is still meaningful. We should also be aware that one of the reasons the other approaches are slower is the time needed to allocate memory for (larger) intermediate objects.
The time needed to allocate memory can be influenced by something called garbage collection 🗑 ♻️. In languages like R that bind names to values, garbage collection refers to freeing up memory that is no longer needed because there are no longer any names pointing to it.
Garbage collection is triggered automatically when R needs more memory. As this could happen in the middle of something you are timing, you may get more consistent results if you explicitly force garbage collection prior to starting the timer. However, you rarely need to consider this elsewhere and not everyone agrees it is a best practice for timing comparisons.
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 1076738 57.6 2091621 111.8 NA 2091621 111.8
## Vcells 3168095 24.2 13240743 101.1 16384 12967962 99.0
# First approach, calculate as a loop
tm1 = proc.time()
r1 = NULL
for ( i in 1:m ) {
r1[i] = cor(xmat[i, ], yvec)
}
tm1 = proc.time() - tm1
gc()
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 1076712 57.6 2091621 111.8 NA 2091621 111.8
## Vcells 3168030 24.2 13240743 101.1 16384 12967962 99.0
# Second approach, functional style with apply
tm2 = proc.time()
r2 = apply(xmat, 1, function(v) cor(v, yvec))
tm2 = proc.time() - tm2
gc()
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 1076774 57.6 2091621 111.8 NA 2091621 111.8
## Vcells 3168152 24.2 13240743 101.1 16384 12967962 99.0
# Third approach, use linear algebra
tm3 = proc.time()
rmn = rowMeans(xmat)
xmat_c = xmat - outer(rmn, array(1, n))
rsd = apply(xmat, 1, sd)
xmat_s = xmat_c / outer(rsd, array(1, n))
yvec_s = {yvec - mean(yvec)} / sd(yvec)
r3 = xmat_s %*% yvec_s / {n - 1}
r3 = as.vector(r3)
tm3 = proc.time() - tm3
gc()
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 1076860 57.6 2091621 111.8 NA 2091621 111.8
## Vcells 3168246 24.2 13240743 101.1 16384 12967962 99.0
# Fourth approach, use linear algebra with broadcasting
tm4 = proc.time()
rmn = rowMeans(xmat)
xmat_c = xmat - rmn
rvar = rowSums(xmat_c^2) / {dim(xmat)[2] - 1}
rsd = sqrt(rvar)
xmat_s = xmat_c / rsd
yvec_s = {yvec - mean(yvec)} / sd(yvec)
r4 = xmat_s %*% yvec_s / {n - 1}
r4 = as.vector(r4)
tm4 = proc.time() - tm4
# Format and print the results
cat(sprintf("1: %5.3f s \n2: %5.3f s \n3: %5.3f s \n4: %5.3f s \n",
tm1[3],tm2[3],tm3[3],tm4[3]
)
)
## 1: 0.298 s
## 2: 0.326 s
## 3: 0.164 s
## 4: 0.020 s
x
has dimension n
by m
rather than m
by n
.Recall our original example of computing coverage probabilities of nominal confidence intervals for non-Gaussian data.
If we wanted to carry out similar studies for many distributions, we may wish to write a function whose body carries out the simulation study using configurable parameters.
Here is the original example with a few changes.
mcrep = 10000 # Simulation replications
n = 30 # Sample size we are studying
target = 1 # value we are estimating
conf_level = .95 # Confidence level
xmat = rexp(n * mcrep) # Simulate standard exponential data
dim(xmat) = c(mcrep, n) # Each row is a dataset
mn = apply(xmat, 1, mean) # Sample mean of each row
std = apply(xmat, 1, sd) # Sample SD of each row
se = std / sqrt(n) # Standard errors of the mean
m = qt( 1 - {1 - conf_level} / 2, df = n - 1) # Multiplier for confidence level
lcb = mn - m * se # lower confidence bounds
ucb = mn + m * se # upper confidence bounds
cvrg_prob = mean( {lcb < target} & {ucb > target}) # coverage probability
Below we incorporate the simulation into a function with parameters for simulation settings. Here we use the special argument ...
to pass additional parameters to rgen
by name.
## Function to estimate nominal coverage probabilities
estimate_nominal_coverage =
function(rgen, target, mcrep = 1e4, n = 30, conf_level = .95, ...) {
# estimate nominal CI coverage for data generated by `rgen`
# Inputs:
# rgen - a function generating a vector of simulated data, i.e rexp(),
# with length equal to its first argument.
# target - the actual expectation of rgen()
# mcrep, n - the number of Monte Carlo replications and sample size, respectively.
# conf_level - the nominal coverage probability
# ... - additional parameters to pass to rgen
# Outputs: A length 1 numeric vector with the estimated coverage probability.
xmat = rgen(n * mcrep, ...) # Simulate data
dim(xmat) = c(mcrep, n) # Each row is a dataset
mn = apply(xmat, 1, mean) # Sample mean of each row
std = apply(xmat, 1, sd) # Sample SD of each row
se = std / sqrt(n) # Standard errors of the mean
m = qt( 1 - {1 - conf_level} / 2, df = n - 1) # Multiplier for confidence level
lcb = mn - m * se # lower confidence bounds
ucb = mn + m * se # upper confidence bounds
mean((lcb < target) & (ucb > target)) # coverage probability
}
Now we can use estimate_nominal_coverage
for multiple simulations.
## [1] 0.9309
## [1] 0.9462
## [1] 0.9584
This could be useful, say, for exploring how the mean or another parameter impacts the coverage probability for a particular distribution.
Here we use our function from above to explore how the nominal coverage for the exponential distribution changes over a range of mean parameters.
# rate parameters to explore
lambdas = exp(-seq(1, 10, 1))
# store the results
coverage_probs = vector( length = length(lambdas), mode = 'numeric')
for ( i in 1:length(lambdas) ) {
rate = lambdas[i]
coverage_probs[i] =
estimate_nominal_coverage(rexp, target = 1 / rate, rate = rate)
}
plot(-seq(1, 10, 1), coverage_probs,
ylab = 'estimated coverage probability',
xlab = expression('log'[10]*'(rate)'),
las = 1, pch = 15, cex = 1.2, ylim = c(.9, .95),
main = 'Nominal Coverage Probabilities for the Exponential Distribution'
)
# Add Monte Carlo confidence bounds
for ( i in 1:length(lambdas) ) {
x = rep(-seq(1, 10, 1)[i], 2)
y = coverage_probs[i] + qnorm(.975) * c(-1, 1) * .25 / sqrt(1e4)
lines(x, y)
}
# add a reference line for the global mean
abline(h = mean(coverage_probs), lty = 'dashed', col = 'grey')
Here we examine how the coverage probability depends on the degrees of freedom if the data come from a \(t_{df}\) distribution.
df = c(1, seq(2, 30, 2)) # df parameters to explore
coverage_probs = vector(mode = "numeric", length(df)) # store the results
for ( i in 1:length(df) ) {
coverage_probs[i] = estimate_nominal_coverage(rt, target = 0, df = df[i])
}
plot(df, coverage_probs,
ylab = 'estimated coverage probability',
xlab = 'degrees of freedom for t',
las = 1, pch = 15, cex = 1.2,
ylim = c( 0.93, 1),
main = 'Nominal Coverage Probabilities for the t Distribution')
# Add Monte Carlo confidence bounds
for ( i in seq_along(df) ) {
x = rep(df[i], 2 )
y = coverage_probs[i] + qnorm(.975) * c(-1, 1) * .25 / sqrt(1e4)
lines(x, y)
}
# add a reference line for the nominal coverage value
abline(h = 0.95, lty = 'dashed', col = 'grey')
Use vectorization to improve the efficiency of estimate_nominal_coverage()
. Compare your improvements to the original.
Revise the “Exponential” and “t-distribution” examples to use dplyr and ggplot2.
Use the function estimate_nominal_coverage()
to explore how nominal coverage changes with the sample size n
for various distributions.