This document is split into two main parts:
Multicore computing on a single machine using mclapply from the parallel package.
Parallel computing on one or more machines using the %dopar% operator from the doParallel package.
In addition we will discuss:
Options for splitting computations across multiple processes and how to choose among these by considering both the structure of the tasks to be performed and the parallel architecture available.
How to obtain valid and reproducible random numbers in parallel computing.
Most modern computers, including laptops and desktops, come with multiple processors or cores - the distinction is not important for our purposes but see here for some discussion. The basic idea of multicore computing is to allow a single program, in this case R, to run multiple threads simultaneously in order to reduce the ‘walltime’ required for completion.
In R, this can be done using the parallel package distributed in the base distribution since version 2.14.0. By default, the parallel package achieves parallelism by forking which is only available on Unix-like operating systems, i.e. Linux and Mac, but not Windows. In the image below, we see a parent process with three forked children.
When a process is forked, child processes have read-only access to objects in the global environment.
Long running computations are also commonly run on specialized, multiuser servers. For instance, here at UM researchers have access to the Statistics & Computation Service (SCS) run by ITS. You can access these servers using ssh
from a terminal application, i.e.:
ssh scs.dsc.umich.edu
ssh luigi.dsc.umich.edu
In R, “apply” statements iterate over a data-structure applying a function to each subset, i.e. row, column, item in list, etc.
The lapply function is used to iterate over a list and mclapply splits these iterations into multiple processes. Parallelizing sequential code with mclapply generally involves:
library(parallel)
,mclapply
with appropriate options.Here are a couple of constructions that follow these steps.
library(parallel)
myFunc = function(case) {
# ... do something with case ...
case_result = case
return(case_result)
}
# when inputs to myFunc are defined relatively by case
results = mclapply(1:nCases, myFunc)
# when myList has data/parameters that are specific to each case
results = mclapply(myList, myFunc)
And here is a minimal example.
# Fit 4 linear models in paralell: --------------------------------------------
formulas = list(
mpg ~ wt + cyl + carb,
mpg ~ wt + cyl + carb + wt:carb,
mpg ~ wt*cyl + carb,
mpg ~ wt*cyl*carb
)
# Serial version: -------------------------------------------------------------
system.time({
fits_serial = lapply(formulas, lm, data = mtcars)
})
## user system elapsed
## 0.006 0.000 0.006
# Parallel version: -----------------------------------------------------------
system.time({
fits_parallel = parallel::mclapply(formulas, lm, data = mtcars)
})
## user system elapsed
## 0.007 0.007 0.023
As a running example, we will consider permutation tests for a simplified verison of gene set enrichment analysis. Such analyses are used to understand large-scale sequencing experiments in terms of sets of functionally related and possibly deferentially expressed genes. For data we will use a cohort of Triple Negative Breast Cancers consisting of two racial groups, African Americans (AA) and European Americans (EA). The data originally come from here and is named YaleTNBC.RData.
The gene sets in c6.unsigned.RData come from MSigDB.
At a high level, the basic steps of the analysis are:
In Example 0 we introduce mclapply for computing the (log) fold-change between groups across these 18,000+ genes.
In Example 1 we examine sequential and parallel implementations for this example.
When thinking of parallelizing some portion of a program it is important to remember that parallelism is not magic. There is some computational overhead involved in splitting the task, initializing child processes, communicating data, and collating results. For this reason, there is usually little to be gained in parallelizing already fast computations (less than a second) as the overhead may outweigh the increase in time.
At the same time, the maximum impact of parallelism is generally the sequential walltime divided by the number of cores available and, in most cases, the actual impact will be even less. As such, very long running programs may require reconsidering the algorithm, implementation, or purpose in order to be computationally feasible even with parallelism.
An exception to this idea of linear improvement in run-time with the number of available cores can occur when the overall run-time is dominated by rare but long running sub-processes. In these cases, well designed parallelism can lead to substantial improvement by repeatedly re-using cores not occupied by the long-running sub-process.
The doParallel package is one of several ‘parallel backends’ for the foreach. ‘Parallel backend’ is a catch-all term for establishing communication between multiple cores, perhaps on different physical “nodes” connected by network connections. You don’t need to know a great detail about the details of the parallel back end to get started with parallel computing.
Below is a minimal example of how to set up a parallel back end using doParallel.
## (install and) load library ##
#install.packages(doParallel)
library(doParallel)
# how many cores to use in the cluster? #
ncores=2
# set up a cluster called 'cl'
cl = makeCluster(ncores)
# register the cluster
registerDoParallel(cl)
## do some parallel computaitons with foreach
results = foreach(case=1:10) %dopar% {
## what do do with case ? #
result_case = case
result_case
}
## Always shut the cluster down when done
stopCluster(cl)
In the example above, we also see the basic structure for performing parallel computations in parallel. Compare the following two code snippets.
Snippet 1:
results = c()
for(case=1:10){
result_case = case
results[case]=result_case
}
Snippet 2:
results = foreach(case=1:10) %dopar% {
result_case = case
result_case
}
The foreach function evaluates an expression for each value of the counter (iterator) ‘case’. The only difference is that in snippet 1, the expression changes results[case]
in the global environment while in snippet 2 the expression is evaluated in a separate environment and returned as a value.
In the examples above, the %dopar% operator is used to execute the code in parallel. Using %do% instead would lead to sequential computation by the primary process.
In Example 2 we revisit the earlier permutation tests using foreach and %dopar% for parallelization.
In Example 3 we see how to create reproducible results when using random number generation within foreach expressions using the %dorng% operator from package doRNG.
Finally, in Example 4 we explore foreach and the iterators package in greater detail.
Many statistical and machine learning applications rely on pseudo-random numbers for things like
sampling from distributions and stochastic optimization. When child processes are spawned to compute in parallel, care needs to be taken to ensure random number streams behave as expected. This is particularly true for bootstrap inference, Monte Carlo simulations, and other applications where it is important that iterations are independent. Care is also needed to ensure results can be reliably reproduced using set.seed()
.
As seen in Example 1 this can be achieved in the multicore approach typified by calls to mclapply by instructing R to use “L’Ecuyer-CMRG” for random number generation by calling RNGkind("L'Ecuyer-CMRG")
.
In the foreach
approach this can be handled by using the doRNG library and replacing %dopar%
calls with %dorng%
.
For more details see the doRNG vignette.
The parallel, doParallel, and foreach packages greatly facilitate parallel computing in R. When used appropriately, performing computations in parallel can reduce the wait time needed to obtain results and make the overall research process more efficient. When deciding whether and how to parallelize it is important to keep in mind the amount and archictecture of available computing resources.