This document is split into two main parts:
Multicore computing on a single machine using mclapply from the parallel package.
Parallel computing on one or more machines using the %dopar% operator from the doParallel package.
In addition we will discuss:
Options for splitting computations across multiple processes and how to choose among these by considering both the structure of the tasks to be performed and the parallel architecture available.
How to obtain valid and reproducible random numbers in parallel computing.
Most modern computers, including laptops and desktops, come with multiple processors or cores - the distinction is not important for our purposes but see here for some discussion. The basic idea of multicore computing is to allow a single program, in this case R, to run multiple threads simultaneously in order to reduce the ‘walltime’ required for completion.
In R, this can be done using the parallel package distributed in the base distribution since version 2.14.0. By default, the parallel package achieves parallelism by forking which is only available on Unix-like operating systems, i.e. Linux and Mac, but not Windows. In the image below, we see a parent process with three forked children.
When a process is forked, child processes have read-only access to objects in the global environment.
Long running computations are also commonly run on specialized, multiuser servers. For instance, here at UM researchers have access to the Statistics & Computation Service (SCS) run by ITS. You can access these servers using ssh
from a terminal application, i.e.:
In R, “apply” statements iterate over a data-structure applying a function to each subset, i.e. row, column, item in list, etc.
The lapply function is used to iterate over a list and mclapply splits these iterations into multiple processes. Parallelizing sequential code with mclapply generally involves:
library(parallel)
,mclapply
with appropriate options.Here are a couple of constructions that follow these steps.
library(parallel)
myFunc = function(case) {
# ... do something with case ...
case_result = case
return(case_result)
}
# when inputs to myFunc are defined relatively by case
results = mclapply(1:nCases, myFunc)
# when myList has data/parameters that are specific to each case
results = mclapply(myList, myFunc)
And here is a minimal example.
# Fit 4 linear models in paralell: --------------------------------------------
formulas = list(
mpg ~ wt + cyl + carb,
mpg ~ wt + cyl + carb + wt:carb,
mpg ~ wt*cyl + carb,
mpg ~ wt*cyl*carb
)
# Serial version: -------------------------------------------------------------
system.time({
fits_serial = lapply(formulas, lm, data = mtcars)
})
## user system elapsed
## 0.006 0.001 0.008
# Parallel version: -----------------------------------------------------------
system.time({
fits_parallel = parallel::mclapply(formulas, lm, data = mtcars)
})
## user system elapsed
## 0.006 0.004 0.018
In this example we will parallelize the cross validation portion of our earlier ridge regression example.
When thinking of parallelizing some portion of a program it is important to remember that parallelism is not magic. There is some computational overhead involved in splitting the task, initializing child processes, communicating data, and collating results. For this reason, there is usually little to be gained in parallelizing already fast computations (less than a second) as the overhead may outweigh the increase in time.
At the same time, the maximum impact of parallelism is generally the sequential walltime divided by the number of cores available and, in most cases, the actual impact will be even less. As such, very long running programs may require reconsidering the algorithm, implementation, or purpose in order to be computationally feasible even with parallelism.
The doParallel package is one of several ‘parallel backends’ for the foreach package. ‘Parallel backend’ is a catch-all term for establishing communication between multiple cores, perhaps on different physical “nodes” connected by network connections. You don’t need to know a great detail about the details of the parallel back end to get started with parallel computing.
Below is a minimal example of how to set up a parallel back end using doParallel.
## (install and) load library ##
#install.packages(doParallel)
library(doParallel)
# how many cores to use in the cluster? #
ncores=2
# set up a cluster called 'cl'
cl = makeCluster(ncores)
# register the cluster
registerDoParallel(cl)
## do some parallel computaitons with foreach
results = foreach(case=1:10) %dopar% {
## what do do with case ? #
result_case = case
result_case
}
## Always shut the cluster down when done
stopCluster(cl)
In the example above, we also see the basic structure for performing parallel computations in parallel. Compare the following two code snippets.
Snippet 1:
results = c()
for(case=1:10){
result_case = case
results[case]=result_case
}
Snippet 2:
results = foreach(case=1:10) %dopar% {
result_case = case
result_case
}
The foreach function evaluates an expression for each value of the counter (iterator) ‘case’. The only difference is that in snippet 1, the expression changes results[case]
in the global environment while in snippet 2 the expression is evaluated in a separate environment and returned as a value.
In the examples above, the %dopar% operator is used to execute the code in parallel. Using %do% instead would lead to sequential computation by the primary process.
Many statistical and machine learning applications rely on pseudo-random numbers for things like
sampling from distributions and stochastic optimization. When child processes are spawned to compute in parallel, care needs to be taken to ensure random number streams behave as expected. This is particularly true for bootstrap inference, Monte Carlo simulations, and other applications where it is important that iterations are independent. Care is also needed to ensure results can be reliably reproduced using set.seed()
.
Review the section “Random numbers” under the help page for more information here. There, we learn that, in the multicore approach typified by calls to mclapply, reproducible ranodm numbers in parallel computaitons can be achieved if instructing R to use “L’Ecuyer-CMRG” for random number generation by calling RNGkind("L'Ecuyer-CMRG")
.
Similarly, to create reproducible results when using random number generation within foreach expressions use the %dorng% operator from package doRNG.
For more details see the doRNG vignette.
The parallel, doParallel, and foreach packages greatly facilitate parallel computing in R. When used appropriately, performing computations in parallel can reduce the wait time needed to obtain results and make the overall research process more efficient. When deciding whether and how to parallelize it is important to keep in mind the amount and archictecture of available computing resources.