Where to find examples

The examples for this lecture can be found at the

To work with these from a Linux server:

# The first time ...
git clone https://github.com/jbhender/Stats506_F18.git

# Subsequently, cd to Stats506_F18 
git pull
cd Examples/r_batch/

You can also get them as a stand alone tar ball.

wget https://github.com/jbhender/Stats506_F18/raw/master/Examples/R/r_batch.tgz
tar xvfz r_batch.tgz
cd r_batch

Batch Computing

The term batch processing refers to running a computer program non interactively. That is, rather than a prompt that waits for user supplied commands (e.g. “>”, “$”), a series of commands in a script is executed according to the script logic without further input from the user.

When you knit your R markdown files, you are in essence running R markdown in “batch mode” as the Rmd file is executed without additional instructions from you.

You’ve previously seen an example of how to run Stata scripts (.do) in batch mode:

stata -b my_script.do

R CMD BATCH

On Unix-alike machines, an R script my_script0.R can be run in batch mode from the command line using:

R CMD BATCH my_script0.R 

The script my_script0.R is known as the infile. An outfile my_script.Rout will be created in the working directory with the output from stdout and stderr that would usually appear at the console. You can specify an explicit name for the outfile as in the example below:

R CMD BATCH my_script0.R ./output/results_my_script0.Rout

In the example, we wrote the R output to a folder /output. Examining the outfile, we see the start up message from R, the commands run and their results, and a final call to proc.time() letting us know how long the script took to run.

We can use options to control how R is run and what gets printed. This page describes some of the relevant options. Follow the instructions in my_script1.R for some examples.

I would recommend always passing --vanilla to make your scripts reproducible and to avoid saving objects inadvertently. A general principle of reproducible research is that results should not depend on local files not explicitly called in the script.

Rscript

The Rscript command is similar to R CMD BATCH except:

  • default options are --slave and --no-restore implying --no-save,
  • output is written to stdout rather rather than an outfile,
  • you can pass an expression using -e rather than a file.

Here are a few examples:

Rscript my_script1.R
Rscript -e "rnorm(2, 0, sqrt(4))"

As a reminder, you can run one R script from within another using source().

Command Line Arguments

When running in batch mode it is sometimes useful to run the same script multiple times with a small number of changes to key parameters. While it is often simpler to run a single script using a for loop (or in parallel via foreach), for long running or otherwise intensive jobs it is often convenient to run each job separately. In such instances, passing parameters at the command line and accessing them within an R script using commandArgs() is a useful construct.

See also help(commandArgs).

We will look at a minimal example by running my_script2.R at the command line using

Rscript my_script2.R --args 2 3 4

For a more interesting example, we will examine the script GammaMLE_mc.R which contains code for a Monte Carlo experiment to compare the bias, variance, and MSE of maximum likelihood estimation for parameters of the Gamma distribution. The script GammaMLE_test.R tests the core functions for the Monte Carlo study.

Job Scheduling & the Flux Cluster

Flux

Flux is the University’s high performance computing cluster. We will review the information here to learn more.

Using the Flux Cluster

We will review the information at this page. If you have not enrolled in DUO two-factor authentication please do so. If you do not yet have a flux user account, please request one using the link from the page above.

If you have a flux user account, you can connect to the login nodes from a campus IP address via:

ssh flux-login.arc-ts.umich.edu

Just as when connecting to the scs servers via scs.dsc.umich.edu, you will be connected to a specific login node. If you use tmux to start an editing session remember the specific node you need to reconnect to.

In many ways the linux environment on the login nodes is similar to the one on the stats servers. However, there are some important differences:

  • These nodes are not designed for long-running or high memory (>8GB) computing jobs,
  • Statistical software like R is not loaded by default, but must be requested as modules,
  • The default directory is not your afs space, but a separate directory known as home.

You can access your afs space over the network from the login nodes, but the compute nodes are not able to access network files.

Software Modules

On Flux, software aside from the linux operating environment and associated tools must be explicitly requested using the module command. Type module --help from a login node to learn more. We will look at the following module sub-commands:

  • list for viewing currently loaded modules
  • load / unload for loading and unloading modules
  • spider for searching for available modules.

Flux allocations

In order to submit jobs for computing on the Flux cluster, you must belong to a flux allocation. You can think of a flux allocation as a pool of resources available to and shared among a group of folks. Statistics students have access to an allocation paid for by the Stats department, there are also shared allocations for Engineering and LSA.

You can see which flux allocations you have access to using the mdiag command:

mdiag -u uniqname

After you have a user account, please login and use the command above to check your allocations. If you do not have an allocation, please send me an email.

To check the jobs currently running or queued for an allocation use showq:

showq -w acct=stats_flux

Job Submission

To run computations in batch mode on the Flux cluster, you must specify both the resources your compute job requires and the code to be run. This is done using a file known as a “PBS script” which is essentially a modified shell script with special comments #PBS for communicating with the scheduling system.

For more on PBS scripts, see item 7 from the flux user guide.

Once you have a valid PBS script in your home directory on flux, you can “submit” it to the scheduler using qsub:

qsub run_myScript0.pbs

You should think of “q” in commands discussed here as short for “queue”. Once your job is submitted, a “scheduler” will decide when to run it. This can happen either quickly, after a very long time, or never depending on what resources you request and what resources are currently available.

You can use showq to see your job’s status in the queue you submitted against or get addtional details using the qstat command. To see a list of all your jobs (potentially in more than one queue) use:

qstat -u

You can delete a specific job using the jobid. For example, if a job I no longer wish to run has jobid=25505879, I can delete it using:

qdel 25505879

Interactive jobs

While the computation nodes are designed and best used for batch computing, they can also be used interactively. This can be useful for debugging when a problem you encounter only occurs when running on the compute nodes. Interactive jobs can be submitted using qsub -I with a list of resources requested:

qsub -I -A stats_flux -q flux -l nodes=1:ppn=2, pmem=1gb, walltime=0:00:20, qos=flux

You can do this using a normal PBS file with the -I flag:

qsub -I run_myScript0.pbs

Another way to run jobs interactively is using the ARC connect service from within a web-browser.

Job Arrays

To submit multiple jobs using the same pbs script, consider job arrays.

There is a special flag -t for specifying job arrays. You can access the value of the array id using the shell variable PBS_ARRAYID. Job arrays can be useful for splitting a long-running computation into smaller chunks. This can help with scheduling as it is often easier to get several short jobs to run than a single long job.

Job arrays have a single job id with brackets [] to denote elements of the array, i.e. 25505879[].

For an example, we will inspect runGammeMLE-mc.pbs.