R¶
R is available for statistical computing. There are R modules built with both the GCC and Intel compilers. We recommend using the Intel builds since those have had the best performance during benchmarks. Some R packages may not compile correctly with the Intel compilers, so use the GCC version in that case. All R modules have been built with OpenMP enabled and use the Intel MKL to improve performance.
The currently available R modules:
$ module avail R
------------------------- /software/modulefiles -----------------------------
R/2.15(default) R/2.15+intel-12.1 R/3.0 R/3.0+intel-12.1 R/3.1 R/3.1+intel-15.0
To install and use additional R packages to your home directory, it is
necessary to set to set the environment variable R_LIBS_USER
. For example:
export R_LIBS_USER=$HOME/R_libs
The directory specified should exist before trying to install R packages.
The RStudio IDE is also available as the rstudio
module. This provides
a graphical interface for developing and running R. To use R in this mode,
you should connect to midway via Connecting with ThinLinc.
Serial Examples¶
Here is a simple “hello world” example to submit an R job to Slurm. This is appropriate for an R job that expects to use a single CPU.
sbatch script Rhello.sbatch
#!/bin/sh
#SBATCH --tasks=1
# load the appropriate R module
module load R/3.1+intel-15.0
# Use Rscript to run hello.R
# alternatively, this could be used:
# R --no-save < hello.R
Rscript hello.R
R script Rhello.R
:
print ( "Hello World" )
Output:
[1] "Hello World"
Parallel Examples¶
For parallel use there are several options depending on whether there should be parallel tasks on a single node only or multiple nodes and the level of flexibility required. There are other R packages available for parallel programming than what is covered here, but we’ll cover some frequently used packages.
Multicore¶
On a single node, it is possible to use doParallel and foreach.
sbatch script doParallel.sbatch
:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=16
# --ntasks-per-node will be used in doParallel.R to specify the number of
# cores to use on the machine. Using 16 will allow us to use all cores
# on a sandyb node
module load R/3.1+intel-15.0
Rscript doParallel.R
R script doParallel.R
:
library(doParallel)
# use the environment variable SLURM_NTASKS_PER_NODE to set the number of cores
registerDoParallel(cores=(Sys.getenv("SLURM_NTASKS_PER_NODE")))
# Bootstrapping iteration example
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
iterations <- 10000# Number of iterations to run
# Parallel version of code
# Note the '%dopar%' instruction
parallel_time <- system.time({
r <- foreach(icount(iterations), .combine=cbind) %dopar% {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}
})[3]
# Shows the number of Parallel Workers to be used
getDoParWorkers()
# Executes the functions
parallel_time
Output:
Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
[1] "16"
elapsed
5.157
SNOW¶
For multiple nodes, you can use the SNOW package, which provides a select number of functions to simplify using multi-node clusters.
sbatch script snow-test.sbatch
:
#!/bin/bash
#SBATCH --job-name=snow-test
#SBATCH --nodes=4
#SBATCH --time=10
#SBATCH --exclusive
module load R/3.1+intel-15.0
# the openmpi module is not loaded by default with R
module load openmpi/1.8+intel-15.0
# Always use -n 1 for the snow package. It uses Rmpi internally to spawn
# additional processes dynamically
mpirun -np 1 Rscript snow-test.R
R script snow-test.R
:
##
# Source: http://www.umbc.edu/hpcf/resources-tara/how-to-run-R.html
# filename: snow-test.R
#
# SNOW quick ref: http://www.sfu.ca/~sblay/R/snow.html
#
# Notes:
# - Library loading order matters
# - system.time([function]) is an easy way to test optimizations
# - parApply is snow parallel version of 'apply'
#
##
library(Rmpi)
library(snow)
# Initialize SNOW using MPI communication. The first line will get the number of
# MPI processes the scheduler assigned to us. Everything else is standard SNOW
np <- mpi.universe.size() - 1
cluster <- makeMPIcluster(np)
# Print the hostname for each cluster member
sayhello <- function() {
info <- Sys.info()[c("nodename", "machine")]
paste("Hello from", info[1], "with CPU type", info[2])
}
names <- clusterCall(cluster, sayhello)
print(unlist(names))
# Compute row sums in parallel using all processes, then a grand sum at the end
# on the master process
parallelSum <- function(m, n) {
A <- matrix(rnorm(m*n), nrow = m, ncol = n)
# Parallelize the summation
row.sums <- parApply(cluster, A, 1, sum)
print(sum(row.sums))
}
# Run the operation over different size matricies
system.time(parallelSum(5000, 5000))
# Always stop your cluster and exit MPI to ensure resources are properly freed
stopCluster(cluster)
mpi.exit()
Output (trimmed for readability):
64 slaves are spawned successfully. 0 failed.
[1] "Hello from midway197 with CPU type x86_64"
[2] "Hello from midway197 with CPU type x86_64"
[3] "Hello from midway197 with CPU type x86_64"
...
[63] "Hello from midway200 with CPU type x86_64"
[64] "Hello from midway197 with CPU type x86_64"
[1] -9363.914
user system elapsed
3.988 0.443 5.553
[1] 1
[1] "Detaching Rmpi. Rmpi cannot be used unless relaunching R."
Rmpi¶
For multiple nodes, you can also use Rmpi. This is what snow uses internally. It is less convenient than snow, but also more flexible.
This page has a number of useful Rmpi examples: http://www.umbc.edu/hpcf/resources-tara-2010/how-to-run-R.php
sbatch script Rmpi.sbatch
:
#!/bin/sh
#SBATCH --nodes=4
#SBATCH --time=1
#SBATCH --exclusive
module load R/3.1+intel-15.0
# the openmpi module is not loaded by default with R
module load openmpi/1.8+intel-15.0
# Always use -n 1 for the Rmpi package. It spawns additional processes dynamically
mpirun -n 1 Rscript Rmpi.R
R script Rmpi.R
:
library(Rmpi)
# initialize an Rmpi environment
ns <- mpi.universe.size() - 1
mpi.spawn.Rslaves(nslaves=ns)
# send these commands to the slaves
mpi.bcast.cmd( id <- mpi.comm.rank() )
mpi.bcast.cmd( ns <- mpi.comm.size() )
mpi.bcast.cmd( host <- mpi.get.processor.name() )
# all slaves execute this command
mpi.remote.exec(paste("I am", id, "of", ns, "running on", host))
# close down the Rmpi environment
mpi.close.Rslaves(dellog = FALSE)
mpi.exit()
Output (trimmed for readability):
64 slaves are spawned successfully. 0 failed.
master (rank 0 , comm 1) of size 65 is running on: midway449
slave1 (rank 1 , comm 1) of size 65 is running on: midway449
slave2 (rank 2 , comm 1) of size 65 is running on: midway449
slave3 (rank 3 , comm 1) of size 65 is running on: midway449
... ... ...
slave63 (rank 63, comm 1) of size 65 is running on: midway452
slave64 (rank 64, comm 1) of size 65 is running on: midway449
$slave1
[1] "I am 1 of 65"
$slave2
[1] "I am 2 of 65"
...
$slave63
[1] "I am 63 of 65"
$slave64
[1] "I am 64 of 65"