R¶

R is available for statistical computing. There are R modules built with both the GCC and Intel compilers. We recommend using the Intel builds since those have had the best performance during benchmarks. Some R packages may not compile correctly with the Intel compilers, so use the GCC version in that case. All R modules have been built with OpenMP enabled and use the Intel MKL to improve performance.

The currently available R modules:

$ module avail R
------------------------- /software/modulefiles -----------------------------
R/2.15(default)   R/2.15+intel-12.1 R/3.0             R/3.0+intel-12.1  R/3.1             R/3.1+intel-15.0

To install and use additional R packages to your home directory, it is necessary to set to set the environment variable R_LIBS_USER. For example:

export R_LIBS_USER=$HOME/R_libs

The directory specified should exist before trying to install R packages.

The RStudio IDE is also available as the rstudio module. This provides a graphical interface for developing and running R. To use R in this mode, you should connect to midway via Connecting with ThinLinc.

Serial Examples
Parallel Examples
- Multicore
- SNOW
- Rmpi

Serial Examples ¶

Here is a simple “hello world” example to submit an R job to Slurm. This is appropriate for an R job that expects to use a single CPU.

sbatch script Rhello.sbatch

#!/bin/sh

#SBATCH --tasks=1

# load the appropriate R module
module load R/3.1+intel-15.0

# Use Rscript to run hello.R
# alternatively, this could be used:
# R --no-save < hello.R
Rscript hello.R

R script Rhello.R:

print ( "Hello World" )

Output:

[1] "Hello World"

Parallel Examples ¶

For parallel use there are several options depending on whether there should be parallel tasks on a single node only or multiple nodes and the level of flexibility required. There are other R packages available for parallel programming than what is covered here, but we’ll cover some frequently used packages.

Multicore ¶

On a single node, it is possible to use doParallel and foreach.

sbatch script doParallel.sbatch:

#!/bin/bash

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=16

# --ntasks-per-node will be used in doParallel.R to specify the number of 
# cores to use on the machine. Using 16 will allow us to use all cores
# on a sandyb node

module load R/3.1+intel-15.0

Rscript doParallel.R

R script doParallel.R:

library(doParallel)

# use the environment variable SLURM_NTASKS_PER_NODE to set the number of cores
registerDoParallel(cores=(Sys.getenv("SLURM_NTASKS_PER_NODE")))

# Bootstrapping iteration example
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
iterations <- 10000# Number of iterations to run

# Parallel version of code 
# Note the '%dopar%' instruction
parallel_time <- system.time({
  r <- foreach(icount(iterations), .combine=cbind) %dopar% {
    ind <- sample(100, 100, replace=TRUE)
    result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
    coefficients(result1)
  }
})[3]

# Shows the number of Parallel Workers to be used
getDoParWorkers()
# Executes the functions
parallel_time

Output:

Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
[1] "16"
elapsed 
  5.157 

SNOW ¶

For multiple nodes, you can use the SNOW package, which provides a select number of functions to simplify using multi-node clusters.

sbatch script snow-test.sbatch:

#!/bin/bash

#SBATCH --job-name=snow-test
#SBATCH --nodes=4
#SBATCH --time=10
#SBATCH --exclusive

module load R/3.1+intel-15.0

# the openmpi module is not loaded by default with R
module load openmpi/1.8+intel-15.0

# Always use -n 1 for the snow package. It uses Rmpi internally to spawn
# additional processes dynamically
mpirun -np 1 Rscript snow-test.R

R script snow-test.R:

##
# Source: http://www.umbc.edu/hpcf/resources-tara/how-to-run-R.html
# filename: snow-test.R
# 
# SNOW quick ref: http://www.sfu.ca/~sblay/R/snow.html
#
# Notes:
#   - Library loading order matters
#   - system.time([function]) is an easy way to test optimizations
#   - parApply is snow parallel version of 'apply'
#
##

library(Rmpi)
library(snow)

# Initialize SNOW using MPI communication. The first line will get the number of
# MPI processes the scheduler assigned to us. Everything else is standard SNOW
np <- mpi.universe.size() - 1
cluster <- makeMPIcluster(np)

# Print the hostname for each cluster member
sayhello <- function() {
  info <- Sys.info()[c("nodename", "machine")]
  paste("Hello from", info[1], "with CPU type", info[2])
}

names <- clusterCall(cluster, sayhello)
print(unlist(names))

# Compute row sums in parallel using all processes, then a grand sum at the end
# on the master process
parallelSum <- function(m, n) {
  A <- matrix(rnorm(m*n), nrow = m, ncol = n)
  # Parallelize the summation
  row.sums <- parApply(cluster, A, 1, sum)
  print(sum(row.sums))
}

# Run the operation over different size matricies
system.time(parallelSum(5000, 5000))

# Always stop your cluster and exit MPI to ensure resources are properly freed
stopCluster(cluster)
mpi.exit()

Output (trimmed for readability):

	64 slaves are spawned successfully. 0 failed.
 [1] "Hello from midway197 with CPU type x86_64"
 [2] "Hello from midway197 with CPU type x86_64"
 [3] "Hello from midway197 with CPU type x86_64"
...
[63] "Hello from midway200 with CPU type x86_64"
[64] "Hello from midway197 with CPU type x86_64"
[1] -9363.914
   user  system elapsed 
  3.988   0.443   5.553 
[1] 1
[1] "Detaching Rmpi. Rmpi cannot be used unless relaunching R."

Rmpi ¶

For multiple nodes, you can also use Rmpi. This is what snow uses internally. It is less convenient than snow, but also more flexible.

This page has a number of useful Rmpi examples: http://www.umbc.edu/hpcf/resources-tara-2010/how-to-run-R.php

sbatch script Rmpi.sbatch:

#!/bin/sh

#SBATCH --nodes=4
#SBATCH --time=1
#SBATCH --exclusive

module load R/3.1+intel-15.0

# the openmpi module is not loaded by default with R
module load openmpi/1.8+intel-15.0

# Always use -n 1 for the Rmpi package. It spawns additional processes dynamically
mpirun -n 1 Rscript Rmpi.R

R script Rmpi.R:

library(Rmpi)

# initialize an Rmpi environment
ns <- mpi.universe.size() - 1
mpi.spawn.Rslaves(nslaves=ns)

# send these commands to the slaves
mpi.bcast.cmd( id <- mpi.comm.rank() )
mpi.bcast.cmd( ns <- mpi.comm.size() )
mpi.bcast.cmd( host <- mpi.get.processor.name() )

# all slaves execute this command
mpi.remote.exec(paste("I am", id, "of", ns, "running on", host))

# close down the Rmpi environment
mpi.close.Rslaves(dellog = FALSE)
mpi.exit()

Output (trimmed for readability):

	64 slaves are spawned successfully. 0 failed.
master  (rank 0 , comm 1) of size 65 is running on: midway449 
slave1  (rank 1 , comm 1) of size 65 is running on: midway449 
slave2  (rank 2 , comm 1) of size 65 is running on: midway449 
slave3  (rank 3 , comm 1) of size 65 is running on: midway449 
... ... ...
slave63 (rank 63, comm 1) of size 65 is running on: midway452 
slave64 (rank 64, comm 1) of size 65 is running on: midway449 
$slave1
[1] "I am 1 of 65"

$slave2
[1] "I am 2 of 65"

...

$slave63
[1] "I am 63 of 65"

$slave64
[1] "I am 64 of 65"

Research Computing Center

User Guide

R¶

Serial Examples ¶

Parallel Examples ¶

Multicore ¶

SNOW ¶

Rmpi ¶

User Guide

R¶

Serial Examples¶

Parallel Examples¶

Multicore¶

SNOW¶

Rmpi¶

Serial Examples ¶

Parallel Examples ¶

Multicore ¶

SNOW ¶

Rmpi ¶