Introduction to RCC for CSPP 51087

The following exercises will walk you through getting access and running simple jobs on Midway. Additional information about Midway and its environment can be found at this website. Contact us at help@rcc.uchicago.edu if you have any problems.

You should contact the TAs or the professor for course-related information on CSPP 51087.

Exercise 1: Log in to Midway

Access to RCC is provided via secure shell (SSH) login, a tool that allows you to connect securely from any computer (including most smartphones and tablets).

All users must have a UChicago CNetID to log in to any RCC systems. Your RCC account credentials are your CNetID and password:

Username CNetID
Password CNet password
Hostname midway.rcc.uchicago.edu

Note

RCC does not store your CNet password and we are unable to reset your password. If you require password assistance, please use the IT Services _CNet Password Recovery tool.

Most UNIX-like operating systems (Mac OS X, Linux, etc) provide an SSH utility by default that can be accessed by typing the command ssh in a terminal. To login to Midway from a Linux/Mac computer, open a terminal and at the command line enter:

ssh <username>@midway.rcc.uchicago.edu

Windows users will first need to download an SSH client, such as PuTTY, which will allow you to interact with the remote Unix server. Use the hostname midway.rcc.uchicago.edu and your CNetID username and password to access the RCC login node.

Exercise 2: Explore the Module System

The module system is a script based system to manage the user environment Try running the commands below and review the output to learn more about the module system.

Basic module commands:

Command Description
module avail [name] lists modules matching [name] (all if ‘name’ empty)
module load [name] loads the named module
module unload [name] unloads the named module
module list lists the modules currently loaded for the user

Example - Matlab:

module list

Currently Loaded Modulefiles:
 1) slurm/2.4       3) subversion/1.6  5) env/rcc         7) tree/1.6.0
 2) vim/7.3         4) emacs/23.4      6) git/1.7

module avail

-------------------------------- /software/modulefiles ---------------------------------
Minuit2/5.28(default)                               intelmpi/4.0
Minuit2/5.28+intel-12.1                             intelmpi/4.0+intel-12.1(default)
R/2.15(default)                                     jasper/1.900(default)
...
ifrit/3.4(default)                                  x264/stable(default)
intel/11.1                                          yasm/1.2(default)
intel/12.1(default)
------------------------- /usr/share/Modules/modulefiles -------------------------------
dot         module-cvs  module-info modules     null        use.own
----------------------------------- /etc/modulefiles -----------------------------------
env/rcc            samba/3.6          slurm/2.3          slurm/2.4(default)
--------------------------------------- Aliases ----------------------------------------

module avail matlab

----------------------------------------------------------------------------------------
/software/modulefiles
----------------------------------------------------------------------------------------
matlab/2011b          matlab/2012a(default) matlab/2012b
----------------------------------------------------------------------------------------

module avail

which matlab
  <not found>

module load matlab

which matlab
/software/matlab-2012a-x86_64/bin/matlab

module list

Currently Loaded Modulefiles:
 1) slurm/2.4       3) subversion/1.6  5) env/rcc         7) tree/1.6.0
 2) vim/7.3         4) emacs/23.4      6) git/1.7         8) matlab/2012a

module unload matlab

which matlab
  <not found>

module load matlab/2011b

which matlab
/software/matlab-2011b-x86_64/bin/matlab

module list

Currently Loaded Modulefiles:
 1) slurm/2.4       3) subversion/1.6  5) env/rcc         7) tree/1.6.0
 2) vim/7.3         4) emacs/23.4      6) git/1.7         8) matlab/2011b

Exercise 3: Set up an MPI Environment on Midway

The RCC provides several compiler suites on Midway, as well as several MPI environments. For most users these should be completely interchangeable, however some codes find different performance or experience problems with certain combinations.

Compiler Module(s)
Intel intel/11.1 intel/12.1(default) intel/13.0
Portland pgi/2012(default)
GNU No module necessary

A complete MPI environment is composed of an MPI installation and a compiler. Each module has the naming convention [mpi]/[mpi version]+[compiler]-[compiler version]. Not all combinations of compiler and MPI environment are supported, but most are. Once you load a module, all the standard MPI commands mpicc, mpirun will function normally

MPI Environment URL Modules
OpenMPI http://www.openmpi.org openmpi/1.6(default) openmpi/1.6+intel-12.1 openmpi/1.6+pgi-2012
Intel MPI http://software.intel.com/en-us/intel-mpi-library intelmpi/4.0 intelmpi/4.0+intel-12.1(default) intelmpi/4.1 intelmpi/4.1+intel-12.1 intelmpi/4.1+intel-13.0
Mvapich2 http://mvapich.cse.ohio-state.edu/overview/mvapich2/ mvapich2/1.8(default) mvapich2/1.8+intel-12.1 mvapich2/1.8+pgi-2012

Note

A code compiled with one MPI module will generally not run properly with another. If you try several MPI modules, be very careful to recompile your code. Each compiler uses different flags and default options, use mpicc -show to see the compiler and default command line flags that MPI is passing to the compiler.

Exercise 4: Run a job on Midway

The Slurm scheduler is used to schedule jobs and manage resources. Jobs are either interactive, in which the user logs directly into a compute node and performs tasks directly, or batch, where a job script is executed by the scheduler on behalf of the user. Interactive jobs are useful during development and debugging, but users will need to wait for nodes to become available.

Interactive Use

To request one processor to use interactively, use the sinteractive command with no further options:

sinteractive

The sinteractive command provides many options for reserving processors. For example, two cores, instead of the default of one, could be reserved for four hours in the following manner:

sinteractive --ntasks-per-node=2 --time:4:00:00

The option --constraint=ib can be used to ensure that Infiniband connected nodes are reserved. Infiniband is a fast networking option that permits up to 40x the bandwidth of gigabit ethernet on Midway. Multi-node jobs that use MPI must request Infiniband.

sinteractive --constraint=ib

Batch jobs

An example sbatch script:

#!/bin/bash
#SBATCH --job-name=test
#SBATCH --output=test.out
#SBATCH --error=test.err
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=16
#SBATCH --partition=sandyb
#SBATCH --constraint=ib
#SBATCH --account=CSPP51087

# load your modules here
module load intel

# execute your tasks here
mpirun hello_world

When a scheduled job runs SLURM sets many environmental variables that may be helpful to query from within your job. You can explore the environment variables by adding

env | grep SLURM

to the sbatch script above. The output will be found in the file defined in your script header.

Different types of hardware are usually organized by partition. Rules about job limits such as maximum wallclock time and maximum numbers of CPUs that may be requested are governed by a QOS (Quality of Service). You can target appropriate compute nodes for your job by specifying a partition and a qos in your batch script.

At the command prompt, you can run sinfo to get some information about the available partitions, and rcchelp qos to learn more about the qos.

#SBATCH --exclusive         Exclusive access to all nodes requested, no other jobs may run here
#SBATCH --partition         Specifies the partition (ex: 'sandyb', 'westmere')
#SBATCH --constraint=ib     For the sandyb partition, request nodes with infiniband
#SBATCH --qos               Quality of Service (ex: 'normal', 'debug')
#SBATCH --account           Sets the account to be charged for use (the course name CSPP51087)

GPU

Midway has 4 GPU nodes, each with two GPUs. To request a job to run on these nodes use the slurm option:

--gres=gpu: Specifies number of GPUs to use

Please Contact RCC Staff prior to using GPU nodes

We will gladly assist you in making sure your sbatch script is written to properly use the nodes.

Exercise 5: Use rcchelp to download, compile, and submit a parallel job

rcchelp is a custom command-line program to provide online help and code examples. Help on software topics can be accessed with the rccsoftware shortcut. Run this command to see available topics:

rccsoftware

The output should look similar to this:

...
c                   Compile and run a C program []
fftw                Fastest Fourier Transform in the West []
gcc                 Compile and run a C program []
mpi                 Compile and run an MPI program []
namd                Submission script and sample files for NAMD []
...

The left column contains topics that can be passed to the rccsoftware command. Enter:

rccsoftware mpi

into the command line and follow the instructions. Choose Yes when you are given the option to download files to your home directory. The final output should look like:

The following files were copied locally to:
  /home/$HOME/rcchelp/software/mpi.rcc-docs
  hello.c
  mpi.sbatch
  README

The information that is printed to the screen can be found and reviewed in README file. Follow the instructions to compile and run the parallel Hello World code.

Exercise 6: Interact With Your Submitted Jobs

Submitted jobs status is viewable and alterable by several means. The primary command squeue is part of a versatile system of job monitoring.

Example:

squeue

JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
3518933   depablo polyA6.0   ccchiu  PD       0:00      1 (QOSResourceLimit)
3519981   depablo    R_AFM mcgovern  PD       0:00      1 (Resources)
3519987   depablo    R_AFM mcgovern  PD       0:00      1 (Priority)
3519988   depablo    R_AFM mcgovern  PD       0:00      1 (Priority)
3539126       gpu _interac  jcarmas   R      45:52      1 midway231
3538957       gpu test.6.3 jwhitmer   R      58:52      1 midway230
3535743      kicp Alleturb   agertz  PD       0:00      6 (QOSResourceLimit)
3535023      kicp hf.b64.L   mvlima   R    5:11:46      1 midway217
3525370  westmere phase_di   khaira   R    4:50:02      1 midway008
3525315  westmere phase_di   khaira   R    4:50:03      1 midway004
3525316  westmere phase_di   khaira   R    4:50:03      1 midway004

The above tells us:

Name Description
JOBID Job ID #, unique reference number for each job
PARTITION Partition job will run on
NAME Name for the job, defaults to slurm-JobID
USER User who submitted job
ST State of the job
TIME Time used by the job in D-HH:MM:SS
NODES Number of Nodes consumed
NODELIST(REASON) List of Nodes consumed, or reason the job has not started running

squeue’s output is customizable

Example:

squeue --user CNet -i 5

The above will only show for user CNet and will refresh every 5 seconds

6.1 Canceling Jobs

Cancel one job:

scancel <JobID>

or cancel all of your jobs at the same time:

scancel --user <User Name>

6.2 More Job Information

scontrol show job <JobID>

Example:

scontrol show job

JobId=3560876 Name=sleep
 UserId=dylanphall(1832378456) GroupId=dylanphall(1832378456)
 Priority=17193 Account=rcc-staff QOS=normal
 JobState=CANCELLED Reason=None Dependency=(null)
 Requeue=1 Restarts=0 BatchFlag=0 ExitCode=0:0
 RunTime=00:00:10 TimeLimit=1-12:00:00 TimeMin=N/A
 SubmitTime=2013-01-09T11:39:40 EligibleTime=2013-01-09T11:39:40
 StartTime=2013-01-09T11:39:40 EndTime=2013-01-09T11:39:50
 PreemptTime=None SuspendTime=None SecsPreSuspend=0
 Partition=sandyb AllocNode:Sid=midway-login2:24907
 ReqNodeList=(null) ExcNodeList=(null)
 NodeList=midway113
 BatchHost=midway113
 NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:*
 MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
 Features=(null) Gres=(null) Reservation=(null)
 Shared=OK Contiguous=0 Licenses=(null) Network=(null)
 Command=/bin/sleep
 WorkDir=/scratch/midway/dylanphall/repositories/pubsw/rccdocs.sphinx

sacct[-plus] Gives detailed reports on user(s) sacct-plus is a RCC wrapper around sacct giving easier manipulation options

Exercise 7: Learn About Your Storage

Participants in CSPP51087 have access to two types of storage:

  1. Home - for personal configurations, private data, limited space /home/[user name]
  2. Scratch - fast, for daily use /scratch/midway/[user name]

Scratch is also accessible via a symlink in your home directory $HOME/scratch-midway

To find the limits enter:

quota

Disk quotas for user dylanphall:
Filesystem       type          used     quota     limit      files    grace
----------       -------- --------- --------- --------- ---------- --------
home             USR         2.90 G   10.00 G   12.00 G      39987     none
midway-scratch   USR        24.26 G    5.00 T    6.00 T     106292     none

Descriptions of the fields:

Filesystem

This is the file system or file set where this quota is valid.

type

This is the type of quota. This can be USR for a user quota, GRP for a group quota, or FILESET for a file set quota. File set quotas can be considered a directory quota. USR and GRP quotas can exist within a FILESET quota to further limit a user or group quota inside a file set.

used

This is the amount of disk space used for the specific quota.

quota

This is the quota or soft quota. It is possible for usage to exceed the quota for the grace period or up to the hard limit.

limit

This is the limit or hard quota that is set. It is not possible for usage to exceed this limit.

files

This is the number of files currently counted in the quota. There are currently no quotas enforced on the number of files.

grace

This is the grace period which is the amount of time remaining that the quota can be exceeded. It is currently set to start at 7 days. The value none means that the quota is not exceeded. After the quota has been exceeded for longer than the grace period, it will no longer be possible to create new files.

6.1 Explore File Backups & Restoration

Snapshots

Automated snapshots of users’ home directories are available in case of accidental file deletion or other problems. Currently snapshots are available for these time periods:

  • 4 hourly snapshots
  • 7 daily snapshots
  • 4 weekly snapshots

You can find snapshots in these directories:

  • /snapshots/*/home/CNetID – Home snapshots

The subdirectories refer to the frequency and time of the backup, e.g. daily-2013-10-04.06h15 or hourly-2013-10-09.11h00.

Try recovering a file from the snapshot directory.

Backup

Backups are performed on a nightly basis to a tape machine located in a different data center than the main storage system. These backups are meant to safeguard against disasters such as hardware failure or events that could cause the loss of the main data center. Users should make use of the snapshots described above to recover files.