Introduction to RCC Hands-on Session

These materials are intended to be used during the hands-on session of the Introduction to RCC workshop.

Exercise 1 Get access to RCC

1.1 Request an Account

Request a user account by completing the applicable form found at http://rcc.uchicago.edu/accounts.

  • If you are an eligible Principal Investigator go directly to the PI Account Request form.
  • If you are a student, post-doc, or otherwise a member of a UChicago research group, complete the General User Account Request Form.
  • RCC will create these accounts during the workshop if possible.
  • PIs must acknowledge and approve the request for a user to be added to their projects, so in some cases the activation time will be longer.

1.2 Temporary Accounts

If you do not have an account we can give you temporary access. Let the instructor know you need a temporary account and they will give you a Yubikey. The Yubikey allows you to access guest accounts as indicated in the images below.

Identify your guest username, which is rccguest plus the last four digits of your Yubikey identification number.

Top of Yubikey with digits

Note the button on the bottom of the Yubikey, which will enter your password when pressed.

Bottom of Yubikey with pass phrase contact
  1. Insert the Yubikey into a USB slot and ssh to Midway
  2. Username : rccguest####
  3. Password : touch the contact
  4. Let the instructor know if you have trouble.

Exercise 2 Log in to Midway

Access to RCC is provided via secure shell (SSH) login, a tool that allows you to connect securely from any computer (including most smartphones and tablets).

All users must have a UChicago CNetID to log in to any RCC systems. Your RCC account credentials are your CNetID and password:

Username CNetID
Password CNetID password
Hostname midway.rcc.uchicago.edu

Note

RCC does not store your CNet password and we are unable to reset your password. If you require password assistance, please contact UChicago IT Services.

Most UNIX-like operating systems (Mac OS X, Linux, etc) provide an SSH utility by default that can be accessed by typing the command ssh in a terminal. To login to Midway from a Linux/Mac computer, open a terminal and at the command line enter:

ssh <username>@midway.rcc.uchicago.edu

Windows users will first need to download an SSH client, such as PuTTY, which will allow you to interact with the remote Unix server. Use the hostname midway.rcc.uchicago.edu and your CNetID username and password to access the Midway login node.

Exercise 3 Reserve a node to work on interactively

RCC uses the SLURM resource manager to schedule jobs. To request one processor to use interactively, use the sinteractive command with no further options:

sinteractive

Note

We have reserved nodes for use during this workshop. If Midway is under heavy use, include the option --reservation=intro in your SLURM commands to access nodes reserved for this workshop session. For example:

sinteractive --reservation=intro

The sinteractive command provides many more options for reserving processors. For example, two cores, instead of the default of one, could be reserved for four hours in the following manner:

sinteractive --nodes=1 --ntasks-per-node=2 --time=4:00:00

The option --constraint=ib can be used to ensure that an Infiniband connected node is reserved. Infiniband is a fast networking option that permits up to 40x the bandwidth of gigabit ethernet on Midway.

Note

Some examples related to MPI/parallel jobs will only work on Infiniband capable nodes. Be sure to reserve use an IB nodes for such exercises below. e.g.,

sinteractive --constraint=ib

Exercise 4 Submit a job to the scheduler

RCC resources are shared by the entire University community. Sharing computational resources creates unique challenges:

  • Jobs must be scheduled in a fair manner.
  • Resource consumption needs to be accounted.
  • Access needs to be controlled.

Thus, a scheduler is used to manage job submissions to the cluster. We use the Slurm scheduler to manage our cluster.

How to submit a job

  1. sinteractive - Gets the specified resources and logs the user onto them
  2. sbatch - Runs a script which defines resources and commands

SBATCH scripts contain two major elements. After the #!/bin/bash line, a series of #SBATCH parameters are defined. These are read by the scheduler, SLURM, and relay informaiton about what specific hardware is required to execute the job, how long that hardware is required, and where the output and error (stdout and stderr streams) should be filed. If resources are available the job may start less than one second following submission. When the queue is busy and the resource request is substantial the job may be placed in line with other jobs awaiting execution.

The second major element of an sbatch script is the user defined commands. When the resource request is granted the script is executed just as if it were run interactively. The #SBATCH lines appear as comments to the bash interpreter.

An example sbatch script:

#!/bin/bash
#SBATCH --job-name=test
#SBATCH --output=test.out
#SBATCH --error=test.err
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1

# load your modules here
module load intel

# execute your tasks here
echo "Hello, world"
date
ls
pwd
hostname
echo "Done with processing"

Copy and modify the script above, then submit it to scheduler a few times. If a reservation is active for this workshop you can add #SBATCH --reservation=intro to the script header to ensure that the script run on the reserved nodes. This reservation is not required.

When a scheduled job runs SLURM sets many environmental variables that may be helpful to query from within your job. You can explore the environment variables by adding

env | grep SLURM

to the sbatch script above. The output will be found in the output file(s) defined in your script header.

Exercise 5 Use rcchelp to download, compile, and submit a parallel job

Note

This exercise on an Infiniband capable compute node. See the exercises above for information about how to reserve a compute node to work on interactively.

rcchelp is a custom command-line program to provide online help and code examples. Help on software topics can be accessed with the rccsoftware shortcut. Run this command to see available topics:

rccsoftware

The output should look similar to this:

...
c                   Compile and run a C program []
fftw                Fastest Fourier Transform in the West []
gcc                 Compile and run a C program []
mpi                 Compile and run an MPI program []
namd                Submission script and sample files for NAMD []
...

The left column contains topics that can be passed to the rccsoftware command. Enter:

rccsoftware mpi

into the command line and follow the instructions. Choose Yes when you are given the option to download files to your home directory. The final output should look like:

The following files were copied locally to:
  /home/$HOME/rcchelp/software/mpi.rcc-docs
  hello.c
  mpi.sbatch
  README

The information that is printed to the screen can be found and reviewed in README file. Follow the instructions to compile and run the parallel Hello World code.

Exercise 6 Target Hardware With Job Submission Scripts

Different types of hardware are usually organized by partition. Rules about job limits such as maximum wallclock time and maximum numbers of CPUs that may be requested are governed by a QOS (Quality of Service). You can target appropriate compute nodes for your job by specifying a partition and a qos in your batch script.

Note

Queue parameters change frequently. Always refer to the official documentation and test your submissions before submitting a large run of jobs.

#SBATCH --exclusive         Exclusive access to all nodes requested, no other jobs may run here
#SBATCH --partition         Specifies the partition (ex: 'sandyb', 'westmere')
#SBATCH --qos               Quality of Service (ex: 'normal', 'debug')

6.1 Specific Configurations

  • GPU

    --gres=gpu: Specifies number of GPUs to use

  • Big Memory - Nodes with >=256G RAM

    --partition=bigmem

    By default, SLURM allocates 2GB of memory per CPU core being used in a job. This follows from the fact most midway nodes contain 16 cores and 32GB of memory. If your job requires more than the default amount of memory per core, you must include option:–mem-per-cpu=<MB> in your sbatch job script. For example, to use 16 CPU cores and 256GB of memory on a bigmem node the required sbatch flags would be:

    --partition=bigmem --ntasks=16 --cpus-per-task=1 --mem-per-cpu=16000

  • Westmere - Not as fast as ‘sandyb’, 12 cores per node (vice 16)

    --partition=westmere

At the command prompt, you can run sinfo to get some information about the available partitions, and rcchelp qos to learn more about the qos.

Try to modify a submission script from one of the earlier examples to run on the bigmem partition.

Exercise 7 Interact With Your Submitted Jobs

Submitted jobs status is viewable and alterable by several means. The primary command squeue is part of a versatile system of job monitoring.

Example:

squeue

JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
3518933   depablo polyA6.0   ccchiu  PD       0:00      1 (QOSResourceLimit)
3519981   depablo    R_AFM mcgovern  PD       0:00      1 (Resources)
3519987   depablo    R_AFM mcgovern  PD       0:00      1 (Priority)
3519988   depablo    R_AFM mcgovern  PD       0:00      1 (Priority)
3539126       gpu _interac  jcarmas   R      45:52      1 midway231
3538957       gpu test.6.3 jwhitmer   R      58:52      1 midway230
3535743      kicp Alleturb   agertz  PD       0:00      6 (QOSResourceLimit)
3535023      kicp hf.b64.L   mvlima   R    5:11:46      1 midway217
3525370  westmere phase_di   khaira   R    4:50:02      1 midway008
3525315  westmere phase_di   khaira   R    4:50:03      1 midway004
3525316  westmere phase_di   khaira   R    4:50:03      1 midway004

The above tells us:

Name Description
JOBID Job ID #, unique reference number for each job
PARTITION Partition job will run on
NAME Name for the job, defaults to slurm-JobID
USER User who submitted job
ST State of the job
TIME Time used by the job in D-HH:MM:SS
NODES Number of Nodes consumed
NODELIST(REASON) List of Nodes consumed, or reason the job has not started running

squeue’s output is verbose, but also customizable

Example:

squeue --user CNet -i 5

The above will only show information for user CNet and will refresh every 5 seconds

7.1 Canceling Jobs

Cancel a job:

scancel <JobID>

or cancel all of your jobs at the same time:

scancel --user <CNetID>

7.2 More Job Information

More information about a submitted job can be obtained through the scontrol command by specifying a JobID to query:

scontrol show job <JobID>

Example:

scontrol show job 3560876

JobId=3560876 Name=sleep
 UserId=dylanphall(1832378456) GroupId=dylanphall(1832378456)
 Priority=17193 Account=rcc-staff QOS=normal
 JobState=CANCELLED Reason=None Dependency=(null)
 Requeue=1 Restarts=0 BatchFlag=0 ExitCode=0:0
 RunTime=00:00:10 TimeLimit=1-12:00:00 TimeMin=N/A
 SubmitTime=2013-01-09T11:39:40 EligibleTime=2013-01-09T11:39:40
 StartTime=2013-01-09T11:39:40 EndTime=2013-01-09T11:39:50
 PreemptTime=None SuspendTime=None SecsPreSuspend=0
 Partition=sandyb AllocNode:Sid=midway-login2:24907
 ReqNodeList=(null) ExcNodeList=(null)
 NodeList=midway113
 BatchHost=midway113
 NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:*
 MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
 Features=(null) Gres=(null) Reservation=(null)
 Shared=OK Contiguous=0 Licenses=(null) Network=(null)
 Command=/bin/sleep
 WorkDir=/scratch/midway/dylanphall/repositories/pubsw/rccdocs.sphinx

Exercise 8 Explore the Module System

The module system is a script based system to manage the user environment.

Why isn’t everything installed and available by default? The need for

  • multiple versions of software (version number, addons, custom software) and
  • multiple build configurations (compiler choice, options, and MPI library)

would lead to hoplessly polluted namespace and PATH problems. Additionally, most of the applications used on HPC machines are research codes in a constant state of development. Testing, stability, compatibility, and other usability concerns are often not a primary consideration of the authors.

Try running the commands below and review the output to learn more about the module system.

Basic module commands:

Command Description
module avail [name] lists modules matching [name] (all if ‘name’ empty)
module load [name] loads the named module
module unload [name] unloads the named module
module list lists the modules currently loaded for the user

8.1 Example - Find and load a particular Mathematica version

Obtain a list of the currently loaded modules:

$ module list

Currently Loaded Modulefiles:
1) slurm/2.4       3) subversion/1.6  5) env/rcc         7) tree/1.6.0
2) vim/7.3         4) emacs/23.4      6) git/1.7

Obtain a list of all available modules:

$ module avail

-------------------------------- /software/modulefiles ---------------------------------
Minuit2/5.28(default)                               intelmpi/4.0
Minuit2/5.28+intel-12.1                             intelmpi/4.0+intel-12.1(default)
R/2.15(default)                                     jasper/1.900(default)
...
ifrit/3.4(default)                                  x264/stable(default)
intel/11.1                                          yasm/1.2(default)
intel/12.1(default)
------------------------- /usr/share/Modules/modulefiles -------------------------------
dot         module-cvs  module-info modules     null        use.own
----------------------------------- /etc/modulefiles -----------------------------------
env/rcc            samba/3.6          slurm/2.3          slurm/2.4(default)
--------------------------------------- Aliases ----------------------------------------

Obtain a list of available versions of a particular piece of software:

$ module avail mathematica

----------------------------------------------------------------------------------------
mathematica/8.0(default)            mathematica/9.0
----------------------------------------------------------------------------------------

Load the default mathematica version:

$ module load mathematica

$ mathematica --version
8.0

List the currently loaded modules:

$ module list

Currently Loaded Modulefiles:
 1) slurm/2.4       3) subversion/1.6  5) env/rcc         7) tree/1.6.0
 2) vim/7.3         4) emacs/23.4      6) git/1.7         8) mathematica/8.0

Unload the mathematica/8.0 module and load mathematica version 9.0

$ module unload mathematica/8.0

$ module load mathematica/9.0

$ mathematica --version
9.0

List the currently loaded modules:

$ module list

Currently Loaded Modulefiles:
 1) slurm/2.4       3) subversion/1.6  5) env/rcc         7) tree/1.6.0
 2) vim/7.3         4) emacs/23.4      6) git/1.7         8) mathematica/9.0

Exercise 9 Learn About Your Storage

There are three different types of storage:

  1. Home - for personal configurations, private data, limited space /home/[user name]
  2. Scratch - fast, for daily use /scratch/midway/[user name]
  3. Project - for common project files, project data, sharing /project/[PI name]

To find the limits enter:

quota

Disk quotas for user dylanphall:
Filesystem       type          used     quota     limit      files    grace
----------       -------- --------- --------- --------- ---------- --------
home             USR         2.90 G   10.00 G   12.00 G      39987     none
midway-scratch   USR        24.26 G    5.00 T    6.00 T     106292     none
project-abcdefg  FILESET    24.33 G  500.00 G  550.00 G       5488     none

Descriptions of the fields:

Filesystem

This is the file system or file set where this quota is valid.

type

This is the type of quota. This can be USR for a user quota, GRP for a group quota, or FILESET for a file set quota. File set quotas can be considered a directory quota. USR and GRP quotas can exist within a FILESET quota to further limit a user or group quota inside a file set.

used

This is the amount of disk space used for the specific quota.

quota

This is the quota or soft quota. It is possible for usage to exceed the quota for the grace period or up to the hard limit.

limit

This is the limit or hard quota that is set. It is not possible for usage to exceed this limit.

files

This is the number of files currently counted in the quota.

grace

This is the grace period which is the amount of time remaining that the quota can be exceeded. The value none means that the quota is not exceeded. After the quota has been exceeded for longer than the grace period, it will no longer be possible to create new files.

Default Storage:

Partition Available
Home 25 G
Scratch 5 T
Project 500 G

9.1 Explore File Backups & Restoration

Snapshots

Automated snapshots of the Home and Project areas are available in case of accidental file deletion or other problems. Currently snapshots are available for these time periods:

  • 4 hourly snapshots
  • 7 daily snapshots
  • 4 weekly snapshots

You can find snapshots in these directories:

  • /snapshots/home/ – Home snapshots
  • /snapshots/project – Project snapshots

The subdirectories within these locations specify the time of the backup.

For example, /snapshots/project/daily-2013-10-08.05h15/project/rcc contains an exact snapshot of the contents of /project/rcc as it appeared on October 10, 2013 at 5:15 am.

Try recovering a file from a snapshot of your home directory. If you are a brand new RCC user, or using a guest account, you may not have any snapshots of your home as it has not existed for long enough.

Backup

Backups are performed on a nightly basis to a tape machine located in a different data center than the main storage system. These backups are meant to safeguard against disasters such as hardware failure or events that could cause the loss of the main data center. Users should make use of the snapshots described above to recover files.

Exercise 10 Run a community code

Computing centers usually maintain the most popular community codes in a central repository. The physical and biological sciences in particular have a wealth of codes for simulating the natural world at a variety of scales. NAMD is an example of a code that can be run on hundreds of thousands of processors in order to study atomistic protein dynamics. Enter the command rcchelp namd and follow the instructions to download sample files and submission scripts. Run the parallel NAMD job.

Exercise 11 Putting it all together

Develop a simple application, compile it, and submit it to the scheduler to run. If you are not sure what to write and compile, try this simple C application:

include <stdio.h>

int main( void ) {
    printf("Hello, World!\n");
    return 0;
}

Try compiling with both the GCC and Intel compilers. You will need to use the commands module load gcc and module load intel.

Exercise 12 Connect to the cluster with Samba

Samba allows uses to connect to (or “mount”) their home and project directories on their local computer so that the file system on Midway appears as if it were directly connected to the local machine. This method of accessing your RCC home and project space is only available from within the UChicago campus network. From off-campus you will need to connect through the UChicago virtual private network.

Your Samba account credentials are your CNetID and password:

Username: ADLOCAL\<CNetID>
Password: CNet password
Hostname: midwaysmb.rcc.uchicago.edu

Note

Make sure to prefix your username with ADLOCAL\

On a Windows computer, use the “Map Network Drive” functionality and the following UNC paths:

Home:    \\midwaysmb.rcc.uchicago.edu\homes
Project: \\midwaysmb.rcc.uchicago.edu\project

On a Mac OS X, use these URLs to connect:

Home:    smb://midwaysmb.rcc.uchicago.edu/homes
Project: smb://midwaysmb.rcc.uchicago.edu/project

To connect on a Mac OS X computer you can either click the above URLs or follow these steps:

  • Use the Connect to Server utility in Finder
../_images/finder-connect_to_server3.jpg
  • Enter one of the URLs from above in the input box for Server Address.
  • When prompted for a username and password, select Registered User.
  • Enter ADLOCAL\user for the username and enter your CNet password.

On Ubuntu, use these URLs to connect:

Home:    smb://midwaysmb.rcc.uchicago.edu/<CNetID>
Project: smb://midwaysmb.rcc.uchicago.edu/project
  • Use the Connect to Server utility in the File Manager
../_images/ubuntu-connect_to_server.png
  • Enter one of the above URLs for Server Address.
  • Enter ADLOCAL\user for the username and enter your CNet password.

Exercise 13 Head to the User’s Guide

You are ready to work from the User’s Guide. Head to http://docs.rcc.uchicago.edu to get started. You may want to try:

  1. Requesting a user or PI account
  2. Connecting with NX, to get a graphical Linux desktop connect to the cluster
  3. Using Sviz, for demanding graphical applications that run on the cluster
  4. Transferring data to the cluster
  5. Running jobs

Exercise 14 Transfer files with Globus Online

Globus select Midway endpoint
  • Click the Get Globus Connect link on the transfer page and follow the installation instructions. The Endpoint Name you create in Step Two will refer to your local computer.
Globus install Globus Connect
  • Open Globus Connect on your local computer
  • Return to the transfer page, enter your local computer’s endpoint name in the second Endpoint field, then click the Go button.
Globus select local endpoint
  • Find a file within your local endpoint window, select it, then click the opposing arrow to transfer it to your Midway home directory. Click refresh list to view the transferred file.
Globus transfer from local to Midway