Introduction to RCC for CSPP 51087¶
The following exercises will walk you through getting access and running simple jobs on Midway. Additional information about Midway and its environment can be found at this website. Contact us at help@rcc.uchicago.edu if you have any problems.
You should contact the TAs or the professor for course-related information on CSPP 51087.
Exercise 1: Log in to Midway¶
Access to RCC is provided via secure shell (SSH) login, a tool that allows you to connect securely from any computer (including most smartphones and tablets).
All users must have a UChicago CNetID to log in to any RCC systems. Your RCC account credentials are your CNetID and password:
Username CNetID Password CNet password Hostname midway.rcc.uchicago.edu
Note
RCC does not store your CNet password and we are unable to reset your password. If you require password assistance, please use the IT Services _CNet Password Recovery tool.
Most UNIX-like operating systems (Mac OS X, Linux, etc) provide an SSH utility by default that can be accessed by typing the command ssh in a terminal. To login to Midway from a Linux/Mac computer, open a terminal and at the command line enter:
ssh <username>@midway.rcc.uchicago.edu
Windows users will first need to download an SSH client, such as PuTTY, which will
allow you to interact with the remote Unix server. Use the hostname midway.rcc.uchicago.edu
and your CNetID username and password to access the RCC login node.
Exercise 2: Explore the Module System¶
The module system is a script based system to manage the user environment Try running the commands below and review the output to learn more about the module system.
Basic module
commands:
Command Description module avail [name] lists modules matching [name] (all if ‘name’ empty) module load [name] loads the named module module unload [name] unloads the named module module list lists the modules currently loaded for the user
Example - Matlab:
module list
Currently Loaded Modulefiles:
1) slurm/2.4 3) subversion/1.6 5) env/rcc 7) tree/1.6.0
2) vim/7.3 4) emacs/23.4 6) git/1.7
module avail
-------------------------------- /software/modulefiles ---------------------------------
Minuit2/5.28(default) intelmpi/4.0
Minuit2/5.28+intel-12.1 intelmpi/4.0+intel-12.1(default)
R/2.15(default) jasper/1.900(default)
...
ifrit/3.4(default) x264/stable(default)
intel/11.1 yasm/1.2(default)
intel/12.1(default)
------------------------- /usr/share/Modules/modulefiles -------------------------------
dot module-cvs module-info modules null use.own
----------------------------------- /etc/modulefiles -----------------------------------
env/rcc samba/3.6 slurm/2.3 slurm/2.4(default)
--------------------------------------- Aliases ----------------------------------------
module avail matlab
----------------------------------------------------------------------------------------
/software/modulefiles
----------------------------------------------------------------------------------------
matlab/2011b matlab/2012a(default) matlab/2012b
----------------------------------------------------------------------------------------
module avail
which matlab
<not found>
module load matlab
which matlab
/software/matlab-2012a-x86_64/bin/matlab
module list
Currently Loaded Modulefiles:
1) slurm/2.4 3) subversion/1.6 5) env/rcc 7) tree/1.6.0
2) vim/7.3 4) emacs/23.4 6) git/1.7 8) matlab/2012a
module unload matlab
which matlab
<not found>
module load matlab/2011b
which matlab
/software/matlab-2011b-x86_64/bin/matlab
module list
Currently Loaded Modulefiles:
1) slurm/2.4 3) subversion/1.6 5) env/rcc 7) tree/1.6.0
2) vim/7.3 4) emacs/23.4 6) git/1.7 8) matlab/2011b
Exercise 3: Set up an MPI Environment on Midway¶
The RCC provides several compiler suites on Midway, as well as several MPI environments. For most users these should be completely interchangeable, however some codes find different performance or experience problems with certain combinations.
Compiler Module(s) Intel intel/11.1 intel/12.1(default) intel/13.0 Portland pgi/2012(default) GNU No module necessary
A complete MPI environment is composed of an MPI installation and a compiler. Each module has the naming convention [mpi]/[mpi version]+[compiler]-[compiler version]. Not all combinations of compiler and MPI environment are supported, but most are. Once you load a module, all the standard MPI commands mpicc, mpirun will function normally
MPI Environment URL Modules OpenMPI http://www.openmpi.org openmpi/1.6(default) openmpi/1.6+intel-12.1 openmpi/1.6+pgi-2012 Intel MPI http://software.intel.com/en-us/intel-mpi-library intelmpi/4.0 intelmpi/4.0+intel-12.1(default) intelmpi/4.1 intelmpi/4.1+intel-12.1 intelmpi/4.1+intel-13.0 Mvapich2 http://mvapich.cse.ohio-state.edu/overview/mvapich2/ mvapich2/1.8(default) mvapich2/1.8+intel-12.1 mvapich2/1.8+pgi-2012
Note
A code compiled with one MPI module will generally not run properly with another. If you try several MPI modules, be very careful to recompile your code. Each compiler uses different flags and default options, use mpicc -show to see the compiler and default command line flags that MPI is passing to the compiler.
Exercise 4: Run a job on Midway¶
The Slurm scheduler is used to schedule jobs and manage resources. Jobs are either interactive, in which the user logs directly into a compute node and performs tasks directly, or batch, where a job script is executed by the scheduler on behalf of the user. Interactive jobs are useful during development and debugging, but users will need to wait for nodes to become available.
Interactive Use¶
To request one processor to use interactively, use the sinteractive command with no further options:
sinteractive
The sinteractive command provides many options for reserving processors. For example, two cores, instead of the default of one, could be reserved for four hours in the following manner:
sinteractive --ntasks-per-node=2 --time:4:00:00
The option --constraint=ib can be used to ensure that Infiniband connected nodes are reserved. Infiniband is a fast networking option that permits up to 40x the bandwidth of gigabit ethernet on Midway. Multi-node jobs that use MPI must request Infiniband.
sinteractive --constraint=ib
Batch jobs¶
An example sbatch script:
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --output=test.out
#SBATCH --error=test.err
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=16
#SBATCH --partition=sandyb
#SBATCH --constraint=ib
#SBATCH --account=CSPP51087
# load your modules here
module load intel
# execute your tasks here
mpirun hello_world
When a scheduled job runs SLURM sets many environmental variables that may be helpful to query from within your job. You can explore the environment variables by adding
env | grep SLURM
to the sbatch script above. The output will be found in the file defined in your script header.
Different types of hardware are usually organized by partition. Rules about job limits such as maximum wallclock time and maximum numbers of CPUs that may be requested are governed by a QOS (Quality of Service). You can target appropriate compute nodes for your job by specifying a partition and a qos in your batch script.
At the command prompt, you can run sinfo to get some information about the available partitions, and rcchelp qos to learn more about the qos.
#SBATCH --exclusive Exclusive access to all nodes requested, no other jobs may run here
#SBATCH --partition Specifies the partition (ex: 'sandyb', 'westmere')
#SBATCH --constraint=ib For the sandyb partition, request nodes with infiniband
#SBATCH --qos Quality of Service (ex: 'normal', 'debug')
#SBATCH --account Sets the account to be charged for use (the course name CSPP51087)
GPU¶
Midway has 4 GPU nodes, each with two GPUs. To request a job to run on these nodes use the slurm option:
--gres=gpu:
Specifies number of GPUs to usePlease Contact RCC Staff prior to using GPU nodes
We will gladly assist you in making sure your sbatch script is written to properly use the nodes.
Exercise 5: Use rcchelp to download, compile, and submit a parallel job¶
rcchelp is a custom command-line program to provide online help and code examples. Help on software topics can be accessed with the rccsoftware shortcut. Run this command to see available topics:
rccsoftware
The output should look similar to this:
...
c Compile and run a C program []
fftw Fastest Fourier Transform in the West []
gcc Compile and run a C program []
mpi Compile and run an MPI program []
namd Submission script and sample files for NAMD []
...
The left column contains topics that can be passed to the rccsoftware command. Enter:
rccsoftware mpi
into the command line and follow the instructions. Choose Yes
when you are given the option
to download files to your home directory. The final output should look like:
The following files were copied locally to:
/home/$HOME/rcchelp/software/mpi.rcc-docs
hello.c
mpi.sbatch
README
The information that is printed to the screen can be found and reviewed in README file. Follow the instructions to compile and run the parallel Hello World code.
Exercise 6: Interact With Your Submitted Jobs¶
Submitted jobs status is viewable and alterable by several means. The primary command squeue is part of a versatile system of job monitoring.
Example:
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3518933 depablo polyA6.0 ccchiu PD 0:00 1 (QOSResourceLimit)
3519981 depablo R_AFM mcgovern PD 0:00 1 (Resources)
3519987 depablo R_AFM mcgovern PD 0:00 1 (Priority)
3519988 depablo R_AFM mcgovern PD 0:00 1 (Priority)
3539126 gpu _interac jcarmas R 45:52 1 midway231
3538957 gpu test.6.3 jwhitmer R 58:52 1 midway230
3535743 kicp Alleturb agertz PD 0:00 6 (QOSResourceLimit)
3535023 kicp hf.b64.L mvlima R 5:11:46 1 midway217
3525370 westmere phase_di khaira R 4:50:02 1 midway008
3525315 westmere phase_di khaira R 4:50:03 1 midway004
3525316 westmere phase_di khaira R 4:50:03 1 midway004
The above tells us:
Name Description JOBID Job ID #, unique reference number for each job PARTITION Partition job will run on NAME Name for the job, defaults to slurm-JobID USER User who submitted job ST State of the job TIME Time used by the job in D-HH:MM:SS NODES Number of Nodes consumed NODELIST(REASON) List of Nodes consumed, or reason the job has not started running
squeue’s output is customizable
Example:
squeue --user CNet -i 5
The above will only show for user CNet
and will refresh every 5
seconds
6.1 Canceling Jobs¶
Cancel one job:
scancel <JobID>
or cancel all of your jobs at the same time:
scancel --user <User Name>
6.2 More Job Information¶
scontrol show job <JobID>
Example:
scontrol show job
JobId=3560876 Name=sleep
UserId=dylanphall(1832378456) GroupId=dylanphall(1832378456)
Priority=17193 Account=rcc-staff QOS=normal
JobState=CANCELLED Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 ExitCode=0:0
RunTime=00:00:10 TimeLimit=1-12:00:00 TimeMin=N/A
SubmitTime=2013-01-09T11:39:40 EligibleTime=2013-01-09T11:39:40
StartTime=2013-01-09T11:39:40 EndTime=2013-01-09T11:39:50
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=sandyb AllocNode:Sid=midway-login2:24907
ReqNodeList=(null) ExcNodeList=(null)
NodeList=midway113
BatchHost=midway113
NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/bin/sleep
WorkDir=/scratch/midway/dylanphall/repositories/pubsw/rccdocs.sphinx
sacct[-plus] Gives detailed reports on user(s)
sacct-plus
is a RCC wrapper around sacct
giving easier manipulation options
Exercise 7: Learn About Your Storage¶
Participants in CSPP51087 have access to two types of storage:
- Home - for personal configurations, private data, limited space
/home/[user name]
- Scratch - fast, for daily use
/scratch/midway/[user name]
Scratch is also accessible via a symlink in your home directory $HOME/scratch-midway
To find the limits enter:
quota
Disk quotas for user dylanphall:
Filesystem type used quota limit files grace
---------- -------- --------- --------- --------- ---------- --------
home USR 2.90 G 10.00 G 12.00 G 39987 none
midway-scratch USR 24.26 G 5.00 T 6.00 T 106292 none
Descriptions of the fields:
-
Filesystem
This is the file system or file set where this quota is valid.
-
type
This is the type of quota. This can be USR for a user quota, GRP for a group quota, or FILESET for a file set quota. File set quotas can be considered a directory quota. USR and GRP quotas can exist within a FILESET quota to further limit a user or group quota inside a file set.
-
used
This is the amount of disk space used for the specific quota.
-
quota
This is the quota or soft quota. It is possible for usage to exceed the quota for the grace period or up to the hard limit.
-
limit
This is the limit or hard quota that is set. It is not possible for usage to exceed this limit.
-
files
This is the number of files currently counted in the quota. There are currently no quotas enforced on the number of files.
-
grace
This is the grace period which is the amount of time remaining that the quota can be exceeded. It is currently set to start at 7 days. The value none means that the quota is not exceeded. After the quota has been exceeded for longer than the grace period, it will no longer be possible to create new files.
6.1 Explore File Backups & Restoration¶
Snapshots¶
Automated snapshots of users’ home directories are available in case of accidental file deletion or other problems. Currently snapshots are available for these time periods:
- 4 hourly snapshots
- 7 daily snapshots
- 4 weekly snapshots
You can find snapshots in these directories:
/snapshots/*/home/CNetID
– Home snapshots
The subdirectories refer to the frequency and time of the backup, e.g. daily-2013-10-04.06h15 or hourly-2013-10-09.11h00.
Try recovering a file from the snapshot directory.
Backup¶
Backups are performed on a nightly basis to a tape machine located in a different data center than the main storage system. These backups are meant to safeguard against disasters such as hardware failure or events that could cause the loss of the main data center. Users should make use of the snapshots described above to recover files.