.. _intro-to-rcc-workshop: .. index:: Workshops; Introduction to RCC ################################### Introduction to RCC for CSPP 51087 ################################### The following exercises will walk you through getting access and running simple jobs on Midway. Additional information about Midway and its environment can be found at this website. Contact us at help@rcc.uchicago.edu if you have any problems. You should contact the TAs or the professor for course-related information on CSPP 51087. Exercise 1: Log in to Midway ===================================================== Access to RCC is provided via secure shell (SSH) login, a tool that allows you to connect securely from any computer (including most smartphones and tablets). All users must have a UChicago `CNetID`_ to log in to any RCC systems. Your RCC account credentials are your CNetID and password: +----------+---------------------------+ | Username | CNetID | +----------+---------------------------+ | Password | CNet password | +----------+---------------------------+ | Hostname | midway.rcc.uchicago.edu | +----------+---------------------------+ .. note:: RCC does not store your CNet password and we are unable to reset your password. If you require password assistance, please use the `IT Services`_ `_CNet Password Recovery` tool. Most UNIX-like operating systems (Mac OS X, Linux, etc) provide an SSH utility by default that can be accessed by typing the command :command:`ssh` in a terminal. To login to Midway from a Linux/Mac computer, open a terminal and at the command line enter: .. code:: bash ssh @midway.rcc.uchicago.edu Windows users will first need to download an SSH client, such as PuTTY_, which will allow you to interact with the remote Unix server. Use the hostname :file:`midway.rcc.uchicago.edu` and your CNetID username and password to access the RCC login node. Exercise 2: Explore the Module System ===================================== The :command:`module` system is a script based system to manage the user environment Try running the commands below and review the output to learn more about the module system. Basic ``module`` commands: +----------------------+-----------------------------------------------------+ | Command | Description | +======================+=====================================================+ | module avail [name] | lists modules matching [name] (all if 'name' empty) | +----------------------+-----------------------------------------------------+ | module load [name] | loads the named module | +----------------------+-----------------------------------------------------+ | module unload [name] | unloads the named module | +----------------------+-----------------------------------------------------+ | module list | lists the modules currently loaded for the user | +----------------------+-----------------------------------------------------+ Example - Matlab: :command:`module list` .. code:: bash Currently Loaded Modulefiles: 1) slurm/2.4 3) subversion/1.6 5) env/rcc 7) tree/1.6.0 2) vim/7.3 4) emacs/23.4 6) git/1.7 :command:`module avail` .. code:: bash -------------------------------- /software/modulefiles --------------------------------- Minuit2/5.28(default) intelmpi/4.0 Minuit2/5.28+intel-12.1 intelmpi/4.0+intel-12.1(default) R/2.15(default) jasper/1.900(default) ... ifrit/3.4(default) x264/stable(default) intel/11.1 yasm/1.2(default) intel/12.1(default) ------------------------- /usr/share/Modules/modulefiles ------------------------------- dot module-cvs module-info modules null use.own ----------------------------------- /etc/modulefiles ----------------------------------- env/rcc samba/3.6 slurm/2.3 slurm/2.4(default) --------------------------------------- Aliases ---------------------------------------- :command:`module avail matlab` .. code:: bash ---------------------------------------------------------------------------------------- /software/modulefiles ---------------------------------------------------------------------------------------- matlab/2011b matlab/2012a(default) matlab/2012b ---------------------------------------------------------------------------------------- :command:`module avail` .. code:: bash which matlab :command:`module load matlab` .. code:: bash which matlab /software/matlab-2012a-x86_64/bin/matlab :command:`module list` .. code:: bash Currently Loaded Modulefiles: 1) slurm/2.4 3) subversion/1.6 5) env/rcc 7) tree/1.6.0 2) vim/7.3 4) emacs/23.4 6) git/1.7 8) matlab/2012a :command:`module unload matlab` .. code:: bash which matlab :command:`module load matlab/2011b` .. code:: bash which matlab /software/matlab-2011b-x86_64/bin/matlab :command:`module list` .. code:: bash Currently Loaded Modulefiles: 1) slurm/2.4 3) subversion/1.6 5) env/rcc 7) tree/1.6.0 2) vim/7.3 4) emacs/23.4 6) git/1.7 8) matlab/2011b Exercise 3: Set up an MPI Environment on Midway =============================================== The RCC provides several compiler suites on Midway, as well as several MPI environments. For most users these should be completely interchangeable, however some codes find different performance or experience problems with certain combinations. +----------+-------------------------------------------+ | Compiler | Module(s) | +----------+-------------------------------------------+ | Intel | intel/11.1 intel/12.1(default) intel/13.0 | +----------+-------------------------------------------+ | Portland | pgi/2012(default) | +----------+-------------------------------------------+ | GNU | No module necessary | +----------+-------------------------------------------+ A complete MPI environment is composed of an MPI installation and a compiler. Each module has the naming convention :command:`[mpi]/[mpi version]+[compiler]-[compiler version]`. Not all combinations of compiler and MPI environment are supported, but most are. Once you load a module, all the standard MPI commands :command:`mpicc`, :command:`mpirun` will function normally +-----------------+------------------------------------------------------+---------------------------------------------------------------------+ | MPI Environment | URL | Modules | +-----------------+------------------------------------------------------+---------------------------------------------------------------------+ | OpenMPI | http://www.openmpi.org | openmpi/1.6(default) openmpi/1.6+intel-12.1 openmpi/1.6+pgi-2012 | +-----------------+------------------------------------------------------+---------------------------------------------------------------------+ | Intel MPI | http://software.intel.com/en-us/intel-mpi-library | intelmpi/4.0 intelmpi/4.0+intel-12.1(default) intelmpi/4.1 | | | | intelmpi/4.1+intel-12.1 intelmpi/4.1+intel-13.0 | +-----------------+------------------------------------------------------+---------------------------------------------------------------------+ | Mvapich2 | http://mvapich.cse.ohio-state.edu/overview/mvapich2/ | mvapich2/1.8(default) mvapich2/1.8+intel-12.1 mvapich2/1.8+pgi-2012 | +-----------------+------------------------------------------------------+---------------------------------------------------------------------+ .. note:: A code compiled with one MPI module will generally not run properly with another. If you try several MPI modules, be very careful to recompile your code. Each compiler uses different flags and default options, use :command:`mpicc -show` to see the compiler and default command line flags that MPI is passing to the compiler. Exercise 4: Run a job on Midway =============================== The Slurm_ scheduler is used to schedule jobs and manage resources. Jobs are either **interactive**, in which the user logs directly into a compute node and performs tasks directly, or **batch**, where a job script is executed by the scheduler on behalf of the user. Interactive jobs are useful during development and debugging, but users will need to wait for nodes to become available. Interactive Use --------------- To request one processor to use interactively, use the :command:`sinteractive` command with no further options: .. code:: bash sinteractive The :command:`sinteractive` command provides many options for reserving processors. For example, two cores, instead of the default of one, could be reserved for four hours in the following manner: .. code:: bash sinteractive --ntasks-per-node=2 --time:4:00:00 The option :command:`--constraint=ib` can be used to ensure that Infiniband connected nodes are reserved. Infiniband is a fast networking option that permits up to 40x the bandwidth of gigabit ethernet on Midway. Multi-node jobs that use MPI **must** request Infiniband. .. code:: bash sinteractive --constraint=ib Batch jobs ---------- An example **sbatch** script: .. code:: bash #!/bin/bash #SBATCH --job-name=test #SBATCH --output=test.out #SBATCH --error=test.err #SBATCH --nodes=2 #SBATCH --ntasks-per-node=16 #SBATCH --partition=sandyb #SBATCH --constraint=ib #SBATCH --account=CSPP51087 # load your modules here module load intel # execute your tasks here mpirun hello_world When a scheduled job runs SLURM sets many environmental variables that may be helpful to query from within your job. You can explore the environment variables by adding .. code:: bash env | grep SLURM to the sbatch script above. The output will be found in the file defined in your script header. Different types of hardware are usually organized by **partition**. Rules about job limits such as maximum wallclock time and maximum numbers of CPUs that may be requested are governed by a **QOS** (Quality of Service). You can target appropriate compute nodes for your job by specifying a partition and a qos in your batch script. At the command prompt, you can run :command:`sinfo` to get some information about the available partitions, and :command:`rcchelp qos` to learn more about the qos. .. code:: bash #SBATCH --exclusive Exclusive access to all nodes requested, no other jobs may run here #SBATCH --partition Specifies the partition (ex: 'sandyb', 'westmere') #SBATCH --constraint=ib For the sandyb partition, request nodes with infiniband #SBATCH --qos Quality of Service (ex: 'normal', 'debug') #SBATCH --account Sets the account to be charged for use (the course name CSPP51087) GPU --- Midway has 4 GPU nodes, each with two GPUs. To request a job to run on these nodes use the slurm option: :option:`--gres=gpu:<1,2>` Specifies number of GPUs to use **Please Contact RCC Staff prior to using GPU nodes** We will gladly assist you in making sure your sbatch script is written to properly use the nodes. Exercise 5: Use rcchelp to download, compile, and submit a parallel job ======================================================================= :command:`rcchelp` is a custom command-line program to provide online help and code examples. Help on software topics can be accessed with the :command:`rccsoftware` shortcut. Run this command to see available topics: .. code:: bash rccsoftware The output should look similar to this: .. code:: bash ... c Compile and run a C program [] fftw Fastest Fourier Transform in the West [] gcc Compile and run a C program [] mpi Compile and run an MPI program [] namd Submission script and sample files for NAMD [] ... The left column contains topics that can be passed to the :command:`rccsoftware` command. Enter: .. code:: bash rccsoftware mpi into the command line and follow the instructions. Choose :option:`Yes` when you are given the option to download files to your home directory. The final output should look like: .. code:: bash The following files were copied locally to: /home/$HOME/rcchelp/software/mpi.rcc-docs hello.c mpi.sbatch README The information that is printed to the screen can be found and reviewed in README file. Follow the instructions to compile and run the parallel Hello World code. Exercise 6: Interact With Your Submitted Jobs ============================================== Submitted jobs status is viewable and alterable by several means. The primary command **squeue** is part of a versatile system of job monitoring. Example: .. code:: bash squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 3518933 depablo polyA6.0 ccchiu PD 0:00 1 (QOSResourceLimit) 3519981 depablo R_AFM mcgovern PD 0:00 1 (Resources) 3519987 depablo R_AFM mcgovern PD 0:00 1 (Priority) 3519988 depablo R_AFM mcgovern PD 0:00 1 (Priority) 3539126 gpu _interac jcarmas R 45:52 1 midway231 3538957 gpu test.6.3 jwhitmer R 58:52 1 midway230 3535743 kicp Alleturb agertz PD 0:00 6 (QOSResourceLimit) 3535023 kicp hf.b64.L mvlima R 5:11:46 1 midway217 3525370 westmere phase_di khaira R 4:50:02 1 midway008 3525315 westmere phase_di khaira R 4:50:03 1 midway004 3525316 westmere phase_di khaira R 4:50:03 1 midway004 The above tells us: +------------------+-------------------------------------------------------------------+ | Name | Description | +==================+===================================================================+ | JOBID | Job ID #, unique reference number for each job | +------------------+-------------------------------------------------------------------+ | PARTITION | Partition job will run on | +------------------+-------------------------------------------------------------------+ | NAME | Name for the job, defaults to slurm-JobID | +------------------+-------------------------------------------------------------------+ | USER | User who submitted job | +------------------+-------------------------------------------------------------------+ | ST | State of the job | +------------------+-------------------------------------------------------------------+ | TIME | Time used by the job in D-HH:MM:SS | +------------------+-------------------------------------------------------------------+ | NODES | Number of Nodes consumed | +------------------+-------------------------------------------------------------------+ | NODELIST(REASON) | List of Nodes consumed, or reason the job has not started running | +------------------+-------------------------------------------------------------------+ squeue's output is customizable Example: .. code:: bash squeue --user CNet -i 5 The above will only show for user :option:`CNet` and will refresh every :option:`5` seconds 6.1 Canceling Jobs ------------------- Cancel one job: :command:`scancel ` or cancel all of your jobs at the same time: :command:`scancel --user ` 6.2 More Job Information ------------------------- :command:`scontrol show job ` Example: .. code:: bash scontrol show job JobId=3560876 Name=sleep UserId=dylanphall(1832378456) GroupId=dylanphall(1832378456) Priority=17193 Account=rcc-staff QOS=normal JobState=CANCELLED Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 ExitCode=0:0 RunTime=00:00:10 TimeLimit=1-12:00:00 TimeMin=N/A SubmitTime=2013-01-09T11:39:40 EligibleTime=2013-01-09T11:39:40 StartTime=2013-01-09T11:39:40 EndTime=2013-01-09T11:39:50 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=sandyb AllocNode:Sid=midway-login2:24907 ReqNodeList=(null) ExcNodeList=(null) NodeList=midway113 BatchHost=midway113 NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/bin/sleep WorkDir=/scratch/midway/dylanphall/repositories/pubsw/rccdocs.sphinx :command:`sacct[-plus]` Gives detailed reports on user(s) ``sacct-plus`` is a RCC wrapper around ``sacct`` giving easier manipulation options Exercise 7: Learn About Your Storage ===================================== Participants in CSPP51087 have access to two types of storage: #. Home - for personal configurations, private data, limited space :file:`/home/[user name]` #. Scratch - fast, for daily use :file:`/scratch/midway/[user name]` Scratch is also accessible via a symlink in your home directory :file:`$HOME/scratch-midway` To find the limits enter: .. code:: bash quota Disk quotas for user dylanphall: Filesystem type used quota limit files grace ---------- -------- --------- --------- --------- ---------- -------- home USR 2.90 G 10.00 G 12.00 G 39987 none midway-scratch USR 24.26 G 5.00 T 6.00 T 106292 none Descriptions of the fields: .. describe:: Filesystem This is the file system or file set where this quota is valid. .. describe:: type This is the type of quota. This can be *USR* for a user quota, *GRP* for a group quota, or *FILESET* for a file set quota. File set quotas can be considered a directory quota. *USR* and *GRP* quotas can exist within a *FILESET* quota to further limit a user or group quota inside a file set. .. describe:: used This is the amount of disk space used for the specific quota. .. describe:: quota This is the quota or soft quota. It is possible for usage to exceed the quota for the grace period or up to the hard limit. .. describe:: limit This is the limit or hard quota that is set. It is not possible for usage to exceed this limit. .. describe:: files This is the number of files currently counted in the quota. There are currently no quotas enforced on the number of files. .. describe:: grace This is the grace period which is the amount of time remaining that the quota can be exceeded. It is currently set to start at 7 days. The value *none* means that the quota is not exceeded. After the quota has been exceeded for longer than the grace period, it will no longer be possible to create new files. 6.1 Explore File Backups & Restoration -------------------------------------- Snapshots --------- Automated snapshots of users' home directories are available in case of accidental file deletion or other problems. Currently snapshots are available for these time periods: * 4 hourly snapshots * 7 daily snapshots * 4 weekly snapshots You can find snapshots in these directories: * :file:`/snapshots/*/home/{CNetID}` -- Home snapshots The subdirectories refer to the frequency and time of the backup, e.g. daily-2013-10-04.06h15 or hourly-2013-10-09.11h00. Try recovering a file from the snapshot directory. Backup ------ Backups are performed on a nightly basis to a tape machine located in a different data center than the main storage system. These backups are meant to safeguard against disasters such as hardware failure or events that could cause the loss of the main data center. Users should make use of the snapshots described above to recover files.