.. _intro-to-rcc-workshop: .. index:: Workshops; Introduction to RCC ###################################### Introduction to RCC Hands-on Session ###################################### These materials are intended to be used during the hands-on session of the **Introduction to RCC** workshop. Exercise 1 Get access to RCC ============================= 1.1 Request an Account ---------------------- Request a user account by completing the applicable form found at http://rcc.uchicago.edu/accounts. * If you are an eligible Principal Investigator go directly to the `PI Account Request`_ form. * If you are a student, post-doc, or otherwise a member of a UChicago research group, complete the General `User Account Request`_ Form. * RCC will create these accounts during the workshop if possible. * PIs must acknowledge and approve the request for a user to be added to their projects, so in some cases the activation time will be longer. 1.2 Temporary Accounts ---------------------- If you do not have an account we can give you temporary access. Let the instructor know you need a temporary account and they will give you a Yubikey. The Yubikey allows you to access guest accounts as indicated in the images below. Identify your guest username, which is **rccguest** plus the last four digits of your Yubikey identification number. .. image:: intro-to-rcc-workshop/yubikey-digit.png :scale: 35 % :alt: Top of Yubikey with digits Note the button on the bottom of the Yubikey, which will enter your password when pressed. .. image:: intro-to-rcc-workshop/yubikey-contact.png :scale: 35 % :alt: Bottom of Yubikey with pass phrase contact #. Insert the Yubikey into a USB slot and ssh to Midway #. Username : **rccguest####** #. Password : **touch the contact** #. Let the instructor know if you have trouble. Exercise 2 Log in to Midway ===================================================== Access to RCC is provided via secure shell (SSH) login, a tool that allows you to connect securely from any computer (including most smartphones and tablets). All users must have a UChicago CNetID to log in to any RCC systems. Your RCC account credentials are your CNetID and password: +----------+---------------------------+ | Username | CNetID | +----------+---------------------------+ | Password | CNetID password | +----------+---------------------------+ | Hostname | midway.rcc.uchicago.edu | +----------+---------------------------+ .. note:: RCC does not store your CNet password and we are unable to reset your password. If you require password assistance, please contact UChicago IT Services. Most UNIX-like operating systems (Mac OS X, Linux, etc) provide an SSH utility by default that can be accessed by typing the command :command:`ssh` in a terminal. To login to Midway from a Linux/Mac computer, open a terminal and at the command line enter: .. code:: bash ssh @midway.rcc.uchicago.edu Windows users will first need to download an SSH client, such as PuTTY_, which will allow you to interact with the remote Unix server. Use the hostname :file:`midway.rcc.uchicago.edu` and your CNetID username and password to access the Midway login node. Exercise 3 Reserve a node to work on interactively =================================================== RCC uses the SLURM resource manager to schedule jobs. To request one processor to use interactively, use the :command:`sinteractive` command with no further options: .. code:: bash sinteractive .. Note:: We have reserved nodes for use during this workshop. If Midway is under heavy use, include the option :file:`--reservation=intro` in your SLURM commands to access nodes reserved for this workshop session. For example: .. code:: bash sinteractive --reservation=intro The :command:`sinteractive` command provides many more options for reserving processors. For example, two cores, instead of the default of one, could be reserved for four hours in the following manner: .. code:: bash sinteractive --nodes=1 --ntasks-per-node=2 --time=4:00:00 The option :file:`--constraint=ib` can be used to ensure that an Infiniband connected node is reserved. Infiniband is a fast networking option that permits up to 40x the bandwidth of gigabit ethernet on Midway. .. Note:: Some examples related to MPI/parallel jobs will only work on Infiniband capable nodes. Be sure to reserve use an IB nodes for such exercises below. e.g., .. code:: bash sinteractive --constraint=ib Exercise 4 Submit a job to the scheduler ================================================== RCC resources are shared by the entire University community. Sharing computational resources creates unique challenges: * Jobs must be scheduled in a fair manner. * Resource consumption needs to be accounted. * Access needs to be controlled. Thus, a **scheduler** is used to manage job submissions to the cluster. We use the Slurm_ scheduler to manage our cluster. How to submit a job -------------------- #. ``sinteractive`` - Gets the specified resources and logs the user onto them #. ``sbatch`` - Runs a script which defines resources and commands .. #. ``srun`` - Runs a command on the specified resources SBATCH scripts contain two major elements. After the **#!/bin/bash** line, a series of **#SBATCH** parameters are defined. These are read by the scheduler, SLURM, and relay informaiton about what specific hardware is required to execute the job, how long that hardware is required, and where the output and error (stdout and stderr streams) should be filed. If resources are available the job may start less than one second following submission. When the queue is busy and the resource request is substantial the job may be placed in line with other jobs awaiting execution. The second major element of an sbatch script is the user defined commands. When the resource request is granted the script is executed just as if it were run interactively. The #SBATCH lines appear as comments to the bash interpreter. An example **sbatch** script: .. code:: bash #!/bin/bash #SBATCH --job-name=test #SBATCH --output=test.out #SBATCH --error=test.err #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 # load your modules here module load intel # execute your tasks here echo "Hello, world" date ls pwd hostname echo "Done with processing" Copy and modify the script above, then submit it to scheduler a few times. If a reservation is active for this workshop you can add :file:`#SBATCH --reservation=intro` to the script header to ensure that the script run on the reserved nodes. This reservation is not required. When a scheduled job runs SLURM sets many environmental variables that may be helpful to query from within your job. You can explore the environment variables by adding .. code:: bash env | grep SLURM to the sbatch script above. The output will be found in the output file(s) defined in your script header. Exercise 5 Use rcchelp to download, compile, and submit a parallel job ====================================================================== .. note:: This exercise on an Infiniband capable compute node. See the exercises above for information about how to reserve a compute node to work on interactively. :command:`rcchelp` is a custom command-line program to provide online help and code examples. Help on software topics can be accessed with the :command:`rccsoftware` shortcut. Run this command to see available topics: .. code:: bash rccsoftware The output should look similar to this: .. code:: bash ... c Compile and run a C program [] fftw Fastest Fourier Transform in the West [] gcc Compile and run a C program [] mpi Compile and run an MPI program [] namd Submission script and sample files for NAMD [] ... The left column contains topics that can be passed to the :command:`rccsoftware` command. Enter: .. code:: bash rccsoftware mpi into the command line and follow the instructions. Choose :option:`Yes` when you are given the option to download files to your home directory. The final output should look like: .. code:: bash The following files were copied locally to: /home/$HOME/rcchelp/software/mpi.rcc-docs hello.c mpi.sbatch README The information that is printed to the screen can be found and reviewed in README file. Follow the instructions to compile and run the parallel Hello World code. Exercise 6 Target Hardware With Job Submission Scripts ======================================================= Different types of hardware are usually organized by **partition**. Rules about job limits such as maximum wallclock time and maximum numbers of CPUs that may be requested are governed by a **QOS** (Quality of Service). You can target appropriate compute nodes for your job by specifying a partition and a qos in your batch script. .. Note:: Queue parameters change frequently. Always refer to the official documentation and test your submissions before submitting a large run of jobs. .. code:: bash #SBATCH --exclusive Exclusive access to all nodes requested, no other jobs may run here #SBATCH --partition Specifies the partition (ex: 'sandyb', 'westmere') #SBATCH --qos Quality of Service (ex: 'normal', 'debug') 6.1 Specific Configurations --------------------------- * GPU :option:`--gres=gpu:<1,2>` Specifies number of GPUs to use * Big Memory - Nodes with >=256G RAM :option:`--partition=bigmem` By default, SLURM allocates 2GB of memory per CPU core being used in a job. This follows from the fact most midway nodes contain 16 cores and 32GB of memory. If your job requires more than the default amount of memory per core, you must include option:`--mem-per-cpu=` in your sbatch job script. For example, to use 16 CPU cores and 256GB of memory on a bigmem node the required sbatch flags would be: :option:`--partition=bigmem --ntasks=16 --cpus-per-task=1 --mem-per-cpu=16000` * Westmere - Not as fast as 'sandyb', 12 cores per node (vice 16) :option:`--partition=westmere` At the command prompt, you can run :command:`sinfo` to get some information about the available partitions, and :command:`rcchelp qos` to learn more about the qos. Try to modify a submission script from one of the earlier examples to run on the **bigmem** partition. Exercise 7 Interact With Your Submitted Jobs ============================================= Submitted jobs status is viewable and alterable by several means. The primary command **squeue** is part of a versatile system of job monitoring. Example: .. code:: bash squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 3518933 depablo polyA6.0 ccchiu PD 0:00 1 (QOSResourceLimit) 3519981 depablo R_AFM mcgovern PD 0:00 1 (Resources) 3519987 depablo R_AFM mcgovern PD 0:00 1 (Priority) 3519988 depablo R_AFM mcgovern PD 0:00 1 (Priority) 3539126 gpu _interac jcarmas R 45:52 1 midway231 3538957 gpu test.6.3 jwhitmer R 58:52 1 midway230 3535743 kicp Alleturb agertz PD 0:00 6 (QOSResourceLimit) 3535023 kicp hf.b64.L mvlima R 5:11:46 1 midway217 3525370 westmere phase_di khaira R 4:50:02 1 midway008 3525315 westmere phase_di khaira R 4:50:03 1 midway004 3525316 westmere phase_di khaira R 4:50:03 1 midway004 The above tells us: +------------------+-------------------------------------------------------------------+ | Name | Description | +==================+===================================================================+ | JOBID | Job ID #, unique reference number for each job | +------------------+-------------------------------------------------------------------+ | PARTITION | Partition job will run on | +------------------+-------------------------------------------------------------------+ | NAME | Name for the job, defaults to slurm-JobID | +------------------+-------------------------------------------------------------------+ | USER | User who submitted job | +------------------+-------------------------------------------------------------------+ | ST | State of the job | +------------------+-------------------------------------------------------------------+ | TIME | Time used by the job in D-HH:MM:SS | +------------------+-------------------------------------------------------------------+ | NODES | Number of Nodes consumed | +------------------+-------------------------------------------------------------------+ | NODELIST(REASON) | List of Nodes consumed, or reason the job has not started running | +------------------+-------------------------------------------------------------------+ squeue's output is verbose, but also customizable Example: .. code:: bash squeue --user CNet -i 5 The above will only show information for user :option:`CNet` and will refresh every :option:`5` seconds 7.1 Canceling Jobs ------------------- Cancel a job: .. code:: bash scancel or cancel all of your jobs at the same time: .. code:: bash scancel --user 7.2 More Job Information ------------------------- More information about a submitted job can be obtained through the :option:`scontrol` command by specifying a JobID to query: .. code:: bash scontrol show job Example: .. code:: bash scontrol show job 3560876 JobId=3560876 Name=sleep UserId=dylanphall(1832378456) GroupId=dylanphall(1832378456) Priority=17193 Account=rcc-staff QOS=normal JobState=CANCELLED Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 ExitCode=0:0 RunTime=00:00:10 TimeLimit=1-12:00:00 TimeMin=N/A SubmitTime=2013-01-09T11:39:40 EligibleTime=2013-01-09T11:39:40 StartTime=2013-01-09T11:39:40 EndTime=2013-01-09T11:39:50 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=sandyb AllocNode:Sid=midway-login2:24907 ReqNodeList=(null) ExcNodeList=(null) NodeList=midway113 BatchHost=midway113 NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/bin/sleep WorkDir=/scratch/midway/dylanphall/repositories/pubsw/rccdocs.sphinx Exercise 8 Explore the Module System ==================================== The ``module`` system is a script based system to manage the user environment. Why isn't everything installed and available by default? The need for * multiple versions of software (version number, addons, custom software) and * multiple build configurations (compiler choice, options, and MPI library) would lead to hoplessly polluted namespace and PATH problems. Additionally, most of the applications used on HPC machines are research codes in a constant state of development. Testing, stability, compatibility, and other usability concerns are often not a primary consideration of the authors. Try running the commands below and review the output to learn more about the module system. Basic ``module`` commands: +----------------------+-----------------------------------------------------+ | Command | Description | +======================+=====================================================+ | module avail [name] | lists modules matching [name] (all if 'name' empty) | +----------------------+-----------------------------------------------------+ | module load [name] | loads the named module | +----------------------+-----------------------------------------------------+ | module unload [name] | unloads the named module | +----------------------+-----------------------------------------------------+ | module list | lists the modules currently loaded for the user | +----------------------+-----------------------------------------------------+ 8.1 Example - Find and load a particular Mathematica version ------------------------------------------------------------ Obtain a list of the currently loaded modules: .. code:: bash $ module list Currently Loaded Modulefiles: 1) slurm/2.4 3) subversion/1.6 5) env/rcc 7) tree/1.6.0 2) vim/7.3 4) emacs/23.4 6) git/1.7 Obtain a list of all available modules: .. code:: bash $ module avail -------------------------------- /software/modulefiles --------------------------------- Minuit2/5.28(default) intelmpi/4.0 Minuit2/5.28+intel-12.1 intelmpi/4.0+intel-12.1(default) R/2.15(default) jasper/1.900(default) ... ifrit/3.4(default) x264/stable(default) intel/11.1 yasm/1.2(default) intel/12.1(default) ------------------------- /usr/share/Modules/modulefiles ------------------------------- dot module-cvs module-info modules null use.own ----------------------------------- /etc/modulefiles ----------------------------------- env/rcc samba/3.6 slurm/2.3 slurm/2.4(default) --------------------------------------- Aliases ---------------------------------------- Obtain a list of available versions of a particular piece of software: .. code:: bash $ module avail mathematica ---------------------------------------------------------------------------------------- mathematica/8.0(default) mathematica/9.0 ---------------------------------------------------------------------------------------- Load the default mathematica version: .. code:: bash $ module load mathematica $ mathematica --version 8.0 List the currently loaded modules: .. code:: bash $ module list Currently Loaded Modulefiles: 1) slurm/2.4 3) subversion/1.6 5) env/rcc 7) tree/1.6.0 2) vim/7.3 4) emacs/23.4 6) git/1.7 8) mathematica/8.0 Unload the mathematica/8.0 module and load mathematica version 9.0 .. code:: bash $ module unload mathematica/8.0 $ module load mathematica/9.0 $ mathematica --version 9.0 List the currently loaded modules: .. code:: bash $ module list Currently Loaded Modulefiles: 1) slurm/2.4 3) subversion/1.6 5) env/rcc 7) tree/1.6.0 2) vim/7.3 4) emacs/23.4 6) git/1.7 8) mathematica/9.0 Exercise 9 Learn About Your Storage ===================================== There are three different types of storage: #. Home - for personal configurations, private data, limited space :file:`/home/[user name]` #. Scratch - fast, for daily use :file:`/scratch/midway/[user name]` #. Project - for common project files, project data, sharing :file:`/project/[PI name]` To find the limits enter: .. code:: bash quota Disk quotas for user dylanphall: Filesystem type used quota limit files grace ---------- -------- --------- --------- --------- ---------- -------- home USR 2.90 G 10.00 G 12.00 G 39987 none midway-scratch USR 24.26 G 5.00 T 6.00 T 106292 none project-abcdefg FILESET 24.33 G 500.00 G 550.00 G 5488 none Descriptions of the fields: .. describe:: Filesystem This is the file system or file set where this quota is valid. .. describe:: type This is the type of quota. This can be *USR* for a user quota, *GRP* for a group quota, or *FILESET* for a file set quota. File set quotas can be considered a directory quota. *USR* and *GRP* quotas can exist within a *FILESET* quota to further limit a user or group quota inside a file set. .. describe:: used This is the amount of disk space used for the specific quota. .. describe:: quota This is the quota or soft quota. It is possible for usage to exceed the quota for the grace period or up to the hard limit. .. describe:: limit This is the limit or hard quota that is set. It is not possible for usage to exceed this limit. .. describe:: files This is the number of files currently counted in the quota. .. describe:: grace This is the grace period which is the amount of time remaining that the quota can be exceeded. The value *none* means that the quota is not exceeded. After the quota has been exceeded for longer than the grace period, it will no longer be possible to create new files. Default Storage: +-----------+-----------+ | Partition | Available | +===========+===========+ | Home | 25 G | +-----------+-----------+ | Scratch | 5 T | +-----------+-----------+ | Project | 500 G | +-----------+-----------+ 9.1 Explore File Backups & Restoration -------------------------------------- Snapshots --------- Automated snapshots of the Home and Project areas are available in case of accidental file deletion or other problems. Currently snapshots are available for these time periods: * 4 hourly snapshots * 7 daily snapshots * 4 weekly snapshots You can find snapshots in these directories: * :file:`/snapshots/home/` -- Home snapshots * :file:`/snapshots/project` -- Project snapshots The subdirectories within these locations specify the time of the backup. For example, :file:`/snapshots/project/daily-2013-10-08.05h15/project/rcc` contains an exact snapshot of the contents of /project/rcc as it appeared on October 10, 2013 at 5:15 am. Try recovering a file from a snapshot of your home directory. If you are a brand new RCC user, or using a guest account, you may not have any snapshots of your home as it has not existed for long enough. Backup ------ Backups are performed on a nightly basis to a tape machine located in a different data center than the main storage system. These backups are meant to safeguard against disasters such as hardware failure or events that could cause the loss of the main data center. Users should make use of the snapshots described above to recover files. Exercise 10 Run a community code ==================================== Computing centers usually maintain the most popular community codes in a central repository. The physical and biological sciences in particular have a wealth of codes for simulating the natural world at a variety of scales. NAMD is an example of a code that can be run on hundreds of thousands of processors in order to study atomistic protein dynamics. Enter the command **rcchelp namd** and follow the instructions to download sample files and submission scripts. Run the parallel NAMD job. Exercise 11 Putting it all together ==================================== Develop a simple application, compile it, and submit it to the scheduler to run. If you are not sure what to write and compile, try this simple C application: .. code:: bash include int main( void ) { printf("Hello, World!\n"); return 0; } Try compiling with both the GCC and Intel compilers. You will need to use the commands :command:`module load gcc` and :command:`module load intel`. Exercise 12 Connect to the cluster with Samba ============================================== Samba allows uses to connect to (or "mount") their home and project directories on their local computer so that the file system on Midway appears as if it were directly connected to the local machine. This method of accessing your RCC home and project space is only available from within the UChicago campus network. From off-campus you will need to connect through the `UChicago virtual private network`_. .. _UChicago virtual private network: https://cvpn.uchicago.edu/ Your Samba account credentials are your CNetID and password:: Username: ADLOCAL\ Password: CNet password Hostname: midwaysmb.rcc.uchicago.edu .. note:: Make sure to prefix your username with **ADLOCAL**\\ On a Windows computer, use the "Map Network Drive" functionality and the following UNC paths:: Home: \\midwaysmb.rcc.uchicago.edu\homes Project: \\midwaysmb.rcc.uchicago.edu\project On a Mac OS X, use these URLs to connect:: Home: smb://midwaysmb.rcc.uchicago.edu/homes Project: smb://midwaysmb.rcc.uchicago.edu/project To connect on a Mac OS X computer you can either click the above URLs or follow these steps: * Use the **Connect to Server** utility in Finder .. image:: intro-to-rcc-workshop/finder-connect_to_server.jpg :width: 40 % * Enter one of the URLs from above in the input box for **Server Address**. * When prompted for a username and password, select **Registered User**. * Enter :samp:`ADLOCAL\\{user}` for the username and enter your CNet password. On Ubuntu, use these URLs to connect:: Home: smb://midwaysmb.rcc.uchicago.edu/ Project: smb://midwaysmb.rcc.uchicago.edu/project * Use the **Connect to Server** utility in the File Manager .. image:: intro-to-rcc-workshop/ubuntu-connect_to_server.png :width: 40 % * Enter one of the above URLs for **Server Address**. * Enter :samp:`ADLOCAL\\{user}` for the username and enter your CNet password. Exercise 13 Head to the User's Guide =============================================== You are ready to work from the User's Guide. Head to http://docs.rcc.uchicago.edu to get started. You may want to try: #. Requesting a user or PI account #. Connecting with NX, to get a graphical Linux desktop connect to the cluster #. Using Sviz, for demanding graphical applications that run on the cluster #. Transferring data to the cluster #. Running jobs Exercise 14 Transfer files with Globus Online ========================================================= * Go to https://globus.rcc.uchicago.edu and follow the instructions to sign in to your Globus Online account (http://docs.rcc.uchicago.edu/user-guide.html#user-guide-globus-online). * Once you have signed in, enter ucrcc#midway as the Endpoint then click the Go button. .. image:: intro-to-rcc-workshop/Globus_tutorialMidwayEndpoint.png :scale: 70 % :alt: Globus select Midway endpoint * Click the Get Globus Connect link on the transfer page and follow the installation instructions. The Endpoint Name you create in Step Two will refer to your local computer. .. image:: intro-to-rcc-workshop/Globus_tutorialConnect.png :scale: 70 % :alt: Globus install Globus Connect * Open Globus Connect on your local computer * Return to the transfer page, enter your local computer's endpoint name in the second Endpoint field, then click the Go button. .. image:: intro-to-rcc-workshop/Globus_tutorialLocalEndpoint.png :scale: 70 % :alt: Globus select local endpoint * Find a file within your local endpoint window, select it, then click the opposing arrow to transfer it to your Midway home directory. Click refresh list to view the transferred file. .. image:: intro-to-rcc-workshop/Globus_tutorialSuccess.png :scale: 70 % :alt: Globus transfer from local to Midway