What’s New on Midway2¶
As part of ongoing effort to provide researchers at the University of Chicago with state-of-the-art computational systems, RCC has released a new High Performance Computing (HPC) system called Midway2. The new system offers more computing capacity than the previous Midway system (now Midway1), while also ensuring fast interconnectivity between all Midway compute nodes.
Overview¶
Midway1 now refers to the previous RCC’s HPC system (accessible via
midway1.rcc.uchicago.edu
), Midway2 refers to the RCC’s newly
deployed HPC system (accessible via midway2.rcc.uchicago.edu
), and
Midway refers to the combined compute cluster.
The Midway2 is set up in a similar way to Midway1; your workflow on midway2—including connecting, using the “module” command to access different software, compiling your code, creating an interactive session, and submitting your batch jobs—is similar or unchanged. However, since the Midway2 system uses a newer Intel computing architecture, the RCC uses the latest version of compilers to build and optimize up-to-date software on Midway2. If you prefer to use an older version of a software, you may send their request to help@rcc.uchicago.edu and ask for a specific version to be installed.
The /project
and /home
file spaces are accessible from the
Midway2 login and compute nodes to allow users to easily access their
files. For optimal results, users are encouraged to run their jobs
from the Midway2 scratch space.
Types of Compute Nodes on Midway2¶
Similar to Midway1, the Midway2 compute cluster contains a variety of node configurations. Partitions contain a (generally) homogenous set of compute nodes. At time of writing, the types of nodes available to RCC users on Midway2 are as follows:
Partition | CPU Cores per Node | Number of Nodes | Memory per Node |
---|---|---|---|
broadwl | 28 x Intel E5-2680v4 @ 2.4 GHz | 370 | 64 GB |
bigmem2 | 28 x Intel E5-2680v4 @ 2.4 GHz | 6 | 512 GB |
gpu2 | 28 x Intel E5-2680v4 @ 2.4 GHz 4x K80 NVIDIA GPU | 6 | 64GB |
For more information about the compute nodes on Midway1 and Midway2, see Using Midway.
Midway2 Interconnects¶
Midway2 nodes are connected via a very fast interconnect called Infiniband. The Infiniband interconnect on Midway2 is faster than the Midway1 interconnect, and has two variants: “Fourteen Data Rate” (FDR) and “Enhanced Data Rate” (EDR). The EDR interconnect is the fastest of the two and almost half of Midway2 nodes (185 nodes to be more specific) are using EDR. The remaining use FDR. When you run an interactive session or submit a job via sbatch that uses multiple compute nodes, you have the option of selecting the interconnect that manages message passing across nodes (see below for an example).
Service Units and Allocations¶
A “service unit” (SU), defined as the use of 1 compute core for 1
hour, is the “currency” used to account for computing resource
expenditures. On Midway1, a job that requests the use of N
cores
for M
hours costs N * M
SUs; on Midway2, the same job will
cost 1.5 * N * M
SUs. For more information about SU allocations,
see RCC Allocations.
Software Modules¶
By default, the modules available on Midway2 are optimized for Midway2’s architecture. See Software for list of the available software modules on Midway2.
Submitting Jobs to Midway2¶
Submitting Single-Node Jobs to Midway2¶
Most jobs submitted to Midway2 will use the broadwl partition
(--partition=broadwl
). For example, to submit a job that uses 28 cores
and 2GB of memory per core for 2 hours on one Midway2 node, the
following sample sbatch script can be used.
#!/bin/bash
#SBATCH --job-name=example_single_node_sbatch
#SBATCH --output=example_single_node_sbatch.out
#SBATCH --error=example_single_node_sbatch.err
#SBATCH --partition=broadwl
#SBATCH --time=02:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=28
#SBATCH --mem-per-cpu=2000
#Load any necessary modules here
./myExecutable
If you save the above commands in a file called single_node.sbatch
,
running the following command will submit the job to the Midway2 queue
$ sbatch single_node.sbatch
For more details on submitting jobs and SBATCH tags, please see Batch Jobs or Slurm.
Submitting Multi-Node Jobs to Midway2¶
To submit jobs to Midway2, specify the broadwl partition as an sbatch option. When running sbatch or sinteractive for distributed computing tasks, use of EDR or FDR nodes can be specified with the constraint option. The following is a sample sbatch script to request 4 full compute nodes (28 cores on each node) for 6 hours on Midway2 connected via EDR.
#!/bin/bash
#SBATCH --job-name=example_sbatch
#SBATCH --output=example_sbatch.out
#SBATCH --error=example_sbatch.err
#SBATCH --partition=broadwl
#SBATCH --time=06:00:00
#SBATCH --constraint=edr
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=28
#SBATCH --mem-per-cpu=2000
module load intelmpi
mpirun ./myExecutable
Submitting Large-Memory Jobs to Midway2¶
In the case that you require more than 64 GB of memory per node for your job, a node on the bigmem2 partition can be used. The bigmem2 partition contains 6 nodes in which each node has 512 GB of memory and 28 Intel Broadwell CPU cores.
Note: Big memory nodes are connected via the FDR interconnect, so you should not specify the EDR constraint when requesting the bigmem2 partition.
The following is a sample sbatch script to request a big memory node for 4 hours
#!/bin/bash
#SBATCH --job-name=example_bigmem_sbatch
#SBATCH --output=example_bigmem_sbatch.out
#SBATCH --error=example_bigmem_sbatch.err
#SBATCH --partition=bigmem2
#SBATCH --time=04:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=28
#SBATCH --mem-per-cpu=18000
./my_bigmem_Executable
In the case where you are not sure how much memory your job was using, you can learn this from your previous sbatch job by using the maxvmsize format option with sacct, like so, with <jobid> being the ID number of your job.
$ sacct -j <jobid> --format=jobid,jobname,account,alloccpus,state,cputime,maxvmsize
The last column of this output will be the maximum virtual memory size of all tasks in the job.
Submitting GPU Jobs to Midway2¶
The Midway2 cluster has 6 nodes for GPU (Graphical Processing Unit) computing. Each node in this partition has 28 Intel Broadwell cores as well as 4 NVIDIA K80 GPUs. The following is a sample sbatch script to request two GPUs (out of possible 4) and 14 cores for 4 hours.
#!/bin/bash
#SBATCH --job-name=example_gpu_sbatch
#SBATCH --output=example_gpu_sbatch.out
#SBATCH --error=example_gpu_sbatch.err
#SBATCH --partition=gpu2
#SBATCH --gres=gpu:2
#SBATCH --time=04:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=14
#SBATCH --mem-per-cpu=2000
./my_gpu_Executable
Because the --exclusive
option was not specified, this script will
allow other users to use the remaining 2 GPUs and 14 cores on the
same node.