Parallelism
Shared memory parallelism (threading)
In shared-memory parallelism (SM), applications achieve parallelism by executing more than one thread at a time across cores within one node. Each of the threads see and manipulate the same memory that the initial process allocated. Threads are light-weight and fast, independently processing portions of work. With SM parallelism, jobs are limited to the total amount of memory and cores on one node.
To run a threaded application properly through Slurm, you will need to specify a number of Slurm constraints. For example, to launch a Slurm job for your threaded application that will use 12 threads do:
#!/bin/bash
#SBATCH ––cpus-per-task=12 # same as –c 12
srun ./myapp
If you opt to not use srun to launch your application, you must add this to your Slurm job script:
[abc123@wind ~]$ export OMP_NUM_THREADS=12
Note: This number MUST match ––cpus-per-task.
Distributed memory parallelism (MPI)
In distributed-memory parallelism (DM), an application achieves parallelism by running multiple instances of itself across multiple nodes to solve the problem. Each instance is allocated its own chunk of virtual memory and communicates to other instances via a message passing interface such as MPI. DM type jobs have many advantages, among which are more cores to tackle the problem and a far greater amount of memory.
To run DM jobs on Monsoon, Slurm needs only to be given the total number of tasks or instances to be run. This number should correspond to the total number of processors you would like to use. For example, to use 64 processors, use ––ntasks=64. Slurm will then set aside 64 processors across the cluster dedicated to your job (Just be sure to load an MPI module in your script).
#!/bin/bash
#SBATCH ––ntasks=64 #same as –n 64
module load openmpi/1.6.5
srun ./myapp
One thing to keep in mind with DM jobs is that there can be a performance hit due to the MPI communication steps when running the application over too many nodes. You can specify the number of nodes to use in your Slurm job script with ‐‐nodes=<number>.
For example, if you have a DM job that runs best with 48 cores, you can then request that your job run on only two nodes with ‐‐nodes=2. Of course, by doing this your job will likely take longer to start due to the extra job constraint, but it could be worth it for your analyses. The next best thing to do is to ask Slurm to preferably use two nodes, but if not possible, use 3, or 4. You would specify this with ‐‐nodes=2-4. Without specifying ‐‐nodes, your MPI job could run on any number of nodes across the cluster. If communication steps are minimal this may not be of concern.
DM + SM parallelism (hybrid)
In hybrid parallelism, applications achieve parallelism with the use of both threads and MPI tasks. This type of parallelization combines the features of the previous two models, allowing a program to run threading on multiple nodes across the cluster. Hybrid jobs can potentially run more efficiently (consuming less memory and scaling further) by reducing the MPI communication overhead of a job. This can be done by substituting light-weight threads for MPI ranks. There is no absolute ratio of threads to ranks, but one approach that usually works well is to assign one MPI rank per socket. To set up jobs to use DM+SM, first make sure your app supports it (see the FAQ regarding this).
Constructing a hybrid Slurm script can take some careful planning. One example is a situation where you want to run an application across two dual-socket nodes for a combined total of 48 CPUs. To do this, you first calculate the correct number of tasks (or ranks) by multiplying the desired node count by 2 (because there are 2 sockets per node in this example): 2 * 2 = 4. Last, you tell Slurm to run only 2 tasks on each node and to run one task per socket with ‐‐ntasks-per-node=2 and ‐‐ntasks-per-socket=1. Note that your total CPU count will be the product of ntasks and cpus-per-task (12*4 = 48). The final script is shown below:
#!/bin/bash
#SBATCH --cpus-per-task=12 #same as -c 12
#SBATCH --ntasks=4 #same as -n 4
#SBATCH --ntasks-per-node=2
#SBATCH --ntasks-per-socket=1
module load openmpi
srun ./myapp #or full path to myapp