slurm

ARCH
Support
SLURM (Queueing System)

Rockfish uses SLURM (Simple Linux Universal Resource Manager) to manage resource scheduling and job submission. SLURM is an open source application with active developers and an increasing user community. It has been adopted by many HPC centers and universities. All users must submit jobs to the scheduler for processing, that is “interactive” use of login nodes for job processing is not allowed. Users who need to interact with their codes while these are running can request an interactive session using the script “interact”, which will submit a request to the queuing system that will allow interactive access to the node.

Partitions

SLURM uses “partitions” to divide types of jobs (partitions are called queues on other schedulers). Rockfish defines a few partitions that will allow sequential/shared computing and parallel (dedicated or exclusive nodes), GPU jobs and large memory jobs. The default partition is “parallel”.

The following table describes the attributes for the different partitions:

Partition	Available Nodes	Max Time (Hours)	Max Cores per Node	Max Memory per Node (MB)
parallel	768	1 / 72	48	192,000
a100	17	1 / 72	48	192,000
bigmem	28	1 /48	48	1,537,000
v100	1	1 / 72	48	193,118
ica100	8	1 / 72	64	256,000
express	5	1 / 8	128	256,000
shared	41	1 / 24	64	256,000

SLURM COMMANDS

Here is a list of useful SLURM commands.

Description	SLURM Command
Submit a job script	sbatch script-name
Queue list and features	sinfo
Node list	sinfo
List all jobs	squeue
List jobs by user	squeue -u [userid] OR sqme
Check job status	squeue [job-id]
Show resource efficiency of job	seff [job-id]
Delete a job	scancel [job-id]
Graphical utility	sview
Hold a job	scontrol hold
Release a held job	scontrol release
Change job resources	scontrol update
Show finished jobs	sacct

Environment Variables

SLURM will set or preset environmental variables that can be used in your script. Here is a table with the most common variables and a LOG file of the SLURM variables et by a SLURM job.

Description	Slurm Variable
JobID	$SLURM_JOBID
Submit Directory	$SLURM_SUBMIT_DIR (default)
Submit Host	$SLURM_SUBMIT_HOST
Node List	$SLURM_JOB_NODELIST
Job Array Index	$SLURM_ARRAY_TASK_ID

Common Flags

This is a list of the most common flags that any user may include on scripts to request different resources and features for jobs.

Description	Job Specification
Script Directive	#SBATCH
Job Name	#SBATCH --job-name=My-Job_Name
Wall time hours	#SBATCH --time=24:0:0
Number of nodes requested	#SBATCH --nodes=1
Number of processes per node requested	#SBATCH --ntasks-per-node=24
Number of cores per task requested	#SBATCH --cpus-per-task=24
Send mail at the end of the job	#SBATCH --mail-type=end
User's email address	#SBATCH --mail-user=userid@jhu.edu
Copy user's environment	#SBATCH --export=[ALL\|NONE\|Variables]
Working Directory	#SBATCH --workdir=dir-name
Job Restart	#SBATCH --requeue
Share Nodes	#SBATCH --shared
Dedicated nodes	#SBATCH --exclusive
Memory Size	#SBATCH --mem=[mem \|M\|G\|T] or --mem-per-cpu
Account to Charge	#SBATCH --account=[account]
Quality of Service	#SBATCH --qos=[name]
Job Arrays	#SBATCH --array=[array_spec]
Use specific resource	#SBATCH --constraint="XXX"

Important Flags for Your Jobs

Users need to pay special attention to these flags because proper management will benefit both the user and the scheduler.

Walltime requested

Walltime requested using --time should be larger than, but close to, actual processing time. If the requested time is not enough, the job will be aborted before the program finishes and results may be lost, while SU’s will still be charged from your allocation. On the other hand, if the requested time is too long, the job will remain in the queue for a longer time as the scheduler tries to allocate the resources needed. Once resources are allocated to your job these will be unavailable for other jobs and will affect the scheduler’s ability to most efficiently allocate resources for all users.

Nodes, tasks, and cpus

Dedicated nodes can be specified with the --exclusive flag and all CPUs and memory for each node will be allocated. Programs that rely heavily on data transfer between tasks may be suited for exclusive nodes. If exclusive nodes are not needed, whether the jobs are too small for a single node or do not leverage shared memory, the --shared flag will designate that a fraction of each node may be used.

Parallel processing may be done with either multiple processes, threads, or a combination of both. A single process may have multiple threads sharing memory. Multiple processes require some form to communicate, for example MPI. In SLURM, the number of processes is controlled by setting the number of “tasks”, while threads are controlled by the number of “cpus” (see below for relevant flags). –ntasks-per-node should only be used for MPI jobs to avoid confusion. Otherwise, –cpus-per-task should be used to specify the number of CPUs.

The number of nodes can be specified using the --nodes or -N flags and takes the form of min-max (e.g. 2-4). If a single number is given, the scheduler will only allocate that number of nodes. You can also specify the resources needed by giving the number of tasks with --ntasks or -n along with the number of --cpus-per-task, in which case the scheduler will decide on the appropriate number of nodes for your job. You may also specify the number --ntasks-per-node, which will multiply --cpus-per-task if both are used. Be aware that if you ask for more CPUs than are available in a single node, the scheduler will refuse your request and throw an exception. Similarly, you may be denied the use of the ‘shared’ partition if you try to ask for more than 1 node. Finally, you may also request a minimum number of CPUs with the --mincpus flag.

Memory

The --mem flag specifies the total amount of memory per node. The --mem-per-cpu specifies the amount of memory per allocated CPU. The two flags are mutually exclusive.

For the majority of nodes, each CPU requested reserves 5GB of memory, with a maximum of 120GB. If you use the --mem flag and the --cpus-per-task flag together, the greater value of resulting CPU’s will be charged to your account.

If your job does not need a particular amount of memory, that is it will run within the minimum amount of memory per node (120GB), use these lines in your script.

#SBATCH -p parallel
#SBATCH --mem=0

Adding these flags to your job submission will cause the job to use all of the available memory. Your account will be charged accordingly, meaning the entirety of the node’s available CPU’s.

GPUs

Requesting gpus requires both the right partition as well as the “gres” flag. Also, one must request a total of 6 CPUs per GPU with a combination of --ntasks-per-node and --cpus-per-task. For example:

#SBATCH -p a100
#SBATCH --gres=gpu:2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=6

If you would like an interactive session, you can use the provided script:

$ interact -n 6 -p a100 -g 1

Note that the interact script has the shortcut -g flag for requesting gpus and does not take “gres” as an input. In the example we requested -n 6 because each gpu is associated with 6 cpus (cores). The environment variable with the device your job is assigned to is $CUDA_VISIBLE_DEVICES. For example, if you requested 2 gpus, this variable may be set to “1,3”, which indicates the devices visible to your code. When setting the cuda device to use, you provide to cudaSetDevices() the index of the device number to use (beginning at zero) and it maps to the node’s device numbering using $CUDA_VISIBLE_DEVICES. In this example, using cudaSetDevices(0) sets device number 1 and cudaSetDevices(1) sets device number 3.

Interactive Processing

If users need to run application that need interactive processing, like applications that need a GUI, post-processing, or visualization, MARCC provides an “interact” command that will allow users to request an interactive session hosted on a compute node.

Below are two common interact commands where “-p” is the partition, “-c” is number of cores, and “-t” is the walltime requested

$ interact -p debug -c 2 -t 120
$ interact -p express -c 6 -t 12:00:00

You can use the following command to find more information about interact.

$ interact --usage

Job Arrays

A job array can be specified in a script when submitted using sbatch. For example,

$ sbatch --array=0-15%4 script.sh

would submit script.sh 16 times, with id’s 0 through 15. The %4 is optional and would only allow 4 jobs to run concurrently.

Within script.sh, there are three environment variables that can be used: $SLURM_JOBID is sequential for each job and depends on the queue; $SLURM_ARRAY_JOB_ID is the same for all jobs in the array and equal to the $SLURM_JOBID of the first job; and $SLURM_ARRAY_TASK_ID is equal to the index specified with the array option (which could be for example --array=1,3,5,7 or --array=1-7:2 where 2 is the step size).

To specify slurm stdin, stdout, and stderr files, use %A instead of SLURM_ARRAY_JOB_ID and %a instead of SLURM_ARRAY_TASK_ID. For example:

$ sbatch -o slurm-%A_%a.out --array=0-15%4 script.sh

would output to files named slurm-45_0.out, slurm-46_1.out, slurm-47_2.out, … (assuming 45 is the id of the first job).

Example Script

A simple script to run an MPI job using 24 cores (a single node) would look like this:

#!/bin/bash
#SBATCH --job-name=MyJob
#SBATCH --time=24:0:0
#SBATCH --partition=shared
#SBATCH --nodes=1
# number of tasks (processes) per node
#SBATCH --ntasks-per-node=24
#SBATCH --mail-type=end
#SBATCH --mail-user=userid@jhu.edu

#### load and unload modules you may need
# module unload openmpi/intel
# module load mvapich2/gcc/64/2.0b
module list

#### execute code and write output file to OUT-24log.
# time mpiexec ./code-mvapich.x > OUT-24log
echo "Finished with job $SLURM_JOBID"

#### mpiexec by default launches number of tasks requested

Displaying Jobs

The squeue command will display all jobs that have been submitted to the queues. The output is usually long due to the large number of jobs running or waiting to be executed. The “sqme” script will show jobs that belong to the user.

$ sqme
Wed Sep 21 11:53:58 2016
JOBID PARTITION NAME USER     STATE   TIME TIME_LIMI NODES NODELIST(REASON)
88791 parallel  Job1 jcombar1 RUNNING 1:22 1:10:00     4   compute[0301-0304]

The columns are self-explanatory. TIME indicates the time the job has consumed, TIME_LIMIT the maximum amount of time requested and NODELIST shows the nodes where the job is running.

Submitting/Canceling a Job

Jobs are usually submitted via a script file (see above). The sbatch command is used:

$ sbatch my-script.scr
88791

The number that shows after the script is submitted corresponds to the JobID. It can be used to cancel/kill the job:

$ scancel 88791