SLURM (Queueing System)
Rockfish uses SLURM (Simple Linux Universal Resource Manager) to manage resource scheduling and job submission. SLURM is an open source application with active developers and an increasing user community. It has been adopted by many HPC centers and universities. All users must submit jobs to the scheduler for processing, that is “interactive” use of login nodes for job processing is not allowed. Users who need to interact with their codes while these are running can request an interactive session using the script “interact”, which will submit a request to the queuing system that will allow interactive access to the node.
SLURM uses “partitions” to divide types of jobs (partitions are called queues on other schedulers). Rockfish defines a few partitions that will allow sequential/shared computing and parallel (dedicated or exclusive nodes), GPU jobs and large memory jobs. The default partition is “defq”.
The following table describes the attributes for the different partitions:
|Partition||Available Nodes||Max Time (Hours)||Max Cores per Node||Max Memory per Node (MB)|
|defq||667||1 / 72||48||192,000|
|a100||17||1 / 72||48||192,000|
|bigmem||22||1 / 48||48||1,537,000|
|v100||1||1 / 72||48||193,118|
Here is a list of useful SLURM commands.
|Submit a job script||sbatch script-name|
|Queue list and features||sinfo|
|List all jobs||squeue|
|List jobs by user||squeue -u [userid] OR sqme|
|Check job status||squeue [job-id]|
|Show resource efficiency of job||seff [job-id]|
|Delete a job||scancel [job-id]|
|Hold a job||scontrol hold|
|Release a held job||scontrol release|
|Change job resources||scontrol update|
|Show finished jobs||sacct|
SLURM will set or preset environmental variables that can be used in your script. Here is a table with the most common variables and a LOG file of the SLURM variables et by a SLURM job.
|Submit Directory||$SLURM_SUBMIT_DIR (default)|
|Job Array Index||$SLURM_ARRAY_TASK_ID|
This is a list of the most common flags that any user may include on scripts to request different resources and features for jobs.
|Job Name||#SBATCH --job-name=My-Job_Name|
|Wall time hours||#SBATCH --time=24:0:0|
|Number of nodes requested||#SBATCH --nodes=1|
|Number of processes per node requested||#SBATCH --ntasks-per-node=24|
|Number of cores per task requested||#SBATCH --cpus-per-task=24|
|Send mail at the end of the job||#SBATCH --mail-type=end|
|User's email address||#SBATCH --email@example.com|
|Copy user's environment||#SBATCH --export=[ALL|NONE|Variables]|
|Working Directory||#SBATCH --workdir=dir-name|
|Job Restart||#SBATCH --requeue|
|Share Nodes||#SBATCH --shared|
|Dedicated nodes||#SBATCH --exclusive|
|Memory Size||#SBATCH --mem=[mem |M|G|T] or --mem-per-cpu|
|Account to Charge||#SBATCH --account=[account]|
|Quality of Service||#SBATCH --qos=[name]|
|Job Arrays||#SBATCH --array=[array_spec]|
|Use specific resource||#SBATCH --constraint="XXX"|
Important Flags for Your Jobs
Users need to pay special attention to these flags because proper management will benefit both the user and the scheduler.
Walltime requested using
--time should be larger than, but close to, actual processing time. If the requested time is not enough, the job will be aborted before the program finishes and results may be lost, while SU’s will still be charged from your allocation. On the other hand, if the requested time is too long, the job will remain in the queue for a longer time as the scheduler tries to allocate the resources needed. Once resources are allocated to your job these will be unavailable for other jobs and will affect the scheduler’s ability to most efficiently allocate resources for all users.
Nodes, tasks, and cpus
Dedicated nodes can be specified with the
--exclusive flag and all CPUs and memory for each node will be allocated. Programs that rely heavily on data transfer between tasks may be suited for exclusive nodes. If exclusive nodes are not needed, whether the jobs are too small for a single node or do not leverage shared memory, the
--shared flag will designate that a fraction of each node may be used.
Parallel processing may be done with either multiple processes, threads, or a combination of both. A single process may have multiple threads sharing memory. Multiple processes require some form to communicate, for example MPI. In SLURM, the number of processes is controlled by setting the number of “tasks”, while threads are controlled by the number of “cpus” (see below for relevant flags). –ntasks-per-node should only be used for MPI jobs to avoid confusion. Otherwise, –cpus-per-task should be used to specify the number of CPUs.
The number of nodes can be specified using the
-N flags and takes the form of min-max (e.g. 2-4). If a single number is given, the scheduler will only allocate that number of nodes. You can also specify the resources needed by giving the number of tasks with
-n along with the number of
--cpus-per-task, in which case the scheduler will decide on the appropriate number of nodes for your job. You may also specify the number
--ntasks-per-node, which will multiply
--cpus-per-task if both are used. Be aware that if you ask for more CPUs than are available in a single node, the scheduler will refuse your request and throw an exception. Similarly, you may be denied the use of the ‘shared’ partition if you try to ask for more than 1 node. Finally, you may also request a minimum number of CPUs with the
--mem flag specifies the total amount of memory per node. The
--mem-per-cpu specifies the amount of memory per allocated CPU. The two flags are mutually exclusive.
For the majority of nodes, each CPU requested reserves 5GB of memory, with a maximum of 120GB. If you use the
--mem flag and the
--cpus-per-task flag together, the greater value of resulting CPU’s will be charged to your account.
If your job does not need a particular amount of memory, that is it will run within the minimum amount of memory per node (120GB), use these lines in your script.
#SBATCH -p defq #SBATCH --mem=0
Adding these flags to your job submission will cause the job to use all of the available memory. Your account will be charged accordingly, meaning the entirety of the node’s available CPU’s.
Requesting gpus requires both the right partition as well as the “gres” flag. Also, one must request a total of 6 CPUs per GPU with a combination of
--cpus-per-task. For example:
#SBATCH -p a100 #SBATCH --gres=gpu:2 #SBATCH --ntasks-per-node=2 #SBATCH --cpus-per-task=6
If you would like an interactive session, you can use the provided script:
$ interact -n 6 -p a100 -g 1
Note that the
interact script has the shortcut
-g flag for requesting gpus and does not take “gres” as an input. In the example we requested
-n 6 because each gpu is associated with 6 cpus (cores).
The environment variable with the device your job is assigned to is
$CUDA_VISIBLE_DEVICES. For example, if you requested 2 gpus, this variable may be set to “1,3”, which indicates the devices visible to your code. When setting the cuda device to use, you provide to
cudaSetDevices() the index of the device number to use (beginning at zero) and it maps to the node’s device numbering using
$CUDA_VISIBLE_DEVICES. In this example, using
cudaSetDevices(0) sets device number 1 and
cudaSetDevices(1) sets device number 3.
If users need to run application that need interactive processing, like applications that need a GUI, post-processing, or visualization, MARCC provides an “interact” command that will allow users to request an interactive session hosted on a compute node.
Below are two common interact commands where “-p” is the partition, “-c” is number of cores, and “-t” is the walltime requested
$ interact -p debug -c 2 -t 120
$ interact -p express -c 6 -t 12:00:00
You can use the following command to find more information about interact.
$ interact --usage
A job array can be specified in a script when submitted using
sbatch. For example,
$ sbatch --array=0-15%4 script.sh
would submit script.sh 16 times, with id’s 0 through 15. The
%4 is optional and would only allow 4 jobs to run concurrently.
Within script.sh, there are three environment variables that can be used:
$SLURM_JOBID is sequential for each job and depends on the queue;
$SLURM_ARRAY_JOB_ID is the same for all jobs in the array and equal to the $SLURM_JOBID of the first job; and
$SLURM_ARRAY_TASK_ID is equal to the index specified with the array option (which could be for example
--array=1-7:2 where 2 is the step size).
To specify slurm stdin, stdout, and stderr files, use %A instead of SLURM_ARRAY_JOB_ID and %a instead of SLURM_ARRAY_TASK_ID. For example:
$ sbatch -o slurm-%A_%a.out --array=0-15%4 script.sh
would output to files named slurm-45_0.out, slurm-46_1.out, slurm-47_2.out, … (assuming 45 is the id of the first job).
A simple script to run an MPI job using 24 cores (a single node) would look like this:
#!/bin/bash #SBATCH --job-name=MyJob #SBATCH --time=24:0:0 #SBATCH --partition=shared #SBATCH --nodes=1 # number of tasks (processes) per node #SBATCH --ntasks-per-node=24 #SBATCH --mail-type=end #SBATCH --firstname.lastname@example.org #### load and unload modules you may need # module unload openmpi/intel # module load mvapich2/gcc/64/2.0b module list #### execute code and write output file to OUT-24log. # time mpiexec ./code-mvapich.x > OUT-24log echo "Finished with job $SLURM_JOBID" #### mpiexec by default launches number of tasks requested
The squeue command will display all jobs that have been submitted to the queues. The output is usually long due to the large number of jobs running or waiting to be executed. The “sqme” script will show jobs that belong to the user.
$ sqme Wed Sep 21 11:53:58 2016 JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON) 88791 parallel Job1 jcombar1 RUNNING 1:22 1:10:00 4 compute[0301-0304]
The columns are self-explanatory. TIME indicates the time the job has consumed, TIME_LIMIT the maximum amount of time requested and NODELIST shows the nodes where the job is running.
Submitting/Canceling a Job
Jobs are usually submitted via a script file (see above). The sbatch command is used:
$ sbatch my-script.scr 88791
The number that shows after the script is submitted corresponds to the JobID. It can be used to cancel/kill the job:
$ scancel 88791