Using the Slurm workload manager

Slurm is an open source workload manager and job scheduler that is now used for all SeaWulf queues in place of PBS Torque/Maui.  This FAQ will explain how to use Slurm to submit jobs.  This FAQ utilizes information from several web resources.  Please see here and here for additional documentation.  

Audience: Faculty, Researchers and Staff

This KB Article References: High Performance Computing
This Information is Intended for: Faculty, Researchers, Staff
Last Updated: December 05, 2019

To use Slurm, first load the following modules

module load shared
module load slurm

 

Slurm wrappers for Torque 

Users are encouraged to learn to use Slurm commands (see below) to take full advantage of Slurm's flexibility and to facilitate easier troubleshooting.  However, to ease the transition between the two workload systems, Slurm comes equipped with several wrapper scripts that will allow users to use many common Torque commands.  These include:

 qsub, qstat, qdel, qhold, qrls, and pbsnodes

 

Slurm commands

The following tables provide a list of HPC job-related functions and the equivalent Torque and Slurm commands needed to execute these functions.

User commands PBS/Torque SLURM
Job submission qsub [filename] sbatch [filename]
Job deletion qdel [job_id] scancel [job_id]
Job status (by job) qstat [job_id] squeue --job [job_id]
Full job status (by job) qstat -f [job_id] scontrol show job [job_id]
Job status (by user) qstat -u [username] squeue --user=[username]

 

Environment variables PBS/Torque SLURM
Job ID $PBS_JOBID $SLURM_JOBID
Submit Directory $PBS_O_WORKDIR $SLURM_SUBMIT_DIR
Node List $PBS_NODEFILE $SLURM_JOB_NODELIST

 

Job specification PBS/Torque SLURM
Script directive #PBS #SBATCH
Job Name -N [name] --job-name=[name] OR -J [name]
Node Count -l nodes=[count] --nodes=[min[-max]] OR -N [min[-max]]
CPU Count -l ppn=[count] --ntasks-per-node=[count]
CPUs Per Task   --cpus-per-task=[count]
Memory Size -l mem=[MB] --mem=[MB] OR --`mem-per-cpu=[MB]
Wall Clock Limit -l walltime=[hh:mm:ss] --time=[min] OR --time=[days-hh:mm:ss]
Node Properties -l nodes=4:ppn=8:[property] --constraint=[list]
Standard Output File -o [file_name] --output=[file_name] OR -o [file_name]
Standard Error File -e [file_name] --error=[file_name] OR -e [file_name]
Combine stdout/stderr -j oe (both to stdout) (Default if you don’t specify --error)
Job Arrays -t [array_spec] --array=[array_spec] OR -a [array_spec]
Delay Job Start -a [time] --begin=[time]
Select Queue -q [queue_name] -p [queue_name]

 

Please note that when you submit jobs with Slurm, all of your environment variables will by default be copied into the environment for your job. This includes all of the modules you have loaded on the login node at the time of submitting your job. You can disable this by using the --export=NONE flag with sbatch. This will cause Slurm to behave the same way as Torque, only loading environment variables from your ~/.bashrc and ~/.bash_profile but not from the current environment.

Slurm scripts will also run in the directory from which you submit the job. You can adjust this with the -D <directory> or --chdir=<directory> flag with sbatch. For example, you could add

#SBATCH -D $HOME

to your Slurm script to replicate the behavior of a PBS Torque script, which by default runs in your home directory.

Submitting interactive jobs with Slurm

You may use the following to submit an interactive job:

srun -J [job_name] -N 1 -p [queue_name] --ntasks-per-node=28 --pty bash

This will start an interactive job using a single node and 28 CPUs per node.  The key flag to create an interactive job is --pty bash.  This will open up a bash session on the compute node, allowing you to issue commands interactively.  

Example Slurm job script for 40-core queues

#!/bin/bash
#
#SBATCH --job-name=test
#SBATCH --output=res.txt
#SBATCH --ntasks-per-node=40
#SBATCH --nodes=2
#SBATCH --time=05:00
#SBATCH -p short-40core

# In this example we're using the 2019 versions of the Intel compiler, MPI implementation, and math kernel library

module load shared
module load intel/compiler/64/2019/19.0.5
module load intel/mkl/64/2019/19.0.5
module load intel/mpi/64/2019/19.0.0

cd /gpfs/projects/samples/intel_mpi_hello/
mpiicc mpi_hello.c -o intel_mpi_hello

mpirun ./intel_mpi_hello

This job will utilize 2 nodes, with 40 CPUs per node for 5 minutes in the short-40core queue to run the intel_mpi_hello script.  

If we named this  script "test.slurm", we could submit the job  using the following command:

sbatch test.slurm

Example Slurm job script for GPU queues

#!/bin/bash
#
#SBATCH --job-name=test-gpu
#SBATCH --output=res.txt
#SBATCH --ntasks-per-node=28
#SBATCH --nodes=1
#SBATCH --time=05:00
#SBATCH -p gpu

module load shared
module load anaconda/3
module load cuda91/toolkit/9.1
module load cudnn/6.0

source activate tensorflow1.4


cd /gpfs/projects/samples/tensorflow

python tensor_hello3.py

The documentation for Slurm can be found here.

Additional Information


There are no additional resources available for this article.

Getting Help


The Division of Information Technology provides support on all of our services. If you require assistance please submit a support ticket through the IT Service Management system.

Submit A Ticket

Supported By


IACS Support System