Table of Contents
What is SLURM?
SLURM (Simple Linux Utility for Resource Management) is one of the most widely used open-source job schedulers on HPC clusters. It is responsible for:
- Tracking available resources (nodes, cores, memory, GPUs, etc.)
- Accepting, queuing, and prioritizing user jobs
- Allocating resources for jobs
- Starting and stopping jobs on compute nodes
- Recording usage for accounting and reporting
On most modern clusters, interacting with SLURM is the main way you request and use compute resources.
Key ideas:
- You do not run heavy workloads directly on login nodes.
- You describe what you need (resources, time, etc.) to SLURM.
- SLURM finds a place and time for your job and runs it there.
SLURM’s Basic Components and Terminology
You will mainly use SLURM through a set of command-line tools. Some important terms:
- Job: A unit of work submitted to SLURM (batch job, interactive job, or job step).
- Partition: A logical grouping of nodes (e.g.,
short,long,gpu), often with different limits and policies. - Node: A physical or virtual machine in the cluster.
- Task: A unit of execution, often one MPI process (
sruntasks). - Job ID: A unique identifier assigned to your job when submitted.
- Account / Project: Identifier for billing/usage tracking; often required with
--account/-A.
Core daemons (for background understanding, not something you manage):
slurmctld: Central controller, manages job queue and scheduling.slurmd: Runs on each node, starts and stops job processes there.
Typical SLURM Workflow Overview
A minimal SLURM usage cycle looks like:
- (Optional) Test interactively with SLURM:
- Use
srunorsallocto get an interactive shell or run a simple command. - Write a batch script:
- A shell script with special
#SBATCHlines describing resource requirements and job settings. - Submit the script:
sbatch my_job.sh- Monitor the job:
squeue,sacct, or site-specific tools.- Inspect results:
- Check output and error files produced by the job.
The details of writing scripts, submitting, monitoring, and modifying jobs are in later chapters; here we focus on how SLURM itself is used at a basic level.
Interactive vs Batch Use of SLURM
SLURM supports two main ways of running work:
Batch jobs (non-interactive)
- You prepare a script with
#SBATCHdirectives. - You submit it with
sbatch. - SLURM queues it, runs it when resources are available, and writes output to files.
- Best for production runs and long simulations.
Example (just the SLURM part, not full explanation):
sbatch my_job.shYou do not stay logged in waiting for it; SLURM handles it in the background.
Interactive jobs
- You request resources and get an interactive shell or run a command directly on the compute node(s).
- Useful for development, debugging, testing, or running small experiments.
Two common tools:
salloc: Allocate resources, then manually run commands inside that allocation.srun: Run a command under SLURM’s control (within an existing allocation or creating a simple one).
Examples:
# Simple interactive shell on a compute node for 30 minutes
salloc --time=00:30:00 --ntasks=1 --mem=2G
# Run a single command directly under SLURM
srun --time=00:10:00 --ntasks=1 hostnameCore SLURM Commands You Will Encounter
These commands form the backbone of day-to-day SLURM usage. Later chapters will go into scripting and options in depth; here we introduce what each is for.
`sbatch`: Submit batch jobs
- Takes a script and sends it to the queue.
- Returns a job ID.
Typical pattern:
sbatch my_script.sh
Submitted batch job 123456You’ll see this command used anytime you see “submit a job script.”
`srun`: Run programs under SLURM
Two main roles:
- Launch tasks inside an existing allocation (e.g., MPI ranks, OpenMP processes).
- Or create a simple allocation and run a command immediately.
Example inside a job script:
srun ./my_program
Think of srun as “start my parallel job under SLURM’s control.”
`salloc`: Request an interactive allocation
- Allocates resources but does not immediately run a specific command.
- You get a shell; everything you run from that shell uses the allocated resources.
Example:
salloc --nodes=1 --ntasks=4 --time=01:00:00
# Now you're on a login or allocated environment;
# you might then run:
srun ./debug_version`scancel`: Cancel jobs
- Stop a queued or running job.
- You will use the job ID from
sbatchorsqueue.
Example:
scancel 123456Monitoring and accounting commands
These are discussed more deeply in the monitoring chapter, but you should recognize them:
squeue: View queued and running jobs.sacct: Show historical accounting info (once jobs have finished).scontrol: Advanced inspection and control (cluster admins and power users).
Basic Resource Requests in SLURM
SLURM commands share a common style for requesting resources. You will see the same options with sbatch, srun, and salloc.
Some commonly used options (names may vary between sites; consult your cluster docs):
--timeor-t: Wall-clock time limit, e.g.--time=01:30:00for 1.5 hours.--nodesor-N: Number of nodes.--ntasksor-n: Number of parallel tasks (often MPI ranks).--cpus-per-task: Number of CPU cores per task (often for OpenMP threads).--mem: Memory per node, e.g.--mem=8G.--partitionor-p: Which partition/queue to use (e.g.short,long,gpu).--gres: Generic resources (e.g., GPUs):--gres=gpu:2.--accountor-A: Which project/account to charge.
Example resource specification (not a complete script):
sbatch --time=02:00:00 --nodes=2 --ntasks=64 --mem=4G my_job.shThe idea: you describe your resource needs, SLURM decides where and when to run your job.
Understanding Partitions and Policies at a High Level
Every SLURM installation is configured by the site administrators and can look a bit different. However, there are common ideas:
- Partitions group nodes and define policies:
- Limits on maximum wall time per job.
- Who can use them.
- Priority rules.
- Examples you might see:
short(small jobs, short max time, fast turnaround)long(larger/longer jobs)gpu(nodes with GPUs)debug(short time limit, for testing and debugging)
To see partitions:
sinfoYou’ll typically pick a partition that matches the problem size and runtime you expect, following local guidelines.
SLURM Job States (High-Level)
Jobs move through several states in SLURM; understanding these helps interpret squeue output:
Common states:
PD(PENDING): Waiting in the queue; not yet running.R(RUNNING): Currently executing on compute nodes.CG(COMPLETING): Finishing up; tasks ending, resources being released.CA(CANCELLED): Cancelled by user or admin.F(FAILED): Job did not complete successfully.CD(COMPLETED): Job finished successfully.
You will see these in squeue -u $USER or in accounting output (sacct).
SLURM and Parallelism (Conceptual View)
SLURM works with the parallel programming models covered elsewhere:
- MPI jobs:
- You request multiple tasks (
--ntasks) and usesrunormpirun(cluster-dependent) to start MPI ranks. - OpenMP or threaded jobs:
- You request fewer tasks but more CPUs per task (
--cpus-per-task) and setOMP_NUM_THREADSaccordingly. - Hybrid jobs (MPI + OpenMP):
- Combination:
--ntasksfor MPI ranks and--cpus-per-taskfor threads per rank.
SLURM itself does not implement MPI or threading; it ensures that the requested processes get the appropriate resources on the cluster.
Site-Specific Differences and Documentation
While SLURM options are largely standardized, clusters often have:
- Different partition names and policies.
- Different defaults for memory, cores, time limits.
- Additional required options (e.g.,
--account,--qos).
Always:
- Check your site’s user guide.
- Look at example SLURM job scripts provided by your HPC center.
- Use
manpages for details (e.g.,man sbatch,man srun).
You will combine this general SLURM knowledge with site-specific rules when you start running real jobs.