5.3 Introduction to SLURM

Table of Contents

Getting Oriented with SLURM

SLURM is one of the most widely used job schedulers on modern HPC clusters. It manages which jobs run where and when, based on the resources they request and the policies of the system. In the broader context of job scheduling and resource management, SLURM is a concrete system you will interact with directly through the command line.

This chapter focuses on how SLURM is structured conceptually, the vocabulary it uses, and how you as a user will think about jobs and resources in SLURM. Detailed scripting, submission, and monitoring will be addressed in later chapters. Here the goal is to become comfortable with what SLURM is, what it controls, and which basic commands and concepts you must know from the start.

What SLURM Does for You

SLURM sits between you and the compute nodes of the cluster. You do not log in directly to compute nodes or run large workloads on login nodes. Instead, you describe your resource needs to SLURM, and SLURM runs your work on appropriate compute nodes when resources become available.

From a user perspective, SLURM provides three central capabilities. It lets you express resource requests, it decides when and where a job will run according to scheduling policies, and it launches and manages the processes that carry out your computation on the allocated resources.

SLURM is modular and can be configured differently on each cluster, but from your point of view the basic pattern is consistent. You log in, load your software environment if needed, submit jobs through SLURM, and then use SLURM commands to track and manage them.

Core SLURM Concepts and Vocabulary

To use SLURM effectively you must understand a few key terms that appear in commands, scripts, and system documentation.

A SLURM cluster is the whole managed system under SLURM’s control, often spanning many compute nodes. Each node is usually a physical server with multiple CPUs and cores, and often multiple sockets and NUMA domains. Within SLURM each node has a name, a set of resources, and a state such as idle, allocated, or down.

Jobs are user requests to run a specific task or workflow on the cluster. Each job has an identifier, the job ID, and a description of the resources it needs and how long it is expected to run. A job may contain one or more job steps, each step representing a specific executable invocation within the job’s allocation.

SLURM works with partitions, which are logical groupings of nodes. A partition typically corresponds to a queue or class, such as debug, long, gpu, or highmem. Partitions can differ in walltime limits, priority, hardware characteristics, or access restrictions. When you submit a job, you often specify which partition to use, and SLURM then chooses nodes from that partition for your job.

Accounts and quality of service, usually abbreviated QOS, are used for resource accounting and priority control. An account identifies a project or group that resources should be charged to. A QOS describes policy characteristics such as maximum number of running jobs, maximum time limits, or relative priority. You may need to set an account and possibly a QOS when submitting jobs, depending on your site configuration.

Resources in SLURM are described in several ways. The most common are number of nodes, number of tasks, CPUs per task, memory, and time. A task corresponds roughly to a process in a parallel job model, such as an MPI rank. CPUs per task indicate how many cores are reserved for each task, which matters when combining processes and threads. Memory can be requested per node or per CPU, depending on cluster configuration. Time is specified as a wall clock limit; when this limit is reached, SLURM cancels the job.

SLURM uses states to describe the life cycle of a job. Typical states you will see include PENDING, RUNNING, COMPLETED, CANCELLED, FAILED, and TIMEOUT. While details will appear in monitoring and debugging chapters, at this stage it is important to know that these states tell you why a job is waiting or how it finished.

Interactive Use and Batch Use

SLURM supports two main modes of use: interactive sessions and batch jobs. These modes share the same underlying resource allocation logic but provide different workflows suited to different tasks.

Interactive use is for work where you want a shell or a program running on the compute nodes while you interact with it. For example, to test small commands or to debug code you might request an interactive allocation. In this case SLURM reserves resources for you and then gives you an interactive shell on one of the allocated nodes. From that shell you can start your program, run short tests, and experiment.

Batch use is for production runs or any work that should run unattended. In this mode you place your resource requests and job instructions into a script, then ask SLURM to run that script. SLURM queues the job, starts it when resources are available, and records its output to log files. Batch jobs are the typical pattern for serious HPC workloads.

Even though the interaction style is different, both modes rely on the same concepts of partitions, accounts, and resource requests. The choice between interactive and batch is about convenience and scale, not fundamentally different types of computation.

Basic SLURM User Commands

SLURM provides a family of commands, most of which start with the letter s. While later chapters will give recipes for actual usage, it is useful here to know the basic purpose of each of the primary commands so that documentation and examples make sense.

The command sbatch submits a batch script to SLURM. You give SLURM a script that contains both the resource specification and the commands to run. SLURM responds with a job ID and places the job into the queue.

The command srun is SLURM’s parallel job launcher. Inside a batch job, srun starts tasks on the allocated resources according to the job’s resource specification. Outside of a job, on some systems, srun can also request a temporary allocation and launch tasks directly.

The command salloc requests an interactive allocation. It reserves resources for you, and then, for example, starts an interactive shell where you can run your own commands. This is conceptually like sbatch, but instead of running a predefined script, you manually issue commands during the allocation.

To inspect the state of the system and your jobs, you use squeue, scontrol, and sinfo. The command squeue lists jobs and their states in the scheduler’s queue. The command sinfo summarizes the state of partitions and nodes, including which nodes are idle or allocated and what configurations exist. For more detailed information about a particular job, node, or partition, scontrol show provides structured, machine readable summaries.

To manage or change jobs, SLURM provides scancel to cancel a job and plugins or options that allow modifications. The precise extent to which you can alter running jobs depends on site policies, but the consistent pattern is that you use SLURM commands, never kill parallel jobs manually with generic system tools.

Resource Requests in SLURM

The heart of interacting with SLURM is the resource request, which describes what your job needs from the cluster. You can think of this as a compact sentence describing nodes, tasks, CPUs, memory, and time.

Some of the most important resource controls appear throughout SLURM documentation and examples. For example, you can specify the partition with an option like --partition, often shortened to -p. The number of nodes can be specified with --nodes. The number of tasks, which you might equate with MPI ranks in a parallel program, is given with --ntasks. The CPUs per task, important for threaded programs, are specified with --cpus-per-task. Memory limits are often given with --mem for memory per node or --mem-per-cpu for memory per CPU. A time limit is set with --time, usually in a format like HH:MM:SS.

These options can appear either on the command line, for example when using srun or salloc interactively, or as directives in a batch script. Batch scripts typically include lines that start with #SBATCH followed by these options. SLURM collects these options and constructs an internal description of your job’s requirements.

There is an important conceptual relationship between tasks, CPUs, and nodes. If you request $N$ nodes, $T$ tasks per job, and $C$ CPUs per task, and the cluster supports it, your total core count will be approximately $T \times C$, distributed across the $N$ nodes. In practice, cluster constraints such as cores per node and memory per node must also be satisfied.

A job will not start if your resource request cannot be satisfied by any combination of nodes in the chosen partition. Make sure that the product of requested tasks and CPUs per task fits within the hardware configuration and policies of that partition.

Some clusters require that you request entire nodes, others allow you to share nodes with other users. Whether --nodes is required and how memory is interpreted depends on local configuration, so you should always check the site specific documentation. However, the names of the options and their conceptual meaning are consistent across SLURM installations.

Time, Priority, and Fairness

SLURM is not only a launcher, it is also a scheduler that tries to distribute resources in a fair and efficient way. It does this through job priorities, time limits, and scheduling policies which are configured by administrators.

Your requested time limit plays two roles. First, it is a hard bound after which SLURM kills your job if it is still running. Second, it is often a key factor in how SLURM schedules your job relative to others. Shorter jobs may be backfilled, which means they are allowed to run in gaps between larger jobs if they can complete before a reserved start time for another job. As a result, accurate time estimates can improve throughput.

Job priority in SLURM is computed based on several components, which can include fair share usage, job age, partition or QOS, and sometimes job size. The exact formula is site dependent, but the qualitative idea is that heavy users see their priority reduced temporarily so others can run, and waiting jobs see their priority increase over time.

Although you do not control the scheduling policy, you can influence how your jobs interact with it by choosing sensible partitions, realistic time limits, and appropriate job sizes. The scheduler’s goal is to balance queue wait times, utilization, and fairness, and SLURM’s priority calculation is the mechanism that drives this behavior.

If you consistently overestimate your job’s time limit by large margins, you may experience longer queue times and reduce backfilling opportunities for yourself and others. Aim for a reasonable upper bound, not a trivial maximum.

SLURM and Parallel Programming Models

SLURM itself is not a parallel programming library. Instead, it works together with libraries such as MPI or threading models such as OpenMP. Your SLURM resource request must match the intended parallel structure of your code.

For MPI programs, --ntasks typically corresponds to the number of MPI ranks you want to run. SLURM can then launch the job so that each rank runs as a separate task, often mapped across nodes. For pure thread based codes using OpenMP, you normally request fewer tasks, often one task per node, and set --cpus-per-task to reflect the desired number of threads. Hybrid codes that combine MPI with threading will mix these concepts and must ensure that total cores requested match cores actually used.

SLURM exports several environment variables into your job environment, such as SLURM_JOB_ID, SLURM_NTASKS, SLURM_CPUS_PER_TASK, and SLURM_JOB_NODELIST. MPI and other runtime systems can use these variables and SLURM’s process management infrastructure to optimize placement and startup. You can also use these variables manually in scripts to adapt behavior to the resources allocated.

The key idea is that SLURM describes and enforces what resources your job gets, and your parallel program must be written to make use of those resources intentionally. Mismatches between resource requests and program configuration are a frequent source of inefficiency or failure.

SLURM Job Lifecycle

From the moment you submit a job until it completes, SLURM tracks its lifecycle. Understanding the typical sequence makes it easier to interpret states and logs.

Once you submit via sbatch, the job enters PENDING state. While pending, SLURM evaluates when it can start, based on resources, priority, and policies. A job may remain pending for seconds or hours, depending on activity. While pending, SLURM may show a reason code, such as waiting for resources, priority, or dependency.

When resources become available and the scheduler chooses your job, SLURM allocates nodes for it. The job transitions to RUNNING, and SLURM launches the shell or script associated with the job on the first node in the allocation. That script then typically calls srun to start parallel tasks across nodes or simply runs serial commands.

As your program runs, SLURM monitors its processes, time usage, and resource consumption as configured. If the script finishes normally, all tasks exit and SLURM marks the job as COMPLETED. If your program or script crashes or exits with an error status, SLURM marks the job as FAILED. If you cancel it explicitly, it becomes CANCELLED. If the time limit is reached, TIMEOUT is recorded.

SLURM logs each job’s lifecycle information, including start and end times, used resources, and exit codes. System accounting tools and commands can query this history for reporting and performance analysis. For you as a user, the important point is that each job has a well defined path with visible states and that SLURM’s interpretation of success or failure is based on the exit status of the script and its tasks.

SLURM considers a job successful if the batch script exits with code 0, regardless of what individual internal steps did. If a critical srun command fails but the script does not propagate that error, SLURM may report the job as COMPLETED even though your computation did not run as expected.

Site Specific Variations and Documentation

Although SLURM is a standard, production quality system, every HPC center configures it differently. Option defaults, available partitions, naming of accounts and QOS, and limits on jobs all depend on local policies. For example, one cluster might require the use of an explicit --account flag, while another infers your account from your login identity. Some clusters provide separate partitions for GPU nodes, others mix GPU and CPU resources in the same partition but require a constraint flag to select GPU nodes.

The SLURM manual pages and official documentation describe the available flags and general behavior. In addition, each HPC site usually provides a user guide that explains how SLURM is configured there, including which partitions exist, which maximum time limits are permitted, and what constraints or features you should use.

You should view the generic SLURM knowledge presented here as a foundation, and the site specific guide as the source of authoritative details for your system. Combining both will allow you to craft job submissions that work well with the scheduler and make best use of the available resources.

Summary of Key SLURM Ideas

SLURM is the central tool that turns conceptual job scheduling and resource management into practical, day to day workflows in an HPC environment. It organizes nodes into partitions, accepts job submissions that specify resource needs, applies scheduling policies to decide when jobs run, and then launches and monitors them on the allocated resources.

Your interaction with SLURM revolves around a consistent set of concepts. You specify requests in terms of nodes, tasks, CPUs, memory, and time, select partitions and accounts, and choose interactive or batch modes that suit your workflow. You learn to read job states and reason codes to understand when and why your jobs run. You map SLURM’s resource model to your programming model, whether MPI, OpenMP, or hybrid, and you pay attention to time limits and system policies to participate fairly and efficiently in shared cluster use.

Later chapters will build on this introduction by showing exactly how to write SLURM batch scripts, how to submit and monitor jobs in practice, and how to debug and optimize workflows on SLURM managed systems.

Comments

Please login to add a comment.

Don't have an account? Register now!