5.2 Batch systems overview

Table of Contents

Historical motivation for batch systems

In early multiuser computers, interactive use by many people at once was slow and chaotic. Jobs competed for CPU and memory in an uncoordinated way, which led to poor throughput, unpredictable response times, and frequent system overloads. The solution was to collect user jobs into queues, decide an order, and run them one by one or in a controlled mix. This is the basic idea of batch processing.

Modern HPC clusters still use this idea, but with much more sophistication. Instead of a single CPU, a scheduler must juggle thousands of cores, multiple types of nodes, accelerators, and large memory nodes. Instead of manually deciding what runs next, a batch system automates these decisions to keep the cluster busy and to enforce policies of fair use.

A batch system, or batch scheduler, is therefore the central service that receives jobs, records their resource requirements, decides when and where they run, and tracks their lifecycle. It is the gatekeeper between users and the compute nodes.

A batch system is a software service that queues user jobs, allocates resources, and controls when and where jobs run on an HPC cluster, according to policies and priorities.

Basic architecture of batch systems

All common HPC batch systems share a similar logical structure, even if their internal implementations differ. At a high level, there are three main roles.

First, there are user interfaces. Users interact with the batch system through commands or APIs such as qsub, bsub, or sbatch, depending on the system. These commands submit jobs, query status, and request modifications. From the user’s point of view, this is the visible part of the batch system.

Second, there is the scheduler and controller. This central component receives job descriptions, stores them in a database or queue, and periodically decides which jobs should start or stop. It consults cluster policies, priorities, and the current state of all nodes. In some systems, the scheduler and the resource manager are distinct daemons, but conceptually they act together as the central decision maker.

Third, there is a set of worker agents running on the compute nodes. These are daemons that receive start and stop instructions from the controller and that actually launch the user’s job processes on the node. They also report back state information such as whether a job is running or has finished, and sometimes resource usage statistics.

In a typical interaction, the user submits a job from a login node, the controller records it in a queue, the job waits until resources match its request, and then the controller instructs one or more node daemons to start it. On completion, the results are written to files, and the batch system updates the job status.

General job lifecycle in a batch system

Although different systems use slightly different names, the typical job lifecycle follows a common pattern.

When you first submit a job, it is accepted by the batch system and placed into a queue. At this point, the job is in a pending or waiting state. The batch system validates the job description to ensure the resource request is syntactically correct and within allowed limits, for example not asking for more nodes than exist.

Next, the scheduler examines pending jobs and the available resources. If no resources match the job’s request, or if higher priority jobs should start first, the job remains pending. During this time, it can move among internal priority levels or partitions, but from the user’s perspective it is just waiting.

When sufficient resources become available and policy conditions are satisfied, the scheduler dispatches the job to specific nodes. The job enters a starting state, as the node daemons prepare the environment, allocate resources, and launch the user’s program. Once the user’s code is executing, the job is in a running state.

The job continues running until one of several events occurs. It may complete normally when the program finishes. It may be cancelled by the user or an administrator. It may reach its time limit and be terminated. It may fail due to an error, such as attempting to access unavailable files, or due to node or network failures. In all of these cases, the batch system records a final state such as completed, cancelled, failed, or timed out.

Output and error streams are typically redirected to files specified by the user in the job description. Accounting records are updated, which can be used later for reporting and charging of usage.

Batch queues, partitions, and policies

From the user viewpoint, a batch system exposes one or more queues or partitions. Each queue is a logical grouping of jobs, usually associated with a subset of hardware and with particular policies. For instance, a cluster might have a short queue for jobs with small time limits, a long queue for multi-day runs, and a GPU queue that gives access to GPU-equipped nodes.

Behind these names lie scheduling policies. A queue may limit the maximum wall time per job, the number of nodes per job, or the total number of concurrent jobs per user or group. Some partitions are reserved for specific projects, while others are open to all users.

Policies also govern how jobs in different queues are prioritized. A common concept is quality of service, or QoS. A high priority QoS might allow urgent jobs to start before others, perhaps at the cost of stricter limits or preemption. Conversely, low priority or scavenger queues may use idle resources but can be interrupted when higher priority work arrives.

Batch systems implement fairness through rules such as fair share. In a fair share model, users or groups that have used many resources recently receive lower priority for new jobs, while those who have used less gain higher priority. This encourages balanced cluster usage and avoids situations where one user continuously monopolizes the system.

Limits can be enforced at many levels. For example, there may be per-job limits on CPUs or memory, per-user limits on the number of running jobs, or global limits to reserve some capacity for specific projects or interactive work.

Resource requests and job descriptions

Instead of starting a program directly on a node, you describe the resources your program needs, and the batch system allocates matching resources. This description is usually given in a job script or as command line options to the submission command.

The core part of a job description is the resource request. Typical elements include the number of nodes, the number of CPUs per node, the amount of memory required per node or per CPU, and the maximum wall clock time. For accelerator jobs, you may also request GPUs per node. Some systems let you specify job arrays, which represent many similar tasks sharing the same resource pattern.

For example, a user might request 4 nodes with 32 tasks per node, 4 GB of memory per task, and a time limit of 2 hours. The batch system will only start this job when it can find 4 nodes that are free at the same time, and that meet any other constraints such as node features or partitions.

Besides resources, the job description also includes metadata such as job name, output and error file locations, the working directory, and email notification preferences. Some systems allow you to express node constraints, for example requesting only nodes with a specific CPU type, or excluding nodes with known issues.

It is important to remember that the batch system relies on the user’s declared resource needs. If you underestimate time or memory, your job may be killed when it exceeds the limits. If you grossly overestimate, your job may wait longer than necessary because fewer sets of resources appear to match. This tension between safety and efficiency is an important practical aspect of working with batch systems.

Common batch systems in HPC

Several major batch and scheduling systems are in widespread use on HPC clusters. While each has its own commands and configuration details, at the conceptual level they all perform the same core functions described above.

Historically, PBS and its derivatives such as Torque were common. These provide commands like qsub for submission and qstat for monitoring. Another well known commercial system is IBM Spectrum LSF, which uses commands such as bsub and bjobs. Many national supercomputing centers have used these or similar systems.

More recently, open source systems such as SLURM have become very popular, especially on large Linux clusters. In SLURM, submission is done with sbatch for batch jobs, while srun can be used for launching parallel tasks. Monitoring is usually done with squeue and sacct. SLURM combines resource management and scheduling functions in a modular, scalable design.

Other batch systems include Grid Engine variants and proprietary schedulers integrated into vendor solutions. Regardless of the specific software, you will typically see the same notions of queues or partitions, resource requests, job scripts, and job states.

As a learner, it is more important to understand the general ideas of batch operation than any one syntax. Once you grasp the common concepts, you can adapt quickly to the particular batch system used on any cluster.

Scheduling strategies and priorities

Internally, a batch system periodically runs a scheduling cycle. During this cycle, it examines the set of pending jobs and the current state of cluster resources, and decides which jobs to start. Different sites configure different scheduling strategies, but some patterns appear frequently.

A basic approach is priority-based scheduling. Each pending job has a numeric priority that can depend on factors such as job age, user or group fair share, queue or QoS, job size, and sometimes project or account information. In each cycle, the scheduler attempts to start the highest priority jobs that fit into the currently available resources.

Some schedulers implement backfilling. In backfilling, the scheduler reserves resources in the future for large or high priority jobs, but then fills any gaps with shorter or lower priority jobs that can complete before the reservation starts. This improves overall utilization and reduces idle time, but requires accurate or conservative time limits.

There are also configurations that favor large parallel jobs, sometimes called large job bias. For example, the scheduler may reserve a subset of the cluster for jobs above a certain size, or give them extra priority. This helps ensure that rare but important large runs can be scheduled.

Preemption is another tool available to schedulers. In a preemptive setup, low priority jobs can be suspended or cancelled to make room for higher priority work. This is commonly used for urgent queues or for interactive, debugging, or real time constraints. Whether preemption is allowed depends on the policies of the site.

All these strategies interact with user behavior. If users consistently request much longer time limits than needed, backfilling becomes less effective. If fair share policies are in place, heavy recent usage reduces priority, which helps balance access but can surprise users who suddenly see longer wait times.

Batch vs interactive use

Batch systems are primarily designed for noninteractive jobs. Once submitted, a batch job does not normally require user input, and its output is written to files. This is ideal for long running computations, parameter sweeps, and production simulations that can take hours or days.

However, batch systems often support special modes for more interactive or exploratory work. These modes still allocate resources through the scheduler, but allow users to connect to the allocated nodes and run commands directly, for example for debugging or profiling. From the scheduler’s perspective, these are just jobs with a particular allocation pattern and perhaps a shorter time limit.

The key difference is that in batch mode, you script everything in advance and submit it, while in interactive mode, you obtain resources through the scheduler but then type commands manually within that allocation. Both rely on the batch system to enforce limits and policies, and both are subject to queuing delays if the cluster is busy.

For production workloads on shared clusters, batch mode is usually preferred because it is more predictable, easier to automate, and easier to account and audit.

Role of batch systems in resource management

Although this chapter focuses on batch concepts, it is important to see how batch systems connect to broader resource management. The batch scheduler is the central policy engine that implements institutional decisions about who can use what, when, and how much.

Through its configuration, administrators can set maximum and default job limits, enforce quotas, reserve capacity for specific projects, and gather detailed usage statistics. The scheduler also integrates with other parts of the cluster environment, such as environment modules and accounting systems, to provide controlled, auditable use of resources.

From a user perspective, the batch system is both an enabler and a constraint. It gives access to powerful hardware without needing to manage it manually. At the same time, it requires you to express your needs clearly and to work within site policies. Learning to write effective job descriptions and to interpret queue states is therefore a central skill in practical HPC work.

Comments

Please login to add a comment.

Don't have an account? Register now!