5.2 Batch systems overview

What is a Batch System?

In HPC, a batch system (or batch scheduler / resource manager) is the software layer that:

Accepts job requests from users
Queues them according to policies
Starts jobs on available compute nodes
Tracks resource usage and job states
Records accounting information for reporting and billing

Unlike interactive desktops where you launch programs directly, on clusters you usually submit a job description to the batch system and let it decide when and where to run it.

Typical batch systems in HPC include:

SLURM (covered in detail in the next chapter)
PBS Pro / Torque / OpenPBS
LSF (Platform LSF, IBM Spectrum LSF)
SGE / Son of Grid Engine / Univa Grid Engine
HTCondor (often in HTC / data-intensive environments)

Each has its own syntax and tools, but they all share common concepts and workflow patterns.

Core Concepts in Batch Systems

Jobs

A job is a unit of work you ask the system to run. Conceptually, a job consists of:

The program(s) to run
Input data and parameters
Resource requirements (CPUs, memory, time, etc.)
Environment setup (modules, variables)
Output handling (log files, error files)

Jobs are usually described in a job script: a text file with shell commands plus batch-system-specific directives.

Typical job life cycle states:

PENDING / QUEUED: waiting for resources or policy
RUNNING: currently executing on compute nodes
COMPLETED / DONE: finished (success or failure)
FAILED / CANCELLED: ended prematurely

Queues / Partitions

Batch systems group jobs and resources into logical sets, often called:

Queues (PBS, LSF, SGE)
Partitions (SLURM)
Pools or Classes (other systems)

Each queue/partition can have different:

Limits (max run time, max nodes)
Priorities
Target hardware (e.g., GPU nodes vs CPU-only)
Access restrictions (e.g., specific project or group)

Choosing the appropriate queue/partition is essential for getting good turnaround time and following site policies.

Resources

The batch system manages and allocates resources such as:

CPU cores
GPUs / accelerators
Memory (per node or per core)
Wall-clock time
Nodes (entire machines)
Specialized hardware (e.g., high-memory nodes, large-storage nodes)

Users request resources quantitatively, for example:

Number of tasks: “16 MPI tasks”
Cores per task: “4 cores per task”
GPUs: “2 GPUs per node”
Memory: “8 GB per task”
Time limit: “2 hours”

The scheduler uses these requests to find a suitable allocation on the cluster.

Schedulers vs Resource Managers

Many HPC batch systems have two interacting components:

Resource manager: Tracks the state and availability of hardware resources (nodes, cores, memory).
Scheduler: Decides which jobs run where and when, based on policies and priorities.

In some systems this is a single integrated component; in others, these are separate daemons that communicate with each other.

Common Scheduling Policies

Batch systems implement policies defined by the cluster administrators. Typical concepts include:

Priority

Every job has an internal priority value that determines its position in the queue. Priority is often influenced by:

User, group, or project
Job size (number of nodes/cores)
Job age (how long it has been waiting)
Fair-share usage (how many resources you’ve used recently)
QoS or service class (e.g., “debug”, “normal”, “long”)

Higher-priority jobs are usually started earlier, subject to resource availability.

Fair-Share

Fair-share aims to balance the usage across users and projects over time.

If you have used a lot of resources recently, your new jobs may get lower priority.
If you have used little, your jobs may get boosted.

This encourages equitable usage and discourages a few users from monopolizing the system.

Limits and Quotas

To protect the system and enforce fairness, batch systems often impose:

Maximum runtime per job (wall time limit)
Maximum number of running jobs per user
Maximum number of cores/GPUs per user or project
Maximum memory per job or per node

If requested resources exceed site limits, job submission may fail or be rejected by policy.

Backfilling

Backfilling is a scheduling strategy that improves utilization:

The scheduler reserves resources for large, high-priority jobs that cannot start immediately.
In the meantime, it runs smaller or shorter jobs in the “gaps,” provided they won’t delay the reserved job’s start.

From a user perspective, this means:

Short jobs with accurate time limits may start sooner, even if they’re behind other jobs in the queue.
Overestimating time limits can reduce backfilling opportunities and increase waiting time.

Types of Jobs in Batch Systems

Batch (Non-Interactive) Jobs

The default and most common job type:

Defined by a job script
Runs without user interaction
Sends output to log files
Suitable for long simulations, parameter sweeps, production runs

The batch system launches the job, monitors it, and records the exit status when it finishes.

Interactive Jobs

Interactive jobs provide a shell or interactive session on compute nodes:

Useful for debugging, profiling, or exploratory work
Still managed by the batch system (you request resources and time)
Typically subject to stricter limits (short runtimes, smaller size)

While they feel like normal logins, they are scheduled like any other job.

Array Jobs

Array jobs (or job arrays) let you submit many similar jobs at once:

One job script, many array elements with different indices
Each element might use a different input file or parameter set
Efficient for parameter sweeps, ensemble simulations, or processing many files

Advantages:

Reduced scheduler overhead compared to thousands of single-job submissions
Easier management: submit, monitor, and cancel a whole group with one command

Typical Batch Workflow (Abstract)

Although command names differ between batch systems, the workflow pattern is similar:

Prepare a job script

Define resource requests using scheduler directives (prefixed by something like #SBATCH, #PBS, #BSUB, depending on system)
Set up the environment (modules, variables)
Run the application (e.g., mpirun, srun, or direct executable)

Submit the job

Use a submission command (e.g., sbatch, qsub, bsub)
The system returns a job ID

Wait in the queue

Job remains PENDING/QUEUED until resources and policy allow it to start

Job runs

System allocates the requested resources
Executes the job script on the compute nodes

Monitor progress

Use status commands (e.g., squeue, qstat, bjobs) to see state and resource usage
Check output/error log files

Completion and accounting

Job finishes or is canceled
Logs remain in your directory
System may record accounting info (CPU-hours, memory, exit code) for reports or billing

The following chapters on SLURM and job scripts will show concrete syntax and examples.

How Batch Systems Improve Cluster Usage

Batch systems are central to efficient and fair use of HPC clusters:

High utilization: They continuously look for jobs to fill idle nodes.
Fairness: They apply policies so many users can share the system effectively.
Protection: They prevent users from overloading login nodes or bypassing limits.
Reproducibility: Jobs run in controlled environments with recorded parameters.
Scalability: They handle thousands of jobs and users across many nodes.

Understanding the general behavior of batch systems helps you:

Form realistic expectations about queue wait times
Request resources wisely
Choose appropriate queues/partitions
Interpret job states and scheduling decisions

Differences Between Common Batch Systems (Conceptual)

While the commands and directives vary, core features are broadly similar:

SLURM

Open-source, widely used in modern HPC
Uses partitions instead of queues
Integrates resource management and scheduling

PBS / Torque / OpenPBS

Older, still common
Uses qsub, qstat, etc.
Often combined with external schedulers in large deployments

LSF

Commercial, used in many enterprise and some research centers
Uses bsub, bjobs, etc.

SGE / Grid Engine variants

Historically popular; some sites still use derivatives
Uses qsub, qstat, with somewhat different semantics than PBS

HTCondor

Optimized for high-throughput and opportunistic computing
Often used in data-intensive and grid environments

When moving between clusters, you will often:

Recognize the same concepts (jobs, queues, resources, priorities)
Adapt to different command names and option syntax
Adjust to site-specific policies and queue configurations

User Responsibilities in Batch Environments

From the user’s perspective, effective use of a batch system involves:

Accurate resource requests

Request enough to run reliably
Avoid large overestimates that increase wait times and reduce utilization

Appropriate queue choice

Use “short” or “debug” queues for tests
Use production queues for large or long runs
Follow documentation from your site

Respecting policies

Observe limits for maximum jobs, runtime, and resources
Avoid job submission patterns that stress the scheduler (e.g., millions of tiny jobs without arrays)

Monitoring and debugging

Check logs for errors and performance issues
Cancel obviously misconfigured or stuck jobs promptly

These practices help both your own productivity and the overall health of the shared system.

How This Connects to Upcoming Chapters

Subsequent chapters in this section build on this overview:

Introduction to SLURM: A concrete look at one popular batch system: commands, partitions, and basic usage.
Writing job scripts: How to write portable, readable scripts that describe batch jobs.
Submitting, monitoring, and cancelling jobs: Practical command-line workflows.
Resource management details: How to get better throughput and performance by using the scheduler intelligently.

The concepts described here—jobs, queues/partitions, resources, priorities, and policies—underpin all of those practical steps.

Comments

Please login to add a comment.

Don't have an account? Register now!