Kahibaro
Discord Login Register

5.2 Batch systems overview

What is a Batch System?

In HPC, a batch system (or batch scheduler / resource manager) is the software layer that:

Unlike interactive desktops where you launch programs directly, on clusters you usually submit a job description to the batch system and let it decide when and where to run it.

Typical batch systems in HPC include:

Each has its own syntax and tools, but they all share common concepts and workflow patterns.

Core Concepts in Batch Systems

Jobs

A job is a unit of work you ask the system to run. Conceptually, a job consists of:

Jobs are usually described in a job script: a text file with shell commands plus batch-system-specific directives.

Typical job life cycle states:

Queues / Partitions

Batch systems group jobs and resources into logical sets, often called:

Each queue/partition can have different:

Choosing the appropriate queue/partition is essential for getting good turnaround time and following site policies.

Resources

The batch system manages and allocates resources such as:

Users request resources quantitatively, for example:

The scheduler uses these requests to find a suitable allocation on the cluster.

Schedulers vs Resource Managers

Many HPC batch systems have two interacting components:

In some systems this is a single integrated component; in others, these are separate daemons that communicate with each other.

Common Scheduling Policies

Batch systems implement policies defined by the cluster administrators. Typical concepts include:

Priority

Every job has an internal priority value that determines its position in the queue. Priority is often influenced by:

Higher-priority jobs are usually started earlier, subject to resource availability.

Fair-Share

Fair-share aims to balance the usage across users and projects over time.

This encourages equitable usage and discourages a few users from monopolizing the system.

Limits and Quotas

To protect the system and enforce fairness, batch systems often impose:

If requested resources exceed site limits, job submission may fail or be rejected by policy.

Backfilling

Backfilling is a scheduling strategy that improves utilization:

From a user perspective, this means:

Types of Jobs in Batch Systems

Batch (Non-Interactive) Jobs

The default and most common job type:

The batch system launches the job, monitors it, and records the exit status when it finishes.

Interactive Jobs

Interactive jobs provide a shell or interactive session on compute nodes:

While they feel like normal logins, they are scheduled like any other job.

Array Jobs

Array jobs (or job arrays) let you submit many similar jobs at once:

Advantages:

Typical Batch Workflow (Abstract)

Although command names differ between batch systems, the workflow pattern is similar:

  1. Prepare a job script
    • Define resource requests using scheduler directives (prefixed by something like #SBATCH, #PBS, #BSUB, depending on system)
    • Set up the environment (modules, variables)
    • Run the application (e.g., mpirun, srun, or direct executable)
  2. Submit the job
    • Use a submission command (e.g., sbatch, qsub, bsub)
    • The system returns a job ID
  3. Wait in the queue
    • Job remains PENDING/QUEUED until resources and policy allow it to start
  4. Job runs
    • System allocates the requested resources
    • Executes the job script on the compute nodes
  5. Monitor progress
    • Use status commands (e.g., squeue, qstat, bjobs) to see state and resource usage
    • Check output/error log files
  6. Completion and accounting
    • Job finishes or is canceled
    • Logs remain in your directory
    • System may record accounting info (CPU-hours, memory, exit code) for reports or billing

The following chapters on SLURM and job scripts will show concrete syntax and examples.

How Batch Systems Improve Cluster Usage

Batch systems are central to efficient and fair use of HPC clusters:

Understanding the general behavior of batch systems helps you:

Differences Between Common Batch Systems (Conceptual)

While the commands and directives vary, core features are broadly similar:

When moving between clusters, you will often:

User Responsibilities in Batch Environments

From the user’s perspective, effective use of a batch system involves:

These practices help both your own productivity and the overall health of the shared system.

How This Connects to Upcoming Chapters

Subsequent chapters in this section build on this overview:

The concepts described here—jobs, queues/partitions, resources, priorities, and policies—underpin all of those practical steps.

Views: 121

Comments

Please login to add a comment.

Don't have an account? Register now!