Table of Contents
What is a Batch System?
In HPC, a batch system (or batch scheduler / resource manager) is the software layer that:
- Accepts job requests from users
- Queues them according to policies
- Starts jobs on available compute nodes
- Tracks resource usage and job states
- Records accounting information for reporting and billing
Unlike interactive desktops where you launch programs directly, on clusters you usually submit a job description to the batch system and let it decide when and where to run it.
Typical batch systems in HPC include:
- SLURM (covered in detail in the next chapter)
- PBS Pro / Torque / OpenPBS
- LSF (Platform LSF, IBM Spectrum LSF)
- SGE / Son of Grid Engine / Univa Grid Engine
- HTCondor (often in HTC / data-intensive environments)
Each has its own syntax and tools, but they all share common concepts and workflow patterns.
Core Concepts in Batch Systems
Jobs
A job is a unit of work you ask the system to run. Conceptually, a job consists of:
- The program(s) to run
- Input data and parameters
- Resource requirements (CPUs, memory, time, etc.)
- Environment setup (modules, variables)
- Output handling (log files, error files)
Jobs are usually described in a job script: a text file with shell commands plus batch-system-specific directives.
Typical job life cycle states:
PENDING/QUEUED: waiting for resources or policyRUNNING: currently executing on compute nodesCOMPLETED/DONE: finished (success or failure)FAILED/CANCELLED: ended prematurely
Queues / Partitions
Batch systems group jobs and resources into logical sets, often called:
- Queues (PBS, LSF, SGE)
- Partitions (SLURM)
- Pools or Classes (other systems)
Each queue/partition can have different:
- Limits (max run time, max nodes)
- Priorities
- Target hardware (e.g., GPU nodes vs CPU-only)
- Access restrictions (e.g., specific project or group)
Choosing the appropriate queue/partition is essential for getting good turnaround time and following site policies.
Resources
The batch system manages and allocates resources such as:
- CPU cores
- GPUs / accelerators
- Memory (per node or per core)
- Wall-clock time
- Nodes (entire machines)
- Specialized hardware (e.g., high-memory nodes, large-storage nodes)
Users request resources quantitatively, for example:
- Number of tasks: “16 MPI tasks”
- Cores per task: “4 cores per task”
- GPUs: “2 GPUs per node”
- Memory: “8 GB per task”
- Time limit: “2 hours”
The scheduler uses these requests to find a suitable allocation on the cluster.
Schedulers vs Resource Managers
Many HPC batch systems have two interacting components:
- Resource manager: Tracks the state and availability of hardware resources (nodes, cores, memory).
- Scheduler: Decides which jobs run where and when, based on policies and priorities.
In some systems this is a single integrated component; in others, these are separate daemons that communicate with each other.
Common Scheduling Policies
Batch systems implement policies defined by the cluster administrators. Typical concepts include:
Priority
Every job has an internal priority value that determines its position in the queue. Priority is often influenced by:
- User, group, or project
- Job size (number of nodes/cores)
- Job age (how long it has been waiting)
- Fair-share usage (how many resources you’ve used recently)
- QoS or service class (e.g., “debug”, “normal”, “long”)
Higher-priority jobs are usually started earlier, subject to resource availability.
Fair-Share
Fair-share aims to balance the usage across users and projects over time.
- If you have used a lot of resources recently, your new jobs may get lower priority.
- If you have used little, your jobs may get boosted.
This encourages equitable usage and discourages a few users from monopolizing the system.
Limits and Quotas
To protect the system and enforce fairness, batch systems often impose:
- Maximum runtime per job (wall time limit)
- Maximum number of running jobs per user
- Maximum number of cores/GPUs per user or project
- Maximum memory per job or per node
If requested resources exceed site limits, job submission may fail or be rejected by policy.
Backfilling
Backfilling is a scheduling strategy that improves utilization:
- The scheduler reserves resources for large, high-priority jobs that cannot start immediately.
- In the meantime, it runs smaller or shorter jobs in the “gaps,” provided they won’t delay the reserved job’s start.
From a user perspective, this means:
- Short jobs with accurate time limits may start sooner, even if they’re behind other jobs in the queue.
- Overestimating time limits can reduce backfilling opportunities and increase waiting time.
Types of Jobs in Batch Systems
Batch (Non-Interactive) Jobs
The default and most common job type:
- Defined by a job script
- Runs without user interaction
- Sends output to log files
- Suitable for long simulations, parameter sweeps, production runs
The batch system launches the job, monitors it, and records the exit status when it finishes.
Interactive Jobs
Interactive jobs provide a shell or interactive session on compute nodes:
- Useful for debugging, profiling, or exploratory work
- Still managed by the batch system (you request resources and time)
- Typically subject to stricter limits (short runtimes, smaller size)
While they feel like normal logins, they are scheduled like any other job.
Array Jobs
Array jobs (or job arrays) let you submit many similar jobs at once:
- One job script, many array elements with different indices
- Each element might use a different input file or parameter set
- Efficient for parameter sweeps, ensemble simulations, or processing many files
Advantages:
- Reduced scheduler overhead compared to thousands of single-job submissions
- Easier management: submit, monitor, and cancel a whole group with one command
Typical Batch Workflow (Abstract)
Although command names differ between batch systems, the workflow pattern is similar:
- Prepare a job script
- Define resource requests using scheduler directives (prefixed by something like
#SBATCH,#PBS,#BSUB, depending on system) - Set up the environment (modules, variables)
- Run the application (e.g.,
mpirun,srun, or direct executable) - Submit the job
- Use a submission command (e.g.,
sbatch,qsub,bsub) - The system returns a job ID
- Wait in the queue
- Job remains
PENDING/QUEUEDuntil resources and policy allow it to start - Job runs
- System allocates the requested resources
- Executes the job script on the compute nodes
- Monitor progress
- Use status commands (e.g.,
squeue,qstat,bjobs) to see state and resource usage - Check output/error log files
- Completion and accounting
- Job finishes or is canceled
- Logs remain in your directory
- System may record accounting info (CPU-hours, memory, exit code) for reports or billing
The following chapters on SLURM and job scripts will show concrete syntax and examples.
How Batch Systems Improve Cluster Usage
Batch systems are central to efficient and fair use of HPC clusters:
- High utilization: They continuously look for jobs to fill idle nodes.
- Fairness: They apply policies so many users can share the system effectively.
- Protection: They prevent users from overloading login nodes or bypassing limits.
- Reproducibility: Jobs run in controlled environments with recorded parameters.
- Scalability: They handle thousands of jobs and users across many nodes.
Understanding the general behavior of batch systems helps you:
- Form realistic expectations about queue wait times
- Request resources wisely
- Choose appropriate queues/partitions
- Interpret job states and scheduling decisions
Differences Between Common Batch Systems (Conceptual)
While the commands and directives vary, core features are broadly similar:
- SLURM
- Open-source, widely used in modern HPC
- Uses partitions instead of queues
- Integrates resource management and scheduling
- PBS / Torque / OpenPBS
- Older, still common
- Uses
qsub,qstat, etc. - Often combined with external schedulers in large deployments
- LSF
- Commercial, used in many enterprise and some research centers
- Uses
bsub,bjobs, etc. - SGE / Grid Engine variants
- Historically popular; some sites still use derivatives
- Uses
qsub,qstat, with somewhat different semantics than PBS - HTCondor
- Optimized for high-throughput and opportunistic computing
- Often used in data-intensive and grid environments
When moving between clusters, you will often:
- Recognize the same concepts (jobs, queues, resources, priorities)
- Adapt to different command names and option syntax
- Adjust to site-specific policies and queue configurations
User Responsibilities in Batch Environments
From the user’s perspective, effective use of a batch system involves:
- Accurate resource requests
- Request enough to run reliably
- Avoid large overestimates that increase wait times and reduce utilization
- Appropriate queue choice
- Use “short” or “debug” queues for tests
- Use production queues for large or long runs
- Follow documentation from your site
- Respecting policies
- Observe limits for maximum jobs, runtime, and resources
- Avoid job submission patterns that stress the scheduler (e.g., millions of tiny jobs without arrays)
- Monitoring and debugging
- Check logs for errors and performance issues
- Cancel obviously misconfigured or stuck jobs promptly
These practices help both your own productivity and the overall health of the shared system.
How This Connects to Upcoming Chapters
Subsequent chapters in this section build on this overview:
- Introduction to SLURM: A concrete look at one popular batch system: commands, partitions, and basic usage.
- Writing job scripts: How to write portable, readable scripts that describe batch jobs.
- Submitting, monitoring, and cancelling jobs: Practical command-line workflows.
- Resource management details: How to get better throughput and performance by using the scheduler intelligently.
The concepts described here—jobs, queues/partitions, resources, priorities, and policies—underpin all of those practical steps.