Table of Contents
Big picture: why scheduling exists in HPC
On an HPC cluster, hundreds or thousands of users share a finite set of powerful nodes. Unlike a personal workstation, you normally cannot just start a large parallel job directly on the login node or choose arbitrary compute nodes for yourself. This would:
- Cause chaos (conflicting usage of the same cores, memory, GPUs).
- Make performance unpredictable (jobs slowing each other down).
- Make accounting and fair sharing impossible.
A job scheduler (also called a batch system or resource manager in combination with a scheduler) sits between users and the hardware. You describe what you want to run and what resources you need; the scheduler decides when, where, and for how long it will run.
This chapter focuses on:
- The role of schedulers and resource managers.
- Basic concepts shared by most HPC batch systems.
- How resource requests, queues/partitions, and job priorities interact.
- What you, as a user, should think about when asking the scheduler for resources.
Core concepts: jobs, resources, and policies
Most HPC schedulers share a common conceptual model, even if their commands differ:
- Job: A unit of work submitted to the scheduler (e.g. run a program with certain parameters).
- Resources: What the job needs from the cluster:
- CPU cores
- GPUs/accelerators
- Memory
- Wall-clock time
- Sometimes special features (e.g. “bigmem”, “GPU type X”, “fast interconnect”).
- Queues / partitions: Logical subdivisions of the cluster that group jobs and resources according to policies (short/long jobs, debug/production, GPU/CPU-only, etc.).
- Scheduler: Decides job order and placement subject to policies.
- Resource manager: Tracks actual resource usage on each node and enforces allocations.
In many systems, a single software stack (e.g. SLURM, PBS Pro, LSF) handles both scheduling and resource management; conceptually, the roles are still useful to distinguish:
- Resource manager ensures your job only uses the CPUs and memory assigned to it.
- Scheduler decides which job runs first and where.
Job lifecycle
A typical job’s lifecycle within the scheduler:
- Submission
You submit a job description (job script or interactive job request). It specifies: - Resources needed (time, cores, nodes, memory, GPUs).
- Which partition/queue to use.
- The executable and arguments.
- Optional: dependencies on other jobs, account to charge, etc.
- Queued / pending
The job waits in a queue. It gets a status like “pending” while it is waiting for: - Matching resources (enough free cores/memory/GPU nodes).
- Priority over other pending jobs.
- Compliance with time-of-day or maintenance windows.
- Scheduling / starting
The scheduler chooses nodes that meet the job’s requirements and marks those resources as allocated. The job’s startup scripts are run, environment is set up, and your executable starts. - Running
While running: - Resource limits (time, memory, cores) are enforced.
- Usage is accounted for (CPU time, energy, etc., depending on system).
- Completion / termination
The job finishes successfully, fails, or is killed (e.g. time limit exceeded, manual cancellation, node failure). The scheduler: - Releases the resources back to the pool.
- Optionally records accounting information.
- Writes your job’s output and error logs.
Understanding this lifecycle helps explain most of the behavior you see when interacting with a batch system (why jobs wait, why they get killed at the time limit, etc.).
Types of jobs in batch systems
Schedulers typically support multiple job “modes” that suit different usage patterns:
- Batch (non-interactive) jobs
- Interactive jobs
- Array jobs
- Dependency-based workflows
These are all just variations on the central idea: describe what you need and what to run; the system executes when resources become available.
Batch (non-interactive) jobs
This is the most common type on HPC clusters.
Characteristics:
- You prepare a job script in advance.
- The script describes the resource request and the commands to run.
- You submit it; it runs unattended when scheduled.
- Output is written to log files for later inspection.
Typical use cases:
- Long simulations.
- Parameter sweeps.
- Production processing of large datasets.
Implications:
- You must think in advance about:
- How many resources you need.
- How long it will run.
- Where input/output will be stored.
- Good for reproducibility: job scripts can be archived and reused.
Interactive jobs
Interactive jobs give you a command-line shell on allocated compute resources rather than starting a pre-defined script. This differs from just working on the login node because the interactive session:
- Runs on compute nodes.
- Uses scheduled and tracked resources.
- Obeys cluster policies and time limits.
Typical use cases:
- Debugging parallel programs.
- Profiling and performance tuning.
- Exploratory work before writing a batch script.
From the scheduler’s point of view, these are still jobs with requested resources and time; they are simply driven by your shell rather than a pre-written batch script.
Array jobs
Array jobs are a way to submit many similar jobs as a single logical entity.
Characteristics:
- One job description, multiple tasks with varying parameters (e.g. different input files, different random seeds).
- The scheduler treats each array element like an individual job for resource allocation, but you get simpler submission and management.
Typical use cases:
- Parameter scans (e.g. varying one or more parameters across many values).
- Running the same program on many independent datasets.
- Monte Carlo simulations.
Scheduling advantage:
- Simpler for the scheduler to manage and for you to monitor.
- Often allowed to run many small tasks under a single job ID with internal indexing.
Algorithmically, array jobs align with embarrassingly parallel workloads: largely independent tasks that do not need to communicate.
Job dependencies and workflows
Schedulers often support declaring dependencies between jobs, for example:
- “Do not start job B until job A completes successfully.”
- “Run job C after jobs D and E are done, regardless of whether they failed or succeeded.”
This enables simple workflow orchestration without external tools:
- Preprocessing → Simulation → Postprocessing.
- Multi-step pipelines where each step transforms data for the next.
At the scheduling level, dependencies:
- Control when a job becomes eligible to run.
- Do not change its resource requirements or priority, but can significantly affect end-to-end workflow time.
Resource types and how they’re expressed
Different schedulers use different flags, but conceptually you ask for:
- Time
- CPU cores or nodes
- Memory
- GPUs/accelerators
- Special features or constraints
Each of these affects both when your job starts and how it runs.
Wall-clock time
You tell the scheduler how long your job is allowed to run: the wall time or time limit.
- If the job exceeds this limit, it is usually killed automatically.
- Schedulers may charge or prioritize based on this time.
- Partitions/queues often have different maximum time limits (e.g. 1 hour in debug, several days in long).
Why it matters:
- If your time estimate is too low:
- The job may be killed shortly before completion, wasting resources and your wait time.
- If your time estimate is too high:
- On many systems, large time requests reduce your priority or restrict you to slower queues.
- The scheduler may find it harder to fit your job into gaps, increasing wait time.
Practical strategy:
- Start with a slightly conservative estimate based on smaller test runs.
- Refine time requests as you learn how your application behaves.
CPU cores, tasks, and nodes
Scheduler requests distinguish between:
- Cores (or “CPUs” in some schedulers): basic CPU execution units.
- Tasks (often “ranks” in MPI context): independent program instances cooperating in parallel.
- Nodes: physical machines.
At the scheduling level:
- You ask for a certain number of cores, tasks, and/or nodes.
- The scheduler maps tasks to cores and nodes to satisfy your request.
Implications:
- More cores does not always mean faster job; your code must scale.
- Overspecifying cores wastes resources and may reduce your priority or cause longer wait times.
- Some schedulers support:
- Exclusive node allocation (no other jobs use your nodes).
- Shared node allocation (multiple small jobs share cores on a node).
Cluster policies often limit maximum cores per user or per job to avoid starvation of other users.
Memory
You typically request memory:
- Per node or per core.
- Sometimes as a total for the job.
The resource manager enforces memory limits:
- If your job exceeds its memory allocation, it may be killed by the system.
- Other jobs on the node rely on you respecting your memory reservation.
Implications:
- Underestimating memory causes failures.
- Overestimating memory reduces how many jobs can fit concurrently, leading to longer queues for everyone and often lower priority for your job.
Some clusters offer special high-memory partitions/nodes. These usually have:
- Much more memory per node.
- Tighter quotas or lower availability.
- Sometimes different accounting rates.
GPUs and accelerators
When GPUs or other accelerators are present:
- You must request them explicitly.
- The scheduler ensures you get exclusive or controlled access to a given number of devices.
- The resource manager may set environment variables or device visibility to enforce allocation (e.g. controlling which GPUs your job can see).
From a scheduling perspective:
- GPU nodes are often in special partitions/queues.
- GPU resources are scarcer; request only what you genuinely need.
- Policies may restrict maximum GPUs per user or per job.
GPU allocation interacts with CPU and memory:
- Jobs on GPU nodes still need CPU cores and memory for host-side work.
- Some centers enforce specific CPU:GPU ratios.
Special features and constraints
Complex clusters may tag nodes with features:
- Hardware characteristics:
- Specific CPU generation (e.g. AVX-512 capable).
- GPU model (e.g. V100 vs A100).
- Large-memory nodes.
- Software constraints:
- Nodes pre-configured for certain libraries or environments.
- Network topology:
- Different interconnects or fabric islands.
By specifying constraints (e.g. “need GPU type X”), you:
- Limit which nodes your job can run on.
- Potentially increase wait time, since fewer nodes match the request.
Schedulers attempt to balance these constraints with efficient resource usage and fairness among users.
Queues / partitions and cluster policies
Clusters organize their resources through queues (also called partitions, classes, or projects). Each queue has associated policies:
- Maximum runtime per job (e.g. 30 minutes in debug, 48 hours in normal).
- Maximum cores, nodes, or GPUs per job.
- Priority level relative to other queues.
- User eligibility (e.g., certain groups, projects).
Schedulers use queues to implement:
- Short vs. long jobs: Short jobs often have higher priority to reduce wait time.
- Debug vs. production: Debug queues allow quick turnaround but with tight time/resource limits.
- Special hardware: GPU queues, big-memory queues.
As a user, choosing the right queue:
- Can greatly reduce your wait time.
- Must align with your job’s time and resource needs.
- Is often more impactful than small changes to core counts in terms of when your job starts.
Priorities, fairness, and accounting
Schedulers have to juggle many users and jobs simultaneously while enforcing:
- Fair access to resources.
- Efficient utilization of the cluster.
They do this using priorities, fair-share policies, and sometimes accounting/billing models.
Priority factors
A job’s effective scheduling priority usually depends on several factors, such as:
- Fair-share usage:
- Users (or projects) who have used a lot of resources recently may get lower priority.
- Users who have used little recently may get a boost.
- Queue/partition priority:
- Jobs in some partitions may get preferential treatment.
- Job size and runtime:
- Very large jobs might be prioritized to avoid fragmentation of resources.
- Very long jobs might be deprioritized, or only allowed in specific queues.
- Age in queue:
- Jobs waiting a long time may gradually increase in priority.
Exact formulas are system-dependent, but the effect is that:
- Priority is dynamic: waiting increases your chance to run, even if you used a lot of resources previously.
- Reasonable resource requests and queue choices are rewarded with better turnaround.
Fair-share and quotas
Fair-share aims to prevent a single user or group from monopolizing the cluster. It can be implemented via:
- Usage-weighted priorities: recently heavy users get lower priority.
- Quotas:
- Limits on number of running jobs.
- Limits on maximum simultaneous cores/GPUs.
- Limits per queue.
Understanding this helps you interpret scheduler behavior:
- If your jobs tend to wait while others start quickly, you may be using more than your fair share at that time.
- If you hit job or core limits, it is the policy enforcing fairness, not a technical error.
Many centers also track usage for reporting or billing (internal or external). That accounting may affect priority indirectly (e.g. projects that overrun their allocation may be deprioritized).
Backfilling and utilization
A key scheduling strategy in HPC is backfilling:
- The scheduler plans when large jobs will start, often after current large jobs finish.
- While waiting for that future timeslot, smaller jobs that can fit into the gaps are allowed to run as long as they do not delay the planned start of already-scheduled higher-priority jobs.
Effects:
- Overall cluster utilization increases (fewer idle cores).
- Small, short jobs may start quickly even on busy systems.
Implications for users:
- Reasonable time estimates allow better backfilling.
- Smaller jobs and shorter time requests often get better turnaround.
- Overstated time limits or oversized resource requests can make your jobs harder to backfill, increasing your own wait time and reducing efficiency for everyone.
Preemption and job interruptions
Some systems use preemption:
- Higher-priority jobs are allowed to “take over” resources from lower-priority jobs.
- The lower-priority jobs are suspended or terminated, then restarted or resubmitted later.
Common scenarios:
- Special high-priority queues (e.g. urgent jobs, admin use) that can preempt standard jobs.
- Preemptible queues where jobs run only when resources are idle but can be stopped at any time.
From a resource-management viewpoint:
- Preemption is another tool to keep utilization high while guaranteeing that high-priority work can start quickly.
- Your job design should be resilient to interruptions when running in preemptible queues (e.g. use checkpointing; see the I/O and data management chapters).
User responsibilities in resource management
While schedulers automate placement and priority, effective resource management relies heavily on user behavior.
As a user, you are responsible for:
- Choosing appropriate resource requests:
- Estimate time, cores, and memory from test runs and experience.
- Avoid massive over-requests that cause long waits and waste capacity.
- Selecting the right queue/partition:
- Short tests in debug/short queues.
- Long production runs in appropriate long queues.
- Using job arrays for many similar runs:
- Reduce overhead in the scheduler.
- Improve manageability and efficiency.
- Applying dependencies for workflows:
- Prevents idle time between steps.
- Avoids manual re-submission of each stage.
- Monitoring and adapting:
- Observe actual usage (time, cores, memory) and adjust future requests.
- Respect cluster policies and quotas.
Doing these well improves:
- Your own throughput and turnaround time.
- Overall cluster efficiency and fairness.
- The reliability and predictability of your workflows.
Interaction with other parts of the HPC stack
Job scheduling and resource management form the bridge between:
- You (and your software environment, modules, containers).
- The hardware (nodes, interconnects, storage).
In this chapter we focused on:
- The concepts of jobs, resources, policies, and fairness.
- How schedulers decide when and where jobs run.
Other chapters will cover:
- The concrete commands and syntax for a specific scheduler (e.g. SLURM).
- How to write job scripts in detail.
- How to monitor, debug, and optimize jobs once they are running.
Together, these aspects will let you turn your code or application into reproducible, efficient workloads on real HPC systems.