18.3 Job sizing and fair-share usage

Why Job Sizing Matters Ethically

On a shared HPC system, your jobs compete with others for finite resources: cores, memory, GPUs, and I/O bandwidth. “Job sizing” means choosing:

How many nodes/cores/GPUs you request
How much memory you request
How long you ask to run (wall time)

These choices have direct ethical and practical implications:

Over‑requesting wastes capacity and increases queue times for others.
Under‑requesting leads to failed jobs, repeated runs, and extra energy use.
Poor sizing can break “fair‑share” policies, even if unintentionally.

The goal is not “get as much as you can”, but “get what you need, efficiently and predictably, while respecting others”.

Principles of Good Job Sizing

Match resources to the problem, not wishful thinking

Avoid both extremes:

“I’ll just ask for 1 core so it starts sooner” for a very heavy job → long wall‑times, inefficient cluster use.
“I’ll ask for the whole node, just in case” for a small job → idle resources blocked from others.

Instead:

Estimate runtime and memory from:

Smaller test cases
Previous runs
Scaling tests (e.g., try 1, 2, 4, 8 nodes)

Choose the smallest configuration that gives acceptable time‑to‑solution.

Choose realistic wall times

Most schedulers use requested wall time as a key scheduling input.

Requesting far too long:

Your job may wait longer in the queue.
The scheduler may have trouble fitting your job into available gaps.
Nodes can sit idle waiting for your oversized slot.

Requesting too short:

Job may be killed by the scheduler when the limit is hit.
You end up re‑running the job, doubling energy and queue load.

Ethical practice:

Run short test jobs to get an order‑of‑magnitude estimate.
Add a safety margin (e.g., 20–30%, not 300%).
Adjust future requests based on observed runtimes.

Right‑size CPU and GPU use

For parallel jobs, there is a point of diminishing returns:

Past a certain core/GPU count, speedup flattens or even degrades.
Excess resources then give poor scientific return per core‑hour or GPU‑hour.

Ethical job sizing:

Use scaling studies to find:

A “strong scaling limit” beyond which your code no longer benefits.
A “sweet spot” where additional resources still produce meaningful speedup.

Choose configurations near that sweet spot, not simply “maximum allowed”.

If you’re unsure, it’s more ethical to slightly under‑provision and test than to massively over‑provision “just to be safe”.

Memory requests: neither starve nor hoard

Memory is often the tightest resource.

Under‑requesting:

Leads to out‑of‑memory crashes and wasted compute time.

Over‑requesting:

Can block your job onto high‑memory nodes.
Prevents others from using otherwise idle hardware.

Ethical approach:

Use small test runs to measure memory usage (with system or profiling tools).
Request a reasonable safety margin (e.g., 20–50% above measured peak).
If your memory use is uncertain or dynamic, discuss with support staff rather than inflating requests arbitrarily.

Understanding Fair‑Share Policies

What “fair‑share” means

Most HPC centers implement a “fair‑share” scheduling policy:

Usage by each user/group is tracked over time (e.g., recent CPU/GPU hours).
If you have used more than your “share”, your job priority decreases.
If you have used less, your priority increases.

Ethical implications:

Fair‑share discourages chronic over‑consumption.
It also means that your decisions about job size and frequency directly affect how others experience the system.

How job sizing interacts with fair‑share

Job size affects your fair‑share balance:

Larger jobs consume more “credits” (e.g., core‑hours), lowering your priority for later jobs.
Many small jobs can collectively consume just as much but may be scheduled more opportunistically.

Unethical patterns (even if unintentional):

Submitting many large jobs at maximum size and wall time, forcing others’ jobs to wait.
Continuously topping up the queue as soon as jobs finish, keeping your usage near the maximum sustained level.

Ethical patterns:

Throttle submission when you are clearly above your fair‑share.
Use queue status and accounting tools to understand your recent usage.
Batch your work reasonably rather than flooding the scheduler.

Practical Guidelines for Ethical Job Sizing

Start small, then scale up

Develop and debug locally or with very small jobs.
Run short, reduced‑size production tests:

Fewer time steps, smaller grid, fewer samples, etc.

Measure:

Runtime
Memory usage
Parallel efficiency

Extrapolate cautiously to production size:

Use scaling tests, not pure linear guesses.

This reduces:

Wasted cluster time due to crashes or logic bugs.
The need for repeated large production runs.

Use appropriate queues/partitions

Most clusters provide:

“Short” queues for quick tests
“Long” queues for extended runs
“GPU” partitions, “high‑memory” partitions, etc.

Ethical usage:

Put tests and small debugging jobs in short queues.
Reserve long queues for truly long runs.
Do not send test jobs to the largest or longest partitions “because they’re there”.

This helps the scheduler respect different user needs and timelines.

Avoid resource hoarding and queue flooding

Common harmful behaviors include:

Submitting hundreds of jobs at once at maximum allowed size.
Requesting exclusive nodes for single‑threaded jobs without justification.
Keeping a constant backlog of huge jobs so the cluster is effectively “yours”.

Better practices:

Cap your own number of queued or running jobs, even if the system does not enforce it strictly.
Group small serial jobs using job arrays or within a single allocation (if appropriate), rather than separate large requests.
Use shared nodes where suitable for small jobs, if the system offers them.

Coordinate within your group

Fair‑share is often enforced per group/project as well as per user.

If one team member routinely over‑requests, others lose priority.
Multiple users in a project can unintentionally overload the system at once.

Ethical team habits:

Share guidelines for typical job sizes for your codes.
Stagger or schedule large campaigns (e.g., parameter sweeps) among team members.
Monitor group‑level usage, not only personal usage.

Balancing Time‑to‑Solution and Community Fairness

When large jobs are justified

There are legitimate reasons to run very large jobs:

Simulations that simply do not fit into smaller configurations.
Urgent deadlines (e.g., conference submissions, project milestones).
Center‑approved large‑scale campaigns or leadership‑class runs.

Ethical considerations:

Communicate with system staff when planning extremely large or unusual runs.
Schedule them during lower‑usage periods if possible.
Share performance and scaling results, so others can learn how to use the system better.

Avoid “deadline panic” waste

Rushing near a deadline often leads to:

Over‑inflated wall times “just to be safe”
Over‑sized jobs to “finish in time”, even with poor scaling
Frequent resubmissions after premature job termination or logic errors

Mitigation:

Start large campaigns early; use the system steadily rather than in bursts.
Plan mid‑scale runs to verify correctness and scaling before a final large run.
Treat your fair‑share as a limited budget you want to invest wisely, not burn.

Monitoring and Reflecting on Your Usage

Use accounting and reporting tools

Most systems provide tools (often sacct, sreport, web dashboards, or custom commands) to show:

Core‑hours or GPU‑hours consumed
Memory usage and wall‑time adherence
Queue wait times and job efficiency

Use these metrics to:

Identify chronic over‑ or under‑requesting of wall time.
Spot jobs with very low utilization (e.g., many idle cores).
Adjust job sizes and submission strategies over time.

Aim for efficiency, not just completion

Ethical job sizing aligns with green computing goals:

Higher job efficiency → fewer wasted core‑hours and GPU‑hours.
Fewer failed or repeated runs → lower energy use and emissions.
Better queue sharing → more science per watt at the system level.

A practical rule of thumb:

Strive for jobs that:

Run close to the requested wall time, but do not hit the limit.
Use most of the allocated cores/GPUs actively.
Use memory close to, but safely below, the requested amount.

When you notice a job is far from these goals, treat it as feedback to improve your sizing, not as “good enough because the job eventually finished”.

Summary: Ethical Job Sizing Behavior

Request only the resources (cores, GPUs, memory, time) your job reasonably needs, with modest safety margins.
Base requests on measurements from smaller or earlier runs, not guesses.
Understand and respect your center’s fair‑share policy; monitor your usage and adjust.
Avoid queue flooding and resource hoarding, even if policies technically allow it.
Coordinate within your group so that shared allocations are used fairly and efficiently.

Thoughtful job sizing is a practical way to embody fairness, respect colleagues, and support sustainable, green HPC.

Comments

Please login to add a comment.

Don't have an account? Register now!