18.3 Job sizing and fair-share usage

Table of Contents

Understanding Job Sizing in an Ethical Context

Job sizing is the practice of matching the resources you request on an HPC system to what your application actually needs. It connects directly to ethics and sustainability, because overestimating resources wastes energy and cluster capacity, while underestimating can cause failures that also waste time and power.

In most schedulers you request at least three things for each job: number of nodes or tasks, number of cores or CPUs per task, and wall time. On GPU systems you typically also request a number of GPUs. Job sizing also includes memory per node or per task, and sometimes other specialized resources. Ethical job sizing means basing these requests on evidence, such as small test runs and performance measurements, instead of guessing very large numbers “just to be safe.”

Bad job sizing has clear consequences. If you ask for far more CPU cores or GPUs than your code can use, many allocated resources stay idle. The scheduler cannot give them to others, so overall throughput and energy efficiency drop. If you ask for an extremely long wall time, the scheduler often holds your job back because long jobs are harder to fit into the schedule, so you may block smaller jobs and delay other users. Conversely, if you underestimate wall time and your job times out and is killed, all the work it did is lost. You then need to resubmit, which repeats the same computation and energy use.

Ethical job sizing aims for realistic, slightly conservative requests: enough to finish reliably, not so much that you block or waste shared resources.

Ethical rule: Always base resource requests on measured needs, not on fear or convenience. Oversizing wastes energy and harms other users. Undersizing repeatedly wastes cycles and power via failed runs.

Practical Strategies for Right-Sizing Jobs

Right-sizing starts with exploratory runs. Before launching a very large production job, run your code on a small problem size and a few resource configurations. Measure run time for different numbers of cores, nodes, or GPUs. Observe memory usage and I/O patterns. This does not require deep performance expertise; you only need to identify an approximate sweet spot and avoid clearly wasteful choices.

A practical approach is to fix the problem size and test a sequence of core counts, for example 1, 2, 4, 8, and 16 cores. Record the run time for each. You can define approximate speedup as $S(p) = T(1) / T(p)$, where $T(p)$ is the run time on $p$ cores. Efficiency is $E(p) = S(p) / p$. When you see that adding more cores gives very little speedup and drops efficiency sharply, that is usually a sign that further increases in core count are not a good use of resources. The same idea applies to GPUs or nodes.

For memory, start by running with moderate input sizes and monitor memory usage with tools provided on the system, for example with scheduler reports or basic tools on login or compute nodes, where allowed. If a test run uses 20 GB on a node with 64 GB, it is reasonable to request something like 32 GB per node for production runs of similar size. Requesting the entire 64 GB for safety, without justification, reduces flexibility for the scheduler and may block other users.

Wall time estimates can be scaled from test runs. If a small test problem of size $N_{\text{test}}$ takes $T_{\text{test}}$ time, and you know your algorithm scales roughly as some power of $N$, you can estimate the time for a larger problem. Even without a precise model, you can use proportional scaling as a first approximation, for example if doubling the resolution roughly quadruples run time. Then you can request perhaps 20 to 50 percent more time than this estimate. This buffer protects you from minor variations without being excessively large.

Ethical job sizing is iterative. As you gain more data from real runs, adjust your future requests. Keep simple notes or a log of previous runs and their performance. This habit is light-weight yet powerful for both efficiency and fairness.

Practical guideline: Use short test runs to estimate CPU, memory, and wall time. Choose the smallest resource set that achieves acceptable time to solution, plus a modest safety margin.

Fair-Share Scheduling and Its Ethical Dimension

Many HPC centers use some form of fair-share policy in their schedulers. Fair-share tries to balance access to resources so that no single user or project dominates the cluster for long periods. The scheduler tracks the recent or historical usage for each account or user, and adjusts job priority accordingly. If you have used a lot of CPU hours recently, your new jobs often receive a lower priority than jobs from users who used less, so their jobs get a chance to run sooner.

Fair-share usually depends on a share tree or weighting configuration defined by the center’s policies. Some projects or groups may have larger allocated shares based on funding or agreements, but within those shares the scheduler still aims for relative fairness using history. This mechanism is an attempt to operationalize the ethical principle that shared resources should benefit many users, not only those who submit the most or the largest jobs.

Your behavior directly influences how fair-share treats you and others. If you constantly oversize your jobs and keep many large jobs queued, you consume a disproportionate amount of capacity when your priority is high, and suffer longer waits when it drops. Meanwhile, other users see more fluctuations and unpredictability. If you right-size your jobs and schedule them thoughtfully, your usage pattern is smoother and the fair-share system can keep the cluster in a more stable and predictable state for everyone.

Fair-share is not only a technical feature, but a social contract. By using resources responsibly, you help the policy achieve its goal. Ignoring fair-share and attempting to game the system for personal benefit contradicts the spirit of ethical HPC use.

Ethical rule: Treat fair-share as a shared agreement, not an obstacle. Your resource usage history affects others. Avoid behaviors that try to bypass or exploit fair-share.

Behaviors That Undermine Fair-Share

Several common user habits conflict with fair-share and ethical usage. One such behavior is “hogging” by submitting a large number of very big jobs at once, without considering whether they are all urgently needed. This can temporarily flood the queue, forcing other users to wait longer until fair-share reduces your priority. Even if the scheduler ultimately corrects this, during the adjustment other users may experience severe delays and the system may not be used in the most energy-efficient way.

Another problematic pattern is artificially slicing a single workflow into many tiny jobs to gain faster starts or higher apparent priority. For example, some users may split a long run into dozens of shorter jobs with slightly different parameters to fit around fair-share policies, or they may submit many trivial single-core jobs to bypass queues designed for large parallel jobs. This behavior can overload the scheduler with overhead and reduce throughput. It also undermines the intended priority balance.

A third example is abusing interactive or debug queues. These queues are usually intended for short, small-scale tests, but some users try to run production workloads there because they start faster. This blocks interactive testing for others and violates policies designed for fairness and responsiveness.

The root of these patterns is often misunderstanding rather than malice. Still, from an ethical perspective, impact matters more than intention. Responsible use requires awareness of how your submissions appear to the scheduler, and how they influence others’ experience.

Fair-share anti-patterns:

Flooding the queue with many large jobs that are not all urgent.
Splitting production workloads into many small jobs solely to exploit scheduling behavior.
Using special queues intended for testing or interactivity as production queues.

Aligning Job Sizing With Fair-Share Policies

Job sizing and fair-share interact in subtle ways. Right-sizing your jobs often leads to better fair-share outcomes for both you and others. Smaller, well-sized jobs are easier for the scheduler to place in gaps, which increases cluster utilization. This means the system can run more jobs overall without increasing hardware or energy consumption, which supports sustainability goals.

One helpful practice is to diversify your job sizes when you have flexibility. If your workflow allows you to run some analyses with fewer nodes, you can create a mix of jobs with different sizes and durations. This gives the scheduler more options to keep nodes busy. When the scheduler can slot smaller jobs into fragmented free spaces, it avoids leaving nodes idle while waiting for a large block for a big job. Ethically, this is a way to cooperate with the scheduling system, rather than always demanding the largest possible contiguous allocation.

It is also important to match your job sizes to the allocation policies of your system. Some centers prefer that users consolidate small, similar tasks into single multi-task jobs, for example using job arrays, which reduce scheduling overhead and make fair-share accounting clearer. Others prefer fewer but larger jobs to minimize overhead. Reading and following site-specific guidelines is part of ethical usage, because these rules are designed to align user behavior with fair-share and efficiency objectives.

Finally, remember that fair-share is usually based on resource consumption, not just job count. If you right-size the number of cores, nodes, and wall time, your recorded usage better reflects what you truly needed. Overestimating by large factors inflates your fair-share footprint, which may lower your future priority and at the same time distort the balance between users.

Practical guideline: Choose job sizes and structures that fit well with your center’s scheduling policies. Help the scheduler keep nodes busy with realistic and flexible requests.

Job Arrays, Checkpointing, and Ethical Throughput

Many workloads in HPC consist of many similar tasks that can run independently. Examples include parameter sweeps, ensemble simulations, and repeated runs with different random seeds. For such workloads, schedulers often provide job arrays. A job array lets you submit many related tasks with a single job script. Each array element uses the same resource request but a different index, which your program can read to choose input or parameters.

From an ethical and fair-share perspective, job arrays are preferable to manually submitting hundreds of nearly identical jobs, because they lower scheduler overhead and make accounting clearer. Most fair-share implementations correctly handle arrays as collections of tasks from one user, so you do not gain unfair advantage, but you also avoid burdening the system with excessive per-job management.

Checkpointing, saving intermediate state that allows a job to restart after interruption, also plays a role in ethical job sizing. If your application supports checkpointing, you can request shorter wall times without risking complete data loss. The scheduler can then preempt or requeue long jobs in favor of higher priority work, if the policies allow it, with smaller energy and time waste. This improves overall throughput and fairness, because long jobs are less likely to monopolize resources continuously.

Using arrays and checkpointing wisely lets you transform a rigid, monolithic workload into more flexible units. This flexibility is valuable for the scheduler and for other users. It aligns your work with the dynamic nature of shared clusters, which is a practical expression of ethical cooperation.

Ethical rule: When possible, use job arrays and checkpointing to increase flexibility. Flexible workloads are easier to schedule fairly and efficiently.

Balancing Personal Deadlines With Community Fairness

Researchers and engineers often face tight deadlines, such as conference submissions, project milestones, or production schedules. It is tempting to treat the cluster as an infinite resource to be used aggressively to meet these deadlines. Ethical HPC use acknowledges these pressures but insists on balance with community fairness.

One aspect of this balance is communication. If you anticipate a period of intense usage, it can be appropriate to inform system administrators or your group’s allocation managers. Some centers provide mechanisms for “burst” usage, special reservations, or negotiated priority adjustments. Using such official channels is much more ethical than quietly overloading the queue or violating recommended job sizing practices.

Another aspect is prioritizing which jobs truly need the largest resources and longest wall times. Not every exploratory analysis requires maximum scale. By reserving the biggest and most demanding jobs for genuinely critical tasks, and conducting other work at more modest scales, you reduce the strain on the system. This selective approach helps maintain a fair environment even during busy periods.

Finally, fairness includes being transparent within your own team or group. When multiple people share the same allocation or account, job sizing decisions by one member affect everyone’s fair-share reputation and access. Discussing submission strategies and establishing internal guidelines is part of responsible collective behavior.

Ethical guideline: Urgent personal deadlines do not justify ignoring fair-share or responsible job sizing. Use official mechanisms and internal coordination instead of unilateral overuse.

Monitoring Your Impact and Adjusting Behavior

To use a fair-share system ethically, you need some visibility into your own impact. Many centers provide commands or web interfaces that show your recent usage, fair-share factor, and job statistics. Reviewing these regularly helps you notice patterns such as sustained very high usage, frequent job failures due to timeouts, or long queues caused by your submissions.

From these observations, you can adjust your job sizing practices. If many of your jobs finish with large unused wall time, reduce the requested time. If you see that you routinely request more memory than your application uses, decrease your memory requests to free capacity for others. If your fair-share priority is often very low, consider spacing out large runs or combining smaller tasks more efficiently, so that your usage curve is smoother.

You can also perform informal self-audits. For a given month, ask: How many of my jobs failed due to resource limits? How many finished much earlier than requested? How often did I submit more than I realistically needed? Treat honest answers as input for improvement, not as blame. Ethical computing is an evolving practice, and feedback from your own usage is the most direct teacher available.

Over time, small corrections based on monitoring will significantly reduce wasted energy, improve your own throughput, and make the system more predictable for others. This incremental refinement is fully compatible with practical research and development, and it expresses ethics through everyday technical decisions.

Practical rule: Periodically examine your job history. Use evidence from past runs to refine future job sizes and submission patterns.

Connecting Job Sizing and Fair-Share to Sustainability

Job sizing and fair-share usage are not only about social fairness. They are also powerful levers for environmental sustainability. Every unnecessary core-hour or GPU-hour burned by oversized jobs translates into extra electricity consumption and associated emissions. Every failed job that must be rerun compounds that waste.

Schedulers that implement fair-share aim to keep the cluster as busy as possible with meaningful work. Well-sized jobs that are efficiently scheduled allow the system to reach high utilization without expanding hardware. Higher effective utilization per watt means more science and engineering output for the same environmental footprint. By cooperating with fair-share through ethical job sizing, you amplify this effect.

Viewed this way, each job submission is a small environmental decision. Choosing realistic resources, using arrays and checkpointing where appropriate, and respecting policies that distribute access fairly are all ways to align your computational work with broader sustainability goals. For an HPC beginner, adopting such habits early establishes a professional identity that integrates technical skill with ethical responsibility.

Core principle: Ethical, sustainable HPC use starts with each job you size and submit. Fair resource requests, aligned with fair-share policies, reduce waste and support both community and planet.

Comments

Please login to add a comment.

Don't have an account? Register now!