Kahibaro
Discord Login Register

Efficient resource usage

Using HPC Resources Efficiently

Efficient resource usage is about getting the most useful work done per unit of time, energy, and shared capacity. In an HPC environment, every inefficient choice you make affects not only your own results, but also energy consumption and the waiting time of other users. This chapter focuses on practical habits and simple quantitative reasoning that help you use CPU time, memory, accelerators, and storage responsibly and effectively.

Efficient usage is not the same as “use as little as possible.” It means “use what you need, no more and no less, and use it well.”

Efficient resource usage means:

  1. Request only the resources you can actually use.
  2. Keep resources busy with useful work, not idle time.
  3. Match your job’s configuration to its scaling behavior.
  4. Avoid unnecessary data movement and storage usage.
  5. Prefer energy efficient configurations when performance is similar.

Matching Requests to Actual Needs

On a shared system, the scheduler gives your job exclusive access to the resources you request. If you ask for more resources than you can use, you are blocking others from using them while providing no benefit to yourself. Efficient usage starts with realistic estimates.

When preparing a job script, you usually specify:

These parameters control how much of the machine you occupy and for how long.

Estimating CPU and Core Requirements

Efficient CPU usage means running with a number of cores that your code can actually use in parallel.

A useful way to reason about this is through speedup. If $T(p)$ is the runtime with $p$ cores and $T(1)$ is the runtime with one core, the speedup is

$$
S(p) = \frac{T(1)}{T(p)}.
$$

The parallel efficiency is

$$
E(p) = \frac{S(p)}{p}.
$$

If $E(p)$ is high, for example $E(p) \ge 0.7$, you are using the cores efficiently. If $E(p)$ becomes very low, for example $E(p) \le 0.3$, adding more cores mostly wastes resources.

In practice, you rarely compute $E(p)$ formally for every job, but you can:

  1. Run small test jobs at different core counts.
  2. Note how runtime changes.
  3. Select a core count where extra cores still give clear benefits.

If doubling cores only saves a few minutes on a long run but doubles your allocation usage, it may be inefficient. Sometimes it is reasonable to accept lower efficiency to meet a hard deadline, but you should make that choice consciously, not accidentally.

Rule of thumb:
Do short scaling tests. Do not blindly use the maximum number of cores or nodes available. Choose a configuration where both runtime and parallel efficiency are reasonable.

Estimating Memory Requirements

Requesting too little memory risks job failure. Requesting far too much ties up memory that could serve other jobs and may force the scheduler to place you on larger or more power hungry nodes.

Efficient memory usage involves:

  1. Measuring or checking memory use.
    Use tools provided on your system or scheduler accounting to see how much memory your previous jobs actually consumed, including peak usage.
  2. Adding a safety margin, not a huge cushion.
    If a test job uses 12 GB per task, you might reasonably request 16 GB per task, not 64 GB. The exact margin depends on how variable your problem size is.
  3. Choosing between per-node and per-task requests.
    If your code uses memory per process, request memory per task. If it uses a single large shared memory region per node, request memory per node.

If your code’s memory usage grows with the problem size, consider simple models such as

$$
M(N) \approx a N + b,
$$

where $M$ is memory, $N$ is an input size parameter, and $a$ and $b$ are constants estimated from small tests. This can guide safe and efficient memory requests for larger production runs.

Choosing an Appropriate Walltime

Walltime is the maximum time your job is allowed to run. If your job finishes early, the remaining walltime is simply unused and wasted from a scheduling perspective. Overly long walltime requests make scheduling harder and may increase your queue time.

To choose efficient walltime:

  1. Time small or medium runs and estimate how runtime scales with input size or core count.
  2. Add a realistic safety margin, for example 20 to 50 percent, depending on variability.
  3. Avoid “infinite” walltimes like several days if your job usually finishes in a few hours.

Many schedulers prioritize shorter jobs because they fragment the schedule less. Efficient walltime requests often lead to faster turnaround for you and better cluster utilization overall.

Rule of thumb:
Estimate runtime from test runs, then request walltime = estimated runtime + a modest safety margin, not a multiple of it.

Keeping Allocated Resources Busy

Once resources are assigned to your job, the most efficient use is to keep them doing useful work. Idle or underutilized resources waste both time and energy.

CPU Utilization and Load Balancing

High CPU utilization means that most cores spend most of their time in user computations, not sleeping or spinning.

You can waste CPU cycles if:

You do not need to redesign algorithms in this chapter, but you can adopt good habits:

Schedulers and job accounting can often report CPU utilization per job. If you see that average CPU usage is, for example, 20 percent for a multi-hour job, this is a sign of inefficiency that you should investigate.

Avoiding Idle Time Within Jobs

Idle time can come from non-computation sources that you can control as a user:

An efficient strategy is to:

  1. Design your workflow so that batch jobs run without human interaction.
  2. Use I/O settings that are known to be reasonable for your filesystem and problem size.
  3. Choose checkpoint intervals that balance safety and overhead. If a job spends a large fraction of its time on checkpointing, that can be very wasteful.

Choosing Efficient Job Configurations

Your decisions about nodes, cores, GPUs, and job layout have a strong impact on both performance and resource efficiency.

Filling Nodes Effectively

Most clusters allocate resources in units of nodes or partial nodes. If you request a full node, but only use a fraction of its cores or memory, the unused capacity remains idle for the duration of your job.

Efficient node usage includes:

If a node has 64 cores and you request 64, then only run 8 active threads, you are wasting 56 cores. If instead you truly need only 8 threads, request a smaller share of the node if the cluster policy supports that.

Reasonable Scaling Choices

Efficiency and scaling are closely connected. If a code shows poor scaling at a certain size, running at that size often wastes resources. Rather than always using the largest job possible, you can:

This is particularly important in systems with long queues and strong resource contention. A large, inefficient job not only wastes energy, it can also delay many smaller, efficient jobs that could have run in the same time.

Efficient Use of Accelerators and Special Hardware

GPUs and other accelerators consume substantial power. If you request them, you should be sure that your application will actually benefit.

Requesting GPUs Responsibly

Many clusters provide GPU specific partitions. If your code does not use GPUs, never submit to a GPU partition just because it is shorter or less busy. This blocks GPU resources for users with genuine accelerator workloads.

When you do use GPUs, consider:

If an application only uses 1 GPU effectively, requesting 4 GPUs per node is wasteful. A short scaling test can reveal how many GPUs provide good efficiency.

GPU Utilization

Even if you have an application that supports GPUs, low GPU utilization is a sign of inefficiency. Causes include:

You can often monitor utilization with vendor tools. If you see a GPU mostly idle while the CPU is busy, you are not using the accelerator efficiently. Sometimes the best configuration is to run more GPU tasks, each with smaller work units, so that GPUs are kept busy without oversubscription.

Rule of thumb:
Use accelerators only when they provide significant speedup. Match the number of accelerators you request to the number your code can keep busy.

Storage, I/O, and Data Management Efficiency

Storage and I/O are shared resources that can become bottlenecks and energy sinks. Efficient usage reduces not only your own runtimes, but also avoids interfering with other users.

Avoiding Unnecessary Data and Files

Every file you create occupies space on a shared filesystem. Excessive data growth increases backup costs, slows down filesystem operations, and may lead to quota problems.

Practical habits include:

In many workflows, the largest savings come from careful control of output frequency and resolution. For example, storing every time step of a simulation might be unnecessary if you only analyze results at coarser intervals.

Efficient I/O Patterns

While detailed parallel I/O strategies are covered elsewhere, a few simple principles improve resource efficiency:

Inefficient I/O can cause long stalls. From a resource perspective, this means you are tying up compute nodes merely to wait for disk operations to complete.

Throughput Oriented Efficiency

Some workloads consist of many independent or loosely coupled tasks, such as parameter scans, ensemble runs, or Monte Carlo simulations. In these cases, the goal is often to maximize total throughput rather than minimize single job runtime.

Job Arrays and Packing Small Tasks

If your tasks are small and similar, using job arrays or bundling multiple tasks into a single job often leads to more efficient scheduling:

Efficient usage in this case means designing job scripts and workflows that expose task parallelism to the scheduler without overwhelming it with thousands of tiny individual job submissions.

Balancing Turnaround Time and Resource Use

Users often face a choice between:

For example, if you need to run 100 configurations, you might run 10 configurations at a time with good efficiency rather than 100 at a time on an overly large allocation with poor scaling. The more efficient configuration may actually finish sooner overall and uses fewer resources for the same scientific outcome.

A simple way to think about this is:

If a configuration takes time $T$ on $p$ cores and you have $K$ configurations, the total core time is $K \cdot p \cdot T$. You can reduce this product by choosing a $p$ where $T$ does not shrink significantly when you add more cores. That reduces both resource consumption and load on the system.

Energy Awareness and Green Choices

Efficient resource usage and sustainability are closely linked. Many of the practices described above reduce energy use automatically. You can also make some explicit energy aware choices.

Favoring Efficient Over Maximal Configurations

The simplest way to save energy is to stop using “maximum everything” as the default:

If two configurations complete your work in similar walltime, but one uses fewer nodes or cores, prefer the one with lower total core time. The energy used is roughly proportional to total active compute time, although exact relationships depend on hardware and power management.

Suppose configuration A uses $p_A$ cores for time $T_A$, and configuration B uses $p_B$ cores for time $T_B$. If $p_A T_A > p_B T_B$ by a significant factor and runtimes are similar, configuration A is less energy efficient, even if it is not slower.

Scheduling Work in a Cluster Friendly Way

Some sites may encourage or require jobs in off peak windows or particular partitions. Running flexible, non urgent jobs when the system is less loaded can:

You can contribute to this by:

This type of cooperation improves the efficiency of the whole center and is part of ethical resource usage.

Monitoring and Improving Your Own Efficiency

You cannot improve what you do not measure. Most HPC systems provide some form of accounting or job statistics that you can access.

Learning from Job Accounting Data

After your jobs complete, check:

If you see patterns such as very low CPU utilization, memory usage far below requested, or GPUs mostly idle, treat this as feedback. Adjust your next job’s configuration accordingly.

Self check:
For each job, ask:

  1. Did I request more cores, memory, or GPUs than I used?
  2. Did my job finish much earlier than the walltime limit?
  3. Were CPU and GPU utilization high for most of the runtime?
    Use the answers to refine future submissions.

Iterative Refinement Rather Than One Perfect Guess

Efficient resource usage is an iterative process:

  1. Start with conservative but reasonable settings, based on small tests and documentation.
  2. Inspect runtime and usage reports.
  3. Adjust resources, layout, and job size.
  4. Repeat until your jobs run reliably, with good performance and without obvious waste.

This cycle reduces failures, limits waste, and over time leads to more predictable and efficient production runs.

Shared Responsibility and Good Citizenship

Efficient resource usage is not only a technical issue. It is also a matter of fairness to other users and to the institution paying for and powering the system.

Good resource citizenship includes:

By using resources effectively, you improve your own productivity and contribute to a more sustainable and fair HPC environment for everyone.

Views: 1

Comments

Please login to add a comment.

Don't have an account? Register now!