Table of Contents
Using HPC Resources Efficiently
Efficient resource usage is about getting the most useful work done per unit of time, energy, and shared capacity. In an HPC environment, every inefficient choice you make affects not only your own results, but also energy consumption and the waiting time of other users. This chapter focuses on practical habits and simple quantitative reasoning that help you use CPU time, memory, accelerators, and storage responsibly and effectively.
Efficient usage is not the same as “use as little as possible.” It means “use what you need, no more and no less, and use it well.”
Efficient resource usage means:
- Request only the resources you can actually use.
- Keep resources busy with useful work, not idle time.
- Match your job’s configuration to its scaling behavior.
- Avoid unnecessary data movement and storage usage.
- Prefer energy efficient configurations when performance is similar.
Matching Requests to Actual Needs
On a shared system, the scheduler gives your job exclusive access to the resources you request. If you ask for more resources than you can use, you are blocking others from using them while providing no benefit to yourself. Efficient usage starts with realistic estimates.
When preparing a job script, you usually specify:
- Number of nodes or tasks or GPUs.
- Number of cores per task or per node.
- Memory per task or per node.
- Walltime (maximum runtime).
These parameters control how much of the machine you occupy and for how long.
Estimating CPU and Core Requirements
Efficient CPU usage means running with a number of cores that your code can actually use in parallel.
A useful way to reason about this is through speedup. If $T(p)$ is the runtime with $p$ cores and $T(1)$ is the runtime with one core, the speedup is
$$
S(p) = \frac{T(1)}{T(p)}.
$$
The parallel efficiency is
$$
E(p) = \frac{S(p)}{p}.
$$
If $E(p)$ is high, for example $E(p) \ge 0.7$, you are using the cores efficiently. If $E(p)$ becomes very low, for example $E(p) \le 0.3$, adding more cores mostly wastes resources.
In practice, you rarely compute $E(p)$ formally for every job, but you can:
- Run small test jobs at different core counts.
- Note how runtime changes.
- Select a core count where extra cores still give clear benefits.
If doubling cores only saves a few minutes on a long run but doubles your allocation usage, it may be inefficient. Sometimes it is reasonable to accept lower efficiency to meet a hard deadline, but you should make that choice consciously, not accidentally.
Rule of thumb:
Do short scaling tests. Do not blindly use the maximum number of cores or nodes available. Choose a configuration where both runtime and parallel efficiency are reasonable.
Estimating Memory Requirements
Requesting too little memory risks job failure. Requesting far too much ties up memory that could serve other jobs and may force the scheduler to place you on larger or more power hungry nodes.
Efficient memory usage involves:
- Measuring or checking memory use.
Use tools provided on your system or scheduler accounting to see how much memory your previous jobs actually consumed, including peak usage. - Adding a safety margin, not a huge cushion.
If a test job uses 12 GB per task, you might reasonably request 16 GB per task, not 64 GB. The exact margin depends on how variable your problem size is. - Choosing between per-node and per-task requests.
If your code uses memory per process, request memory per task. If it uses a single large shared memory region per node, request memory per node.
If your code’s memory usage grows with the problem size, consider simple models such as
$$
M(N) \approx a N + b,
$$
where $M$ is memory, $N$ is an input size parameter, and $a$ and $b$ are constants estimated from small tests. This can guide safe and efficient memory requests for larger production runs.
Choosing an Appropriate Walltime
Walltime is the maximum time your job is allowed to run. If your job finishes early, the remaining walltime is simply unused and wasted from a scheduling perspective. Overly long walltime requests make scheduling harder and may increase your queue time.
To choose efficient walltime:
- Time small or medium runs and estimate how runtime scales with input size or core count.
- Add a realistic safety margin, for example 20 to 50 percent, depending on variability.
- Avoid “infinite” walltimes like several days if your job usually finishes in a few hours.
Many schedulers prioritize shorter jobs because they fragment the schedule less. Efficient walltime requests often lead to faster turnaround for you and better cluster utilization overall.
Rule of thumb:
Estimate runtime from test runs, then request walltime = estimated runtime + a modest safety margin, not a multiple of it.
Keeping Allocated Resources Busy
Once resources are assigned to your job, the most efficient use is to keep them doing useful work. Idle or underutilized resources waste both time and energy.
CPU Utilization and Load Balancing
High CPU utilization means that most cores spend most of their time in user computations, not sleeping or spinning.
You can waste CPU cycles if:
- Some processes or threads finish early and wait for others.
- Your domain decomposition is poorly balanced across MPI ranks.
- You use fewer threads than cores you requested.
- Your program spends significant time in serial sections.
You do not need to redesign algorithms in this chapter, but you can adopt good habits:
- Ensure that the number of OpenMP threads matches the cores you requested, using environment variables and job script configuration.
- For MPI jobs, match
--ntasksand--cpus-per-taskto the parallel model the code expects. - For parameter sweeps, spread work evenly among tasks, instead of overloading one task with many runs while others do little.
Schedulers and job accounting can often report CPU utilization per job. If you see that average CPU usage is, for example, 20 percent for a multi-hour job, this is a sign of inefficiency that you should investigate.
Avoiding Idle Time Within Jobs
Idle time can come from non-computation sources that you can control as a user:
- Long waits for interactive input during batch jobs. Avoid anything that requires you to respond during a batch run.
- Inefficient I/O patterns that cause processes to stall while waiting for data.
- Overly frequent checkpoints that spend a large fraction of time writing to disk.
An efficient strategy is to:
- Design your workflow so that batch jobs run without human interaction.
- Use I/O settings that are known to be reasonable for your filesystem and problem size.
- Choose checkpoint intervals that balance safety and overhead. If a job spends a large fraction of its time on checkpointing, that can be very wasteful.
Choosing Efficient Job Configurations
Your decisions about nodes, cores, GPUs, and job layout have a strong impact on both performance and resource efficiency.
Filling Nodes Effectively
Most clusters allocate resources in units of nodes or partial nodes. If you request a full node, but only use a fraction of its cores or memory, the unused capacity remains idle for the duration of your job.
Efficient node usage includes:
- Requesting a full node only if your job can take advantage of most of its cores or memory.
- For smaller jobs, using node sharing if the system allows it, by requesting only as many cores and memory as needed.
- Aligning the number of MPI ranks and threads with the node’s hardware layout, so all cores are used and none are oversubscribed.
If a node has 64 cores and you request 64, then only run 8 active threads, you are wasting 56 cores. If instead you truly need only 8 threads, request a smaller share of the node if the cluster policy supports that.
Reasonable Scaling Choices
Efficiency and scaling are closely connected. If a code shows poor scaling at a certain size, running at that size often wastes resources. Rather than always using the largest job possible, you can:
- Run several smaller, more efficient jobs instead of one very large and inefficient one, if your workflow allows it.
- For parameter studies, spread independent tasks across many nodes at modest core counts per task, instead of trying to parallelize one task to a huge scale where efficiency collapses.
This is particularly important in systems with long queues and strong resource contention. A large, inefficient job not only wastes energy, it can also delay many smaller, efficient jobs that could have run in the same time.
Efficient Use of Accelerators and Special Hardware
GPUs and other accelerators consume substantial power. If you request them, you should be sure that your application will actually benefit.
Requesting GPUs Responsibly
Many clusters provide GPU specific partitions. If your code does not use GPUs, never submit to a GPU partition just because it is shorter or less busy. This blocks GPU resources for users with genuine accelerator workloads.
When you do use GPUs, consider:
- How many GPUs the application can use concurrently.
- Whether scaling from 1 to multiple GPUs yields a meaningful speedup.
- The balance between GPU tasks and CPU tasks, so that CPUs are not sitting idle for long periods waiting for GPUs or vice versa.
If an application only uses 1 GPU effectively, requesting 4 GPUs per node is wasteful. A short scaling test can reveal how many GPUs provide good efficiency.
GPU Utilization
Even if you have an application that supports GPUs, low GPU utilization is a sign of inefficiency. Causes include:
- Too little work per GPU.
- Excessive data transfers between host memory and GPU memory.
- Long periods where the GPU waits for CPU side work.
You can often monitor utilization with vendor tools. If you see a GPU mostly idle while the CPU is busy, you are not using the accelerator efficiently. Sometimes the best configuration is to run more GPU tasks, each with smaller work units, so that GPUs are kept busy without oversubscription.
Rule of thumb:
Use accelerators only when they provide significant speedup. Match the number of accelerators you request to the number your code can keep busy.
Storage, I/O, and Data Management Efficiency
Storage and I/O are shared resources that can become bottlenecks and energy sinks. Efficient usage reduces not only your own runtimes, but also avoids interfering with other users.
Avoiding Unnecessary Data and Files
Every file you create occupies space on a shared filesystem. Excessive data growth increases backup costs, slows down filesystem operations, and may lead to quota problems.
Practical habits include:
- Write only the data you truly need for analysis or reproducibility.
- Avoid extremely verbose logging, especially from large parallel jobs.
- Compress data when appropriate, particularly intermediate data that is large and not used frequently.
- Clean up temporary and intermediate files once they are no longer necessary.
In many workflows, the largest savings come from careful control of output frequency and resolution. For example, storing every time step of a simulation might be unnecessary if you only analyze results at coarser intervals.
Efficient I/O Patterns
While detailed parallel I/O strategies are covered elsewhere, a few simple principles improve resource efficiency:
- Coordinate I/O so that not every process writes its own large file when that is not required. Use collective or aggregated I/O patterns if available in your application.
- Avoid mixing many small writes to shared filesystems inside tight loops. Instead, buffer data in memory and write in larger blocks at appropriate intervals.
- Use the right filesystem for the job: scratch for temporary large data, home or project areas for important, curated results, according to your site guidelines.
Inefficient I/O can cause long stalls. From a resource perspective, this means you are tying up compute nodes merely to wait for disk operations to complete.
Throughput Oriented Efficiency
Some workloads consist of many independent or loosely coupled tasks, such as parameter scans, ensemble runs, or Monte Carlo simulations. In these cases, the goal is often to maximize total throughput rather than minimize single job runtime.
Job Arrays and Packing Small Tasks
If your tasks are small and similar, using job arrays or bundling multiple tasks into a single job often leads to more efficient scheduling:
- Fewer jobs to schedule means lower overhead for the scheduler.
- Packing many small tasks into one job decreases the number of times you pay startup costs.
- The system can keep nodes busy by running many independent tasks concurrently.
Efficient usage in this case means designing job scripts and workflows that expose task parallelism to the scheduler without overwhelming it with thousands of tiny individual job submissions.
Balancing Turnaround Time and Resource Use
Users often face a choice between:
- Running fewer, larger parallel jobs to get fast results per configuration.
- Running many smaller, more efficient jobs at moderate scale.
For example, if you need to run 100 configurations, you might run 10 configurations at a time with good efficiency rather than 100 at a time on an overly large allocation with poor scaling. The more efficient configuration may actually finish sooner overall and uses fewer resources for the same scientific outcome.
A simple way to think about this is:
If a configuration takes time $T$ on $p$ cores and you have $K$ configurations, the total core time is $K \cdot p \cdot T$. You can reduce this product by choosing a $p$ where $T$ does not shrink significantly when you add more cores. That reduces both resource consumption and load on the system.
Energy Awareness and Green Choices
Efficient resource usage and sustainability are closely linked. Many of the practices described above reduce energy use automatically. You can also make some explicit energy aware choices.
Favoring Efficient Over Maximal Configurations
The simplest way to save energy is to stop using “maximum everything” as the default:
- Avoid using more cores per job than needed for acceptable runtime.
- Avoid requesting the maximum walltime.
- Avoid using specialized hardware that does not provide clear speed benefits.
If two configurations complete your work in similar walltime, but one uses fewer nodes or cores, prefer the one with lower total core time. The energy used is roughly proportional to total active compute time, although exact relationships depend on hardware and power management.
Suppose configuration A uses $p_A$ cores for time $T_A$, and configuration B uses $p_B$ cores for time $T_B$. If $p_A T_A > p_B T_B$ by a significant factor and runtimes are similar, configuration A is less energy efficient, even if it is not slower.
Scheduling Work in a Cluster Friendly Way
Some sites may encourage or require jobs in off peak windows or particular partitions. Running flexible, non urgent jobs when the system is less loaded can:
- Improve overall utilization.
- Reduce the number of nodes that need to power up or down.
- Decrease queue times for users with urgent or interactive needs.
You can contribute to this by:
- Submitting large, low urgency jobs with a flexible start time or lower priority if your scheduler supports it.
- Reserving urgent or interactive priority queues only for tasks that genuinely require them.
This type of cooperation improves the efficiency of the whole center and is part of ethical resource usage.
Monitoring and Improving Your Own Efficiency
You cannot improve what you do not measure. Most HPC systems provide some form of accounting or job statistics that you can access.
Learning from Job Accounting Data
After your jobs complete, check:
- CPU utilization: ratio of CPU time to walltime.
- Memory usage: peak memory compared to requested memory.
- GPU utilization: if available, average GPU activity.
- I/O statistics: total data read and written, if reported.
If you see patterns such as very low CPU utilization, memory usage far below requested, or GPUs mostly idle, treat this as feedback. Adjust your next job’s configuration accordingly.
Self check:
For each job, ask:
- Did I request more cores, memory, or GPUs than I used?
- Did my job finish much earlier than the walltime limit?
- Were CPU and GPU utilization high for most of the runtime?
Use the answers to refine future submissions.
Iterative Refinement Rather Than One Perfect Guess
Efficient resource usage is an iterative process:
- Start with conservative but reasonable settings, based on small tests and documentation.
- Inspect runtime and usage reports.
- Adjust resources, layout, and job size.
- Repeat until your jobs run reliably, with good performance and without obvious waste.
This cycle reduces failures, limits waste, and over time leads to more predictable and efficient production runs.
Shared Responsibility and Good Citizenship
Efficient resource usage is not only a technical issue. It is also a matter of fairness to other users and to the institution paying for and powering the system.
Good resource citizenship includes:
- Respecting local usage policies and recommended practices.
- Avoiding speculative “just in case” massive jobs that you are not sure you need.
- Cleaning up after your projects complete, including data and unused environments.
- Communicating with support staff when you see unexpected behavior that may indicate inefficiency or misconfiguration.
By using resources effectively, you improve your own productivity and contribute to a more sustainable and fair HPC environment for everyone.