Table of Contents
Introduction
Benchmarking is the practice of running controlled experiments on an application to quantify its performance. In high performance computing, benchmarking is not about getting the biggest number once, but about obtaining repeatable, comparable, and interpretable measurements that can guide design, optimization, and procurement decisions.
This chapter focuses on how to benchmark applications themselves, not microbenchmarks for hardware and not generic performance metrics. You will learn how to design a benchmarking campaign, what to measure, how to control experimental conditions, and how to interpret and report results in a way that is useful to you and to others.
Goals of Application Benchmarking
Benchmarking an application in HPC typically has four main goals.
First, you may want to understand current performance. You run the code on a realistic input and measure runtime, memory usage, and scaling. This gives you a baseline against which you can judge future changes.
Second, you may want to guide optimization work. A good benchmark suite reveals which parts of the code matter most, which input sizes are representative, and which hardware characteristics limit performance.
Third, you may need to compare configurations. Examples include comparing different compilers, libraries, or algorithmic settings, or evaluating how an application runs on two different clusters.
Fourth, you may have to communicate and document performance. This is important for papers, project reports, funding proposals, and software documentation. A well designed benchmark provides credible evidence that your code meets its performance requirements.
In all cases, benchmarking is an experimental activity. It requires a clear question, a controlled setup, repeated measurements, and careful analysis.
Choosing What to Measure
For application level benchmarking, you rarely care about a single raw number. Instead, you select metrics that connect to your goals. In this chapter, we only outline the most common application level metrics and how they fit into a benchmarking plan. Low level metrics and tools are discussed elsewhere.
The primary metric is usually wall clock runtime, measured from start to end of the relevant computation. From runtime you can derive several other metrics.
For codes that perform a known amount of work, you can compute throughput, such as elements per second, tasks per second, or time per iteration. These metrics help compare different input sizes and configurations that do different total work.
In numerical computing, you often want floating point performance. If you know or can estimate the number of floating point operations performed, $N_\text{flop}$, you can compute:
$$\text{Throughput (FLOP/s)} = \frac{N_\text{flop}}{T},$$
where $T$ is wall time. This can be converted to GFLOP/s or TFLOP/s for readability.
Memory and I/O related performance is also important. Given the number of bytes read or written, $B$, you can derive a bandwidth:
$$\text{Bandwidth} = \frac{B}{T}.$$
For parallel applications, you will often measure speedup and efficiency. Suppose $T_1$ is the runtime on one core or one process and $T_p$ is the runtime on $p$ cores or processes. Then
$$S_p = \frac{T_1}{T_p}$$
is the speedup and
$$E_p = \frac{S_p}{p} = \frac{T_1}{p T_p}$$
is the parallel efficiency. These quantities are central in strong and weak scaling benchmarks, which are treated in detail elsewhere, but they appear repeatedly in application benchmarking.
Finally, you should always monitor resource usage that may limit performance: maximum memory footprint, file system usage, and sometimes energy consumption if your system provides reliable power measurements.
A useful benchmark always defines:
- Exactly what is being measured.
- Under which conditions it is measured.
- How the metric is computed from raw observations.
Defining Representative Workloads
An application benchmark is only meaningful if the workload resembles what the application does in real life. A common pitfall is to benchmark a trivial or artificial case that is convenient to run but does not exercise the important parts of the code or the system.
To define a representative workload, you begin from actual use cases. If users run your code to simulate weather, train a model, or solve a large sparse system, examine the typical domain size, problem parameters, and runtime. Then you derive benchmark cases that reflect these characteristics.
For each benchmark case, specify at least the input data, the configuration options or parameter files, and the target scale in terms of cores, memory, and job duration. Avoid using synthetic inputs that are easier but structurally different, such as random matrices instead of physical ones, unless you can justify that they behave similarly for performance purposes.
It is often useful to define more than one benchmark case. For instance, you might have a small case that runs in seconds for quick testing, a medium case for node level tuning, and a large case that reflects production scaling. The key is that each case has a clear purpose and is documented so that others can reproduce and interpret your results.
If the application supports multiple algorithms or code paths, make sure your benchmark activates the paths used in production, or explicitly state which configurations are being evaluated. A benchmark that measures an unused algorithm can be misleading even if it is carefully executed.
Controlling the Experimental Environment
Benchmarking applications on shared clusters is tricky because the environment can change from run to run. If you want stable and comparable numbers, you must control what you can and document what you cannot.
First, fix the software environment. Use the same compiler, libraries, and environment modules for all runs in a benchmark series. Record version numbers and important build options. If you change one of these intentionally, such as comparing two compilers, treat this as a separate controlled parameter.
Second, pin down run time parameters that can influence performance. This includes thread counts, process placement, environment variables that affect OpenMP or MPI behavior, GPU usage, and any application level tuning knobs. Specify them explicitly in your job scripts and avoid relying on implicit defaults that may change.
Third, consider the placement of your job on the cluster. If the scheduler supports options to control node allocation or CPU binding, use them consistently. For node level benchmarks, prefer exclusive access to the node to avoid interference from other jobs if possible.
Fourth, manage background variability. On a shared system you cannot fully eliminate interference from other jobs, but you can mitigate it by running each benchmark multiple times and using robust statistics to summarize results. If you see an outlier run that is much slower than the others, rerun the experiment and note the variability.
Finally, keep the environment constant between runs in a given campaign. If you change anything significant, such as loading a new module or modifying the code, treat it as a different experiment and label the results clearly.
Designing Benchmark Experiments
A benchmarking campaign is more than a single run. You design a set of experiments that explore the behavior of your application across relevant dimensions while keeping the total number of runs manageable.
The first dimension is problem size. You might run your application on several increasing input sizes to understand how runtime, memory usage, and performance scale. If the problem size has a natural parameter $N$, for instance grid points per dimension, you might pick values like $N, 2N, 4N$ and examine how runtime grows.
The second dimension is parallelism. You vary the number of nodes, cores, processes, or GPUs and observe speedup and efficiency. When you change parallel resources, you must decide whether to keep the problem size fixed or increase it with resources. These correspond to strong and weak scaling scenarios.
The third dimension is configuration choice. You may want to compare different algorithms, solvers, precision levels, or library backends. When you change such a parameter, try to hold other parameters fixed so that you can attribute performance differences to the change.
For each experiment, you should plan in advance which combinations of parameters you will run. A simple table that lists all test cases with their problem size, resource allocation, and configuration helps avoid confusion. This also prevents you from cherry picking a convenient subset of results after the fact.
It is not necessary to explore every possible combination. Instead, choose a minimal but informative set of points. For example, to study node level scaling you might run on 1, 2, 4, and 8 nodes for a fixed problem size rather than every intermediate value.
Running Benchmarks Robustly
Once you have an experimental plan, you need to execute it in a way that yields reliable numbers.
Always use job scripts to run benchmarks on a cluster and keep them under version control together with the code. The script should include all relevant resource requests, environment modules, environment variables, and the exact command lines. This ensures that the benchmark is reproducible and that you can rerun it later.
For each planned experiment, run the benchmark multiple times, for example three to five repeats, and measure runtime for each. Do not rely on a single measurement, especially on a shared system. To reduce warmup effects, discard the first iteration of internal loops if relevant or include a warmup phase in the application itself.
When measuring runtime, be explicit about what is timed. Avoid including setup or teardown unrelated to the computation, such as input generation or post processing, unless those costs are part of the user relevant notion of runtime. If the application prints its own timing information, ensure that it measures the part you care about and that you understand how it does so.
Make sure the application always finishes successfully and produces correct results for all benchmark runs. Performance numbers without verified correctness are not useful. If necessary, add quick consistency checks to the code that verify outputs for benchmark cases.
Store all raw logs, standard output, and error streams from the runs. Even if you only plan to publish summary statistics, having access to the raw data allows you to reanalyze results or investigate anomalies later.
Analyzing Runtime Data
After collecting runtime measurements, your first task is to summarize them in a way that reflects the underlying performance while accounting for variability.
Let $T_1, T_2, \dots, T_n$ be the runtimes of $n$ repeated runs of the same experiment. You can compute the sample mean
$$\bar{T} = \frac{1}{n}\sum_{i=1}^{n} T_i$$
and the sample standard deviation
$$s = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n} (T_i - \bar{T})^2}.$$
The mean gives a central value, while the standard deviation measures variability. For small $n$, the mean can be influenced by outliers. It is often better to report the median and sometimes the minimum as well.
Many practitioners in HPC use the minimum runtime across runs as a proxy for the least disturbed performance, especially on noisy shared systems. If you follow this practice, always state it clearly. You can report the minimum together with the median and range to give a fuller picture.
Once you have a representative runtime for each experiment, compute derived metrics, such as throughput, speedup, and efficiency, using the formulas already introduced. For parallel benchmarks, plot speedup $S_p$ versus $p$ and efficiency $E_p$ versus $p$ rather than only listing numbers. This makes it easier to see trends, such as diminishing returns as you increase resources.
Look for systematic patterns rather than single data points. For example, if runtime grows faster than linearly with problem size, or if parallel efficiency drops sharply beyond a certain number of nodes, these observations can guide where to optimize or how to choose job sizes.
Comparing Configurations and Systems
A major use case of application benchmarking is comparing configurations, for example two compilers, two algorithm settings, or two machines. When making such comparisons, careful experimental design is crucial to avoid misleading conclusions.
First, when you compare two configurations, run them under as identical conditions as possible. This means the same problem size, same node allocation and binding, same software environment except for the component under test, and interleaved execution in time if the system load may change. Interleaving runs from both configurations reduces bias from time dependent background load.
Second, use relative metrics in addition to raw runtimes. A common approach is to define a baseline and compute a speedup factor. If configuration A is your reference and configuration B is the alternative, you can define
$$\text{Relative speedup} = \frac{T_\text{A}}{T_\text{B}}.$$
Values greater than 1 mean that B is faster. This makes tables and plots easier to interpret than raw seconds in some cases.
Third, consider whether differences are significant relative to the variability of your measurements. If each configuration has a standard deviation on the order of a few percent, and the mean runtimes differ by only 1 percent, you should be cautious in claiming a real improvement. Increasing the number of repeats can help, but often it is better to focus on larger, clearly visible effects.
Fourth, think about portability of performance. If configuration B is faster on one system but slower on another, your benchmark results should document both cases. Application benchmarking rarely yields a single absolute ranking that applies everywhere.
Finally, do not discard unfavorable results. If a configuration does not perform as expected, include it in your analysis and try to understand why. These insights can be valuable for future tuning and for users making informed choices.
Benchmarking Parallel Scaling
Parallel scaling benchmarks are a special but very common case of application benchmarking. Here, the primary question is how performance changes as you increase computational resources. While the concepts of strong and weak scaling are covered elsewhere, this section focuses on how to structure the experiments for an application.
For a strong scaling benchmark, you fix the problem size and vary the number of processes or threads. You measure runtime $T_p$ for each value of $p$ and compute speedup $S_p$ and efficiency $E_p$ using the formulas given earlier. The benchmark explores how much faster the application becomes as you devote more resources to the same amount of work.
For a weak scaling benchmark, you increase the problem size in proportion to the resources so that the work per process or per core stays roughly constant. The primary metric here is typically runtime as a function of $p$. Ideally, the runtime stays constant or grows slowly as you use more resources and solve a larger overall problem.
In both cases, it is important that the way you scale the problem size is well defined and documented. For weak scaling, for example, if you solve a three dimensional problem on a grid, and each process handles a subdomain of size $N^3$, then going from $p$ to $8p$ processes might correspond to doubling the grid extent in each dimension. Recording such relationships in your benchmark description is essential for correct interpretation.
Parallel scaling benchmarks are particularly sensitive to system noise, network contention, and file system behavior. As before, repeat runs, examine variability, and if possible conduct scaling studies at times of lower cluster load or using dedicated reservations.
Avoiding Common Benchmarking Pitfalls
Several recurring mistakes can render application benchmarks less useful or even misleading.
One common pitfall is measuring unrealistic workloads, such as tiny test problems that fit entirely in cache, when production runs operate far beyond cache capacity. These tiny problems may exaggerate computation performance and underrepresent communication or I/O costs.
Another pitfall is mixing code testing and benchmarking. Debug builds, extra checks, or profiling instrumentation can significantly distort performance. For meaningful benchmarking, you should use optimized builds and disable heavy debugging features, while still ensuring correctness.
Reusing stale numbers without rerunning benchmarks is also problematic. Performance can change as compilers, libraries, or even system firmware are updated. Benchmark results should be tied to specific software versions and hardware configurations, and they should be refreshed when the environment changes.
Overfitting to a single benchmark case is another risk. If you tune your application aggressively for one particular input or hardware configuration, you might degrade performance for other realistic cases. A small but diverse benchmark suite, rather than a single case, reduces this risk.
Finally, there is a temptation to cherry pick the best runs or the most favorable configurations and ignore the rest. This weakens trust in your results and can lead to poor decisions. Being transparent about the full benchmark setup, variability, and any anomalies will make your conclusions more robust.
Documenting and Sharing Benchmark Results
The value of a benchmark increases when others can understand, reproduce, and build on your results. Good documentation is part of the benchmarking process, not an afterthought.
For each benchmark experiment or campaign, record at least the application version or commit ID, the compilation settings, including compiler, optimization flags, and linked libraries, and the hardware description, such as CPU model, core count, memory, interconnect, and accelerator type. Also note the operating system version and important runtime libraries, such as MPI or math libraries.
Describe the benchmark workloads, including input data sets, problem sizes, and any pre or post processing. Provide the exact run commands and job script fragments that specify resources, environment modules, and environment variables. Summarize the measurement methodology, stating what was timed, how many repetitions were run, and how summary statistics were computed.
Present results in a structured form such as tables or plots, clearly labeling axes, units, and legends. Whenever you show derived metrics, such as speedup, indicate the baseline. If you publish or share benchmark data, consider including machine readable files with raw measurements in addition to figures.
Keeping this information together in a benchmark report, a repository, or a lab notebook allows you or others to rerun the benchmarks later, verify claims, and extend the experiments to new systems or configurations.
Integrating Benchmarking into the Development Cycle
Benchmarking is most effective when it becomes a regular part of the development process rather than an occasional large effort. You can maintain a small, representative benchmark suite that you run after significant code changes or when porting to a new system. Over time, this helps you detect performance regressions and evaluate the impact of optimizations.
Some teams integrate benchmarks into automated workflows. For example, they run quick benchmarks on every major change and compare performance against a baseline. This requires care to manage variability and avoid overloading shared systems, but even a lightweight approach can provide early warnings if performance drops significantly.
By treating performance benchmarks similarly to correctness tests, you encourage a culture where performance expectations are explicit and monitored. This reduces surprises late in a project and leads to more predictable behavior when the application is deployed on production scale systems.
In summary, benchmarking applications in HPC is about careful experimental design, disciplined execution, and honest analysis. With representative workloads, controlled environments, and clear documentation, your benchmarks become a powerful tool to understand and improve performance across the full software and hardware stack.