12.1 Measuring performance

Table of Contents

Understanding What You Are Measuring

Before you can optimize an HPC program, you must be able to describe its behavior with numbers. Measuring performance is the step where vague impressions such as “it feels slow” are turned into concrete metrics such as “this simulation advances 3.2 time steps per second on 256 cores.”

In this chapter, the focus is on what and how to measure, not yet on how to change the code. You will see how to define useful performance metrics, how to collect them in a controlled and repeatable way, and how to interpret them at a basic level.

Basic Performance Metrics

At the most fundamental level, performance measurement starts with time. From time, you derive rates, efficiencies, and speedups that allow comparison between runs, systems, and algorithms.

Wall-clock time and CPU time

Wall-clock time, often called elapsed time, is the total real-world time between the start and end of your program. It is what a user experiences and usually what matters first.

CPU time is the time a CPU core spends executing your program’s instructions, excluding periods where the process is blocked or waiting. For parallel jobs that use many cores, aggregate CPU time can be much larger than wall time because multiple cores run simultaneously.

In HPC, wall-clock time of the full application or of critical phases such as “time per time step” is usually the primary performance metric. Many measurement tools and batch systems can report both wall time and CPU time.

Key rule: For comparing end-to-end performance between different runs, always use wall-clock time measured in a consistent way.

Throughput and rates

Time alone often does not convey how much “work” was done. It is better to define a rate, that is, work per unit time.

Examples include:

Number of time steps per second.

Grid points updated per second.

Particles advanced per second.

Gigabytes read or written per second.

A common unit in HPC is FLOP/s, floating point operations per second. If you know that a kernel performs $N_{\text{flops}}$ operations and takes time $T$, then:

$$
\text{FLOP/s} = \frac{N_{\text{flops}}}{T}.
$$

For larger values, you use MFLOP/s, GFLOP/s, or TFLOP/s. The exact number of operations is not always easy to obtain, but the idea is the same for any kind of work: count units of useful work and divide by time.

Speedup

When you change something, such as the number of processes or an algorithm, you want to quantify whether the change improved performance. Speedup compares two timings.

If a baseline version takes time $T_{\text{old}}$ and an improved version takes $T_{\text{new}}$, the speedup is

$$
S = \frac{T_{\text{old}}}{T_{\text{new}}}.
$$

If $S = 2$, the new version is twice as fast. For parallel scaling, the same formula is used, with $T_{\text{old}}$ often being the time on one process or one node.

Important:
Speedup $S > 1$ means the new configuration is faster.
Speedup $S = 1$ means no change.
Speedup $S < 1$ means the new configuration is slower.

Efficiency

Efficiency relates achieved performance to a reference. For parallel programs, you often measure parallel efficiency, which compares actual speedup to the ideal speedup.

If you run your code on $P$ processing elements and measure speedup $S(P)$ relative to a baseline, the (strong scaling) parallel efficiency is

$$
E(P) = \frac{S(P)}{P}.
$$

If $E(P) = 1$, the parallelization is ideal. In real programs, efficiencies are less than 1 and usually decrease as you increase $P$.

Levels and Regions of Measurement

Measuring only “total runtime” is often not enough to understand performance problems. It is useful to think about different levels of detail and different program regions.

Whole-application timing

Whole-application timing measures from program start to program end. This is useful when you compare:

Different systems.

Different compilers or build options.

Different parallel configurations.

You can measure this with simple wrappers around your executable, using shell tools, or with the job scheduler’s own accounting. This is the most coarse-grained measurement and usually your starting point.

Phase and region timing

Real applications consist of phases, such as initialization, computation, communication, I/O, and finalization. Measuring each phase separately can show which part dominates and where to focus optimization.

At the simplest level, you insert timers around code regions of interest. Many languages and libraries provide high-resolution timers, as do MPI and OpenMP. Measuring at region level gives a breakdown like “60% in computation, 25% in I/O, 15% in communication,” which guides later analysis.

Loop and kernel timing

For hotspots such as inner loops or numerical kernels, you may want to measure very small code sections. These measurements are more sensitive to noise but give detailed insight. They are useful when you work on micro-optimizations, vectorization, or cache-oriented changes.

Because short regions may take only microseconds or less per call, you often measure them over many iterations and then compute average times.

Measuring Time in Practice

On HPC systems you usually cannot use graphical tools interactively for each test. Simple, reproducible command line measurements are very effective.

Using time and scheduler information

Most Unix-like systems provide a time command that measures elapsed time and CPU time of a program:

time ./my_application

Many job schedulers also report how long a job ran, resources used, and average CPU utilization in job accounting outputs. This is a convenient way to collect coarse-grained timing information for production runs.

In-code timing

For phase or region timing, you often instrument your own code. The general pattern is:

Record start time.

Execute the code block.

Record end time.

Compute the difference.

For example, in MPI programs, you can use the standard wall clock routine, which returns a double-precision value in seconds:

$$
T_{\text{elapsed}} = t_{\text{end}} - t_{\text{start}}.
$$

If the code is parallel and you care about the slowest process, you usually measure on each process and then take the maximum over all of them. That maximum determines the actual wall time of the phase.

Rule for parallel regions: When measuring a parallel phase, use the maximum time among all ranks or threads, not the minimum or average, because the slowest participant dictates the global runtime.

Reproducibility and Experimental Design

Performance measurements are experiments. You must control conditions to obtain meaningful and comparable results.

Controlling variables

When you compare runs, change only one factor at a time. For example, if you compare different thread counts, keep:

Problem size fixed.

Compiler version and flags fixed.

Input data fixed.

Node type and number of nodes fixed.

Otherwise, you cannot attribute performance differences to the factor you think you are studying.

Handling variability and noise

Shared systems can introduce variability because other jobs contend for shared resources such as network and filesystem. Modern CPUs may change frequency depending on load and temperature. To reduce the effect of this noise:

Run each configuration multiple times.

Discard outliers if justified.

Report at least mean and some measure of spread, such as minimum and maximum, or standard deviation, although in practice HPC users often focus on best or median time when system noise is known.

Whenever possible, run tests on dedicated nodes or at less busy times of day. Be aware that I/O and network performance may vary significantly between runs on a shared cluster.

Warm-up and steady state

For some applications, especially those that include just-in-time compilation, caching, or adaptive algorithms, the first iteration or time step behaves differently from later ones. If you want to measure steady-state performance, run a small number of warm-up iterations and then measure subsequent iterations.

Similarly, when measuring I/O performance, the operating system might cache data. You should be clear whether you are measuring cached reads or real disk access.

Choosing Useful Metrics for HPC Codes

Different codes require different metrics. The key is to select metrics that meaningfully reflect the scientific or engineering objective.

Application-specific metrics

For a climate model, time per simulation day is usually more meaningful than absolute FLOP/s. For a molecular dynamics code, nanoseconds of simulated time per day is a common metric. For a sparse linear solver, time per solve or solves per hour may be most useful.

When defining an application-specific metric, identify a natural unit of work in your workflow and then normalize by time.

Hardware-related metrics

Although detailed hardware metrics are discussed elsewhere, basic derived metrics are often needed when measuring performance. Common examples include:

Achieved memory bandwidth in GB/s.

I/O throughput in GB/s.

Core utilization percentage.

These can sometimes be inferred from timing and known data volumes. For example, if you know that a kernel reads and writes $V$ bytes over time $T$, then the average throughput is:

$$
\text{Throughput} = \frac{V}{T}.
$$

This simple calculation already tells you whether you are near the limits of the storage system or memory bandwidth.

Measuring Scaling Behavior

Performance measurement in HPC usually involves changing the scale of the run, either by changing the number of cores or the problem size. This connects to strong and weak scaling, but here the focus is on how to measure and record the data.

Strong scaling measurements

For strong scaling studies you keep the problem size fixed and vary the number of processes or threads. The steps are:

Choose a representative problem size that fits on the smallest configuration.

Run the problem with different process counts, such as $1, 2, 4, 8, 16, \dots$.

Measure wall-clock time for each run.

Compute speedup and efficiency using the formulas

$$
S(P) = \frac{T(1)}{T(P)}, \quad E(P) = \frac{S(P)}{P}.
$$

Record these values in a table or plot.

The resulting data shows how the program benefits from additional resources and where the benefits start to diminish.

Weak scaling measurements

For weak scaling studies you increase the problem size with the number of processes, aiming to keep the work per process approximately constant. The steps are:

Define a per-process workload.

Run with $P = 1, 2, 4, 8, \dots$ processes, and choose total problem size $N(P)$ so that $N(P)$ approximately equals $P$ times the per-process workload.

Measure wall-clock time for each $P$.

In ideal weak scaling, the time stays constant as $P$ increases. In practice, time usually grows because of communication, synchronization, or I/O costs. Recording and plotting these times allows you to see how additional resources affect overall throughput.

Separating Computation, Communication, and I/O

Even without full-featured profiling tools, you can manually measure where time is spent at a coarse level.

Computation vs communication

In distributed memory codes, you can group your code into parts that perform local computation and parts that involve communication. By placing timers around these regions, you can estimate:

Total time spent in computation.

Total time spent in communications.

This basic separation can show, for example, that communication occupies 40% of runtime at high core counts, which will inform later optimization efforts.

I/O timing

I/O is often a performance bottleneck. You can measure:

Time to read input data.

Time per checkpoint write.

Time to write final results.

These measurements help you decide how frequently you can checkpoint, whether you need parallel I/O, and how to structure output for acceptable performance.

Recording, Organizing, and Reporting Measurements

Measuring performance is only useful if you keep your data organized and interpret it carefully.

Keeping a performance notebook

It is good practice to record every experiment in a structured way. For each run, keep track of:

Code version and build settings.

Input configuration and problem size.

Resource configuration such as nodes, cores per node, threads, and GPU usage.

Timing results and any derived metrics such as speedup or throughput.

You can do this in a simple text file, spreadsheet, or version-controlled notes. Over time, this history becomes valuable when you revisit old optimizations or compare systems.

Visualizing results

Even simple plots of time versus number of cores or problem size can reveal patterns that are hard to see in tables. For strong scaling, plotting speedup or efficiency against core count immediately shows where scaling starts to degrade. For weak scaling, plotting time against core count shows how close you are to ideal behavior.

When sharing performance results, always include enough information so that others can understand the context. At minimum, specify hardware, software environment, and the measurement method.

Good practice for reporting: Always state

What was measured,
How it was measured,
Under which conditions it was measured.
Without these three pieces, performance numbers are hard to interpret or reproduce.

Limitations and Common Pitfalls

Performance measurement is never perfect. Being aware of limitations helps you avoid incorrect conclusions.

Short regions can be dominated by timer overhead or by effects such as caching and branch prediction, which makes them hard to measure reliably. If possible, measure them over many iterations.

Comparing runs across different systems without accounting for differences in compiler flags, libraries, or problem sizes can be misleading.

Reporting only a single “best” time without mentioning variability can hide important performance issues.

Measuring only end-to-end time without any breakdown may leave you without actionable insight when you want to optimize.

From Measurement to Optimization

Performance analysis and optimization start with numbers, not guesses. Once you can reliably measure wall-clock time, throughput, speedup, and efficiency, and once you can separate time into broad categories such as computation, communication, and I/O, you have the basic tools needed to diagnose performance problems.

Later chapters will describe how to use profiling tools and hardware counters to obtain much more detailed views of your code. The methods in this chapter remain essential, because all advanced tools still build on simple and careful timing, clear definitions of metrics, and reproducible experiments.

Comments

Please login to add a comment.

Don't have an account? Register now!