12.1 Measuring performance

Why and How We Measure Performance

In HPC, “fast” is not a feeling; it is a measurable quantity. Measuring performance means turning runtime behavior into numbers you can compare, track, and optimize.

Performance measurement sits between raw execution and deep profiling: it answers “how fast/how efficient?” before “why?”. More advanced profiling tools and optimization strategies are covered in their own chapters; here we focus on what to measure and how to measure it in a simple, systematic way.

Key Performance Metrics

Wall-Clock Time

The most basic metric is wall-clock time: the elapsed real time from start to finish of your program or a specific region.

Total runtime (end-to-end)
Time for critical phases (e.g., initialization, computation, I/O)

You typically want:

A baseline time (unoptimized or reference)
Timings under different configurations (input size, cores, nodes, compiler flags, etc.)

At a high level, runtime is:

$$
T_\text{total} = T_\text{compute} + T_\text{communication} + T_\text{I/O} + T_\text{other}
$$

Performance analysis and later optimization try to reduce these components.

Throughput and Work Rate

Runtime alone is less meaningful without context. You need to relate time to “work done”:

Grid updates per second
Particles processed per second
Time steps per day
Iterations per second

A simple way to define a work rate is:

$$
\text{Throughput} = \frac{\text{Amount of work}}{\text{Elapsed time}}
$$

This helps you compare performance for different problem sizes or even across different applications that perform similar tasks.

FLOP Rate (Floating-Point Performance)

In numerical HPC codes, a common metric is floating-point operations per second (FLOP/s), usually reported in:

MFLOP/s (millions per second)
GFLOP/s (billions per second)
TFLOP/s (trillions per second), etc.

If your algorithm performs $N_\text{flop}$ floating-point operations and the execution time is $T$ seconds:

$$
\text{FLOP/s} = \frac{N_\text{flop}}{T}
$$

Often you care about efficiency relative to theoretical peak:

$$
\text{Efficiency} = \frac{\text{Achieved FLOP/s}}{\text{Peak FLOP/s}} \times 100\%
$$

Peak FLOP/s depends on hardware and is typically given by vendor specs or derived from core count, frequency, and vector width.

Bandwidth and I/O Throughput

Many HPC applications are limited not by arithmetic but by data movement:

Memory bandwidth (GB/s between CPU and RAM)
Interconnect bandwidth (GB/s between nodes)
Storage throughput (MB/s or GB/s to/from filesystem)

If your program transfers $V$ bytes in time $T$:

$$
\text{Bandwidth} = \frac{V}{T}
$$

Again, you can compare against a theoretical or measured peak bandwidth to understand how close you are to hardware limits.

Parallel Speedup and Efficiency (Briefly)

Parallel performance metrics are covered in more depth elsewhere, but they are central to measurement:

Speedup when using $p$ processors:

$$
S(p) = \frac{T(1)}{T(p)}
$$

where $T(1)$ is time on a single core or baseline, and $T(p)$ is time on $p$ cores.

Parallel efficiency:

$$
E(p) = \frac{S(p)}{p}
$$

Measuring $T(p)$ accurately and consistently is the foundation for meaningful scaling studies.

What Exactly Should You Time?

End-to-End vs. Region-Based Measurements

You rarely want only a single “program took X seconds” number.

Common levels:

Total runtime:

Useful for user-perceived performance and resource planning.

Phase runtimes:

Initialization, main compute loop, I/O, communication phases, post-processing.

Kernel-level timing:

Specific loops or computational kernels you suspect dominate runtime.

This decomposition helps you:

Identify where time is actually spent (computation vs. I/O vs. communication).
Focus optimization efforts where they matter.

Warm-Up and Measurement Noise

Real systems exhibit noise:

First run may be slower (cache cold, filesystem metadata, JIT, etc.).
Other users’ jobs can interfere on shared systems.
Power management or turbo modes may vary CPU frequency.

To mitigate:

Discard the first run or treat it separately.
Run each configuration multiple times (e.g., 3–10).
Report statistics (min, median, mean; often the minimum is a good approximation of least-contended performance).

Practical Timing Techniques in HPC

Timing from the Command Line

For coarse, whole-program measurement you can use shell tools.

The `time` Command

Most systems provide a time utility:

Measures:

Real: wall-clock time
User: CPU time spent in user code
Sys: CPU time spent in kernel (system) calls

Example usage:

time ./my_program input.dat

You can also use /usr/bin/time -v (often a more detailed version) to see additional metrics like maximum resident set size (memory), page faults, etc.

Use command-line timing for:

Quick comparisons (different compilers, flags).
End-to-end runtime under various core counts or nodes (in combination with the job scheduler).

In-Code Timing: Instrumenting Applications

To understand where time is spent inside your program, you insert timing calls around code regions.

Basic Principles

Use a high-resolution, monotonic clock (does not go backward).
Minimize intrusiveness (timing should not significantly change performance).
Use consistent timing API across your measurements.

Examples of Common Timing APIs

You will see different timing calls depending on language and parallel programming model. At a conceptual level, all do the same thing:

Get current time (as a double or timestamp).
Run the code section.
Get time again.
Compute difference.

For example, conceptually:

t_start = now()
# ... code to be measured ...
t_end = now()
elapsed = t_end - t_start

In MPI codes, it is common to use the MPI-provided timer to ensure consistency across processes. Threaded models or GPU models have their own timing functions, discussed in their specific chapters.

Timing Parallel Regions

An important detail in parallel programs: which process or thread reports what?

Guidelines:

For MPI:

Often measure time on rank 0, but ensure all ranks reach synchronization points (e.g., with MPI_Barrier) so measurements correspond to the full global phase.

For OpenMP:

Collect timing once per region, typically in the master thread.

For hybrid codes:

Combine approaches carefully; decide whether you want node-level or global timings.

Make sure the measured region encompasses all relevant work, including communication and synchronization, not just the local computation.

Measuring Performance at Scale

Strong and Weak Scaling Experiments

Scaling concepts are detailed elsewhere; here is how they connect to measurement.

To evaluate parallel performance, you run systematic experiments:

Strong scaling:

Fix the problem size.
Measure runtime $T(p)$ for different process counts $p$.
Derive speedup $S(p)$ and efficiency $E(p)$.

Weak scaling:

Increase problem size in proportion to $p$.
Measure how runtime changes (ideally remains approximately constant).

For each data point:

Choose the job configuration (nodes, tasks per node, threads).
Run the application multiple times.
Use consistent timing methodology (either inside the code or scheduler logs / time).
Record:

Input size / problem parameters.
Number of processes/threads.
Wall-clock time of the primary work phase.
Any additional metrics of interest (iterations, time steps, etc.).

Plotting:

$T(p)$ vs. $p$ (runtime).
$S(p)$ vs. $p$ (speedup).
$E(p)$ vs. $p$ (efficiency).

These plots are more informative than standalone numbers and form the basis of a performance study.

Averaging and Reporting

For each configuration, you may have several runs. Common practices:

Report the minimum time to indicate best-case performance under low noise.
Optionally report mean and standard deviation to show variability.
If outliers occur (e.g., an unusually slow run due to filesystem hiccups), document and decide whether to discard them with justification.

When reporting strong/weak scaling:

Clearly state:

Hardware (CPU model, cores, memory, interconnect).
Software environment (compiler and version, libraries, optimization flags).
Exact run configuration (nodes, processes, threads, binding if relevant).

This is essential for reproducibility and fair comparison.

Common Pitfalls in Performance Measurement

Measuring the Wrong Thing

Typical mistakes:

Including setup that is irrelevant to the algorithm’s steady-state performance (e.g., heavy one-off preprocessing) without separating it.
Timing only a small toy problem where overheads dominate and behavior differs from production runs.
Measuring only wall time once and ignoring variability.

Mitigation:

Distinguish between initialization, compute, and finalization.
Use realistic problem sizes.
Repeat runs and look at variability.

Timer Placement Errors

Incorrect timer placement leads to misleading conclusions:

Starting timer before process synchronization: late-arriving processes can distort measured durations.
Stopping timer too early (e.g., not including all communication or I/O).
Double-counting or missing parts of the workload.

Mitigation:

Carefully decide the region of interest.
Use barriers or other synchronization constructs when needed to ensure all parallel workers have completed.

Perturbing the Program

Excessive timing can change behavior:

Printing from all processes or threads (especially inside loops) drastically distorts performance.
Very fine-grained timing in tight loops can introduce overhead larger than the code being measured.

Mitigation:

Time coarse-grained regions first.
Limit prints to a small subset of processes (e.g., rank 0) and to infrequent events.
Use more advanced profiling tools (discussed elsewhere) when you need fine-grained details.

Integrating Measurement into Your Workflow

Establishing a Baseline

Before making any optimization:

Choose a representative input.
Measure:

Total runtime.
Critical phase runtimes.
Basic throughput (e.g., iterations per second).

Record system and environment information.

This baseline is your reference for all future changes.

Iterative Improvement

For each change (code modification, compiler flag, different node count):

Rebuild with clearly labeled version.
Repeat the same measurement procedure.
Compare against the baseline:

Did performance improve?
By how much (in percentage)?
Is the improvement consistent across runs?

Never assume that a theoretically “better” optimization actually speeds up your code without measuring.

Automation

As projects grow, manual measurement becomes tedious and error-prone. Simple automation helps:

Shell scripts to:

Run the program at different core counts or input sizes.
Collect and parse timing output into a table.

Use environment variables or command-line options in your code (e.g., enabling/disabling timers or printing only summary statistics).

Later chapters on performance analysis and optimization will integrate these measurements with profiling and more sophisticated tools, but the underlying pattern remains: measure, change, measure again.

Summary

Performance measurement turns program behavior into quantitative metrics like wall time, throughput, FLOP/s, and bandwidth.
Measure not only total runtime but also key phases and kernels.
Use appropriate timing tools:

External (time) for whole-program measurements.
In-code timers for detailed region-level measurements.

For parallel programs, design careful scaling studies (strong/weak) and derive speedup and efficiency from measured runtimes.
Avoid common pitfalls: timing the wrong regions, poor timer placement, or perturbing performance with excessive output.
Always establish a baseline, apply changes systematically, and re-measure under controlled, repeatable conditions.

These measurement practices are the foundation on which profiling, low-level optimization, and sophisticated performance tools build in the rest of the performance analysis and optimization material.

Comments

Please login to add a comment.

Don't have an account? Register now!

12.1 Measuring performance

Why and How We Measure Performance

Key Performance Metrics

Wall-Clock Time

Throughput and Work Rate

FLOP Rate (Floating-Point Performance)

Bandwidth and I/O Throughput

Parallel Speedup and Efficiency (Briefly)

What Exactly Should You Time?

End-to-End vs. Region-Based Measurements

Warm-Up and Measurement Noise

Practical Timing Techniques in HPC

Timing from the Command Line

The `time` Command

In-Code Timing: Instrumenting Applications

Basic Principles

Examples of Common Timing APIs

Timing Parallel Regions

Measuring Performance at Scale

Strong and Weak Scaling Experiments

Averaging and Reporting

Common Pitfalls in Performance Measurement

Measuring the Wrong Thing

Timer Placement Errors

Perturbing the Program

Integrating Measurement into Your Workflow

Establishing a Baseline

Iterative Improvement

Automation

Summary

Comments

Where to Move