12.2 Benchmarking applications

Table of Contents

What Benchmarking Is (and Is Not)

Benchmarking in HPC is the systematic, repeatable measurement of application performance under controlled conditions.

You are not just running your code once and looking at how long it took. Proper benchmarking aims to:

Measure performance in a way that is:

Reproducible
Comparable across systems, compilers, and configurations
Informative for optimization and purchasing decisions

Separate:

Application behavior from transient system noise
Performance issues from correctness issues

Benchmarking answers questions like:

How fast does this application run for a given problem size?
How does performance change when:

I use more cores or nodes?
I change compilers or optimization flags?
I change algorithms or libraries?

How efficient is this system compared to another?

It does not replace correctness testing or debugging.

Types of Benchmarks in HPC

Microbenchmarks

Microbenchmarks focus on a very narrow aspect of performance:

Examples:

Measuring memory bandwidth (e.g., STREAM benchmark)
Measuring network latency and bandwidth (e.g., osu_latency, osu_bw)
Measuring floating-point peak performance (simple dense matrix multiply kernels)

Use cases:

Characterizing hardware limits
Explaining why an application cannot exceed certain performance levels
Validating theoretical models (e.g., roofline model limits)

These do not tell you directly how your whole application will perform but give insight into bottlenecks.

Kernel / Mini-Application Benchmarks

Here you benchmark a representative kernel or mini-app that captures the dominant computation and communication pattern of your full application, but in a simplified form.

Examples:

A stencil update kernel from a CFD code
A sparse matrix-vector multiply from a larger PDE solver
Community mini-apps like:

LULESH (hydrodynamics)
miniMD (molecular dynamics)

Use cases:

Rapid experimentation with algorithms or libraries
Tuning for specific architectures without porting the full codebase
Comparing systems with application-like behavior but less complexity

Full Application Benchmarks

This is your actual production application (or a close equivalent), run with realistic input data and configuration.

Characteristics:

Include:

I/O
Communication
Load imbalance
Algorithmic setup phases

Reflect “end-to-end” performance that real users care about

Use cases:

Tuning production runs
Capacity planning on clusters
System acceptance testing using domain-specific codes

Synthetic vs Realistic Workloads

Synthetic: Artificial inputs designed to stress specific aspects (e.g., worst-case communication, maximum memory usage).
Realistic: Inputs that reflect actual scientific or industrial workloads.

Both have value:

Synthetic workloads can reveal corner-case bottlenecks.
Realistic workloads tell you what actually matters operationally.

Designing a Meaningful Benchmark

A useful benchmark must be:

Well-defined
Repeatable
Relevant

Define the Benchmark Scenario Clearly

Specify:

Problem type: e.g., 3D heat equation, molecular system with X atoms, matrix size $N \times N$
Input data:

How it is generated, or
Where it is obtained (datasets, scripts)

Numerical settings:

Tolerances
Iteration limits
Solver/preconditioner choices

Parallel configuration:

Number of nodes
Tasks (MPI ranks) per node
Threads per task
GPU usage if applicable

Software environment:

Compiler and version
Key libraries and versions
Important environment variables (e.g., thread affinity, OpenMP settings)

All of this should be documented so that someone else (or you in 6 months) can reproduce the benchmark.

Choose Metrics That Matter

Common metrics include:

Wall-clock time $T$:

Total time from start to end of the relevant computation

Throughput:

Jobs per hour
Time steps per second
Iterations per second

Floating-point performance:

FLOPs/s or GFLOP/s (giga floating-point operations per second)
Requires knowing or estimating operation counts

Speedup:

$$S(p) = \frac{T(1)}{T(p)}$$
Where $T(1)$ is baseline time, $T(p)$ time on $p$ processing elements

Parallel efficiency:

$$E(p) = \frac{S(p)}{p} = \frac{T(1)}{p \cdot T(p)}$$

Resource efficiency:

Performance per core, per node, or per watt (if energy data available)

I/O performance:

MB/s read/write during relevant phases

Memory footprint:

Peak memory usage per process, per node

Pick metrics aligned with your goals:

For production work: turnaround time and cost per simulation.
For scaling studies: speedup and efficiency.
For procurement: performance per node or per dollar.

Control Variables vs Tunables

Clearly separate:

Fixed parameters (controlled):

Problem size
Numerical fidelity
Hardware platform

Tunables (what you are systematically varying):

Number of cores/nodes
Compiler flags
Libraries (e.g., different BLAS implementations)
Algorithmic choices (e.g., solver A vs solver B)

Change one category of tunables at a time to isolate their impact.

Strong and Weak Scaling Benchmarks

Scaling concepts themselves are covered elsewhere, but here is how they shape benchmarking setups.

Strong Scaling Benchmarks

Goal: measure how time decreases as you increase resources for a fixed problem size.

Setup:

Fix the global problem size $N$.
Run on $p = 1, 2, 4, 8, \dots$ processors (cores/nodes).
Measure:

$T(p)$
Speedup $S(p)$ and efficiency $E(p)$

Use cases:

Understand how fast this job can run if given more resources.
Identify at what scale adding more resources stops being worth it.

Weak Scaling Benchmarks

Goal: measure how time behaves as you increase resources and increase problem size so that work per resource stays constant.

Setup:

Fix the problem size per process, e.g., $N_{\text{local}}$.
Total work $N_{\text{global}}$ scales with $p$.
Ideally, $T(p)$ remains close to constant.

Use cases:

Understand how well your application handles larger and larger global problems.
Evaluate scalability for large production runs.

Controlling the Benchmark Environment

To make benchmarks comparable and interpretable, the environment must be as stable as possible.

System Load and Interference

Cluster conditions vary:

Other users’ jobs may:

Share network links
Contend for I/O bandwidth
Induce OS noise

Mitigation strategies:

Run benchmarks when the system is less loaded (e.g., off-peak hours).
Request dedicated nodes or exclusive access if the scheduler supports it:

For example (conceptually): --exclusive in SLURM job scripts.

Avoid mixing benchmarks with other heavy tasks from your own jobs on the same nodes.

Process and Thread Affinity

To reduce variability:

Pin processes/threads to cores consistently:

Use your scheduler’s binding options.
Use threading runtime options (e.g., OpenMP controls).

Keep the mapping (e.g., ranks per node, threads per rank) documented and consistent across runs.

Affinity impacts:

Cache utilization
NUMA effects (local vs remote memory)
Measured performance variation

Repeating Runs and Statistical Treatment

Due to noise, single measurements are often misleading.

Basic practice:

Repeat each configuration several times (e.g., 5–10 runs).
Compute:

Mean run time $\bar{T}$
Standard deviation or at least min/max

Interpretation:

Large variance suggests:

Uncontrolled environmental factors
I/O or network contention

When reporting results, prefer:

Mean plus variance, or
Median (robust to outliers)

Instrumentation and Timing Techniques

Benchmarking requires reliable timing for the portions of the code that matter.

What to Time

Avoid timing everything from process start to exit if:

There is significant setup overhead (reading large inputs, pre-processing).
You want to separate:

Initialization
Core compute
Output / checkpointing

Common practice:

Time:

The main compute loop
Specific phases (solver, assembly, communication)

Also record:

Overall wall-clock time from job start, to capture “end-to-end” behavior separately.

Timing Tools and APIs

Depending on your application and environment, you might use:

High-level tools (external):

Time command wrappers (e.g., /usr/bin/time) for full-application timing.
Job scheduler accounting for start/end times.

In-code timing:

Language-specific timers (e.g., std::chrono in C++, high-resolution timers in other languages).
MPI timing functions for distributed regions:

MPI_Wtime() across all processes for the same region of code.

Library timers:

Some numerical libraries expose internal timing information or logging.

Consistency is more important than the specific API chosen. Always:

Use the same timing method for all runs of a given benchmark.
Ensure the timed regions are well-defined and documented.

Benchmark Input Design and Warm-Up

Representative Inputs

The performance of many codes depends strongly on input characteristics, such as:

Geometry and mesh structure
Sparsity patterns in matrices
Physical parameters (e.g., Reynolds number)
Data distribution and skew

Guidelines:

Include multiple test cases:

Small / quick sanity run
Medium / typical user case
Large / production-scale case

Avoid benchmarking solely on trivial problem sizes that:

Fit entirely in cache
Do not trigger realistic communication patterns

Warm-Up Runs

First runs often include:

JIT compilations (for some environments)
File caching effects
Memory allocation effects

Methods:

Perform one or more warm-up runs that you do not include in your measurements.
Or, within a single long run:

Exclude the first few iterations from timing.

This helps the benchmark reflect steady-state performance.

Benchmarking Different Configurations

Benchmarking is often used to compare:

Systems
Compilers
Algorithmic variants
Parallel configurations

Hardware Comparisons

When comparing different systems:

Keep:

Problem size
Software stack (or as similar as possible)
Numerical settings
identical, unless intentionally testing differences.

Report:

Node specifications (CPU model, core count, memory per node, accelerator type)
Network type (e.g., InfiniBand, Ethernet with given bandwidth)

Interpretation tips:

If performance scales similarly with problem size and parallelism, but one system is faster by a nearly constant factor, then:

It may reflect CPU frequency, cache size, or memory bandwidth differences.

If differences appear only at larger scales:

They may be due to network or memory subsystem differences.

Software and Compiler Comparisons

When comparing builds:

Control:

Same source code revision
Same problem input
Same node allocation

Vary:

Compiler (e.g., GCC vs Intel vs LLVM)
Optimization flags (e.g., -O2 vs -O3, vectorization options)
Libraries (e.g., different BLAS/MPI implementations)

Document:

Exact build commands used
Relevant environment variables
Any non-default runtime parameters for libraries

Algorithmic Variants

When benchmarking algorithmic changes:

Ensure:

Same numerical target accuracy or convergence criteria.

Be explicit if:

Algorithms converge in a different number of iterations.

Sometimes you may need to separate:

Time per iteration
Total iterations to convergence

This allows fair comparisons when different algorithms have different convergence rates.

Interpreting and Presenting Benchmark Results

Basic Data Handling

For each configuration (e.g., problem size, core count, compiler), you should at least have:

Number of runs
Mean or median time
A measure of variability (std dev, range)
Derived metrics (speedup, efficiency) where relevant

Avoid:

Relying on a single “best” run; it may be a lucky outlier.
Ignoring anomalously slow runs without understanding the cause.

Visualization

Common useful plots:

Time vs. number of cores/nodes (for strong scaling)
Speedup vs. number of cores/nodes
Efficiency vs. number of cores/nodes
Time vs. problem size (for weak scaling or complexity assessments)
GFLOP/s vs. problem size (to see asymptotic performance)

Guidelines:

Use logarithmic scales when spans are large.
Clearly label:

Axes (units!)
Legend (system, compiler, configuration)

Indicate:

Error bars when variation is significant.
Whether results are min/mean/median.

Identifying Performance Regimes and Limits

From benchmarking data, you can often identify regimes:

Startup-limited:

Small problem sizes, overhead dominates.

Compute-bound:

Performance scales with FLOP rate; good cache use, little communication overhead.

Memory-bound:

Time grows with memory traffic; little improvement from higher core counts.

Communication-bound:

Strong scaling degrades as communication costs grow with processor count.

I/O-bound:

Large datasets make disk/network I/O the dominant factor.

Recognizing which regime you are in informs which optimizations are likely to pay off.

Benchmarking Best Practices

To make your benchmarks robust and useful over time:

Automate:

Use scripts or simple workflows to launch benchmarks, collect timing, and post-process results.

Version everything:

Record code version (e.g., git commit hash), input version, and environment.

Keep raw data:

Store raw timing logs, not just aggregate tables or plots.

Be explicit about conditions:

Note any deviations from ideal conditions, such as partial node sharing or known system issues.

Separate correctness from performance:

Validate correctness on small/medium cases before any serious benchmarking.

Document non-obvious choices:

E.g., why a particular block size or solver was used.

Used systematically, benchmarking becomes a powerful tool to:

Guide performance optimization efforts
Justify resource requests and machine choices
Track performance regressions or improvements over time

Comments

Please login to add a comment.

Don't have an account? Register now!

12.2 Benchmarking applications

What Benchmarking Is (and Is Not)

Types of Benchmarks in HPC

Microbenchmarks

Kernel / Mini-Application Benchmarks

Full Application Benchmarks

Synthetic vs Realistic Workloads

Designing a Meaningful Benchmark

Define the Benchmark Scenario Clearly

Choose Metrics That Matter

Control Variables vs Tunables

Strong and Weak Scaling Benchmarks

Strong Scaling Benchmarks

Weak Scaling Benchmarks

Controlling the Benchmark Environment

System Load and Interference

Process and Thread Affinity

Repeating Runs and Statistical Treatment

Instrumentation and Timing Techniques

What to Time

Timing Tools and APIs

Benchmark Input Design and Warm-Up

Representative Inputs

Warm-Up Runs

Benchmarking Different Configurations

Hardware Comparisons

Software and Compiler Comparisons

Algorithmic Variants

Interpreting and Presenting Benchmark Results

Basic Data Handling

Visualization

Identifying Performance Regimes and Limits

Benchmarking Best Practices

Comments

Where to Move