Table of Contents
What Benchmarking Is (and Is Not)
Benchmarking in HPC is the systematic, repeatable measurement of application performance under controlled conditions.
You are not just running your code once and looking at how long it took. Proper benchmarking aims to:
- Measure performance in a way that is:
- Reproducible
- Comparable across systems, compilers, and configurations
- Informative for optimization and purchasing decisions
- Separate:
- Application behavior from transient system noise
- Performance issues from correctness issues
Benchmarking answers questions like:
- How fast does this application run for a given problem size?
- How does performance change when:
- I use more cores or nodes?
- I change compilers or optimization flags?
- I change algorithms or libraries?
- How efficient is this system compared to another?
It does not replace correctness testing or debugging.
Types of Benchmarks in HPC
Microbenchmarks
Microbenchmarks focus on a very narrow aspect of performance:
- Examples:
- Measuring memory bandwidth (e.g., STREAM benchmark)
- Measuring network latency and bandwidth (e.g.,
osu_latency,osu_bw) - Measuring floating-point peak performance (simple dense matrix multiply kernels)
- Use cases:
- Characterizing hardware limits
- Explaining why an application cannot exceed certain performance levels
- Validating theoretical models (e.g., roofline model limits)
These do not tell you directly how your whole application will perform but give insight into bottlenecks.
Kernel / Mini-Application Benchmarks
Here you benchmark a representative kernel or mini-app that captures the dominant computation and communication pattern of your full application, but in a simplified form.
- Examples:
- A stencil update kernel from a CFD code
- A sparse matrix-vector multiply from a larger PDE solver
- Community mini-apps like:
- LULESH (hydrodynamics)
- miniMD (molecular dynamics)
Use cases:
- Rapid experimentation with algorithms or libraries
- Tuning for specific architectures without porting the full codebase
- Comparing systems with application-like behavior but less complexity
Full Application Benchmarks
This is your actual production application (or a close equivalent), run with realistic input data and configuration.
Characteristics:
- Include:
- I/O
- Communication
- Load imbalance
- Algorithmic setup phases
- Reflect “end-to-end” performance that real users care about
Use cases:
- Tuning production runs
- Capacity planning on clusters
- System acceptance testing using domain-specific codes
Synthetic vs Realistic Workloads
- Synthetic: Artificial inputs designed to stress specific aspects (e.g., worst-case communication, maximum memory usage).
- Realistic: Inputs that reflect actual scientific or industrial workloads.
Both have value:
- Synthetic workloads can reveal corner-case bottlenecks.
- Realistic workloads tell you what actually matters operationally.
Designing a Meaningful Benchmark
A useful benchmark must be:
- Well-defined
- Repeatable
- Relevant
Define the Benchmark Scenario Clearly
Specify:
- Problem type: e.g., 3D heat equation, molecular system with X atoms, matrix size $N \times N$
- Input data:
- How it is generated, or
- Where it is obtained (datasets, scripts)
- Numerical settings:
- Tolerances
- Iteration limits
- Solver/preconditioner choices
- Parallel configuration:
- Number of nodes
- Tasks (MPI ranks) per node
- Threads per task
- GPU usage if applicable
- Software environment:
- Compiler and version
- Key libraries and versions
- Important environment variables (e.g., thread affinity, OpenMP settings)
All of this should be documented so that someone else (or you in 6 months) can reproduce the benchmark.
Choose Metrics That Matter
Common metrics include:
- Wall-clock time $T$:
- Total time from start to end of the relevant computation
- Throughput:
- Jobs per hour
- Time steps per second
- Iterations per second
- Floating-point performance:
- FLOPs/s or GFLOP/s (giga floating-point operations per second)
- Requires knowing or estimating operation counts
- Speedup:
- $$S(p) = \frac{T(1)}{T(p)}$$
- Where $T(1)$ is baseline time, $T(p)$ time on $p$ processing elements
- Parallel efficiency:
- $$E(p) = \frac{S(p)}{p} = \frac{T(1)}{p \cdot T(p)}$$
- Resource efficiency:
- Performance per core, per node, or per watt (if energy data available)
- I/O performance:
- MB/s read/write during relevant phases
- Memory footprint:
- Peak memory usage per process, per node
Pick metrics aligned with your goals:
- For production work: turnaround time and cost per simulation.
- For scaling studies: speedup and efficiency.
- For procurement: performance per node or per dollar.
Control Variables vs Tunables
Clearly separate:
- Fixed parameters (controlled):
- Problem size
- Numerical fidelity
- Hardware platform
- Tunables (what you are systematically varying):
- Number of cores/nodes
- Compiler flags
- Libraries (e.g., different BLAS implementations)
- Algorithmic choices (e.g., solver A vs solver B)
Change one category of tunables at a time to isolate their impact.
Strong and Weak Scaling Benchmarks
Scaling concepts themselves are covered elsewhere, but here is how they shape benchmarking setups.
Strong Scaling Benchmarks
Goal: measure how time decreases as you increase resources for a fixed problem size.
Setup:
- Fix the global problem size $N$.
- Run on $p = 1, 2, 4, 8, \dots$ processors (cores/nodes).
- Measure:
- $T(p)$
- Speedup $S(p)$ and efficiency $E(p)$
Use cases:
- Understand how fast this job can run if given more resources.
- Identify at what scale adding more resources stops being worth it.
Weak Scaling Benchmarks
Goal: measure how time behaves as you increase resources and increase problem size so that work per resource stays constant.
Setup:
- Fix the problem size per process, e.g., $N_{\text{local}}$.
- Total work $N_{\text{global}}$ scales with $p$.
- Ideally, $T(p)$ remains close to constant.
Use cases:
- Understand how well your application handles larger and larger global problems.
- Evaluate scalability for large production runs.
Controlling the Benchmark Environment
To make benchmarks comparable and interpretable, the environment must be as stable as possible.
System Load and Interference
Cluster conditions vary:
- Other users’ jobs may:
- Share network links
- Contend for I/O bandwidth
- Induce OS noise
Mitigation strategies:
- Run benchmarks when the system is less loaded (e.g., off-peak hours).
- Request dedicated nodes or exclusive access if the scheduler supports it:
- For example (conceptually):
--exclusivein SLURM job scripts. - Avoid mixing benchmarks with other heavy tasks from your own jobs on the same nodes.
Process and Thread Affinity
To reduce variability:
- Pin processes/threads to cores consistently:
- Use your scheduler’s binding options.
- Use threading runtime options (e.g., OpenMP controls).
- Keep the mapping (e.g., ranks per node, threads per rank) documented and consistent across runs.
Affinity impacts:
- Cache utilization
- NUMA effects (local vs remote memory)
- Measured performance variation
Repeating Runs and Statistical Treatment
Due to noise, single measurements are often misleading.
Basic practice:
- Repeat each configuration several times (e.g., 5–10 runs).
- Compute:
- Mean run time $\bar{T}$
- Standard deviation or at least min/max
Interpretation:
- Large variance suggests:
- Uncontrolled environmental factors
- I/O or network contention
- When reporting results, prefer:
- Mean plus variance, or
- Median (robust to outliers)
Instrumentation and Timing Techniques
Benchmarking requires reliable timing for the portions of the code that matter.
What to Time
Avoid timing everything from process start to exit if:
- There is significant setup overhead (reading large inputs, pre-processing).
- You want to separate:
- Initialization
- Core compute
- Output / checkpointing
Common practice:
- Time:
- The main compute loop
- Specific phases (solver, assembly, communication)
- Also record:
- Overall wall-clock time from job start, to capture “end-to-end” behavior separately.
Timing Tools and APIs
Depending on your application and environment, you might use:
- High-level tools (external):
- Time command wrappers (e.g.,
/usr/bin/time) for full-application timing. - Job scheduler accounting for start/end times.
- In-code timing:
- Language-specific timers (e.g.,
std::chronoin C++, high-resolution timers in other languages). - MPI timing functions for distributed regions:
MPI_Wtime()across all processes for the same region of code.- Library timers:
- Some numerical libraries expose internal timing information or logging.
Consistency is more important than the specific API chosen. Always:
- Use the same timing method for all runs of a given benchmark.
- Ensure the timed regions are well-defined and documented.
Benchmark Input Design and Warm-Up
Representative Inputs
The performance of many codes depends strongly on input characteristics, such as:
- Geometry and mesh structure
- Sparsity patterns in matrices
- Physical parameters (e.g., Reynolds number)
- Data distribution and skew
Guidelines:
- Include multiple test cases:
- Small / quick sanity run
- Medium / typical user case
- Large / production-scale case
- Avoid benchmarking solely on trivial problem sizes that:
- Fit entirely in cache
- Do not trigger realistic communication patterns
Warm-Up Runs
First runs often include:
- JIT compilations (for some environments)
- File caching effects
- Memory allocation effects
Methods:
- Perform one or more warm-up runs that you do not include in your measurements.
- Or, within a single long run:
- Exclude the first few iterations from timing.
This helps the benchmark reflect steady-state performance.
Benchmarking Different Configurations
Benchmarking is often used to compare:
- Systems
- Compilers
- Algorithmic variants
- Parallel configurations
Hardware Comparisons
When comparing different systems:
- Keep:
- Problem size
- Software stack (or as similar as possible)
- Numerical settings
identical, unless intentionally testing differences. - Report:
- Node specifications (CPU model, core count, memory per node, accelerator type)
- Network type (e.g., InfiniBand, Ethernet with given bandwidth)
Interpretation tips:
- If performance scales similarly with problem size and parallelism, but one system is faster by a nearly constant factor, then:
- It may reflect CPU frequency, cache size, or memory bandwidth differences.
- If differences appear only at larger scales:
- They may be due to network or memory subsystem differences.
Software and Compiler Comparisons
When comparing builds:
- Control:
- Same source code revision
- Same problem input
- Same node allocation
- Vary:
- Compiler (e.g., GCC vs Intel vs LLVM)
- Optimization flags (e.g.,
-O2vs-O3, vectorization options) - Libraries (e.g., different BLAS/MPI implementations)
Document:
- Exact build commands used
- Relevant environment variables
- Any non-default runtime parameters for libraries
Algorithmic Variants
When benchmarking algorithmic changes:
- Ensure:
- Same numerical target accuracy or convergence criteria.
- Be explicit if:
- Algorithms converge in a different number of iterations.
- Sometimes you may need to separate:
- Time per iteration
- Total iterations to convergence
This allows fair comparisons when different algorithms have different convergence rates.
Interpreting and Presenting Benchmark Results
Basic Data Handling
For each configuration (e.g., problem size, core count, compiler), you should at least have:
- Number of runs
- Mean or median time
- A measure of variability (std dev, range)
- Derived metrics (speedup, efficiency) where relevant
Avoid:
- Relying on a single “best” run; it may be a lucky outlier.
- Ignoring anomalously slow runs without understanding the cause.
Visualization
Common useful plots:
- Time vs. number of cores/nodes (for strong scaling)
- Speedup vs. number of cores/nodes
- Efficiency vs. number of cores/nodes
- Time vs. problem size (for weak scaling or complexity assessments)
- GFLOP/s vs. problem size (to see asymptotic performance)
Guidelines:
- Use logarithmic scales when spans are large.
- Clearly label:
- Axes (units!)
- Legend (system, compiler, configuration)
- Indicate:
- Error bars when variation is significant.
- Whether results are min/mean/median.
Identifying Performance Regimes and Limits
From benchmarking data, you can often identify regimes:
- Startup-limited:
- Small problem sizes, overhead dominates.
- Compute-bound:
- Performance scales with FLOP rate; good cache use, little communication overhead.
- Memory-bound:
- Time grows with memory traffic; little improvement from higher core counts.
- Communication-bound:
- Strong scaling degrades as communication costs grow with processor count.
- I/O-bound:
- Large datasets make disk/network I/O the dominant factor.
Recognizing which regime you are in informs which optimizations are likely to pay off.
Benchmarking Best Practices
To make your benchmarks robust and useful over time:
- Automate:
- Use scripts or simple workflows to launch benchmarks, collect timing, and post-process results.
- Version everything:
- Record code version (e.g., git commit hash), input version, and environment.
- Keep raw data:
- Store raw timing logs, not just aggregate tables or plots.
- Be explicit about conditions:
- Note any deviations from ideal conditions, such as partial node sharing or known system issues.
- Separate correctness from performance:
- Validate correctness on small/medium cases before any serious benchmarking.
- Document non-obvious choices:
- E.g., why a particular block size or solver was used.
Used systematically, benchmarking becomes a powerful tool to:
- Guide performance optimization efforts
- Justify resource requests and machine choices
- Track performance regressions or improvements over time