Table of Contents
Why and How We Measure Performance
In HPC, “fast” is not a feeling; it is a measurable quantity. Measuring performance means turning runtime behavior into numbers you can compare, track, and optimize.
Performance measurement sits between raw execution and deep profiling: it answers “how fast/how efficient?” before “why?”. More advanced profiling tools and optimization strategies are covered in their own chapters; here we focus on what to measure and how to measure it in a simple, systematic way.
Key Performance Metrics
Wall-Clock Time
The most basic metric is wall-clock time: the elapsed real time from start to finish of your program or a specific region.
- Total runtime (end-to-end)
- Time for critical phases (e.g., initialization, computation, I/O)
You typically want:
- A baseline time (unoptimized or reference)
- Timings under different configurations (input size, cores, nodes, compiler flags, etc.)
At a high level, runtime is:
$$
T_\text{total} = T_\text{compute} + T_\text{communication} + T_\text{I/O} + T_\text{other}
$$
Performance analysis and later optimization try to reduce these components.
Throughput and Work Rate
Runtime alone is less meaningful without context. You need to relate time to “work done”:
- Grid updates per second
- Particles processed per second
- Time steps per day
- Iterations per second
A simple way to define a work rate is:
$$
\text{Throughput} = \frac{\text{Amount of work}}{\text{Elapsed time}}
$$
This helps you compare performance for different problem sizes or even across different applications that perform similar tasks.
FLOP Rate (Floating-Point Performance)
In numerical HPC codes, a common metric is floating-point operations per second (FLOP/s), usually reported in:
- MFLOP/s (millions per second)
- GFLOP/s (billions per second)
- TFLOP/s (trillions per second), etc.
If your algorithm performs $N_\text{flop}$ floating-point operations and the execution time is $T$ seconds:
$$
\text{FLOP/s} = \frac{N_\text{flop}}{T}
$$
Often you care about efficiency relative to theoretical peak:
$$
\text{Efficiency} = \frac{\text{Achieved FLOP/s}}{\text{Peak FLOP/s}} \times 100\%
$$
Peak FLOP/s depends on hardware and is typically given by vendor specs or derived from core count, frequency, and vector width.
Bandwidth and I/O Throughput
Many HPC applications are limited not by arithmetic but by data movement:
- Memory bandwidth (GB/s between CPU and RAM)
- Interconnect bandwidth (GB/s between nodes)
- Storage throughput (MB/s or GB/s to/from filesystem)
If your program transfers $V$ bytes in time $T$:
$$
\text{Bandwidth} = \frac{V}{T}
$$
Again, you can compare against a theoretical or measured peak bandwidth to understand how close you are to hardware limits.
Parallel Speedup and Efficiency (Briefly)
Parallel performance metrics are covered in more depth elsewhere, but they are central to measurement:
- Speedup when using $p$ processors:
$$
S(p) = \frac{T(1)}{T(p)}
$$
where $T(1)$ is time on a single core or baseline, and $T(p)$ is time on $p$ cores.
- Parallel efficiency:
$$
E(p) = \frac{S(p)}{p}
$$
Measuring $T(p)$ accurately and consistently is the foundation for meaningful scaling studies.
What Exactly Should You Time?
End-to-End vs. Region-Based Measurements
You rarely want only a single “program took X seconds” number.
Common levels:
- Total runtime:
- Useful for user-perceived performance and resource planning.
- Phase runtimes:
- Initialization, main compute loop, I/O, communication phases, post-processing.
- Kernel-level timing:
- Specific loops or computational kernels you suspect dominate runtime.
This decomposition helps you:
- Identify where time is actually spent (computation vs. I/O vs. communication).
- Focus optimization efforts where they matter.
Warm-Up and Measurement Noise
Real systems exhibit noise:
- First run may be slower (cache cold, filesystem metadata, JIT, etc.).
- Other users’ jobs can interfere on shared systems.
- Power management or turbo modes may vary CPU frequency.
To mitigate:
- Discard the first run or treat it separately.
- Run each configuration multiple times (e.g., 3–10).
- Report statistics (min, median, mean; often the minimum is a good approximation of least-contended performance).
Practical Timing Techniques in HPC
Timing from the Command Line
For coarse, whole-program measurement you can use shell tools.
The `time` Command
Most systems provide a time utility:
- Measures:
- Real: wall-clock time
- User: CPU time spent in user code
- Sys: CPU time spent in kernel (system) calls
Example usage:
time ./my_program input.dat
You can also use /usr/bin/time -v (often a more detailed version) to see additional metrics like maximum resident set size (memory), page faults, etc.
Use command-line timing for:
- Quick comparisons (different compilers, flags).
- End-to-end runtime under various core counts or nodes (in combination with the job scheduler).
In-Code Timing: Instrumenting Applications
To understand where time is spent inside your program, you insert timing calls around code regions.
Basic Principles
- Use a high-resolution, monotonic clock (does not go backward).
- Minimize intrusiveness (timing should not significantly change performance).
- Use consistent timing API across your measurements.
Examples of Common Timing APIs
You will see different timing calls depending on language and parallel programming model. At a conceptual level, all do the same thing:
- Get current time (as a double or timestamp).
- Run the code section.
- Get time again.
- Compute difference.
For example, conceptually:
t_start = now()
# ... code to be measured ...
t_end = now()
elapsed = t_end - t_startIn MPI codes, it is common to use the MPI-provided timer to ensure consistency across processes. Threaded models or GPU models have their own timing functions, discussed in their specific chapters.
Timing Parallel Regions
An important detail in parallel programs: which process or thread reports what?
Guidelines:
- For MPI:
- Often measure time on rank 0, but ensure all ranks reach synchronization points (e.g., with
MPI_Barrier) so measurements correspond to the full global phase. - For OpenMP:
- Collect timing once per region, typically in the master thread.
- For hybrid codes:
- Combine approaches carefully; decide whether you want node-level or global timings.
Make sure the measured region encompasses all relevant work, including communication and synchronization, not just the local computation.
Measuring Performance at Scale
Strong and Weak Scaling Experiments
Scaling concepts are detailed elsewhere; here is how they connect to measurement.
To evaluate parallel performance, you run systematic experiments:
- Strong scaling:
- Fix the problem size.
- Measure runtime $T(p)$ for different process counts $p$.
- Derive speedup $S(p)$ and efficiency $E(p)$.
- Weak scaling:
- Increase problem size in proportion to $p$.
- Measure how runtime changes (ideally remains approximately constant).
For each data point:
- Choose the job configuration (nodes, tasks per node, threads).
- Run the application multiple times.
- Use consistent timing methodology (either inside the code or scheduler logs /
time). - Record:
- Input size / problem parameters.
- Number of processes/threads.
- Wall-clock time of the primary work phase.
- Any additional metrics of interest (iterations, time steps, etc.).
Plotting:
- $T(p)$ vs. $p$ (runtime).
- $S(p)$ vs. $p$ (speedup).
- $E(p)$ vs. $p$ (efficiency).
These plots are more informative than standalone numbers and form the basis of a performance study.
Averaging and Reporting
For each configuration, you may have several runs. Common practices:
- Report the minimum time to indicate best-case performance under low noise.
- Optionally report mean and standard deviation to show variability.
- If outliers occur (e.g., an unusually slow run due to filesystem hiccups), document and decide whether to discard them with justification.
When reporting strong/weak scaling:
- Clearly state:
- Hardware (CPU model, cores, memory, interconnect).
- Software environment (compiler and version, libraries, optimization flags).
- Exact run configuration (nodes, processes, threads, binding if relevant).
This is essential for reproducibility and fair comparison.
Common Pitfalls in Performance Measurement
Measuring the Wrong Thing
Typical mistakes:
- Including setup that is irrelevant to the algorithm’s steady-state performance (e.g., heavy one-off preprocessing) without separating it.
- Timing only a small toy problem where overheads dominate and behavior differs from production runs.
- Measuring only wall time once and ignoring variability.
Mitigation:
- Distinguish between initialization, compute, and finalization.
- Use realistic problem sizes.
- Repeat runs and look at variability.
Timer Placement Errors
Incorrect timer placement leads to misleading conclusions:
- Starting timer before process synchronization: late-arriving processes can distort measured durations.
- Stopping timer too early (e.g., not including all communication or I/O).
- Double-counting or missing parts of the workload.
Mitigation:
- Carefully decide the region of interest.
- Use barriers or other synchronization constructs when needed to ensure all parallel workers have completed.
Perturbing the Program
Excessive timing can change behavior:
- Printing from all processes or threads (especially inside loops) drastically distorts performance.
- Very fine-grained timing in tight loops can introduce overhead larger than the code being measured.
Mitigation:
- Time coarse-grained regions first.
- Limit prints to a small subset of processes (e.g., rank 0) and to infrequent events.
- Use more advanced profiling tools (discussed elsewhere) when you need fine-grained details.
Integrating Measurement into Your Workflow
Establishing a Baseline
Before making any optimization:
- Choose a representative input.
- Measure:
- Total runtime.
- Critical phase runtimes.
- Basic throughput (e.g., iterations per second).
- Record system and environment information.
This baseline is your reference for all future changes.
Iterative Improvement
For each change (code modification, compiler flag, different node count):
- Rebuild with clearly labeled version.
- Repeat the same measurement procedure.
- Compare against the baseline:
- Did performance improve?
- By how much (in percentage)?
- Is the improvement consistent across runs?
Never assume that a theoretically “better” optimization actually speeds up your code without measuring.
Automation
As projects grow, manual measurement becomes tedious and error-prone. Simple automation helps:
- Shell scripts to:
- Run the program at different core counts or input sizes.
- Collect and parse timing output into a table.
- Use environment variables or command-line options in your code (e.g., enabling/disabling timers or printing only summary statistics).
Later chapters on performance analysis and optimization will integrate these measurements with profiling and more sophisticated tools, but the underlying pattern remains: measure, change, measure again.
Summary
- Performance measurement turns program behavior into quantitative metrics like wall time, throughput, FLOP/s, and bandwidth.
- Measure not only total runtime but also key phases and kernels.
- Use appropriate timing tools:
- External (
time) for whole-program measurements. - In-code timers for detailed region-level measurements.
- For parallel programs, design careful scaling studies (strong/weak) and derive speedup and efficiency from measured runtimes.
- Avoid common pitfalls: timing the wrong regions, poor timer placement, or perturbing performance with excessive output.
- Always establish a baseline, apply changes systematically, and re-measure under controlled, repeatable conditions.
These measurement practices are the foundation on which profiling, low-level optimization, and sophisticated performance tools build in the rest of the performance analysis and optimization material.