Kahibaro
Discord Login Register

Measuring performance

Why and How We Measure Performance

In HPC, “fast” is not a feeling; it is a measurable quantity. Measuring performance means turning runtime behavior into numbers you can compare, track, and optimize.

Performance measurement sits between raw execution and deep profiling: it answers “how fast/how efficient?” before “why?”. More advanced profiling tools and optimization strategies are covered in their own chapters; here we focus on what to measure and how to measure it in a simple, systematic way.


Key Performance Metrics

Wall-Clock Time

The most basic metric is wall-clock time: the elapsed real time from start to finish of your program or a specific region.

You typically want:

At a high level, runtime is:

$$
T_\text{total} = T_\text{compute} + T_\text{communication} + T_\text{I/O} + T_\text{other}
$$

Performance analysis and later optimization try to reduce these components.

Throughput and Work Rate

Runtime alone is less meaningful without context. You need to relate time to “work done”:

A simple way to define a work rate is:

$$
\text{Throughput} = \frac{\text{Amount of work}}{\text{Elapsed time}}
$$

This helps you compare performance for different problem sizes or even across different applications that perform similar tasks.

FLOP Rate (Floating-Point Performance)

In numerical HPC codes, a common metric is floating-point operations per second (FLOP/s), usually reported in:

If your algorithm performs $N_\text{flop}$ floating-point operations and the execution time is $T$ seconds:

$$
\text{FLOP/s} = \frac{N_\text{flop}}{T}
$$

Often you care about efficiency relative to theoretical peak:

$$
\text{Efficiency} = \frac{\text{Achieved FLOP/s}}{\text{Peak FLOP/s}} \times 100\%
$$

Peak FLOP/s depends on hardware and is typically given by vendor specs or derived from core count, frequency, and vector width.

Bandwidth and I/O Throughput

Many HPC applications are limited not by arithmetic but by data movement:

If your program transfers $V$ bytes in time $T$:

$$
\text{Bandwidth} = \frac{V}{T}
$$

Again, you can compare against a theoretical or measured peak bandwidth to understand how close you are to hardware limits.

Parallel Speedup and Efficiency (Briefly)

Parallel performance metrics are covered in more depth elsewhere, but they are central to measurement:

$$
S(p) = \frac{T(1)}{T(p)}
$$

where $T(1)$ is time on a single core or baseline, and $T(p)$ is time on $p$ cores.

$$
E(p) = \frac{S(p)}{p}
$$

Measuring $T(p)$ accurately and consistently is the foundation for meaningful scaling studies.


What Exactly Should You Time?

End-to-End vs. Region-Based Measurements

You rarely want only a single “program took X seconds” number.

Common levels:

  1. Total runtime:
    • Useful for user-perceived performance and resource planning.
  2. Phase runtimes:
    • Initialization, main compute loop, I/O, communication phases, post-processing.
  3. Kernel-level timing:
    • Specific loops or computational kernels you suspect dominate runtime.

This decomposition helps you:

Warm-Up and Measurement Noise

Real systems exhibit noise:

To mitigate:

Practical Timing Techniques in HPC

Timing from the Command Line

For coarse, whole-program measurement you can use shell tools.

The `time` Command

Most systems provide a time utility:

Example usage:

time ./my_program input.dat

You can also use /usr/bin/time -v (often a more detailed version) to see additional metrics like maximum resident set size (memory), page faults, etc.

Use command-line timing for:

In-Code Timing: Instrumenting Applications

To understand where time is spent inside your program, you insert timing calls around code regions.

Basic Principles

Examples of Common Timing APIs

You will see different timing calls depending on language and parallel programming model. At a conceptual level, all do the same thing:

  1. Get current time (as a double or timestamp).
  2. Run the code section.
  3. Get time again.
  4. Compute difference.

For example, conceptually:

t_start = now()
# ... code to be measured ...
t_end = now()
elapsed = t_end - t_start

In MPI codes, it is common to use the MPI-provided timer to ensure consistency across processes. Threaded models or GPU models have their own timing functions, discussed in their specific chapters.

Timing Parallel Regions

An important detail in parallel programs: which process or thread reports what?

Guidelines:

Make sure the measured region encompasses all relevant work, including communication and synchronization, not just the local computation.


Measuring Performance at Scale

Strong and Weak Scaling Experiments

Scaling concepts are detailed elsewhere; here is how they connect to measurement.

To evaluate parallel performance, you run systematic experiments:

For each data point:

  1. Choose the job configuration (nodes, tasks per node, threads).
  2. Run the application multiple times.
  3. Use consistent timing methodology (either inside the code or scheduler logs / time).
  4. Record:
    • Input size / problem parameters.
    • Number of processes/threads.
    • Wall-clock time of the primary work phase.
    • Any additional metrics of interest (iterations, time steps, etc.).

Plotting:

These plots are more informative than standalone numbers and form the basis of a performance study.

Averaging and Reporting

For each configuration, you may have several runs. Common practices:

When reporting strong/weak scaling:

This is essential for reproducibility and fair comparison.


Common Pitfalls in Performance Measurement

Measuring the Wrong Thing

Typical mistakes:

Mitigation:

Timer Placement Errors

Incorrect timer placement leads to misleading conclusions:

Mitigation:

Perturbing the Program

Excessive timing can change behavior:

Mitigation:

Integrating Measurement into Your Workflow

Establishing a Baseline

Before making any optimization:

  1. Choose a representative input.
  2. Measure:
    • Total runtime.
    • Critical phase runtimes.
    • Basic throughput (e.g., iterations per second).
  3. Record system and environment information.

This baseline is your reference for all future changes.

Iterative Improvement

For each change (code modification, compiler flag, different node count):

  1. Rebuild with clearly labeled version.
  2. Repeat the same measurement procedure.
  3. Compare against the baseline:
    • Did performance improve?
    • By how much (in percentage)?
    • Is the improvement consistent across runs?

Never assume that a theoretically “better” optimization actually speeds up your code without measuring.

Automation

As projects grow, manual measurement becomes tedious and error-prone. Simple automation helps:

Later chapters on performance analysis and optimization will integrate these measurements with profiling and more sophisticated tools, but the underlying pattern remains: measure, change, measure again.


Summary

These measurement practices are the foundation on which profiling, low-level optimization, and sophisticated performance tools build in the rest of the performance analysis and optimization material.

Views: 15

Comments

Please login to add a comment.

Don't have an account? Register now!