12 Performance Analysis and Optimization

Table of Contents

The Role of Performance in HPC

High performance computing exists to solve large and complex problems under practical time and resource constraints. A program that produces correct results but wastes time, energy, or hardware capacity is usually not acceptable on shared HPC systems. Performance analysis and optimization provide a systematic way to understand how an application uses the machine and how to improve that usage.

In this chapter, the focus is on performance as a measurable, analyzable property of HPC workloads. Other chapters describe hardware, parallel programming models, and numerical libraries. Here the concern is how to quantify and improve behavior on a given system, independent of the specific domain science.

Performance should be viewed as a first-class aspect of the software lifecycle. It is not something to think about only after a code is finished. The most successful HPC projects integrate measurement and optimization from early prototypes through production runs.

Performance is not a single number. It is a combination of time to solution, resource usage, scalability, and efficiency with respect to both hardware and algorithms.

Basic Performance Metrics

In HPC, performance usually starts with time and cost. Many derived metrics are simply different ways to normalize or interpret these quantities.

Wall-clock time and speed

The most fundamental measurement is wall-clock time, often just called runtime. This is the elapsed real time from the start of a computation to its end, as would be seen on a clock on the wall.

From time, other metrics are defined, such as speed and throughput. For a problem that processes a known number of operations or items, speed can be described as:

$$
\text{Speed} = \frac{\text{Work done}}{\text{Time}}
$$

For instance, work can be a number of time steps, simulated particles, or floating point operations (FLOPs). In practice, users often think in terms of how long a simulation needs to run to reach a target time step or convergence criterion.

Floating point performance

In scientific computing, floating point operations per second (FLOP/s) are commonly used to describe computational performance. On large systems, this is usually scaled to giga (GFLOP/s) or tera (TFLOP/s).

If a kernel or routine performs $N_{\text{flop}}$ floating point operations and takes a time $T$, then the floating point rate is:

$$
\text{FLOP/s} = \frac{N_{\text{flop}}}{T}
$$

This definition is clear, but in real applications it is often hard to know $N_{\text{flop}}$ exactly. Performance analysis tools and hardware counters can provide estimates.

It is important to note that high FLOP/s do not automatically mean a fast or efficient application. Many real codes are limited by memory or communication, not by the raw ability to perform floating point operations.

Parallel speedup and efficiency

For parallel programs, performance analysis often compares behavior on different numbers of cores or nodes. The basic quantity is speedup. If $T_1$ is the runtime on a reference configuration, and $T_p$ is the runtime on $p$ processing elements, then

$$
S_p = \frac{T_1}{T_p}
$$

is the speedup on $p$ processing elements.

Parallel efficiency normalizes speedup by the ideal linear speedup:

$$
E_p = \frac{S_p}{p} = \frac{T_1}{p \, T_p}
$$

An efficiency of 1 corresponds to perfect linear speedup. In practice, efficiencies decline as $p$ increases, due to overheads such as communication, synchronization, and load imbalance.

When the serial baseline $T_1$ is too slow to measure or does not exist, a different reference configuration can be used, and speedup is measured relative to that reference. The same formulas apply if $T_{\text{ref}}$ and $T_p$ are used.

Time to solution and throughput

Time to solution is the total wall-clock time needed to complete a meaningful scientific or engineering task. For HPC users, this is usually the main performance concern. Every optimization step and hardware choice should be evaluated in terms of its effect on time to solution.

Throughput refers to how many tasks or jobs can be completed per unit time. For production environments with many similar runs, such as parameter sweeps, throughput and job turnaround time matter more than the performance of a single run.

Throughput is influenced not only by code speed, but also by queuing policies, job sizing, and resource fragmentation on the cluster. Performance analysis must therefore consider the interaction between jobs and the scheduler.

Resource efficiency and cost

On shared systems, using resources efficiently is a matter of both fairness and cost. Two key ideas are:

CPU efficiency or utilization. How much of the allocated CPU time is actually used for useful computations rather than idle waiting or overhead.
Energy efficiency. How much energy is consumed per unit of useful work done, often expressed as Joules per simulation step or per FLOP.

Many supercomputing centers track job efficiency metrics and may warn or penalize consistently wasteful jobs. Good performance analysis practices help avoid such issues.

Understanding Performance Limits: Roofline Perspective

Modern HPC hardware has complex behaviors. Rather than focus on microarchitectural details, a high-level model is often more useful for beginners. One widely used conceptual model is the roofline model.

Arithmetic intensity

Arithmetic intensity is a key concept that links computation and data movement. It is defined as the number of floating point operations per byte of data moved from memory:

$$
I = \frac{\text{Number of FLOPs}}{\text{Bytes moved to or from memory}}
$$

An operation with high arithmetic intensity does a lot of computation for each byte it reads or writes, for example dense matrix multiplications. Operations with low intensity perform relatively little computation per byte, such as simple vector additions.

Arithmetic intensity helps predict whether an algorithm is limited by the speed of computation or by memory bandwidth.

Compute bound and memory bound regimes

For a given hardware platform, two main capabilities are relevant: the peak floating point rate $P_{\text{peak}}$ and the maximum memory bandwidth $B_{\text{max}}$. The highest achievable performance for a kernel with arithmetic intensity $I$ is approximately

$$
P_{\text{attainable}} = \min\left(P_{\text{peak}}, \, I \cdot B_{\text{max}} \right)
$$

This means that

If $I \cdot B_{\text{max}} < P_{\text{peak}}$, the kernel is memory bound. Performance is limited by memory bandwidth, and improving data locality or reducing memory traffic is likely more useful than increasing raw computational power.

If $I \cdot B_{\text{max}} \ge P_{\text{peak}}$, the kernel is compute bound. Performance is limited by how fast the processor can perform operations. Techniques like better vectorization, instruction-level optimization, or exploiting accelerators can help.

The roofline name comes from plotting performance as a function of arithmetic intensity. The plot has a slanted line corresponding to $I \cdot B_{\text{max}}$ that meets a horizontal line at $P_{\text{peak}}$. Every real measurement should fall below or on this roof.

If measured performance is far below both the memory roof $I B_{\text{max}}$ and the compute roof $P_{\text{peak}}$, then there is a significant opportunity for optimization.

Practical use of the roofline idea

Beginners do not need to construct a detailed roofline plot to benefit from the idea. A simple reasoning process is often enough:

Estimate whether your code moves much more data than it computes. If so, it is likely memory bound, and focus on memory access patterns.

Estimate whether there are parts that perform many operations on relatively small local data. Those parts may be compute bound and can benefit from vectorization and reduced overhead.

Observe performance changes when varying problem size. If doubling the floating point work without changing memory size increases runtime roughly 2x, computation is significant. If runtime changes less than expected, memory or communication may dominate.

This perspective helps avoid random low-level tweaks that do not address the main limiting factor.

Levels of Performance Analysis

Performance analysis can operate at different levels of detail, depending on the questions being asked and the available time.

High level: algorithm and scaling behavior

At the highest level, the focus is on how algorithms scale with problem size and with parallelism. This connects to concepts like strong and weak scaling, Amdahl’s Law, and Gustafson’s Law, covered in other chapters.

At this level, typical questions include:

What is the asymptotic complexity with respect to problem size, for example $O(n^2)$ or $O(n^3)$.

How does runtime change when the number of processors is increased, with fixed or varying problem size.

Are there regions of the algorithm that are inherently sequential and limit speedup.

Changes at this level often involve algorithmic choices or data structure redesign, which can yield the largest gains but may require significant code modifications.

Intermediate level: code structure and hotspots

At an intermediate level, analysis focuses on where most time is spent within the program. A hotspot is a function, loop, or code region that dominates total runtime.

It is common to find that a small fraction of the code is responsible for most of the time. For example, one inner loop or particular library call may account for 80 percent of the runtime.

Identifying hotspots is crucial, because optimizing cold regions has little impact on overall performance, even if local improvements seem dramatic.

At this level, profiling tools and timers are used to build a hierarchical picture of time distribution among functions, call paths, and phases of the computation.

Low level: hardware behavior and micro-optimization

At the lowest level, analysis involves detailed hardware behaviors such as cache miss rates, vectorization efficiency, branch mispredictions, and instruction-level parallelism.

This level is often necessary when a well-chosen algorithm and data structure are already in place, and the main kernel is still far from hardware limits. Tools like hardware performance counter profilers, vectorization reports from compilers, and specialized analysis utilities become important.

Low-level optimization should generally come after higher-level adjustments. Attempting to tweak instructions when the algorithm itself is suboptimal is rarely a good investment of effort.

A Systematic Performance Engineering Process

Performance engineering is more than occasional tuning. It is an iterative process that uses measurement and reasoning to guide changes.

Establishing a baseline

Before any optimization, it is essential to obtain a clear baseline. A baseline is a set of measurements, on a specific machine and input dataset, that characterize the current performance.

A good baseline typically includes:

Wall-clock time for a representative run.

Basic resource usage such as CPU utilization and memory footprint.

Scaling behavior with a small set of processor counts.

Key algorithmic parameters such as problem size and important tolerances.

Care must be taken to ensure that the baseline is reproducible. This usually means recording input parameters, code version, compiler flags, and environment settings such as loaded modules.

Without a trustworthy baseline, it is impossible to know whether a modification is an improvement or just an artifact of changing conditions.

Formulating hypotheses

Random changes to the code are inefficient. Instead, performance optimization should be guided by hypotheses.

A hypothesis might be: The application is memory bound in the main stencil kernel, and reorganizing data will reduce cache misses and improve runtime by about 20 percent.

Or: The code spends too much time in global communication, and reducing the frequency of MPI collective operations will improve scaling beyond 64 nodes.

Each hypothesis is testable by specific experiments, such as measuring cache miss rates before and after a change, or profiling communication volumes as processor counts increase.

A hypothesis-driven approach reduces wasted effort and helps structure the analysis.

Using measurement to guide improvements

Performance measurement typically involves combinations of the following techniques:

Instrumentation. Adding timers and counters into the code, often with simple MPI_Wtime or high-resolution clock functions, to record durations of specific phases.

Profiling. Running the program under a profiler to collect statistics on where time is spent at function or line level.

Tracing. Recording detailed event timelines, such as when messages are sent and received or when threads are created and synchronized.

Counter sampling. Using hardware performance counters to approximate metrics like cache misses or branch mispredictions.

For beginners, the most important practice is to start with simple measurements and only introduce more complex tools if necessary. An initial division of time among a few high-level phases often reveals surprisingly clear bottlenecks.

After each change, the same measurement procedure should be used so that comparisons are valid.

Iterative refinement

Performance engineering is an iterative loop. A typical cycle looks like:

Measure the current behavior and identify the dominant bottleneck.

Formulate a hypothesis about why that bottleneck occurs.

Implement a targeted change to address it.

Re-measure using the same protocol.

Evaluate whether the change helped, harmed, or had no effect.

If successful, the bottleneck may shift to another part of the code. If unsuccessful, the hypothesis may need refinement or a different line of attack.

This iterative process continues until further improvements become too small to justify the effort or the code reaches known hardware and algorithmic limits.

Never trust a single run as proof of improvement. Repeat measurements and consider variability, especially on shared systems.

Common Sources of Performance Bottlenecks

While every application is different, certain classes of bottlenecks appear frequently in HPC codes. Recognizing them helps focus analysis.

Insufficient parallelism and serial regions

Parallel performance depends on there being enough parallel work to keep all processing elements busy. Serial regions, where only one thread or process is active, limit speedup according to relationships like Amdahl’s Law.

Common causes include:

Initialization or I/O that is done by a single process.

Global control flow that forces all work to be serialized.

Legacy algorithms that are inherently sequential.

Analysis for such cases involves measuring the fraction of time spent in purely serial parts and understanding their impact on overall speedup.

Load imbalance

Parallel programs often divide work among tasks. Load imbalance occurs when some tasks finish early and become idle while others still work.

Symptoms include:

Some cores are much more heavily utilized than others.

In MPI programs, processes wait at collective operations for stragglers.

In OpenMP programs, some threads consistently have less work in loops.

Performance tools can show per-process or per-thread utilization and time spent idle or waiting. Fixing imbalance may involve changing data decomposition, dynamic scheduling, or rethinking how tasks are assigned.

Communication overhead

Distributed memory applications using MPI and similar models must exchange data among processes. This communication introduces overhead and can limit scalability.

Bottlenecks arise from:

Excessive use of fine-grained messages.

Synchronous communications that cause unnecessary waiting.

Frequent global reductions that require global coordination.

Contention in the network when many processes communicate simultaneously.

Performance analysis in this area looks at time spent in communication routines, message sizes and frequencies, and patterns such as all-to-all or nearest-neighbor communication. Understanding whether communication overlaps with computation is also important.

Synchronization and contention

Shared memory parallel programs use synchronization operations such as locks, barriers, and atomic operations to coordinate threads. Excessive or poorly placed synchronization can degrade performance.

Typical issues include:

Too many global barriers in OpenMP regions.

Coarse-grained locks that serialize access to shared structures.

Hot spots where many threads contend for the same lock or atomic variable.

Profiling can reveal time spent in synchronization primitives and identify which code regions cause heavy contention.

Memory bandwidth and latency limitations

Many HPC applications are ultimately limited by how fast data can be moved between memory and processors. Bottlenecks here show up as:

Low CPU utilization even when there is enough work.

Tools reporting high memory bandwidth usage close to theoretical peak.

Stalled cycles and high cache miss rates.

For these cases, improvement often requires reorganization of data and access patterns to exploit caches better, reduce memory traffic, and increase data reuse.

I/O bottlenecks

Input and output operations can be surprisingly expensive, especially on large node counts or when many processes write small files.

Symptoms are:

Long periods at the start or end of runs where computation is not occurring.

Poor scaling as more processes perform I/O simultaneously.

Filesystem contention and variable performance from run to run.

Analysis involves measuring time spent in open, read, write, and close operations, and understanding how data is organized on disk. Parallel I/O libraries and careful data layout help mitigate such issues.

Optimization Strategies at Different Scales

Once bottlenecks are identified, a range of optimization strategies is available. It is helpful to categorize them by the scale at which they operate.

Algorithmic and mathematical optimization

These are changes that alter the mathematical or algorithmic structure of the code.

Examples include:

Choosing an algorithm with lower complexity for the same task.

Using more efficient solvers or preconditioners for linear systems.

Exploiting problem-specific structure such as symmetry or sparsity.

Such changes can yield orders-of-magnitude improvements in both runtime and resource usage, but they may require deeper domain knowledge and validation that numerical properties remain acceptable.

Data structure and locality optimization

Data structures strongly influence memory access patterns and cache behavior. Good locality means that data loaded into fast memory is reused before being evicted.

Potential optimizations include:

Reordering arrays or loop nests to access memory in contiguous patterns.

Choosing array-of-structures or structure-of-arrays layouts depending on access patterns.

Blocking computations to operate on subregions of data that fit in cache.

These changes try to increase arithmetic intensity, reduce cache misses, and move from a memory-bound to a more compute-bound regime.

Parallelization and load balancing strategies

Improvements here change how work is divided and coordinated across processing elements.

Possible strategies:

Adjusting domain decompositions to better match problem geometry or workload distribution.

Using dynamic scheduling in OpenMP for irregular workloads.

Redistributing work in MPI to account for heterogeneous node speeds.

Reducing global synchronizations and making progress in local regions independently when possible.

The goal is to keep all processing elements busy and minimize idle time due to imbalance or unnecessary waiting.

Communication and synchronization optimization

Since communication and synchronization costs grow with scale, careful design is necessary for large parallel jobs.

Typical techniques are:

Aggregating messages to send fewer, larger communications.

Using nonblocking operations to overlap communication with computation.

Reducing the frequency of expensive collective operations by rethinking algorithms.

Replacing global barriers with more targeted synchronization mechanisms.

These optimizations can significantly improve scalability, especially for strong scaling where problem size is fixed and processor count increases.

Implementation-level and compiler-driven optimization

Compilers and libraries play a major role in performance. Implementation-level optimizations include:

Enabling appropriate optimization flags for the chosen compiler.

Ensuring that critical loops are vectorizable and inspecting compiler vectorization reports.

Using tuned numerical libraries instead of hand-written basic operations.

Avoiding unnecessary temporary objects or expensive language features in performance-critical sections.

These modifications often yield incremental speedups, but combined with higher-level changes, they can bring implementations closer to hardware limits.

Measuring and Reporting Performance

In HPC, performance measurements have to be communicated clearly, both within teams and in publications or reports. Good reporting makes it possible for others to understand results and reproduce them.

Choosing appropriate metrics for reporting

Different audiences care about different performance aspects. For scientific users, time to solution and scalability on realistic problem sizes may be most important. For computer scientists, FLOP/s or efficiency relative to hardware peak may be relevant.

Commonly reported metrics include:

Total wall-clock time and its breakdown.

Speedup and parallel efficiency as functions of processor count.

Throughput for ensembles of runs.

Resource utilization such as CPU and memory usage.

Energy consumption or power usage on energy-aware systems.

Choosing metrics that match the goals of the study avoids misleading impressions. For instance, reporting only FLOP/s can be meaningless for I/O dominated applications.

Ensuring fair and reproducible comparisons

When comparing different versions of a code or different implementations, it is critical to control the experimental environment. This means:

Using the same hardware and software environment unless differences are part of the comparison.

Fixing problem sizes and inputs.

Repeating runs and reporting averages and variability.

Disclosing compiler versions, flags, and significant runtime parameters.

On shared clusters, background load from other users can add noise to measurements. Multiple runs and simple statistical reporting help account for this.

Performance claims without clear experimental conditions and reproducibility are of limited value and can be misleading.

Interpreting variability

Observed runtimes often vary even when nothing in the code is changed. Reasons include filesystem contention, network congestion, and other jobs contending for shared resources.

When analyzing results:

Do not overreact to small differences that fall within typical variation.

Use multiple runs to estimate mean and spread.

Look for consistent trends across independent measurements rather than isolated best-case numbers.

When improvements are small, for example a few percent, more careful statistical treatment is needed to determine whether they are real.

Performance vs Other Objectives

Optimization rarely happens in isolation. Other concerns such as correctness, maintainability, portability, and energy use also matter.

Balancing performance and code complexity

Aggressive optimization can lead to complicated code that is hard to understand and maintain. Striking a balance is essential.

General guidelines include:

Prioritize clean algorithmic improvements over low-level tweaks that obscure logic.

Limit complex optimizations to well isolated hotspots, and document them clearly.

Use comments and design notes to explain non-obvious decisions made for performance reasons.

Consider whether automatic tools or libraries can provide performance without manual micro-optimization.

In many projects, a moderate performance improvement that preserves clarity is preferable to a marginal extra speedup that creates technical debt.

Performance portability

Optimizations that are highly specific to one architecture or compiler may not transfer well to other systems. Performance portability seeks to write code that performs well across different platforms with minimal changes.

Strategies include:

Using portable parallel programming interfaces that support multiple backends.

Encapsulating architecture-specific kernels so they can be replaced for different targets.

Avoiding reliance on undefined or obscure behavior that may change across compilers.

When evaluating performance, it is often useful to measure on more than one system to assess portability.

Energy and resource aware performance

HPC facilities are constrained not only by hardware capacity but also by power and cooling budgets. Performance optimization that reduces time to solution usually also improves energy efficiency, but not always. For example, using more nodes to finish faster may increase total energy used.

Some systems allow users to choose energy related settings such as CPU frequency or power caps. Performance analysis in such environments may involve exploring trade-offs between runtime and energy.

The ideal is often to minimize energy to solution, which involves considering both power and time. Although detailed energy modeling is beyond the scope of this chapter, awareness of these trade-offs is increasingly important.

Integrating Performance Analysis into Your Workflow

For beginners, performance analysis and optimization can seem like an advanced, separate activity. In practice, it should be integrated into everyday development.

A practical approach includes:

Adding simple timers around major phases as soon as there is a working prototype.

Preserving small test cases that are fast to run and representative enough to show performance changes.

Periodically profiling as features are added to ensure that no new bottlenecks appear unexpectedly.

Recording configurations and results in a lightweight log or lab notebook.

Viewing performance not as a one-time tuning step, but as a continuous feedback loop during development makes later large-scale optimization much easier.

Ultimately, systematic performance analysis and measured optimization are what transform correct parallel programs into effective and efficient HPC applications that fully exploit modern systems.

12.1 Measuring performance

12.2 Benchmarking applications

12.3 Profiling tools

12.4 Memory optimization

12.5 Cache optimization

12.6 Vectorization strategies

12.7 Improving parallel efficiency

Comments

Please login to add a comment.

Don't have an account? Register now!