12 Performance Analysis and Optimization

Table of Contents

Big Picture: Why Performance Analysis Matters in HPC

In high-performance computing, correctness is only the first step. A program that produces the right answer but runs 100× slower than it could is often unusable on a shared cluster. Performance analysis and optimization are about:

Understanding where time and resources are actually spent
Matching your code to the hardware and software stack
Making evidence-based changes instead of random tweaks

This chapter gives you a systematic way to think about performance, plus the main tools and concepts you’ll use. Later subchapters will go into specific techniques (benchmarking, profiling, cache optimization, etc.).

Performance as a Multi-Dimensional Problem

Performance is not a single number. Several dimensions matter:

Execution time: How long a job takes to run (wall-clock time)
Throughput: How many tasks per unit time (e.g., simulations/day)
Scalability: How performance changes with more cores/nodes
Efficiency: How well you use available resources (CPU, memory, I/O)
Energy use: How much energy is consumed per useful result

Different applications prioritize different combinations of these. For example, a weather forecast must finish before the forecast is needed (time-to-solution); a parameter sweep might care more about throughput.

When you “optimize performance,” you should first clarify:

What metric are you optimizing?
Under what constraints (e.g., fixed node count, fixed memory, fixed energy)?

A Systematic Performance Workflow

Rather than guessing, follow a structured cycle:

Establish a baseline

Use a realistic input problem.
Measure basic metrics: wall time, CPU utilization, memory usage, I/O rates.
Record the software environment (compiler, flags, libraries, module versions) and hardware (nodes, core counts, GPU presence).

Identify the main bottleneck

Is your code CPU-bound, memory-bound, I/O-bound, or communication-bound?
Use profiling and system tools (covered in later subchapters) to find:

Hotspots (functions or loops where most time is spent)
Resource usage patterns (e.g., low CPU, high I/O)

Formulate a hypothesis

Example: “The code is memory-bandwidth bound due to poor data locality”
Example: “Most time is spent in a non-vectorized inner loop”

Apply a targeted optimization

Change one thing at a time.
Use known strategies: algorithmic improvements, better libraries, more appropriate parallelization, etc.

Measure again

Compare to your baseline with the same environment and input.
Check that results are still correct (no change in numerical validity).

Iterate

Stop when further changes give diminishing returns or become too complex for the benefit gained.

This is an experimental process: measure → hypothesize → change → remeasure.

Common Types of Bottlenecks

Most performance problems in HPC fall into a few categories:

Compute-Bound

Characteristics:

High CPU usage (near 100% on all cores)
Low memory bandwidth and I/O usage
Performance scales with clock speed and number of cores (up to some limit)

Typical causes:

Expensive arithmetic operations
Complex algorithms with many floating-point operations

Solutions often involve:

Algorithmic improvements (fewer operations overall)
Better vectorization / SIMD usage
Using optimized math libraries

Memory-Bound

Characteristics:

CPU not fully utilized
High memory bandwidth utilization
Time spent waiting on data from memory rather than computing

Typical causes:

Poor data locality (accessing memory in patterns unfriendly to caches)
Large working sets that don’t fit in cache
Indirect or random memory accesses

Solutions often involve:

Reordering computations for better locality
Changing data structures
Tiling/blocking techniques
Avoiding unnecessary data movement

Communication-Bound (Parallel Codes)

Characteristics:

Time dominated by MPI or other communication calls
Strong dependency on network performance
Scaling gets worse as more processes are used

Typical causes:

Excessive fine-grained communication
Global synchronizations (e.g., frequent barriers)
Poor data decomposition across processes

Solutions often involve:

Aggregating messages
Reducing synchronization points
Better domain decompositions

I/O-Bound

Characteristics:

CPU and network relatively idle while reading/writing data
Very large files or many small file operations
Slow startup or periodic output dominates runtime

Typical causes:

Inefficient I/O patterns (small, frequent reads/writes)
Using unsuitable file formats
Serial I/O in a parallel application

Solutions often involve:

Parallel I/O techniques
Buffering and batching I/O operations
Choosing appropriate file formats and libraries

Levels of Optimization

Performance optimization can happen at several layers. It’s usually best to start at the top:

Algorithmic Level

Changes that reduce the total amount of work:

Choosing a more efficient algorithm (e.g., $O(n \log n)$ instead of $O(n^2)$)
Using better numerical methods (faster convergence, fewer iterations)
Reducing problem size where possible (e.g., coarser grids, adaptive meshes)

Algorithmic improvements can easily give order-of-magnitude gains and should be considered before lower-level tweaks.

Implementation Level

Improvements in how you implement a chosen algorithm:

Choosing appropriate data structures
Minimizing overhead in critical loops
Using vectorizable constructs and avoiding unnecessary branching
Exploiting data locality

These optimizations give significant improvements but usually smaller than major algorithmic changes.

Parallelization and Scaling Level

Improvements in how you exploit hardware parallelism:

Choosing between shared memory (OpenMP), distributed memory (MPI), or hybrid approaches
Tuning process and thread counts per node
Overlapping communication and computation where possible
Adjusting domain decompositions and load balancing

Effective parallelization can turn a usable single-node code into a capable large-scale application.

System and Build Level

Adjusting how you compile and run the code:

Compiler choices and optimization flags
Linking against optimized numerical libraries
Appropriate job sizing on the cluster (cores per task, memory per task)
Using NUMA-aware placement (binding threads/processes to cores and memory)

These tend to be relatively low effort for moderate gains when done correctly.

Trade-Offs in Optimization

Performance optimization almost always involves trade-offs:

Performance vs. maintainability

Highly tuned code can be harder to understand and modify.
Use clear structure and comments around optimized regions.

Performance vs. portability

Code tuned for one architecture (e.g., specific vector extensions) may not perform well or even compile on another.
Libraries and compiler flags can help manage this.

Performance vs. development time

You rarely need to squeeze out every last percent.
Focus effort on parts of the code that matter most (hotspots).

Before working on a major optimization effort, decide:

What speedup is required to make the application practically useful?
How much time and complexity are you willing to invest to get there?

Measuring What Matters: Basic Metrics

Later subchapters cover detailed techniques and tools. Here are core quantities you will often measure:

Wall-clock time $T_{\text{wall}}$: elapsed real time between start and end.
Speedup $S$ relative to a reference:
$$ S = \frac{T_{\text{reference}}}{T_{\text{current}}} $$
Parallel efficiency $E$ for $p$ processing elements:
$$ E = \frac{S}{p} = \frac{T_1}{p \cdot T_p} $$
Throughput: tasks per unit time (e.g., simulations/hour).
Resource utilization:

Average CPU usage per core
Memory consumption
I/O bandwidth

Energy per task: if your system exposes energy counters.

A basic performance report for any experiment should at least include:

Problem size and input parameters
Hardware description (nodes, cores, accelerators)
Software stack (compiler, libraries, relevant environment modules)
Measured wall time and resource usage

Principles for Effective Optimization

To use HPC resources responsibly and efficiently, adopt these habits:

Optimize the right thing

Focus on code that actually runs on the cluster at scale.
Identify and work on hotspots instead of tweaking rarely used routines.

Change one thing at a time

Make small, isolated changes and remeasure, so you know what helped or hurt.

Use appropriate baselines

Compare against meaningful references (e.g., previous version, single-node run, known-good library).

Leverage existing libraries

Vendor- and community-tuned libraries often outperform hand-written code.
Only hand-optimize when needed and justified.

Document your performance experiments

Keep simple logs: configuration, changes made, and results.
This aids reproducibility and helps others (and future you) understand what was done.

Validate correctness at every step

Performance gains are useless if the scientific results are wrong.
Re-run tests or checksums whenever you make deeper changes to algorithms or numerics.

How This Chapter Connects to the Subtopics

The rest of this part of the course will deepen specific aspects of performance work:

Measuring performance: basic timing, collecting and interpreting metrics
Benchmarking applications: designing fair and meaningful tests
Profiling tools: identifying hotspots and bottlenecks
Memory and cache optimization: making better use of the memory hierarchy
Vectorization strategies: exploiting SIMD capabilities
Improving parallel efficiency: understanding and improving scaling

Taken together, these topics will give you both the conceptual framework and the practical tools to analyze and improve HPC application performance in a disciplined way.

12.1 Measuring performance

12.2 Benchmarking applications

12.3 Profiling tools

12.4 Memory optimization

12.5 Cache optimization

12.6 Vectorization strategies

12.7 Improving parallel efficiency