Table of Contents
Overview of Profiling in HPC
Profiling tools help you understand where time and resources are spent in a program. In the parent chapter you saw why measurement matters and how to think about performance. Here the focus is on the concrete tools and the kinds of information they provide, specifically in an HPC context.
Profiling is different from simple timing. A single wall clock measurement tells you how long the program took. A profiler tells you why it took that long, by breaking execution down by function, line, loop, thread, process, and sometimes even by hardware events such as cache misses or floating point operations.
Profiling without understanding what is being measured can be misleading. Always link profiler output back to a clear performance question, for example: “Why does scaling stop beyond 16 nodes?” or “Which loop dominates runtime?”
Types of Profilers
Profiling tools can be grouped by how they collect data and what they focus on. For beginners, it is useful to distinguish three main categories.
A sampling profiler periodically interrupts the program and records where it is executing. The result is a statistical picture of hot spots with relatively low overhead. Sampling is often the first step, especially on large parallel jobs, because it scales well and has small performance impact.
An instrumentation profiler inserts extra code at function entry and exit, or at specific points such as MPI calls or OpenMP regions. Instrumentation gives very detailed information and precise timings but at the cost of higher overhead. It is well suited to smaller inputs or targeted experiments.
A hardware performance counter profiler uses CPU and sometimes GPU counters that count events such as cache misses, branch mispredictions, or vector instructions. These tools help connect time spent to microarchitectural behavior and are particularly relevant when optimizing for memory and cache performance.
Most HPC-oriented tools combine these techniques. For instance, a tool may use sampling for general hot spot detection, and hardware counters to annotate samples with deeper information, while allowing optional instrumentation of key libraries such as MPI.
General Workflow When Using Profiling Tools
Although tools differ in detail, they generally follow a similar workflow on an HPC system.
First, you decide what to measure. Examples include function hot spots, MPI communication overhead, OpenMP thread imbalance, memory bandwidth, or GPU utilization. Selecting the right metric avoids drowning in irrelevant data.
Second, you prepare a suitable build of your program. For most profilers, you should compile with debug symbols, usually via -g, and with optimization enabled, for example -O2 or -O3. Debug symbols allow the profiler to map machine instructions back to functions and source lines. Optimization ensures that you profile realistic performance behavior.
A typical compilation command might look like:
mpicc -O3 -g mycode.c -o mycode
Third, you run the program under the profiler. On an HPC cluster this often means wrapping your srun or mpirun command in the profiler’s launcher utility or using environment variables provided by the module system. Many profiling tools integrate with job schedulers like SLURM and are designed to work inside batch scripts.
Finally, you analyze the collected data. Most tools provide both command line reports and graphical interfaces. Command line summaries are convenient on login nodes. Graphical views such as flame graphs, timeline views, and communication matrices can be explored on your local workstation using exported profiles.
Always start with a manageable problem size and a limited number of nodes when profiling. Large-scale production jobs often generate too much profile data and can overwhelm both storage and the analysis tools.
Command Line Profiling Tools on Linux
Several basic profilers are available on most Linux-based HPC systems and are valuable first tools before moving to complex HPC-specific packages.
The time command is the simplest and reports wall time, user CPU time, system time, and memory statistics. While not a full profiler, it establishes a baseline.
The gprof tool works with programs compiled with -pg. It instruments function calls and produces a call graph and per-function timing information. gprof is relatively old and not ideal for complex modern HPC codes, but it can still be instructive for small single-process programs.
The perf tool on Linux is a flexible sampling and event measurement utility. It can report hot spots and hardware counters with commands such as:
perf record ./mycode
perf report
This combination records samples during execution and then presents a summary of where time was spent, along with data sourced from performance counters. perf is particularly useful when you suspect microarchitectural issues, such as poor cache behavior or limited instruction throughput.
Profiling Parallel Programs with MPI
Distributed-memory programs add complexity, because what matters is not only where time is spent in computation, but also how much time is spent in communication and where processes wait for each other. Dedicated MPI profiling tools address this by capturing information about MPI calls and communication patterns.
Many MPI implementations provide a basic internal profiling interface, often accessed by linking against special libraries or setting environment variables. These tools can produce simple summaries of how much time was spent in each MPI routine, such as MPI_Send, MPI_Recv, MPI_Bcast, or MPI_Allreduce.
More advanced MPI profilers use the MPI profiling interface (PMPI) to intercept and measure all MPI calls. They can record message sizes, communication partners, and call stacks. The resulting data can be viewed as communication matrices that show which ranks communicate heavily with which others, or as timelines that reveal periods of waiting and imbalance.
MPI-aware profilers help you answer questions such as whether time is dominated by small messages, by collective operations, or by load imbalance across ranks. They also allow you to compare the time spent in computation functions to the time spent in the MPI library.
When profiling MPI applications, always preserve the launch pattern used in production runs. For example, if you usually run with srun -n 64, the profiler invocation should use the same layout, such as srun -n 64 profiler_command ./mycode, to obtain representative results.
Profiling Shared-Memory Parallelism
Shared-memory profiling focuses on threads and cores inside a node. For OpenMP or pthread-based codes, you often want to know which threads are busy and which are idle, and how much overhead is caused by synchronization.
Many general-purpose profilers can display per-thread CPU usage. More specialized thread profilers can show OpenMP constructs such as parallel regions, worksharing loops, and synchronization points, and can indicate the time each thread spends in these regions. This information reveals load imbalance within a node and helps identify loops that are not effectively parallelized.
Timeline views plot each thread as a row and show colored segments for computation, synchronization, and building or tearing down parallel regions. Dense regions of synchronization blocks and frequent transitions between parallel and serial regions highlight opportunities to simplify or restructure parallel sections.
In addition, some profilers report OpenMP-specific metrics such as the number of times a barrier was executed, the amount of time threads waited at barriers, and the overhead of tasking constructs. These metrics are essential when optimizing for shared-memory parallel efficiency.
Combined MPI and OpenMP Profiling
Hybrid codes that combine MPI with OpenMP or other threading models require profilers that understand both distributed and shared-memory levels at once. In hybrid profiling, you want to see interactions between MPI ranks and threads, for example how inner-node imbalance affects inter-node communication.
Hybrid profilers often present hierarchical views where nodes contain MPI ranks, and ranks contain threads. Timeline or trace views then show MPI communication events on each rank, along with threaded computations inside those ranks.
These tools can highlight patterns such as one rank leaving an MPI call late because its threads finished work more slowly, which then delays collectives across many nodes. This kind of behavior is difficult to infer from simple overall timing numbers but becomes visible in detailed traces.
Because hybrid profiles can generate very large trace files, many tools support selective tracing. You might first run a sampling pass to find hot MPI phases, then rerun with detailed tracing enabled only for those phases or only for a subset of ranks.
GPU and Accelerator Profiling Tools
When GPUs or other accelerators are involved, CPU-only profiling is not enough. You need tools that understand device kernels, data transfers across PCIe or NVLink, and the GPU memory hierarchy.
GPU profilers typically provide:
- Kernel-level timing that identifies which kernels dominate GPU time and how often they are launched.
- Memory transfer tracing that reports time spent transferring data between host and device, and between devices.
- Hardware utilization metrics such as occupancy, achieved bandwidth, and instruction throughput.
Visual GPU profilers often produce timelines in which CPU and GPU activity are shown together, aligned in time. This helps you see whether GPUs are underutilized because of insufficient work per kernel, inefficient data transfer strategies, or CPU-side bottlenecks that delay kernel launches.
Like CPU profilers, GPU profiling tools can use sampling, but much of their power comes from detailed instrumentation provided by the vendor’s runtime. Because GPU applications can be sensitive to profiling overhead, many tools let you restrict profiling to a subset of kernels or time intervals.
Do not interpret GPU low utilization metrics in isolation. Always correlate GPU kernel timings and utilization with CPU-side behavior and data movement to understand the full pipeline.
Tracing versus Statistical Profiling
Two conceptual styles often appear in performance tools, tracing and statistical profiling. Understanding the distinction helps you choose the right approach.
A trace records detailed information about a large number of events, often every MPI call, every OpenMP region, or every kernel launch, along with exact timestamps. Traces can be replayed and examined to reconstruct a precise timeline of program execution. This is extremely valuable for debugging performance issues that arise from complex interactions. The trade-off is that traces can be large and expensive to collect.
Statistical profiling samples the state of the program at intervals, for example every millisecond, and records the active function or code location, sometimes combined with performance counter values. Over many samples, you get a probability distribution of where the program spends time. Sampling introduces less overhead and produces smaller data sets, but loses fine-grained ordering information.
In practice, a common strategy is to begin with statistical hot spot analysis and then focus tracing on the most interesting parts of the program or on reduced problem sizes where detailed traces are manageable.
Integrating Profiling Tools into Batch Workflows
On shared HPC systems you rarely run interactive, long profiling sessions. Instead, profiling commands are integrated into batch jobs. The basic pattern is to load the profiling module, modify your run line, and ensure output files are written to appropriate directories.
A typical SLURM script segment for profiling a parallel job might look like:
module load myprofiler
srun -n 64 myprofiler_launch --output=profile_%j ./mycode input.dat
The job scheduler environment variable %j is often used to tag profile output with the job ID, which simplifies later analysis. Since many profilers can generate substantial data, it is good practice to avoid writing profiling output to shared home directories. Use scratch or work file systems designed for high I/O volume.
Automating profiling runs in this way helps you systematically explore performance as you change code, compiler flags, or problem sizes. You can also script post-processing, such as converting traces into summarized reports or extracting key metrics into simple text tables.
Interpreting Typical Profiler Reports
Profilers produce many forms of output. For beginners, it is helpful to recognize a few common report types and how to use them.
A flat profile lists functions ranked by total time or by percentage of total runtime. This is the most basic hot spot view. If a function accounts for a large fraction of time, it is usually the first candidate for more detailed inspection.
A call graph shows how functions call each other along with aggregated time per call path. It helps distinguish whether a heavy function is inherently expensive, or whether it is simply called many times by a higher-level routine.
A timeline plots execution over time for processes and threads. Gaps or long stretches of waiting reveal imbalance, synchronization issues, and communication delays. MPI timelines show collective operations and point-to-point messages as colored segments or arrows.
A communication matrix aggregates messages between ranks into a 2D matrix. Heavily communicating rank pairs appear as bright or large entries. This visualization helps identify bottlenecks from non-uniform communication patterns or overloaded ranks.
A hardware metric report lists derived quantities such as instructions per cycle (IPC), cache miss rates, branch misprediction rates, and memory bandwidth utilization. Interpreting these figures requires some understanding of the underlying architecture, but even simple comparisons, such as “before” and “after” an optimization, can be informative.
When examining profiler reports, avoid optimizing rare code paths. Focus first on regions that consume a significant fraction of total runtime, typically at least 5 to 10 percent.
Measurement Perturbation and Profiling Overhead
Profilers inevitably change the behavior they measure to some extent. This effect is called measurement perturbation. Sampling reduces this effect, whereas heavy instrumentation or very fine-grained tracing can significantly slow the program and alter timing relationships.
To manage this, most tools allow you to adjust sampling frequency, choose which functions to instrument, or limit tracing to specific phases. You can also run with smaller input sizes to contain overhead, although care is required to ensure that the performance characteristics remain representative.
A simple sanity check is to compare total runtime between a normal run and a profiled run. If profiling increases runtime by a small factor, the data is usually still meaningful. If runtime increases dramatically, measurements must be interpreted with great caution, and sampling or selective tracing should be used instead.
Choosing Appropriate Profiling Tools
No single tool is best for every situation. The right choice depends on your programming model, target architecture, and the question you are asking.
For serial or simple threaded programs, a general-purpose CPU profiler that provides function-level timing and line-level hotspots is usually sufficient. When you start using MPI, an MPI-aware profiler becomes essential, particularly one that can deal with many processes and large jobs.
For shared-memory performance tuning, a profiler that understands OpenMP constructs and can highlight thread imbalances is important. Once GPUs are involved, a GPU-aware tool that can relate host and device activities is necessary.
Beginners should start with the simplest tools and reports that answer immediate questions. As your understanding grows, you can explore more advanced features, such as hardware counter integration, custom metrics, or scripting interfaces that allow automating complex analyses.
Building a Profiling Habit
Profilers are most effective when used regularly, not just at the end of development. Incorporating profiling runs into your normal workflow helps you detect performance regressions early, compare different algorithmic choices, and understand scaling behavior as you move to larger problem sizes or more nodes.
A practical approach is to maintain a small suite of standard test problems and a set of profiling scripts that you can run after significant code changes. Recording profiler output along with the code version, compiler options, and runtime configuration forms a performance history that is invaluable for systematic optimization and for the broader performance analysis practices discussed in the parent chapter.
Do not wait until the end of a project to profile. Make profiling a routine activity, alongside functional testing and debugging, so that performance issues are discovered while they are still easy to fix.