Profiling tools

Table of Contents

What Profiling Tools Do in HPC

Profiling tools measure where time is spent and how resources are used while your program runs. In the context of HPC, they help answer questions like:

Which functions dominate run time?
Is the code limited by CPU, memory, or communication?
Are all cores/threads/GPUs used effectively?
How does performance change as we scale to more nodes?

In this chapter, the focus is on types of profiling tools, typical workflows, and how to interpret and act on profiling data in an HPC setting. The concepts of what to measure and why are covered in the parent chapter; here we concentrate on the tools themselves and their use.

Kinds of Profiling Tools

Most HPC performance tools fall into a few broad categories:

Instrumentation-based profilers

Instrumentation means adding extra code (manually or automatically) to record events such as function entry/exit or MPI calls.

Characteristics:

Can be done at:

Source level (e.g., #pragma or macros)
Compiler level (automatic function instrumentation)
Binary level (post-processing the executable)

Often generate very detailed traces (timeline of events)
Overhead can be significant if too many events are recorded

Typical uses:

Understanding call paths and which routines are expensive
Detailed MPI/OpenMP behavior (who talks to whom and when)
Studying load imbalance in parallel loops

Examples (you don’t need to know them all, but you should recognize the types):

Score-P (instrumentation + trace collection)
TAU (Tuning and Analysis Utilities)
Intel VTune/VTune Profiler (has both sampling and instrumentation modes)
HPCToolkit (supports low-overhead measurement and analysis)

Sampling-based profilers

Sampling profilers interrupt the program at regular intervals and record the current call stack and other hardware counters.

Characteristics:

Much lower overhead than heavy instrumentation
Less precise per-function timing, but statistically accurate overall
Work well for long-running, large-scale jobs
Often rely on hardware performance counters (e.g., via PAPI or perf)

Typical uses:

Locating hot spots (functions that use the most CPU time)
Identifying whether the code is compute-bound or memory-bound
Quick, first-pass performance analysis

Examples:

gprof (historical; limited parallel support)
perf (Linux performance tools)
HPCToolkit (sampling-based call-path profiling)
Intel VTune (sampling modes)
NVIDIA Nsight Systems (sampling, tracing) and Nsight Compute (GPU kernel profiling)

Tracing tools

Traces record a timeline of events (function calls, MPI messages, I/O, etc.) for each process or thread. This is often based on instrumentation, but the key feature is the time-ordered log.

Characteristics:

Provide a visual timeline of what each rank/thread is doing
Show communication patterns, waiting, and synchronization
Can become huge for large runs (need to restrict scale and duration)
Often visualized with specialized GUI tools

Typical uses:

Diagnosing MPI wait times and communication overhead
Identifying serialization or synchronization bottlenecks
Understanding overlap of computation and communication

Examples:

Vampir (visualization for Score-P traces)
Paraver (with Extrae instrumentation)
Intel Trace Analyzer and Collector
NVIDIA Nsight Systems (timeline view including CPU and GPU)

Hardware counter tools

These tools focus on low-level hardware events:

Cache misses
Branch mispredictions
FLOPs
Memory bandwidth
Vectorization usage

They often use hardware performance monitoring units (PMUs).

Typical uses:

Determining whether performance is limited by:

Computation (FLOPs)
Memory bandwidth
Latency (cache misses)
Branching

Checking if compiler vectorization and CPU features are exploited

Examples:

PAPI (Performance API) as a foundation for many tools
perf stat on Linux
Intel VTune (hardware counters)
LIKWID (Linux tools for performance counters and affinity)
CPU vendor tools (e.g., AMD uProf)

Specialized tools (MPI, OpenMP, GPU)

There are also tools focused on specific programming models:

MPI-focused:

MPI profiling interfaces (PMPI) used under the hood by many tools
Intel Trace Analyzer, Paraver/Extrae, Vampir (MPI timelines and statistics)

OpenMP-focused:

Tools that show thread-level activity and work-sharing behavior

GPU-focused:

NVIDIA Nsight Systems (global timeline across CPU and GPU)
NVIDIA Nsight Compute (single-kernel analysis: occupancy, memory, etc.)
Vendor-specific tools for other accelerators

These tools report metrics that are particularly meaningful for that model (e.g., MPI wait time, OpenMP idle time, GPU occupancy).

Typical HPC Profiling Workflow

Profiling is not a one-shot activity. A practical workflow on HPC systems usually looks like this:

Start with a small-to-medium test case

Use a representative input that runs quickly enough to experiment.
Start with modest core counts or a single node to simplify data.

Run a sampling profiler to find hot spots

Use a low-overhead tool (perf, VTune sampling, HPCToolkit).
Identify:

Top functions by percentage of total time
Whether time is spent in your code or in libraries

Decide: Is the issue algorithmic, computational, memory, or communication-related?

Drill down with more detailed tools if needed

If communication seems expensive:

Use an MPI-aware tracing tool or MPI summary profiler.

If threads or OpenMP are the issue:

Use thread-aware profiling/tracing to look at imbalance or oversubscription.

If GPU kernels are involved:

Use GPU profilers (Nsight Compute/Systems) to look at kernel performance and memory traffic.

Change one thing at a time

Apply a targeted optimization (e.g., change loop order, modify MPI domain decomposition).
Re-profile with the same tool and settings for comparison.
Look at relative changes, not just absolute timings.

Scale up gradually

Once single-node performance looks reasonable, profile at a few larger scales.
Use tools that can handle more processes (often sampling-based).
Focus on metrics like:

Parallel efficiency
Communication/computation ratio
Load balance

Using Profiling Tools on HPC Systems

Most HPC clusters provide a curated set of tools via environment modules. A typical usage pattern is:

Profiling tools

What Profiling Tools Do in HPC

Kinds of Profiling Tools

Instrumentation-based profilers

Sampling-based profilers

Tracing tools

Hardware counter tools

Specialized tools (MPI, OpenMP, GPU)

Typical HPC Profiling Workflow

Using Profiling Tools on HPC Systems

Comments

Where to Move