Table of Contents
What Profiling Tools Do in HPC
Profiling tools measure where time is spent and how resources are used while your program runs. In the context of HPC, they help answer questions like:
- Which functions dominate run time?
- Is the code limited by CPU, memory, or communication?
- Are all cores/threads/GPUs used effectively?
- How does performance change as we scale to more nodes?
In this chapter, the focus is on types of profiling tools, typical workflows, and how to interpret and act on profiling data in an HPC setting. The concepts of what to measure and why are covered in the parent chapter; here we concentrate on the tools themselves and their use.
Kinds of Profiling Tools
Most HPC performance tools fall into a few broad categories:
Instrumentation-based profilers
Instrumentation means adding extra code (manually or automatically) to record events such as function entry/exit or MPI calls.
Characteristics:
- Can be done at:
- Source level (e.g.,
#pragmaor macros) - Compiler level (automatic function instrumentation)
- Binary level (post-processing the executable)
- Often generate very detailed traces (timeline of events)
- Overhead can be significant if too many events are recorded
Typical uses:
- Understanding call paths and which routines are expensive
- Detailed MPI/OpenMP behavior (who talks to whom and when)
- Studying load imbalance in parallel loops
Examples (you don’t need to know them all, but you should recognize the types):
- Score-P (instrumentation + trace collection)
- TAU (Tuning and Analysis Utilities)
- Intel VTune/VTune Profiler (has both sampling and instrumentation modes)
- HPCToolkit (supports low-overhead measurement and analysis)
Sampling-based profilers
Sampling profilers interrupt the program at regular intervals and record the current call stack and other hardware counters.
Characteristics:
- Much lower overhead than heavy instrumentation
- Less precise per-function timing, but statistically accurate overall
- Work well for long-running, large-scale jobs
- Often rely on hardware performance counters (e.g., via PAPI or perf)
Typical uses:
- Locating hot spots (functions that use the most CPU time)
- Identifying whether the code is compute-bound or memory-bound
- Quick, first-pass performance analysis
Examples:
gprof(historical; limited parallel support)perf(Linux performance tools)- HPCToolkit (sampling-based call-path profiling)
- Intel VTune (sampling modes)
- NVIDIA Nsight Systems (sampling, tracing) and Nsight Compute (GPU kernel profiling)
Tracing tools
Traces record a timeline of events (function calls, MPI messages, I/O, etc.) for each process or thread. This is often based on instrumentation, but the key feature is the time-ordered log.
Characteristics:
- Provide a visual timeline of what each rank/thread is doing
- Show communication patterns, waiting, and synchronization
- Can become huge for large runs (need to restrict scale and duration)
- Often visualized with specialized GUI tools
Typical uses:
- Diagnosing MPI wait times and communication overhead
- Identifying serialization or synchronization bottlenecks
- Understanding overlap of computation and communication
Examples:
- Vampir (visualization for Score-P traces)
- Paraver (with Extrae instrumentation)
- Intel Trace Analyzer and Collector
- NVIDIA Nsight Systems (timeline view including CPU and GPU)
Hardware counter tools
These tools focus on low-level hardware events:
- Cache misses
- Branch mispredictions
- FLOPs
- Memory bandwidth
- Vectorization usage
They often use hardware performance monitoring units (PMUs).
Typical uses:
- Determining whether performance is limited by:
- Computation (FLOPs)
- Memory bandwidth
- Latency (cache misses)
- Branching
- Checking if compiler vectorization and CPU features are exploited
Examples:
- PAPI (Performance API) as a foundation for many tools
perf staton Linux- Intel VTune (hardware counters)
- LIKWID (Linux tools for performance counters and affinity)
- CPU vendor tools (e.g., AMD uProf)
Specialized tools (MPI, OpenMP, GPU)
There are also tools focused on specific programming models:
- MPI-focused:
- MPI profiling interfaces (PMPI) used under the hood by many tools
- Intel Trace Analyzer, Paraver/Extrae, Vampir (MPI timelines and statistics)
- OpenMP-focused:
- Tools that show thread-level activity and work-sharing behavior
- GPU-focused:
- NVIDIA Nsight Systems (global timeline across CPU and GPU)
- NVIDIA Nsight Compute (single-kernel analysis: occupancy, memory, etc.)
- Vendor-specific tools for other accelerators
These tools report metrics that are particularly meaningful for that model (e.g., MPI wait time, OpenMP idle time, GPU occupancy).
Typical HPC Profiling Workflow
Profiling is not a one-shot activity. A practical workflow on HPC systems usually looks like this:
- Start with a small-to-medium test case
- Use a representative input that runs quickly enough to experiment.
- Start with modest core counts or a single node to simplify data.
- Run a sampling profiler to find hot spots
- Use a low-overhead tool (
perf, VTune sampling, HPCToolkit). - Identify:
- Top functions by percentage of total time
- Whether time is spent in your code or in libraries
- Decide: Is the issue algorithmic, computational, memory, or communication-related?
- Drill down with more detailed tools if needed
- If communication seems expensive:
- Use an MPI-aware tracing tool or MPI summary profiler.
- If threads or OpenMP are the issue:
- Use thread-aware profiling/tracing to look at imbalance or oversubscription.
- If GPU kernels are involved:
- Use GPU profilers (Nsight Compute/Systems) to look at kernel performance and memory traffic.
- Change one thing at a time
- Apply a targeted optimization (e.g., change loop order, modify MPI domain decomposition).
- Re-profile with the same tool and settings for comparison.
- Look at relative changes, not just absolute timings.
- Scale up gradually
- Once single-node performance looks reasonable, profile at a few larger scales.
- Use tools that can handle more processes (often sampling-based).
- Focus on metrics like:
- Parallel efficiency
- Communication/computation ratio
- Load balance
Using Profiling Tools on HPC Systems
Most HPC clusters provide a curated set of tools via environment modules. A typical usage pattern is:
- Load the tool module
Example (exact names vary by system):