16.3 Debugging tools for HPC

Goals of This Chapter

In this chapter you will:

Get an overview of common debugging tools used in HPC environments.
Understand how tools differ for serial vs. parallel (MPI/OpenMP/GPU) codes.
Learn typical workflows for using these tools on clusters (interactive vs batch).
See concrete examples of how tools help with common parallel bugs (crashes, deadlocks, race conditions, memory errors, performance anomalies).

The aim is not to master every tool, but to recognize what to use when, and how to get started with each.

Types of Debugging Tools in HPC

HPC debugging tools can be grouped into:

Symbolic debuggers: Step through code, inspect variables, set breakpoints.
Memory and correctness checkers: Detect invalid memory accesses, leaks, and race conditions.
MPI correctness tools: Detect mismatched sends/receives, deadlocks, and other MPI usage errors.
Thread and data race analyzers: Focus on OpenMP/pthreads races and synchronization issues.
GPU debuggers and checkers: For CUDA/OpenACC/OpenMP offload kernels and memory issues.
Record-and-replay tools: Reproduce hard-to-trigger bugs deterministically.
Lightweight logging/trace tools: For bugs that only appear at scale.

Many full-featured HPC tools bundle several of these capabilities.

Using Debuggers on Clusters: Practical Considerations

On HPC systems, debugging has some constraints compared to a laptop:

You typically debug on compute nodes, not on login nodes.
You often need to allocate an interactive job through the scheduler (e.g. SLURM) before running the debugger.
For large MPI jobs, attaching a debugger to all ranks is often impractical; you debug a small number of ranks or a reduced test case.

Typical workflow with SLURM (schematic):

salloc -N 1 -n 4 --time=01:00:00  # interactive allocation
srun --pty gdb ./my_mpi_program   # run under debugger

The exact options depend on your site; cluster documentation usually has a “debugging” section.

Symbolic Debuggers for HPC

GDB and Variants

GDB (GNU Debugger) is the default debugger on many Linux/HPC systems. Key capabilities:

Run your program under control: gdb ./a.out.
Set breakpoints: break my_function.
Step through code: next, step, continue.
Inspect variables: print var, backtrace for call stacks.

For MPI programs, you typically:

Start each rank in its own gdb (tedious beyond a few ranks), or
Use MPI-launcher integration (e.g. mpirun -np 4 xterm -e gdb ./a.out on local clusters), or
Use wrappers that launch multiple gdb instances or a parallel-aware interface.

For OpenMP programs, you normally:

Start once under gdb.
Use thread-related commands (info threads, thread <id>) to switch between threads.

Many sites also offer:

cgdb, ddd, or IDE integrations as graphical/interactive front-ends to gdb.
gdbserver for remote debugging, though it is less common in batch environments.

LLDB

LLDB (from the LLVM project) is another modern debugger, especially used with Clang/LLVM:

Similar usage to gdb: lldb ./a.out, breakpoints, stepping, variable inspection.
Better C++ support in some cases.
Less MPI-specific tooling than some commercial HPC debuggers, but works similarly for single-process or modest MPI jobs.

Memory and Correctness Checkers

Valgrind

Valgrind is widely available on clusters and provides several tools; the most important for HPC beginners:

memcheck: finds invalid reads/writes, use of uninitialized memory, memory leaks.
helgrind / drd: detect data races in threaded programs (can be very slow).

Example:

valgrind --tool=memcheck ./my_program input.dat

Considerations in HPC:

Valgrind slows programs down by 10–100x or more. Use small problem sizes.
It is most useful on single-process or very small MPI jobs (e.g., 2–4 ranks).
For MPI, you typically run:

  srun -n 2 valgrind --tool=memcheck ./my_mpi_program

but performance overhead and log volume grow quickly with rank count.

Address Sanitizer and Other Sanitizers

Compiler-based sanitizers (part of GCC and Clang) are extremely helpful:

AddressSanitizer (ASan): catches buffer overflows, use-after-free.
UndefinedBehaviorSanitizer (UBSan): catches undefined behavior (e.g. integer overflows, invalid casts).
ThreadSanitizer (TSan): detects data races in threaded programs.

You enable them at compile time, for example:

# GCC/Clang example
mpicc -g -O1 -fsanitize=address -fno-omit-frame-pointer -o myprog myprog.c

Then run normally (on reduced input sizes). Pros:

Typically lower overhead than Valgrind, especially for large codes.
Work reasonably well with MPI as long as your environment supports them.

Check your site documentation; some clusters provide preconfigured sanitizer builds.

MPI-Aware Debugging Tools

For MPI-specific correctness (mismatched sends/receives, deadlocks, incorrect collectives), MPI-aware tools are very valuable.

MPI Checkers (MUST, Intel MPI correctness, etc.)

Examples (availability varies by site):

MUST (MPI correctness checking tool): An open-source tool that intercepts MPI calls and looks for:

Deadlocks and mismatched communication.
Incorrect use of communicators, tags, datatypes.
Resource leaks (e.g., communicators not freed).

Vendor tools or extensions:

Some MPI distributions provide built-in checkers (e.g. -check_mpi style options or environment variables).
Intel MPI tooling (often part of Intel oneAPI) can diagnose some MPI issues.

Typical usage pattern (schematic):

mpirun -np 8 mustrun ./my_mpi_program

or use a wrapper command specified by the tool.

Features:

Generate human-readable reports that point to the MPI call sites involved in potential deadlocks or mismatches.
Often integrate with existing debuggers (e.g. show stack traces).

These tools are particularly useful for bugs that only appear in parallel and may not crash (e.g., silent hangs).

Parallel-Aware GUI Debuggers (DDT/Arm Forge, TotalView)

Many HPC centers provide commercial parallel debuggers, commonly:

Arm DDT (part of Arm Forge).
Rogue Wave TotalView (now Perforce TotalView).

Key capabilities:

Attach to or launch MPI jobs with hundreds or thousands of ranks (practically, you usually debug tens, not thousands).
Inspect all ranks simultaneously; compare variables across ranks.
Visualize which ranks are stuck in which MPI call (useful for deadlock analysis).
Set breakpoints in all ranks at once, or in a subset.
Control job execution collectively (start/stop/step all ranks).

Typical workflow (high level):

Start an interactive job (salloc).
Load the module, e.g. module load forge or module load totalview.
Launch via the tool’s front-end or a wrapper, e.g.:

   ddt srun -n 8 ./my_mpi_program

Use the GUI (X11 forwarding or remote desktop) to insert breakpoints, inspect variables, etc.

These tools are often the most practical way to debug non-trivial MPI applications on clusters.

Tools for Threading and Race Conditions

Race conditions and synchronization bugs in OpenMP (and other threading models) are notoriously tricky. Dedicated tools can help.

OpenMP Debugging with Traditional Debuggers

Using gdb/lldb + environment variables:

OpenMP runtimes often provide debug environment variables (e.g., OMP_DISPLAY_ENV, vendor-specific tracing).
In a debugger, you can:

List threads: info threads.
Switch threads: thread <id>.
Inspect shared vs private variables at the point where corruption is seen.

However, stepping through multi-threaded execution can be confusing; race detectors are often more practical.

Thread/Memory Race Detectors

Common tools:

Valgrind Helgrind / DRD:

Target POSIX threads but can sometimes help with OpenMP codes using pthreads underneath.
Very high overhead, best for small reproducing test cases.

ThreadSanitizer (TSan):

Compiler-based; detect data races and some locking issues.
Enabled with -fsanitize=thread (with Clang or compatible GCC setups).
Works on real workloads better than Helgrind but still adds significant overhead.

For example, with Clang:

clang -g -O1 -fopenmp -fsanitize=thread -o myprog myprog.c

Run with reduced problem size and inspect the sanitizer’s error output, which usually:

Identifies conflicting reads/writes.
Shows file/line numbers and thread IDs.
Provides stack traces for both sides of the race.

Vendor suites (e.g., Intel Inspector in Intel oneAPI) also provide GUI-based thread/race analysis.

GPU Debugging Tools

For GPU-accelerated applications (CUDA, OpenACC, OpenMP offload), specialized tools are required.

CUDA Debugging

If you use CUDA:

cuda-gdb:

Command-line debugger similar to gdb but GPU-aware.
Lets you set breakpoints in kernels, inspect device memory, step through GPU code.
Usage is similar to gdb:

    cuda-gdb ./my_cuda_program

NVIDIA Nsight tools:

Nsight Compute / Nsight Systems focus more on performance, but Nsight also offers debugging functionality via IDE integrations.

Important considerations:

GPU debugging adds significant overhead; use small test cases.
You may need to compile with lower optimization levels and debug symbols:

  nvcc -G -g -O0 mykernel.cu -o myprog

OpenACC / OpenMP Offload Debugging

For directive-based GPU programming (OpenACC, OpenMP target offload):

Vendor/compiler suites often provide:

Environment variables to increase verbosity (e.g., data mapping logs).
Limited support for stepping through device code.

Some GPU debuggers (e.g. cuda-gdb) can work with OpenACC-generated kernels depending on the toolchain.
Check your compiler documentation for:

How to generate debuggable device code.
Any special flags to keep symbols and line mappings.

For correctness of GPU memory usage (out-of-bounds, illegal accesses), sanitizer-like CUDA tools (e.g. cuda-memcheck, or its newer replacements in CUDA compute-sanitizer) are critical:

cuda-memcheck ./my_cuda_program

They report illegal memory accesses, race conditions, and API misuse in GPU code.

Record-and-Replay and Deterministic Debugging

Some bugs appear only rarely or at scale. Record-and-replay tools try to:

Record an execution (or a part of it) including non-deterministic events.
Replay it deterministically inside a debugger.

Examples (availability and practicality vary):

General-purpose recorders like rr (more common on workstations; limited support for large-scale MPI).
MPI-focused recorders integrated into some commercial tools or research software.

In practice, on many clusters you will instead:

Reduce the problem size and node count.
Add instrumentation or logging to increase reproducibility.
Use deterministic execution modes if your algorithm/library offers them.

Logging, Tracing, and Lightweight Instrumentation

For bugs that depend on scale or particular timing:

Traditional debuggers may perturb timing too much.
Full-scale record-and-replay may not be available.

Instead, you can use:

Logging:

Insert printf, logging libraries, or rank-aware log messages.
Use rank IDs and thread IDs in logs to correlate events.
Be careful: excessive logging can change timings and generate huge files.

Event tracing tools (some also used for performance):

E.g., tools that record MPI calls, thread events, or GPU kernels.
Even if primarily performance-oriented, their timelines (who called what and when) can expose:

Unexpected ordering of messages.
Ranks stuck in specific calls.
Misuse of collectives.

Examples (names may vary by site): Score-P, Vampir, Paraver, Intel VTune/Trace Analyzer. While mainly performance tools, they double as powerful “what actually happened?” debuggers.

Integrating Debugging Tools with the Batch Scheduler

Common patterns when using tools with schedulers like SLURM:

Interactive jobs for debugging:

Use salloc or srun --pty to get a shell on compute nodes.
Run debuggers directly from that shell.

Batch scripts for tools that need multiple runs:

Some tools (e.g. MPI checkers or recorders) can be run in batch, producing reports for offline inspection.
Example:

    #!/bin/bash
    #SBATCH -N 2
    #SBATCH -n 64
    #SBATCH -t 00:30:00
    module load must
    srun mustrun ./my_mpi_program > must_output.log 2>&1

X11/GUI access:

For GUI debuggers (DDT, TotalView), you need X11 forwarding or a remote desktop solution.
Example: ssh -X or ssh -Y to the login node, then launch the GUI after obtaining an allocation.

Always check site documentation for:

Which tools are installed (module avail).
Recommended way to run them on that specific system.

Choosing the Right Tool

A practical mapping from symptom to tool:

Segmentation fault / crash in one rank or thread:

Start with gdb or a GUI debugger (DDT/TotalView).

Program hangs (likely MPI deadlock):

Use MPI-aware debuggers (DDT/TotalView) or MPI correctness tools (MUST, MPI checkers).

Wrong numerical results, no crash:

Use memory checkers (Valgrind, ASan) and inspect with gdb; possibly add logging or assertions.

Intermittent or non-reproducible behavior:

Try sanitizers, logging with timestamps/ranks, or record-and-replay if available.

Data races / threading bugs:

Use ThreadSanitizer, Helgrind/DRD, or vendor thread-analysis tools.

GPU memory errors or kernel crashes:

Use cuda-memcheck/compute-sanitizer and cuda-gdb or Nsight debuggers.

Practical Tips and Best Practices

Compile with debug info:

Use -g and avoid aggressive optimization (e.g., -O0 or -O1 for debugging builds).

Keep test cases small:

Smaller inputs and fewer ranks/threads make debugging far easier and faster.

Use assertions:

Insert checks (assert, custom error checks) to catch invalid states early.

Combine tools:

For example, run with AddressSanitizer first; if that passes, use MPI checkers; then step through suspicious regions with a debugger.

Learn your site’s toolchain:

Most HPC centers standardize on a small set of supported tools and provide training material or examples—use them.

By understanding the strengths and limitations of each debugging tool and how to run them on a cluster, you can systematically approach even complex parallel bugs rather than relying on trial-and-error.

Comments

Please login to add a comment.

Don't have an account? Register now!