Table of Contents
Goals of this Chapter
In parallel programming, bugs are often subtle, data-dependent, and hard to reproduce. This chapter focuses on:
- Recognizing the kinds of problems that appear only (or mainly) in parallel programs.
- Learning systematic strategies to locate and fix these problems.
- Learning how to design tests that give you confidence your code is correct and stays correct as it scales.
Details about specific bug types and tools appear in the later subsections of this chapter; here we focus on the overall mindset and workflow for debugging and testing parallel programs.
Unique Challenges of Debugging Parallel Programs
Debugging parallel code is fundamentally harder than debugging serial code because of:
Non-determinism
Many parallel bugs depend on:
- The exact timing of thread or process execution.
- The order of message passing.
- The layout of data in memory and caches.
You can run the same program twice, with the same input, and get:
- Different results.
- A failure in one run and no failure in another.
- Different stack traces or hangs.
This breaks the usual assumption that “if I can’t reproduce it, it’s not really a bug” and forces you to use strategies that can expose rare behaviors more reliably.
Scale-Dependent Behavior
A parallel program might:
- Work on a laptop with 4 cores but fail on a cluster with 512 cores.
- Work for small problem sizes but fail or slow down drastically for production-size inputs.
- Show no problems at low process counts, but exhibit deadlocks or wrong results under heavy load.
You need methods to:
- Debug small-scale runs effectively.
- Transfer what you learn to large-scale runs.
- Detect bugs that appear only at scale (e.g., certain communication patterns or collective operations).
Multiple Failure Modes
Parallel programs can fail by:
- Crashing (segmentation faults, illegal instructions).
- Hanging (deadlocks, livelocks, infinite loops).
- Producing incorrect results (data races, lost updates, precision/order issues).
- Violating performance expectations (severe slowdowns, scaling failures).
A robust debugging and testing approach must be able to tell these apart and deal with each type.
A Systematic Debugging Workflow for Parallel Programs
Debugging becomes more manageable if you follow a structured workflow instead of trying random changes. A typical approach:
1. Confirm the Bug and Its Scope
- Determine whether the bug is:
- Deterministic (always happens with given input and configuration).
- Intermittent (happens only sometimes under the same conditions).
- Check if it appears:
- In serial or single-thread/process mode.
- Only with multiple threads/processes.
- Only beyond a certain number of ranks/threads or problem size.
This tells you whether you’re dealing with a logic error that would exist in serial code, or a “true” parallel bug.
2. Minimize the Failing Case
Try to reduce complexity while still triggering the bug:
- Use the smallest input size that still fails.
- Use the smallest number of threads/processes that still fails.
- Turn off optional features (I/O, checkpoints, advanced algorithms) unless they’re suspected.
Benefits:
- Faster turnaround times.
- Less noise in logs and output.
- Easier mental model for what’s going on.
3. Localize the Problem
Use a combination of:
- Logging (with process/thread IDs and timestamps).
- Assertions that check invariants (e.g., array bounds, non-negative counts).
- Sanity checks at key points (e.g., after a collective communication, verify that all ranks agree on sizes).
Approach:
- Identify the last place where the program’s state is still correct.
- Identify the first place where state is incorrect, or behavior diverges between ranks/threads.
- Narrow down to a single function or block of code as much as possible.
4. Form a Hypothesis, Then Test It
Once you have a suspect region:
- Form a concrete hypothesis (e.g., “rank 0 may skip a barrier under condition X; others don’t”).
- Add temporary diagnostics (extra prints, assertions, tracing).
- Run repeated tests to see if the added diagnostics align with the hypothesis.
Avoid making lots of unrelated changes at once; introduce one change at a time, and confirm its effect.
5. Use Tool Support Strategically
Parallel debugging tools can identify:
- Memory errors.
- Race conditions.
- Deadlocks.
- MPI/OpenMP misuse.
Integrate them into your workflow:
- Use simple tools (assertions, logging) first.
- Use heavy tools (dynamic analyzers, full-featured debuggers) when:
- The bug is hard to reproduce.
- The failure disappears when you add logging (a “Heisenbug”).
- You suspect complex timing or synchronization issues.
6. Verify the Fix Under Realistic Conditions
After a fix:
- Re-run tests:
- With multiple seeds or inputs.
- With a range of thread/process counts.
- At the scales where the bug originally occurred.
- Add a regression test that would have failed before the fix.
This reduces the chance that the bug will reappear in the future.
Strategies for Dealing with Non-Deterministic Bugs
Because many parallel bugs are timing-dependent, traditional “run once and inspect” debugging is often insufficient. Useful strategies:
Multiple Repeats and Statistical Debugging
- Run the same test many times (e.g., in a loop or via a script).
- Track:
- Failure frequency.
- Conditions under which it fails (process count, input, random seed).
- Compare logs of successful vs failed runs to see what differs.
This helps expose rare failures and understand their context.
Deterministic Modes and Controlled Execution
Sometimes you can force more predictable behavior:
- Disable dynamic scheduling features when possible.
- Use fixed seeds for pseudo-random number generators.
- Force a fixed process mapping (
mpirun/srunmapping options). - Introduce small artificial delays in suspected threads or ranks to change relative ordering and reveal races.
These techniques can make a rare bug more frequent or easier to reproduce.
Record and Replay (When Available)
Some environments or debuggers offer record/replay:
- Record execution (or parts of it) for one failing run.
- Replay deterministically under a debugger, stepping through the recorded events.
This is especially useful when failures are infrequent and complex to reproduce live.
Testing Strategies Specific to Parallel Programs
Testing parallel code is not just running your serial test suite with more processes. It needs additional attention to concurrency, synchronization, and scaling.
Types of Tests to Include
Alongside typical unit and integration tests, consider:
- Concurrency tests:
- Run the same test with different thread/process counts (1, 2, 4, 8, …).
- Verify results are identical (or within expected numerical tolerances).
- Stress tests:
- Push the code with higher concurrency or load for a sustained period.
- Aim to reveal rare race conditions or resource exhaustion.
- Scaling tests:
- Keep the problem fixed and increase core count (strong scaling behavior).
- Grow the problem size with core count (weak scaling behavior).
- Check both correctness and performance trends.
- Fault and recovery tests (if your code supports it):
- Simulate node or process failures.
- Test checkpoint/restart behavior.
- Verify resilience and correct recovery.
Designing Tests for Parallel Correctness
When writing tests for parallel sections:
- Check global invariants:
- Sum of work counts across ranks equals total tasks.
- No overlapping ownership of data ranges.
- Conservation laws (mass, energy, etc.) remain satisfied.
- Use collective checks:
- Aggregate results from all processes and compare to a known solution.
- Have different ranks compute partial checksums, then compare the global checksum against a reference.
- Validate consistency across configurations:
- The result should not depend on:
- The number of MPI ranks (excluding expected floating-point rounding differences).
- The scheduling of threads (again, up to known tolerances).
- Design tests to explicitly compare outputs between different configurations.
Automation and Continuous Testing
Parallel testing can be slow and resource-intensive, so you need to balance coverage against cost:
- Maintain several test “tiers”:
- Fast, small tests that run frequently (e.g., on every commit).
- More expensive, larger tests that run nightly or before releases.
- Use test runners that:
- Support launching MPI jobs or multi-threaded runs.
- Record resource usage and timing, not just pass/fail.
- Always include at least:
- A single-process/thread mode (for logic regressions).
- One or more truly parallel configurations (for concurrency regressions).
Practical Techniques and Habits
The effectiveness of debugging and testing in HPC often comes down to how you write and instrument your code from the start.
Build Configurations for Debugging vs Production
Maintain at least two build modes:
- Debug builds:
- No or minimal optimization.
- Full debug symbols.
- Extra runtime checks enabled (bounds checks, assertions).
- Linked with debugging or checking variants of libraries when available.
- Production builds:
- High optimization.
- Carefully selected debug aids that don’t impact performance too much (e.g., occasional checks in critical places).
Switching between these should be easy and well documented.
Use Assertions and Invariants Liberally
Assertions are powerful for catching logic errors early:
- Check preconditions (input shapes, ranges).
- Check postconditions (output properties, conservation laws).
- In parallel sections, include:
- Consistency checks across threads or ranks.
- Assumptions about ordering or ownership of data.
For parallel programs, assertions often need to:
- Include rank/thread identifiers in error messages.
- Optionally be collective (e.g., a failing condition triggers a clean abort across all ranks).
Structured Logging for Parallel Programs
When many processes or threads print to the same console or file, the result is often unreadable. To make logging useful:
- Prefix all messages with:
- Rank ID and/or thread ID.
- Time or step index.
- Use a hierarchy of log levels (e.g., error, warning, info, debug, trace).
- Allow selective logging:
- Per-rank logging (e.g., only rank 0 or a small subset).
- Log to separate files per rank or per node.
For reproducing intermittent bugs:
- Capture minimal structured logs with enough context to reconstruct what happened.
- Avoid excessive logging that masks timing-related issues or drastically changes program behavior.
Reproducibility for Debugging
Reproducibility is not just a scientific concern; it is essential for debugging:
- Record:
- Code version (e.g., git commit hash).
- Build configuration and compiler flags.
- Number of processes/threads and placement settings.
- Input parameters and random seeds.
- Make it easy to:
- Re-run a previous configuration on the same or similar hardware.
- Share a failing setup with colleagues or support teams.
This transforms “it crashed once last week” into a concrete, reproducible test case.
Working with HPC Systems While Debugging
Debugging on shared HPC systems introduces additional constraints:
Using Interactive vs Batch Modes
- Use interactive sessions (e.g., interactive jobs) for:
- Attaching debuggers.
- Rapid test cycles.
- Inspecting memory and variables during execution.
- Use batch jobs for:
- Collecting logs and traces from larger runs.
- Repeating long-running tests.
- Stress testing over many iterations.
Design your scripts and configurations so they can run under both modes with minimal changes.
Balancing Resource Use and Debugging Needs
Debugging often requires:
- Many repeated runs.
- Running at serviceable scales (not the full production scale).
Be considerate and efficient:
- Use the smallest allocation that still exhibits the bug.
- Prefer smaller, dedicated test queues when available.
- Clean up large log and trace files.
This aligns with good HPC citizenship and avoids wasting valuable cluster time.
Integrating Debugging and Testing into Your Development Cycle
Parallel codes evolve quickly and involve many moving parts. To keep them reliable:
- Incorporate parallel-specific tests into your regular regression suite.
- Run at least a subset of scaling and concurrency tests before:
- Major refactors.
- Changing communication or synchronization patterns.
- Track:
- Which bugs appeared only at scale.
- Which configurations are especially fragile.
Over time, this produces:
- A library of tricky tests that serve as early warnings.
- A better intuition for where parallel bugs tend to arise in your codebase.
By treating debugging and testing as ongoing practices rather than one-off activities, you greatly reduce the risk and cost of parallel bugs, especially as your applications grow in complexity and scale.