Table of Contents
Typical bug categories in parallel codes
Parallel programs can fail in ways that do not appear in sequential code. Many bugs only show up under specific timing, loads, or hardware configurations, which makes them hard to reproduce. This chapter focuses on recognizing common bug patterns and what their symptoms look like, so you can quickly suspect “parallel bug” rather than “random mystery.”
We will group bugs into several broad families:
- Concurrency and synchronization bugs
- Data sharing and memory bugs
- Communication and ordering bugs
- Performance bugs that look like correctness bugs
- Tool- and environment-related pitfalls
Details on race conditions, deadlocks, and debugging tools are treated in their own chapters; here we map the landscape and emphasize how these problems tend to appear in practice.
Concurrency and synchronization bugs
Data races (read–write, write–write)
A data race happens when:
- Two or more threads/processes access the same memory location, and
- At least one access is a write, and
- There is no proper synchronization enforcing an order between them.
Typical symptoms:
- Results change between runs with the same input.
- Results depend on the number of threads or ranks.
- NaNs or huge outliers appear sporadically in arrays.
- Code passes tests when compiled without optimization, but fails with
-O2/-O3.
Common patterns:
- Parallel reductions implemented manually without atomics or proper reduction clauses (e.g.,
sum += x[i]from multiple threads). - Shared loop counters or indices modified by multiple threads.
- Global or static variables used as temporary scratch space without protection.
- Lazy initialization of shared data (first thread that gets there “initializes” a structure) without synchronization.
Key hint: if adding a small sleep call or a few printfs makes the bug disappear or appear, suspect a race.
Incorrect or missing synchronization
Bugs also arise when synchronization is used incorrectly:
- Missing barriers: some threads proceed assuming others have already updated shared data.
- Excessive or misplaced barriers: can cause performance collapse and sometimes hangs if not all threads reach the barrier.
- Improper use of locks:
- Forgetting to lock before updating shared data.
- Forgetting to unlock (leading to hangs).
- Locking the wrong variable (protecting A while reading B).
Examples of problematic patterns:
- Updating a shared array in one loop and consuming it in another loop without a barrier between them.
- Having conditionals so that some threads skip a barrier that others execute (barrier inside an
ifthat is not taken by all threads in the team).
Deadlocks and livelocks (at a high level)
Deadlocks and livelocks are so important they have dedicated coverage in another chapter, but it helps to recognize them:
- Deadlock: all threads/ranks are blocked forever, usually waiting on each other’s locks or messages. CPU usage typically goes near zero; program simply “hangs.”
- Livelock: program keeps “doing something” (often spinning) but never makes progress.
Common causes:
- Two threads acquiring locks in opposite orders.
- MPI point‑to‑point calls mismatched (e.g., each rank does a blocking
sendand none does arecv). - Not all processes reaching a collective (some skip
MPI_BarrierorMPI_Bcast).
Data sharing and memory bugs
Shared vs. private data mistakes
In shared‑memory models (e.g., OpenMP), confusion between shared and private variables is a major source of bugs:
- A variable meant to be private is shared:
- Temporary variables in loops not declared
privateorfirstprivate. - Stack-allocated buffers whose addresses are captured and shared across threads.
- A variable meant to be shared is private:
- Accumulators or flags declared as
private, so each thread has its own copy and the program never “sees” updates.
Symptoms:
- Using more threads gives obviously wrong or nonsensical results.
- Printing variable values from each thread shows inconsistent or unexpected values.
- Final result remains equal to the initial value because the “shared” variable is actually private.
False sharing
False sharing is a performance bug but can also produce confusing behavior when timing affects numerical results.
It occurs when multiple threads update different variables that happen to reside on the same cache line. The cache line becomes a hotspot, causing:
- Large slowdowns when increasing thread count.
- Performance that is extremely sensitive to small changes in data layout or padding.
Indicators:
- Code seems “embarrassingly parallel” but runs slower with more cores.
- Adding padding (extra elements) between array sections per thread suddenly speeds up the code.
Uninitialized and stale data
Parallel execution easily hides or exposes use of uninitialized or outdated data:
- Threads depend on a value that is initialized by another thread, but the initialization and the use are not properly ordered.
- Variables not initialized at all in some code paths; parallelism makes those paths more likely.
Typical symptoms:
- NaNs, infinities, or random numbers in outputs.
- Different compilers or optimization levels change whether the bug appears.
- Only some threads or MPI ranks exhibit the issue.
Memory allocation and ownership errors
On large HPC codes, memory bugs can interact with parallelism:
- Multiple frees of the same pointer by different threads/ranks.
- One rank frees memory that another rank still uses (due to misunderstanding of data ownership).
- Thread A allocates a buffer, thread B modifies it and frees it, then thread A still uses it.
Symptoms:
- Crashes (“segmentation fault”, “invalid free”) appearing randomly, often under high concurrency or large problem sizes.
- Bugs that disappear when run with a single thread or a single MPI rank.
Communication and ordering bugs
Mismatched sends and receives (MPI or similar)
Distributed-memory codes often break because message patterns are incorrect:
- Wrong
sourceordestinationrank. - Wrong
tagorcommunicator. - Mismatch between the message size sent and the buffer size expected by the receiver.
- Blocking
send/recvcalls that are not paired in the same order on both sides.
Symptoms:
- Deadlock: program hangs in MPI calls.
- Data corruption: receiver reads garbage or misaligned data.
- Non-reproducible failures when network timing changes.
Examples:
- Rank 0 does
MPI_Sendto rank 1, but rank 1 is waiting for data from rank 2. - Rank A sends 100 doubles while rank B attempts to receive 80 doubles into its buffer.
Missing or incorrect collectives usage
MPI collectives (e.g., MPI_Bcast, MPI_Reduce, MPI_Allreduce, MPI_Gather) require that:
- All ranks in the communicator call the collective, and
- They call it in the same order.
Common mistakes:
- Some ranks return early (e.g., due to an error or condition) and skip a collective that others execute.
- Different ranks use different
rootparameters or communicators. - Ranks call collectives in different orders (e.g., one does
BcastthenReduce, another doesReducethenBcast).
Symptoms:
- Hangs on a collective call.
- Wrong results from reductions or broadcasts.
- Crashes when one rank expects a different data type or count in a collective.
Order-of-operations and visibility issues
Even with correct communication, ordering can be wrong:
- Thread A updates data, then signals completion; thread B wakes up but sees old data due to missing memory fences or misuse of atomics.
- MPI ranks reuse send/receive buffers before the operation has actually completed (forgetting to wait on nonblocking operations).
Symptoms:
- Intermittent errors or corrupted boundary/halo regions.
- Correctness when using blocking calls, but incorrect results after switching to nonblocking operations.
- Correctness when adding artificial delays (e.g.,
sleep), which mask the issue.
Typical mistake with nonblocking MPI:
- Posting
MPI_Isend/MPI_Irecvand then immediately reusing or freeing the buffer without callingMPI_WaitorMPI_Testto ensure completion.
Numerical and algorithmic bugs exposed by parallelism
Some bugs are not “parallel” in essence, but parallelization brings them to the surface.
Non-associativity and reduction order sensitivity
Floating-point addition and multiplication are not strictly associative. Parallel reductions:
- Change the order of operations.
- Can lead to small differences in results between runs or between process counts.
- In poorly conditioned problems, small differences can grow into visible discrepancies.
Symptoms:
- Results differ slightly (e.g., in the last few digits) between single-core and multi-core runs.
- Regression tests that insist on bitwise identical results fail, even though the answers are scientifically equivalent.
This is usually not a bug in the algorithm but a property of floating-point math; however, lack of awareness can lead to misdiagnosis.
Non-determinism and heisenbugs
Heisenbugs are bugs that seem to disappear or change behavior when you try to observe them (e.g., by adding logging).
Parallel sources of non-determinism:
- Unspecified order of operations among threads or ranks.
- Asynchronous communication and I/O.
- Use of non-deterministic algorithms (e.g., unordered task queues, random seeds not carefully controlled).
Symptoms:
- Tests occasionally fail with no consistent pattern.
- Running with tracing/logging enabled makes the bug go away.
- Different machines or compiler versions show different behavior.
Performance-related “bugs”
Some problems are primarily performance issues but are often reported as “the code is broken” because the behavior is so unexpected.
Scalability collapse
Patterns:
- Adding more processes/threads slows down the program.
- Parallel version slower than serial version, even for large problems.
Typical root causes:
- Excessive synchronization (too many barriers, critical sections longer than expected).
- Too fine-grained parallelism (cost of creating/scheduling threads exceeds useful work).
- Load imbalance causing some ranks/threads to idle.
These are not logical errors, but they make systems appear “broken” to new users.
Contention and oversubscription
Oversubscription occurs when more runnable threads or MPI ranks exist than hardware contexts (cores or hardware threads).
Symptoms:
- Severe slowdown; CPU time per process appears much larger than wall time.
- Highly variable runtimes across job submissions.
- Node load average much larger than the core count.
Causes:
- Running
MPIjobs with-nplarger than the total number of hardware cores without binding. - Combining MPI and OpenMP (or GPU kernels) without planning the number of threads/processes per node.
- System threads (I/O, OS, daemons) competing with application threads.
Environment and build-related pitfalls
Inconsistent builds across ranks or nodes
In cluster environments, different nodes may:
- Load different module versions.
- Use different libraries or compiler flags.
If not carefully controlled, you might have:
- MPI ranks built with different ABI or math libraries.
- Some ranks using debug builds and others using optimized builds.
Symptoms:
- Crashes only on certain nodes.
- Different numerical results depending on which node runs which ranks.
- Mysterious MPI errors about datatype or communicator inconsistencies.
Optimization-sensitive bugs
Aggressive compiler optimizations can:
- Reorder instructions based on assumptions that are invalid in the presence of data races or undefined behavior.
- Inline or eliminate code in ways that hide timing assumptions.
Typical signs:
- Program works at
-O0but fails at-O2/-O3. - Adding
volatileor small code changes appear to “fix” the problem temporarily. - Thread sanitizers or memory checkers report data races or undefined behavior.
Recognizing patterns and first-response strategies
When you encounter a problem in a parallel code, it helps to classify it quickly:
- Hangs / program never finishes
→ suspect deadlock, mismatched collectives, missing unlock, or a barrier that not all threads reach. - Random crashes or “sometimes it works”
→ suspect data races, use-after-free, mismatched message sizes, uninitialized data. - Results differ between runs or between core counts
→ suspect data races, ordering/visibility bugs, mistaken shared/private variables; distinguish from harmless floating-point reduction differences. - Massive slowdown when adding more cores
→ suspect excessive synchronization, false sharing, load imbalance, or oversubscription. - Bugs that disappear when printing debug information
→ suspect heisenbugs: typically races or timing-sensitive communication.
Later chapters focus on tools and methods to pinpoint these bugs; the main goal here is to recognize that many “strange” behaviors in HPC codes map onto a relatively small set of parallel bug patterns.