Table of Contents
Overview
Parallel programs fail in ways that rarely appear in simple serial codes. The bugs are often intermittent, depend on timing, and may disappear when you try to print debug information. This chapter focuses on the characteristic failure modes of shared memory and distributed memory parallel programs, what they look like in practice, and how to recognize their signatures before choosing debugging tools or strategies.
You will see some categories that have their own chapters, such as deadlocks and race conditions. Here, you place them in the broader landscape of common bugs, learn how they manifest, and understand how they differ from other mistakes that may look similar.
Heisenbugs and nondeterminism
Parallel bugs are often nondeterministic. A run may succeed, the next run with the same input may crash or give a different answer, and a third run may hang. This behavior is often called a heisenbug, because observing or instrumenting the program can change its behavior.
Nondeterminism usually comes from interactions that depend on timing, such as two threads updating the same memory, or two processes exchanging messages with subtle ordering assumptions. In a serial program, an execution path is often repeatable. In a parallel program, the relative ordering of independent operations can change every run, especially on large systems.
A typical symptom is a test that passes most of the time, but occasionally fails or produces slightly different floating point results. For debugging, it is useful to record three things whenever you see such behavior: the job configuration (number of nodes, ranks, and threads), the input, and any random seeds. When you cannot reproduce a failure at will, focus on defects that depend on ordering, such as races, misuse of synchronization, or message ordering assumptions.
Common parallel bugs are often nondeterministic. If a failure appears rarely, suspect ordering dependent errors such as race conditions, missing synchronization, or incorrect message ordering.
Data races and unsynchronized access
A data race occurs when multiple threads or processes access the same memory location without proper coordination, at least one access is a write, and there is no synchronization that defines a predictable order. You will see the details of race conditions in a separate chapter. Here the focus is on their typical symptoms and how they differ from simple logic errors.
Races can cause silent data corruption, intermittent crashes, or subtle performance changes. On shared memory systems, two threads might both update a global counter without an atomic operation, leading to a total count that is sometimes correct and sometimes too small. On distributed memory systems, races can appear when multiple ranks update overlapping regions of a shared array through remote memory operations or file I/O without coordination.
An important symptom is that a bug disappears if you add extra prints, insert artificial sleeps, or run with fewer threads. Another clue is that reordering loops or changing compiler optimization flags suddenly changes behavior, even though the logical computation should be unchanged. In such cases, suspect missing synchronization rather than flawed mathematics.
Incorrect or inconsistent synchronization
Synchronization primitives such as mutexes, barriers, and condition variables in shared memory, and matching sends and receives in distributed memory, are essential for correct parallel behavior. Incorrect use of synchronization can produce hangs, livelock, inconsistent data, or very poor performance.
A common error is to guard only part of the shared data structures with a lock, leaving some fields unprotected. Another frequent mistake is to access a resource sometimes under a lock and sometimes without, which leads to subtle inconsistencies. Mismatched barrier participation is also common: some threads or ranks call a barrier, others skip it due to conditional logic, and the program hangs.
On distributed memory systems, collective operations and point to point communication must follow consistent patterns. If some ranks call a collective but others take a different branch and do not, a global hang or cryptic error message usually follows. If ranks call collectives in different orders, or combine collectives with nonblocking operations incorrectly, you can see both deadlocks and incorrect results.
The characteristic sign of incorrect synchronization is a program that stops making progress without consuming CPU in a useful way, often at the same stage of execution, but sometimes only at large scale. Differentiating this from performance bottlenecks is important. A stall that never recovers suggests faulty synchronization, while slow but steady progress suggests an inefficient algorithm or resource contention.
Deadlocks and collective mismatches in practice
Deadlocks have their own chapter, but they are so common in parallel programs that you need a practical sense for how they show up, especially in message passing code.
In shared memory, a simple deadlock occurs when thread A holds lock X and waits for lock Y, while thread B holds lock Y and waits for lock X. In more complex programs, long chains of locks or lock order that depends on runtime conditions can cause deadlocks that only appear with particular inputs.
In MPI or other message passing systems, deadlocks often arise from incorrect ordering of blocking sends and receives. For example, all ranks perform a blocking send to their neighbor before posting a receive. If the system has limited buffering, no rank can progress because each is waiting for the other to receive. Deadlocks also occur when some ranks call a collective operation such as MPI_Bcast while others call MPI_Reduce at the same program point, or when some ranks skip a collective entirely due to conditional branches.
One informal rule that helps you recognize potential deadlocks is that every communication pattern must be both globally consistent and symmetric where required. If rank 0 believes it is talking to rank 1, rank 1 must also be talking to rank 0 with the same tags, types, and communicators. Any asymmetry in expectations is a red flag.
Incorrect assumptions about memory models
Parallel hardware and compilers apply many optimizations that change the order of instruction execution as long as the single thread result appears correct. In a parallel context, these reorders can violate naive assumptions about when another thread will see an update to shared memory.
On some architectures and with aggressive compiler optimization, a write performed by one thread may not be visible to another without appropriate memory fences or synchronization primitives. If you assume that a simple write by one thread and a subsequent read by another thread always see the new value without explicit synchronization, you may encounter rare and inexplicable behavior.
Such bugs are particularly hard to reproduce, and they often disappear if you disable optimization, add debugging prints, or restrict execution to a single core. They often look like "impossible" states, for example a flag that should never be false after a certain point is occasionally observed as false.
Recognition of these bugs rests on understanding that memory visibility between threads is not guaranteed without explicit synchronization. When you see intermittent inconsistencies that seem to violate the logical order of your code, especially on multicore or accelerator hardware, suspect incorrect assumptions about the memory model.
Indexing, partitioning, and off by one errors in parallel loops
Parallelization commonly involves splitting data among threads or processes. Mistakes in domain decomposition and indexing are a frequent source of incorrect results, crashes, or out of bounds accesses.
Typical errors include assigning the same indices to more than one worker, which causes double counting or conflicting updates, or leaving gaps where some indices are never processed. Off by one errors are common when translating a serial loop like for i in [0, N) into multiple ranges. A miscalculation such as using start = rank (N / P) and end = (rank + 1) (N / P) without handling the remainder can drop elements or cause some to be processed twice.
On distributed memory systems, you must also map global indices to local buffers. A small mismatch between the assumed layout on the sender and the actual layout on the receiver can lead to accessing beyond array bounds or overwriting unrelated data. These bugs can be very sensitive to problem size, since some sizes may align so that the mistake only affects unused padding, while others trigger a crash.
Unlike pure timing bugs, indexing errors are often deterministic for a given configuration. They may, however, only appear at certain scales or with certain problem sizes. A sign that such a bug is present is a program that works perfectly for small inputs but fails as soon as you increase the grid size, the number of ranks, or both.
Floating point quirks and numerical surprises
Parallelization alters the order in which floating point operations are performed. Because floating point addition and multiplication are not strictly associative, changing the order of operations can change rounding behavior. This can lead to small differences between parallel and serial results.
Sometimes these small differences are expected and harmless. However, they can also push an algorithm into a different branch of conditional logic, affect convergence tests, or amplify through unstable computations into a large discrepancy. For example, a parallel reduction that sums many terms in a different order on each run can produce slightly different results between runs, which can confuse simple regression tests that expect bitwise identical answers.
Another pattern occurs when shared accumulators are updated without proper atomic operations. Here the problem is not simply a different rounding, but lost updates. A sum that should increase on every iteration may silently miss some contributions, yielding visibly incorrect results.
It is important to distinguish between acceptable numerical variation and genuine bugs. If you see small differences in the last few bits of a result that vary with the number of threads or ranks, and your algorithm is known to be sensitive to summation order, this may be an expected consequence of parallel floating point arithmetic. Large or erratic changes usually indicate a deeper bug, such as uninitialized values, races on floating point data, or divergent control flow.
Uninitialized and partially initialized data
Uninitialized variables and buffers are a classic serial bug, but in parallel programs they can interact with concurrency in more complicated ways. A variable that is incidentally initialized by one thread in most executions may remain uninitialized in rare timing combinations. Different ranks may assume that a shared array is initialized by some other rank, which is not always true.
Partially initialized data structures are another risk. For example, only a subset of threads might call an initialization routine due to a conditional check, but all threads later assume the data is fully ready. In a message passing program, one rank might allocate a buffer of the wrong size or forget to set all fields before sending it to another rank, which then interprets the garbage as valid data.
Uninitialized data bugs can look like random corruption that changes from run to run, or like occasional use of extreme values such as NaN or Inf. Parallelism can also affect how frequently such bugs manifest, since the exact pattern of memory reuse and cache behavior can change with thread count.
Resource leaks and improper finalization
Parallel programs often create many threads, processes, communicators, streams, and file descriptors. Failing to release these resources correctly can lead to gradual failure or performance degradation, especially in long running applications or workflows that run many jobs.
Common mistakes include creating new threads repeatedly without joining them, never freeing MPI communicators or groups, and repeatedly opening files without closing them. These errors may not show up in small test runs, but on large systems they can exhaust limits on file handles, memory, or system objects.
Improper finalization is another related issue. For instance, some ranks may call a parallel library's finalize function while others continue to use it. In MPI this can lead to undefined behavior if one rank exits or calls MPI_Finalize while others are still communicating. In shared memory libraries, destroying synchronization primitives while they are still in use can also lead to crashes or silent corruption.
A common symptom is an application that runs correctly once but fails after repeated use in a larger workflow, or a program that gradually slows down and then aborts with errors related to resource limits.
Input, output, and file I/O hazards in parallel
Parallel file access is necessary for many HPC applications, but introduces its own class of bugs. Processes or threads may overwrite one another's output, truncate files unexpectedly, or create inconsistent files that are hard to diagnose.
A frequent mistake is to have every process write to the same file without any coordination. Without proper atomic operations or a parallel I/O library, the resulting file may contain interleaved lines, partial records, or corrupted binary structures. Another common error is to assume that file operations are instantly visible to all processes, while in reality caching and buffering can delay visibility.
Parallel checkpointing or restart logic can also be a source of defects. If some ranks believe a checkpoint is complete while others are still writing, a restart attempt may read an inconsistent snapshot. If checkpoints are not synchronized with computational phases, some processes may restart from one step, and others from an earlier one, resulting in a logically impossible state.
These bugs often appear as inconsistent or unreadable output files, crashes in postprocessing tools, or failures during restart that do not occur when running from scratch.
Misuse of libraries and collective abstractions
Many parallel programs rely on libraries such as MPI, OpenMP, CUDA, and high level numerical frameworks. Misunderstanding the semantics of these libraries is a common source of subtle bugs.
Examples include assuming that nonblocking operations complete immediately without calling the corresponding wait or test routines, or assuming that a collective call like a reduction does not require all members of the communicator to participate. In shared memory environments, misuse of thread private and shared variables, or misunderstanding of default scoping in directives, can cause hidden data sharing bugs.
Another frequent category is confusion about ownership of buffers. Some libraries require that buffers passed to asynchronous calls remain valid until completion, but users sometimes reuse or free these buffers too early. At small scales, the program may work by chance because the buffer content is not yet overwritten, while at larger scales the bug appears as corrupted data or crashes.
The general pattern is that a library call that seemed to work in simple tests fails under heavier load or with more parallelism. Reading the precise specification of library routines and paying attention to lifetime, ownership, and synchronization guarantees is essential to avoid this class of bugs.
Scaling related bugs
Many parallel defects only appear when you increase the problem size, the number of threads, or the number of processes. A code may work reliably on a laptop or with a few cores, but fail on a large cluster.
Scaling related bugs include hidden assumptions about rank counts, such as using neighbor indices that wrap incorrectly, or using fixed size arrays that are too small for large process counts. Some collective algorithms or hand rolled communication patterns may only be correct for specific numbers of participants. Logic that relies on a particular ordering among processes or on low communication delay can also break at scale.
From a debugging perspective, these bugs are frustrating because your small scale tests pass, and the failures may appear only on the actual HPC system. Recognizing that a bug is scale dependent is the first step. If a problem appears only beyond a particular number of ranks or threads, examine any code that depends on rank IDs, thread IDs, array sizes, or topologies.
Distinguishing logic errors from concurrency errors
Not every bug in a parallel program is fundamentally parallel. Algorithmic mistakes, incorrect formulas, and logic errors are just as common. Distinguishing these from concurrency and synchronization bugs is an important skill.
Logic errors are often deterministic: for a given input and configuration, the wrong answer is the same every run. They appear regardless of whether you run with one thread or process, or many. In contrast, concurrency errors often disappear when you reduce parallelism to one or two workers, or when you add debugging code, and they may produce different failures across runs.
A useful practice is to create a serial or minimal parallel configuration of your program, for example one process and one thread, and check whether the bug persists. If it does, focus on algorithmic and indexing issues. If not, concentrate on synchronization, communication, and sharing of data.
If a bug appears only when using multiple threads or ranks and disappears with a single worker, suspect a concurrency or communication error. If it persists even with a single worker, it is likely a pure logic or algorithmic bug.
Summary
Parallel programs introduce several characteristic categories of bugs: data races, incorrect synchronization, deadlocks, indexing and partitioning errors, numerical surprises due to reordered floating point operations, uninitialized or partially initialized data, resource leaks, I/O hazards, misuse of parallel libraries, and scale dependent errors. Many of these defects are intermittent and depend on timing, rank counts, or hardware details.
Recognizing the common symptoms of each class is critical. Stable but wrong results hint at logic or indexing mistakes. Nondeterministic or disappearing failures suggest races or memory model issues. Hangs point to deadlocks, mismatched collectives, or incorrect synchronization. Scale dependent failures highlight assumptions that break on large systems.
The subsequent chapters on specific bug types and debugging tools will build on this taxonomy. Your goal is not only to learn how to fix individual defects, but to develop an intuition for which category a new problem most likely belongs to, so that you can choose an effective debugging strategy early and avoid chasing the wrong explanations.