Kahibaro
Discord Login Register

16.2 Deadlocks

Understanding Deadlocks in Parallel Programs

Deadlocks are a class of bugs that arise from how parallel entities (threads or processes) coordinate and wait for each other. In parallel and distributed programming, they are both common and notoriously hard to debug if you don’t know what to look for. This chapter focuses specifically on deadlocks in shared-memory (e.g., OpenMP, threads) and distributed-memory (e.g., MPI) HPC programs.

What Is a Deadlock (Precisely)?

In this context, a deadlock occurs when a set of threads or processes are blocked forever, each waiting for an event that only another blocked member of the same set can cause.

Typical blocking events include:

No one can proceed, so the program appears to “hang” indefinitely.

Conceptually, four conditions must all hold for deadlock to be possible (Coffman conditions):

  1. Mutual exclusion: Some resources (locks, messages, files) can be held by only one thread/process at a time.
  2. Hold and wait: A thread/process holds at least one resource and is waiting to acquire others.
  3. No preemption: Resources can’t be taken away; they must be released voluntarily.
  4. Circular wait: There exists a cycle in the wait-for relationships among threads/processes.

In HPC, you usually see deadlock as a job that uses CPU at ~0% and never finishes until killed by the scheduler.

Common Deadlock Patterns in Shared-Memory Code

Shared-memory deadlocks usually involve locks, barriers, or other synchronization constructs.

Lock Ordering Deadlocks

The classic pattern:

Both wait forever.

A minimal example in pseudo-C:

// Thread 1
lock(A);
// ... work ...
lock(B);
// critical section using A and B
unlock(B);
unlock(A);
// Thread 2
lock(B);
// ... work ...
lock(A);  // waits forever if Thread 1 holds A and wants B
// critical section
unlock(A);
unlock(B);

Key characteristic: The same set of locks can be acquired in different orders in different threads.

Nested Locks and Hidden Dependencies

Even if you don’t explicitly acquire locks in different orders, nested calls can do it:

void f() {
    lock(A);
    // ...
    lock(B);
    // ...
    unlock(B);
    unlock(A);
}
void g() {
    lock(B);
    // ...
    lock(A);  // may deadlock if someone else is in f()
    // ...
    unlock(A);
    unlock(B);
}

If different threads call f() and g(), you can get circular wait without realizing there is a lock order violation.

Barrier Deadlocks

In many thread models (including OpenMP), barriers require that all participating threads reach the barrier. If one or more threads never get there, the others wait forever.

Typical causes:

Example in OpenMP-like pseudocode:

#pragma omp parallel
{
    int tid = omp_get_thread_num();
    if (tid == 0) {
        // This thread exits the parallel region early
        return;
    }
    // All other threads reach the barrier
    #pragma omp barrier   // deadlock: thread 0 never gets here
}

Deadlocks via Condition Variables

Condition-variable deadlocks often come from incorrect signaling logic:

Or:

Even though these may look more like “logic bugs”, in behavior they are deadlocks: mutual waiting with no progress.

Common Deadlock Patterns in MPI and Distributed Programs

Deadlocks in MPI are especially common because communication is often explicit and blocking.

Mutual Blocking with Blocking Sends/Receives

The classic MPI deadlock:

// Rank 0
MPI_Send(..., dest=1, tag=0, ...);
MPI_Recv(..., source=1, tag=0, ...);
// Rank 1
MPI_Send(..., dest=0, tag=0, ...);
MPI_Recv(..., source=0, tag=0, ...);

If the MPI implementation uses blocking semantics where MPI_Send waits until the message is taken, both ranks can be stuck in MPI_Send, each waiting for the other to call MPI_Recv.

Mismatched Communication Patterns

Deadlocks can arise when only some ranks post matching receives or sends. For example:

if (rank == 0) {
    MPI_Send(data, ..., dest=1, tag=0, ...);
    MPI_Recv(buf, ..., source=1, tag=0, ...);
} else if (rank == 1) {
    MPI_Recv(buf, ..., source=0, tag=0, ...);
    MPI_Send(data, ..., dest=0, tag=0, ...);
} else {
    // Other ranks do something else, but never call MPI_Recv or MPI_Send
    // that rank 0 or 1 expect.
}

This specific code might be fine between 0 and 1, but deadlock arises if:

Typical manifestations:

Deadlocks in Collectives

Collective operations must be called by all processes in the communicator using the same operation and in the same order.

Deadlocks occur when:

Example pattern (simplified):

for (int step = 0; step < N; step++) {
    if (rank == 0 && step > 0) {
        // mistakenly skip the barrier in some iterations
    } else {
        MPI_Barrier(MPI_COMM_WORLD);
    }
}

Ranks that call MPI_Barrier in those iterations will block forever.

Ordering and Tag Mistakes

Deadlocks can also stem from:

Example:

// Rank 0
MPI_Recv(buf, ..., source=1, tag=0, ...); // expects tag 0
MPI_Send(data, ..., dest=1, tag=1, ...);
// Rank 1
MPI_Recv(buf, ..., source=0, tag=1, ...); // expects tag 1
MPI_Send(data, ..., dest=0, tag=0, ...);

Both ranks are stuck in MPI_Recv expecting tags that will never be received in the correct order.

Recognizing Symptoms of Deadlocks

On an HPC system, deadlocks usually show up as:

In logs or terminal output, you might see:

Techniques to Debug Deadlocks

Deadlock debugging is mainly about figuring out where things are waiting and why the other side never acts.

1. Insert Strategic Logging / Tracing

Add print statements (or lightweight logging) before and after synchronization or communication calls:

printf("Rank %d: before MPI_Recv from %d, tag %d\n", rank, src, tag);
MPI_Recv(...);
printf("Rank %d: after  MPI_Recv from %d, tag %d\n", rank, src, tag);

Guidelines:

The last printed messages from different ranks often tell you:

2. Use Debuggers to Inspect Where Threads/Processes Are Blocked

For MPI:

For shared memory:

3. Timeouts and Watchdogs (for Diagnosis)

You can temporarily add timeouts around potentially blocking operations:

For example (simplified pseudo-code):

double t0 = now();
while (!condition_met) {
    if (now() - t0 > 30.0) {
        fprintf(stderr, "Timeout waiting for condition X in thread %d\n", tid);
        abort();
    }
    // small sleep or work
}

This doesn’t solve the deadlock but turns a silent hang into a controlled abort with more diagnostic information.

4. Use Tools Specialized for Concurrency Bugs

Some tools can automatically detect deadlocks or dangerous patterns:

These are particularly useful for larger codes where manual reasoning about all interactions is difficult.

Strategies to Prevent Deadlocks

Prevention is usually easier than post-hoc debugging. The key idea is to design protocols and synchronization patterns that cannot form circular waits.

For Shared-Memory Programs

  1. Use a Global Lock Ordering

Establish a strict order for acquiring multiple locks and enforce it everywhere:

This breaks the circular wait condition.

  1. Minimize Lock Nesting
  1. Avoid Conditional Barriers
  1. Prefer Higher-Level Constructs

For MPI and Distributed Programs

  1. Use a Consistent Communication Protocol

Define, document, and maintain strict rules for:

Avoid ad-hoc send/recv calls scattered without a clear pattern.

  1. Prefer Nonblocking Communication Carefully

Nonblocking operations (MPI_Isend, MPI_Irecv, MPI_Wait, MPI_Test) can help you avoid deadlocks by:

However, you must still:

  1. Enforce Correct Collective Usage
  1. Simplify Patterns

Where possible:

  1. Use Distinct Communicators and Tags

Practical Checklists

When You Suspect a Deadlock

Questions to Ask About Your Code

By designing synchronization and communication with these principles in mind and by using systematic debugging techniques, you can dramatically reduce the frequency and impact of deadlocks in parallel HPC applications.

Views: 79

Comments

Please login to add a comment.

Don't have an account? Register now!