7.5 Synchronization mechanisms

Table of Contents

Why Synchronization Is Needed in Shared-Memory Programs

In shared-memory parallel programs, multiple threads can read and write the same variables in memory. Without coordination, they may:

Overwrite each other’s updates
Read inconsistent or partially updated data
Execute critical steps in the wrong order

Synchronization mechanisms provide controlled ways for threads to:

Enforce mutual exclusion (only one thread at a time in some code)
Enforce ordering (A must happen before B)
Manage visibility of updates (changes by one thread become visible to others at the right time)

This chapter focuses on the main synchronization tools you’ll use in typical OpenMP-style shared-memory programming, without going into general OpenMP syntax that belongs elsewhere.

Kinds of Synchronization

At a high level, synchronization mechanisms fall into a few categories:

Mutual exclusion primitives: protect critical regions of code or data
Barriers: make all threads wait until everyone has reached a point
Condition synchronization: wait until some condition/event holds
Atomic operations: very small updates performed safely without full locking

Each has different performance characteristics and appropriate use-cases.

Mutual Exclusion: Locks and Critical Sections

Critical Sections

A critical section is a block of code that must be executed by at most one thread at a time. Conceptually:

// Pseudocode
critical {
    // Access or update shared variable(s)
}

Typical use-cases:

Updating a shared counter
Appending to a shared list
Writing to a shared log/file (in simple examples)

Critical sections are easy to use but can become performance bottlenecks if:

They are too large (contain lots of work)
They are executed very frequently by many threads

The key design principle is to minimize the amount of work done inside critical sections.

Locks (Mutexes)

A lock (often called a mutex) is a lower-level primitive for mutual exclusion. The idea:

lock_acquire(L) — wait until no one else owns L, then take ownership
lock_release(L) — give up ownership so another thread can enter

Pseudocode example:

lock_init(L);
parallel {
    // Some thread-private work...
    lock_acquire(L);
    // Critical region: safely update shared data
    shared_sum += local_sum;
    lock_release(L);
}
lock_destroy(L);

Locks give you more flexibility than high-level critical constructs:

You can protect different data structures with different locks
You can pass locks around functions and data structures

But they also introduce more risk:

Deadlocks: if locks are acquired in inconsistent order between threads
Performance penalties: if many threads contend for the same lock

Design tips:

Use fine-grained locks where it makes sense (one per data structure, or even per element), but not so many that overhead dominates.
Establish a global lock acquisition order to avoid deadlocks (e.g., always acquire L1 before L2).

Barriers

A barrier forces all threads in a parallel region to wait until every thread has reached that barrier. After the barrier:

All threads have completed everything before the barrier
Shared data written before the barrier is visible to all threads (modulo the memory model of the language/runtime)

Conceptual pseudocode:

parallel {
    // Phase 1: compute partial results
    compute_partial_result();
    barrier();  // Wait for all threads
    // Phase 2: use results from all threads
    use_all_partial_results();
}

Common uses:

Divide work into phases where each phase depends on the results of the previous one
Ensure that a shared data structure built cooperatively is fully constructed before use

Overuse of barriers can degrade performance:

All threads must wait for the slowest one
Too many barriers can serialize your computation

Good practice:

Use barriers only when there is a clear data dependency between phases
Avoid placing them inside tight loops unless absolutely necessary

Atomic Operations

Sometimes you only need to protect a very small, simple update to a shared variable, such as:

counter = counter + 1
sum += x
min_val = min(min_val, x)

Using a full lock for these can be overkill; atomic operations perform such updates as an indivisible unit, often with hardware support.

Conceptual example:

parallel {
    // Each thread has some local contribution
    atomic_add(global_sum, local_sum);  // implemented atomically
}

Characteristics:

Much cheaper than a lock-based critical region for simple operations
Avoids the overhead and complexity of explicit locks
Best suited for short, simple read-modify-write operations on a single variable

Limitations:

You cannot easily perform multi-variable atomic updates (e.g., updating two counters together) with a single atomic
Not suited for protecting large or complex critical sections

Rule of thumb:

Use atomic operations for single-variable counters, reductions, or min/max updates
Use locks/critical regions for more complex shared data structures

Condition Synchronization and Signaling

Beyond mutual exclusion and barriers, you often need mechanisms where:

A producer thread signals that new work or data is available
A consumer thread waits until that condition is satisfied

This is known as condition synchronization or signaling.

Common patterns:

Producer–consumer queue: Producers add tasks or data to a shared queue; consumers remove tasks for processing.
Phase-specific activation: One thread prepares data, then signals another thread to proceed.

Conceptual pseudocode with condition variables:

// Shared:
queue Q;
lock L;
condition not_empty;
producer_thread() {
    while (more_work) {
        item = produce_item();
        lock_acquire(L);
        enqueue(Q, item);
        signal(not_empty);    // Wake a waiting consumer
        lock_release(L);
    }
}
consumer_thread() {
    while (true) {
        lock_acquire(L);
        while (is_empty(Q)) {
            wait(not_empty, L);  // Atomically release L and block
        }
        item = dequeue(Q);
        lock_release(L);
        consume_item(item);
    }
}

Key properties:

wait releases the associated lock and suspends the thread until signaled, then reacquires the lock before returning.
signal or broadcast awakens one or all waiting threads.

These mechanisms are more advanced but essential for complex shared-memory applications that need event-driven or asynchronous coordination, rather than simple phase-based barriers.

Memory Visibility and Ordering

Synchronization does not only control “who runs what when”; it also affects how and when memory updates become visible to other threads.

Modern CPUs and compilers can:

Reorder independent instructions
Cache values in registers or local caches
Delay propagating writes to main memory

Synchronization primitives have memory ordering semantics:

Entering or exiting a critical section
Acquiring or releasing a lock
Reaching a barrier
Performing an atomic operation

These operations typically act as memory fences:

Changes made by a thread before the fence become visible to other threads after they cross the corresponding fence.
This ensures that program order (as seen in source code) matches the logical order required for correctness.

Practical implications:

Do not rely on “naked” shared variable accesses without appropriate synchronization.
Use the provided synchronization constructs rather than trying to control memory ordering manually.

Choosing the Right Mechanism

Selecting the right synchronization tool is crucial for both correctness and performance:

Atomic operations

Best for: simple counters, accumulators, min/max
Pros: very low overhead
Cons: limited to simple operations on single variables

Critical sections / locks

Best for: protecting complex data structures or sequences of operations
Pros: flexible, general
Cons: can serialize execution if heavily contended

Barriers

Best for: clearly separated computational phases
Pros: simple way to enforce global phase ordering
Cons: can cause idle time and poor scaling if phases are unbalanced

Condition variables / signaling

Best for: producer–consumer patterns, asynchronous workflows
Pros: allow threads to sleep instead of busy-waiting
Cons: more complex to use correctly

General performance guidelines:

Keep synchronized regions (critical sections) as short as possible.
Avoid unnecessary synchronization; each operation has overhead.
Prefer local (thread-private) data where possible, and only synchronize to combine results.
Be aware that contention (many threads contending for the same lock or atomic variable) can severely limit scalability.

Common Pitfalls Related to Synchronization

While details of race conditions are covered elsewhere, synchronization mechanisms are closely tied to several typical mistakes:

Forgetting to synchronize shared data accesses, leading to race conditions.
Overusing synchronization, which can serialize the program and erase parallel speedup.
Using the wrong tool, such as a global lock when a more fine-grained lock or atomic would suffice.
Deadlocks from acquiring multiple locks in inconsistent order.
Busy-waiting (spin-waiting) on shared flags without proper atomic operations or fences, leading to high CPU usage and subtle bugs.

Good habits:

Clearly identify which variables are shared and which are private.
Document which synchronization primitive protects each shared data structure.
Think in terms of ownership and lifecycle of shared data: who initializes it, who reads or updates it, and how you ensure safe transitions.

Summary

Synchronization mechanisms are the backbone of safe shared-memory parallel programming. They allow threads to:

Coordinate access to shared data (mutual exclusion)
Enforce logical ordering of phases and events
Ensure that updates are correctly and timely visible between threads

Effective use of synchronization is about balancing correctness with performance: enough synchronization to avoid bugs, but no more than necessary to maintain scalability.

Comments

Please login to add a comment.

Don't have an account? Register now!