Table of Contents
Why Synchronization Is Needed in Shared-Memory Programs
In shared-memory parallel programs, multiple threads can read and write the same variables in memory. Without coordination, they may:
- Overwrite each other’s updates
- Read inconsistent or partially updated data
- Execute critical steps in the wrong order
Synchronization mechanisms provide controlled ways for threads to:
- Enforce mutual exclusion (only one thread at a time in some code)
- Enforce ordering (A must happen before B)
- Manage visibility of updates (changes by one thread become visible to others at the right time)
This chapter focuses on the main synchronization tools you’ll use in typical OpenMP-style shared-memory programming, without going into general OpenMP syntax that belongs elsewhere.
Kinds of Synchronization
At a high level, synchronization mechanisms fall into a few categories:
- Mutual exclusion primitives: protect critical regions of code or data
- Barriers: make all threads wait until everyone has reached a point
- Condition synchronization: wait until some condition/event holds
- Atomic operations: very small updates performed safely without full locking
Each has different performance characteristics and appropriate use-cases.
Mutual Exclusion: Locks and Critical Sections
Critical Sections
A critical section is a block of code that must be executed by at most one thread at a time. Conceptually:
// Pseudocode
critical {
// Access or update shared variable(s)
}Typical use-cases:
- Updating a shared counter
- Appending to a shared list
- Writing to a shared log/file (in simple examples)
Critical sections are easy to use but can become performance bottlenecks if:
- They are too large (contain lots of work)
- They are executed very frequently by many threads
The key design principle is to minimize the amount of work done inside critical sections.
Locks (Mutexes)
A lock (often called a mutex) is a lower-level primitive for mutual exclusion. The idea:
lock_acquire(L)— wait until no one else ownsL, then take ownershiplock_release(L)— give up ownership so another thread can enter
Pseudocode example:
lock_init(L);
parallel {
// Some thread-private work...
lock_acquire(L);
// Critical region: safely update shared data
shared_sum += local_sum;
lock_release(L);
}
lock_destroy(L);Locks give you more flexibility than high-level critical constructs:
- You can protect different data structures with different locks
- You can pass locks around functions and data structures
But they also introduce more risk:
- Deadlocks: if locks are acquired in inconsistent order between threads
- Performance penalties: if many threads contend for the same lock
Design tips:
- Use fine-grained locks where it makes sense (one per data structure, or even per element), but not so many that overhead dominates.
- Establish a global lock acquisition order to avoid deadlocks (e.g., always acquire
L1beforeL2).
Barriers
A barrier forces all threads in a parallel region to wait until every thread has reached that barrier. After the barrier:
- All threads have completed everything before the barrier
- Shared data written before the barrier is visible to all threads (modulo the memory model of the language/runtime)
Conceptual pseudocode:
parallel {
// Phase 1: compute partial results
compute_partial_result();
barrier(); // Wait for all threads
// Phase 2: use results from all threads
use_all_partial_results();
}Common uses:
- Divide work into phases where each phase depends on the results of the previous one
- Ensure that a shared data structure built cooperatively is fully constructed before use
Overuse of barriers can degrade performance:
- All threads must wait for the slowest one
- Too many barriers can serialize your computation
Good practice:
- Use barriers only when there is a clear data dependency between phases
- Avoid placing them inside tight loops unless absolutely necessary
Atomic Operations
Sometimes you only need to protect a very small, simple update to a shared variable, such as:
counter = counter + 1sum += xmin_val = min(min_val, x)
Using a full lock for these can be overkill; atomic operations perform such updates as an indivisible unit, often with hardware support.
Conceptual example:
parallel {
// Each thread has some local contribution
atomic_add(global_sum, local_sum); // implemented atomically
}Characteristics:
- Much cheaper than a lock-based critical region for simple operations
- Avoids the overhead and complexity of explicit locks
- Best suited for short, simple read-modify-write operations on a single variable
Limitations:
- You cannot easily perform multi-variable atomic updates (e.g., updating two counters together) with a single atomic
- Not suited for protecting large or complex critical sections
Rule of thumb:
- Use atomic operations for single-variable counters, reductions, or min/max updates
- Use locks/critical regions for more complex shared data structures
Condition Synchronization and Signaling
Beyond mutual exclusion and barriers, you often need mechanisms where:
- A producer thread signals that new work or data is available
- A consumer thread waits until that condition is satisfied
This is known as condition synchronization or signaling.
Common patterns:
- Producer–consumer queue: Producers add tasks or data to a shared queue; consumers remove tasks for processing.
- Phase-specific activation: One thread prepares data, then signals another thread to proceed.
Conceptual pseudocode with condition variables:
// Shared:
queue Q;
lock L;
condition not_empty;
producer_thread() {
while (more_work) {
item = produce_item();
lock_acquire(L);
enqueue(Q, item);
signal(not_empty); // Wake a waiting consumer
lock_release(L);
}
}
consumer_thread() {
while (true) {
lock_acquire(L);
while (is_empty(Q)) {
wait(not_empty, L); // Atomically release L and block
}
item = dequeue(Q);
lock_release(L);
consume_item(item);
}
}Key properties:
waitreleases the associated lock and suspends the thread until signaled, then reacquires the lock before returning.signalorbroadcastawakens one or all waiting threads.
These mechanisms are more advanced but essential for complex shared-memory applications that need event-driven or asynchronous coordination, rather than simple phase-based barriers.
Memory Visibility and Ordering
Synchronization does not only control “who runs what when”; it also affects how and when memory updates become visible to other threads.
Modern CPUs and compilers can:
- Reorder independent instructions
- Cache values in registers or local caches
- Delay propagating writes to main memory
Synchronization primitives have memory ordering semantics:
- Entering or exiting a critical section
- Acquiring or releasing a lock
- Reaching a barrier
- Performing an atomic operation
These operations typically act as memory fences:
- Changes made by a thread before the fence become visible to other threads after they cross the corresponding fence.
- This ensures that program order (as seen in source code) matches the logical order required for correctness.
Practical implications:
- Do not rely on “naked” shared variable accesses without appropriate synchronization.
- Use the provided synchronization constructs rather than trying to control memory ordering manually.
Choosing the Right Mechanism
Selecting the right synchronization tool is crucial for both correctness and performance:
- Atomic operations
- Best for: simple counters, accumulators, min/max
- Pros: very low overhead
- Cons: limited to simple operations on single variables
- Critical sections / locks
- Best for: protecting complex data structures or sequences of operations
- Pros: flexible, general
- Cons: can serialize execution if heavily contended
- Barriers
- Best for: clearly separated computational phases
- Pros: simple way to enforce global phase ordering
- Cons: can cause idle time and poor scaling if phases are unbalanced
- Condition variables / signaling
- Best for: producer–consumer patterns, asynchronous workflows
- Pros: allow threads to sleep instead of busy-waiting
- Cons: more complex to use correctly
General performance guidelines:
- Keep synchronized regions (critical sections) as short as possible.
- Avoid unnecessary synchronization; each operation has overhead.
- Prefer local (thread-private) data where possible, and only synchronize to combine results.
- Be aware that contention (many threads contending for the same lock or atomic variable) can severely limit scalability.
Common Pitfalls Related to Synchronization
While details of race conditions are covered elsewhere, synchronization mechanisms are closely tied to several typical mistakes:
- Forgetting to synchronize shared data accesses, leading to race conditions.
- Overusing synchronization, which can serialize the program and erase parallel speedup.
- Using the wrong tool, such as a global lock when a more fine-grained lock or atomic would suffice.
- Deadlocks from acquiring multiple locks in inconsistent order.
- Busy-waiting (spin-waiting) on shared flags without proper atomic operations or fences, leading to high CPU usage and subtle bugs.
Good habits:
- Clearly identify which variables are shared and which are private.
- Document which synchronization primitive protects each shared data structure.
- Think in terms of ownership and lifecycle of shared data: who initializes it, who reads or updates it, and how you ensure safe transitions.
Summary
Synchronization mechanisms are the backbone of safe shared-memory parallel programming. They allow threads to:
- Coordinate access to shared data (mutual exclusion)
- Enforce logical ordering of phases and events
- Ensure that updates are correctly and timely visible between threads
Effective use of synchronization is about balancing correctness with performance: enough synchronization to avoid bugs, but no more than necessary to maintain scalability.