Table of Contents
Understanding Synchronization in Shared Memory
Synchronization in shared memory programming coordinates concurrent access to data that is visible to multiple threads. In OpenMP and similar models, all threads can usually read and write the same variables. Without explicit coordination, they may interleave their operations in arbitrary ways, which leads to race conditions, inconsistent results, or crashes.
In this chapter, the focus is on the main synchronization mechanisms you will encounter in shared memory systems, especially OpenMP. You will see what each mechanism guarantees, when it is appropriate to use it, and what costs or pitfalls are associated with it. The goal is not to remove parallelism, but to control it just enough so that your program remains correct.
Synchronization is always a trade-off between correctness and performance.
Use the weakest mechanism that still guarantees correctness, and avoid unnecessary synchronization.
Critical Sections
A critical section protects a region of code that must be executed by only one thread at a time. In OpenMP, this is expressed with #pragma omp critical in C or !$omp critical in Fortran.
Conceptually, you can think of a critical section as putting a lock around the code region. When a thread reaches the critical section, it waits until no other thread is inside it, then enters, executes the code, and finally leaves so that another thread may enter.
You typically use a critical section when multiple threads must update a shared variable using a non-trivial operation that cannot be expressed as a simple atomic update. For example, updating complex data structures, appending to a shared log, or performing a combined read-modify-write that involves several instructions.
Critical sections are easy to use and understand, but they can become a performance bottleneck if many threads frequently contend for the same section. If only one thread can execute that part at a time, the effective parallelism in that region is lost.
Rule: If many threads frequently enter the same critical region, it becomes a serialization point and can destroy scalability.
Use critical sections for correctness first, then consider replacing them with more fine-grained mechanisms such as reductions, atomics, or private data plus explicit combination when possible.
Atomic Operations
Atomic operations provide a very small, low-level synchronization primitive focused on a single memory location. In OpenMP, an atomic operation tells the compiler and runtime to ensure that a simple update to a variable occurs as an indivisible operation.
For example, a statement such as sum += x[i] updates sum through a read, an addition, and a write. Without synchronization, two threads may both read the same old value of sum and overwrite each other's contribution. An atomic update makes the combined read-modify-write of a particular variable appear as one, uninterruptible step to all threads.
Atomic operations are ideal when several threads update a shared counter, accumulate a sum, or track a maximum or minimum on a single variable. They are finer-grained than critical sections because they only protect one variable and one operation type, so they usually scale better.
However, excessive use of atomics can still lead to contention, particularly if many threads hammer on the same location. In addition, not all compound operations can be expressed as a single atomic construct, so there are situations where a critical section or a different design is required.
Use atomic for simple updates on a single variable.
If your code inside critical is just x = x + value;, prefer an atomic update.
Barriers
A barrier is a synchronization point where all threads in a team must arrive before any of them can proceed. Barriers are used to separate phases of computation and ensure that all threads have reached a consistent state.
An OpenMP barrier in C looks like #pragma omp barrier. When a thread reaches the barrier, it waits until every other thread in the same parallel region also reaches that point. Once the last thread arrives, all of them are released and can continue.
Barriers are useful when later code depends on the completion of earlier work from all threads. For example, after all threads update different parts of a shared array, a barrier can ensure that the entire array has been updated before a subsequent computation that assumes the array is complete.
However, barriers can also harm performance if overused. Any time one thread is much faster or slower than others, faster threads will spend time idle at the barrier waiting for the slowest one.
Rule: Use barriers only when you truly need a global synchronization point.
Unnecessary barriers reduce parallel efficiency.
Many OpenMP constructs, such as work-sharing loops, may include an implicit barrier at the end by default. Often, you can remove these implicit barriers with clauses provided by the model when you know they are not required, which can improve performance.
Ordered Execution
Ordered execution is a special kind of synchronization that enforces sequential order for specific parts of code inside a parallel loop. OpenMP provides an ordered construct which allows you to specify that certain operations must occur in the same order as the loop iterations, even though the iterations are otherwise executed in parallel.
This is useful when you want to exploit parallelism in most of the loop body, but you have a small section that must obey a specific order. A common example is producing output in the loop iteration order or updating a shared resource that must reflect a strictly increasing sequence.
Only the ordered region is serialized according to the loop index. Outside this region, iterations may proceed concurrently. This gives you more flexibility than simply executing the entire loop sequentially, while still preserving order where needed.
You should avoid using ordered regions for large amounts of work because they impose serialization and limit scalability in the same way that a long critical section would.
Use ordered for small, order-sensitive parts inside otherwise parallel loops, not for large computations.
Locks and Mutexes
Locks (also called mutexes) provide manual control over mutual exclusion. Instead of relying on implicit lock management via constructs like critical, you explicitly create, acquire, and release locks in your program.
In OpenMP, you typically declare an omp_lock_t variable, initialize it, and then use library functions to set or unset the lock. While a lock is held by a thread, other threads that attempt to acquire the same lock will block until it is released.
Locks allow more flexible patterns than a single critical region. For example, you can associate different locks with different shared resources or parts of a data structure. This can reduce contention because threads operating on independent resources do not interfere with each other.
However, locks also introduce greater complexity and new classes of bugs. If a thread forgets to release a lock, or if locks are acquired in inconsistent orders, the program can hang or behave unpredictably. Locks also add overhead, so they should be used only when finer-grained control is necessary.
When using locks, always follow a consistent lock ordering and ensure that every path that acquires a lock also releases it, even on errors.
Read–Write Synchronization
In many algorithms, multiple threads frequently read shared data while writes are less frequent. For such patterns, it can be overly conservative to use a lock that blocks readers from reading when no one is writing.
Read–write synchronization distinguishes between read-only access and write access. Multiple readers can hold a shared lock simultaneously, while writers require exclusive access. Many threading libraries provide reader–writer locks or similar constructs, and some languages and frameworks offer high-level primitives that encapsulate this behavior.
Although read–write locks can improve scalability for read-dominated workloads, they also incur more complexity than simple mutex locks. Managing upgrade and downgrade from read to write lock, or vice versa, can introduce subtle bugs if not carefully designed.
For simple HPC programs, a standard mutex or atomic operation is often easier and sufficient. Consider read–write synchronization only when profiling shows that plain mutual exclusion is a major bottleneck and the access pattern is heavily skewed toward reads.
Reader–writer locks are most beneficial when reads vastly outnumber writes.
If write frequency is high, they may perform worse than simple mutex locks.
Memory Consistency and Fences
Synchronization is not only about who can access data, but also about when updates become visible to other threads. Modern CPUs and compilers may reorder instructions and memory accesses as long as they preserve the illusion of sequential execution within each thread. In a parallel program, this can lead to subtle visibility problems between threads that communicate through shared variables.
A memory fence is a synchronization mechanism that restricts such reordering. It ensures that all memory operations before the fence are completed and visible to other threads before operations after the fence can proceed. In OpenMP and other models, certain constructs implicitly include memory fences, such as entering and exiting barriers, critical sections, or atomic updates.
In many high-level shared memory programs, you will not explicitly insert fences, because the constructs you already use for synchronization also establish the necessary ordering. Problems arise when threads communicate without proper synchronization, for example, one thread writes data and sets a flag, while another thread reads the flag and then the data without any barrier, lock, or atomic semantics.
When you rely on shared flags or other low-level signaling between threads, you must guarantee that the writer and reader use some form of synchronization that establishes a happens-before relationship, so that the reader sees the writer’s updates consistently.
Rule: Do not rely on unsynchronized reads and writes of shared flags or data.
Use synchronization constructs that both control access and establish memory visibility.
Thread Coordination with Single and Master
OpenMP provides constructs that let exactly one thread execute a code section inside a parallel region, without requiring mutual exclusion in the usual sense. The single construct ensures that one and only one thread in a team executes a block, while the other threads skip it. The master construct in OpenMP designates that only the master thread executes the block.
These constructs are used for coordination tasks rather than data protection. For example, you might use single to perform an initialization that only needs to happen once per team, or to execute a non-parallelizable I/O operation while other threads wait at an implicit barrier that follows the single block.
Unlike critical, single and master do not provide general mutual exclusion for arbitrary regions entered by multiple threads. Instead, they are designed for patterns where one thread should perform a role on behalf of the team, often combined with implicit or explicit barriers to ensure that other threads see a consistent state afterward.
You should choose between single and master based on whether it matters which thread executes the block. If you do not care which thread does the work, a single region allows any thread to take on the task. If you specifically want the master thread to handle it, for example for ease of reasoning or to interact with thread-unsafe libraries, then master is appropriate.
Use single or master for one-thread-only tasks inside a team,
but not as a general substitute for critical or locks.
Reductions as Structured Synchronization
Reduction operations are a specialized and structured form of synchronization used to combine multiple thread-local values into a single result. Common reductions include sums, products, minima, and maxima. In OpenMP, you can declare a reduction clause so that the runtime transparently manages the combination of partial results.
Although reductions are often discussed in the context of work-sharing constructs, they are also an important synchronization mechanism because they automatically handle partial result accumulation and final combination in a thread-safe way. Internally, the runtime will use atomics, critical sections, or tree-based combination strategies, but as a programmer you simply express the mathematical intent.
Reductions are more efficient and less error-prone than manually protecting a shared accumulator with a critical section or atomic operation in each loop iteration. They also help you avoid subtle bugs that arise from forgetting to initialize private accumulators or incorrectly combining them.
Whenever your parallel pattern matches a mathematical reduction, it pays to state it as such, and let the model handle the necessary synchronization optimizations.
Whenever possible, express collective updates as reductions instead of manual critical sections or atomic operations on a shared accumulator.
Choosing the Right Mechanism
The central skill with synchronization is choosing the weakest mechanism that is sufficient to guarantee correctness, while minimizing contention and overhead.
If you only need to update a single numeric variable using a simple operation, an atomic update is often ideal. For more complex updates to shared state, a critical section or lock is appropriate. If all threads must wait at a phase boundary, a barrier is the right tool. When combining results from many threads, a reduction offers a concise and usually efficient abstraction.
Excessive or unnecessarily strong synchronization will cause threads to wait instead of perform useful work. Too little synchronization or incorrect use will cause race conditions and hard-to-debug errors. Correctness is always the first priority, but once correctness is ensured, profiling and iteration on your choice of mechanisms will help recover performance.
Guiding principle:
- First ensure correctness with clear, possibly conservative synchronization.
- Then profile and gradually replace heavy mechanisms with lighter ones where safe.
- Always avoid unsynchronized access to shared data that is written by multiple threads.
By understanding the strengths and limitations of these synchronization mechanisms, you can structure shared memory programs that are both correct and efficient, and you will be better prepared to identify and fix the synchronization-related bugs and performance issues discussed elsewhere in this course.