7 Shared-Memory Parallel Programming

Table of Contents

Overview

Shared memory parallel programming refers to all approaches where multiple execution threads or tasks access a single, common address space. Each thread can read and write any variable in that space, subject to the rules of the programming model and the hardware. This model underlies typical multicore desktop and server systems, and most HPC clusters today combine shared memory inside a node with distributed memory across nodes.

In practice you will typically use a shared memory programming model on a single compute node or even a single CPU socket, and combine it with a distributed memory model when moving to multiple nodes. This chapter focuses on the conceptual aspects of shared memory, independent of any specific programming interface such as OpenMP or threads in a particular language, which are covered later.

Shared memory models are widely used because they are conceptually natural. Instead of explicitly sending messages between processes, you simply write code that operates on ordinary variables, arrays, and data structures. The parallelism is expressed by splitting work across multiple threads that operate on shared data. The difficulty lies in correctly coordinating access to this shared state.

In shared memory programming, all threads can access the same data, so uncoordinated writes lead to incorrect results and hard to reproduce bugs.

This chapter explores what makes shared memory parallelism different, when it is appropriate, and what conceptual issues arise when multiple threads interact via a common address space.

The Shared Address Space

In a shared memory system each thread sees the same logical memory. A pointer or array index refers to the same location, regardless of which thread dereferences it. While the hardware implementation can be quite complex, your mental model can be simple: all threads read and write a single pool of variables.

This is different from distributed memory programming where each process has its own local memory and cannot directly access another process’s data. Shared memory programming avoids explicit communication and is often easier to start with, especially for beginners who already know sequential programming.

At a programming level, shared memory can appear in different forms. In low level APIs you might create operating system threads that all share the address space of a single process. In higher level APIs, such as directive based models, you might mark regions of code that should run in parallel, and the runtime creates the threads for you. In both cases, the common feature is a single address space that all threads can read and write.

Because of this shared address space, it is easy to convert some sequential code patterns into shared memory parallel code. For example, you can split a loop over an array into chunks that different threads work on. You do not need to explicitly make copies of the array or send parts of it to different processes. The main task is to ensure that two threads do not write to the same element at the same time, unless this is deliberate and coordinated.

Threads as Units of Execution

Shared memory parallel programs are usually structured around threads. A thread is a lightweight execution context that runs within a process and shares that process’s address space. All threads in a process have access to the same global and heap allocated variables, but they have their own stacks and their own instruction streams.

On a multicore CPU, at any moment each core typically executes one or more threads. When you write a shared memory program, you are choosing how to map your computational work onto these threads. You might choose a fixed number of threads equal to the number of physical cores, or a slightly different number depending on workload and hardware. This mapping and thread management strategy will be discussed in detail later, but here the key idea is that threads are the agents that perform parallel work on shared data.

Threads are not independent programs. They are tightly coupled. A bug in one thread can corrupt a shared data structure and cause a failure in another thread. For this reason, reasoning carefully about which thread reads or writes which locations at which time is central to shared memory programming.

Patterns of Shared-Memory Parallelism

Although shared memory programming models offer many constructs, most useful patterns fall into a small number of categories. Understanding these patterns helps you recognize opportunities to parallelize existing code and to select appropriate implementations.

A very common pattern is parallel loops over arrays or other indexable data structures. Suppose you have an array a of length N and a loop that applies the same operation to each element. In a shared memory program you can divide the iteration space 0, 1, ..., N-1 into contiguous or interleaved segments and assign each segment to a different thread. Since each thread writes to a disjoint subset of a, no coordination is required except for knowing where to start and stop. This pattern is sometimes called data parallelism, but here it is expressed within a shared address space, without explicit data distribution.

A second pattern is task parallelism within a shared memory space. Instead of splitting a large homogeneous loop, you may have several conceptually different tasks that can run concurrently and share some common data. For instance one thread may read input from disk and place it into a buffer, while several worker threads process chunks from that buffer. All threads see the same buffer variables, but you must ensure that they coordinate through appropriate mechanisms so that no one consumes data that has not yet been produced, or overwrites data that another thread is still using.

A third common pattern is producer consumer queues. Here threads communicate by placing items into and removing items from shared data structures such as queues or work lists. Since the data structure itself is shared, all inserts and removals must be done in a way that avoids corrupting its internal representation. This usually involves some synchronization mechanism and careful design to avoid excessive contention.

Finally, some algorithms require fine grained sharing of complex data structures such as graphs, trees, or hash tables. In these cases different threads may traverse and update overlapping portions of the structure. Designing such algorithms to scale while remaining correct is challenging. It often requires a deeper understanding of concurrency and synchronization patterns, which later sections will explore.

Memory Consistency and Visibility

On a single core processor, a simple sequential program can assume that operations happen in program order: a store to a variable followed by a load from that variable sees the stored value. In a shared memory parallel program, multiple threads execute on different cores and the hardware may reorder memory operations or keep local cached copies of data. As a result, the order in which operations become visible to other threads is not always the same as the order in which they appear in the program.

This gives rise to the concept of a memory consistency model. A memory consistency model defines which values a read operation is allowed to observe when multiple threads perform writes to the same location. While specific models differ between programming languages and platforms, the important consequence for you as a programmer is that modifications in one thread are not automatically and immediately visible to other threads unless certain ordering conditions are satisfied.

For example, if one thread writes to a shared variable flag and then writes to another shared variable data, another thread that reads flag may or may not see the updated data value, depending on the memory model and the presence of explicit ordering or synchronization constructs. This is counterintuitive when coming from purely sequential programming, but it is a fundamental aspect of shared memory parallelism.

Programming models provide synchronization constructs and atomic operations to establish well defined ordering. When used correctly, these constructs ensure that once a thread observes a particular synchronization event, it also sees all preceding writes done by the thread that performed that event. You rarely need to reason about hardware level reorderings in detail, but you must ensure that every read of shared data that depends on writes from other threads is properly ordered through some form of synchronization.

Never rely on compiler or hardware to keep operations in program order across threads.
Use synchronization primitives provided by your programming model to establish required memory visibility guarantees.

Ignoring memory visibility can lead to elusive bugs where a program appears to work most of the time but fails under heavy load or on particular hardware. Understanding that each thread has its own view of memory, and that synchronization aligns these views, is crucial in shared memory programming.

Shared Variables and Data Scoping

A central question in any shared memory program is which variables are actually shared and which are private to each thread. In many languages, variables on the stack of a function are private to the thread executing that function, while global variables and heap allocated objects are shared. However, parallel programming models often let you override these defaults and explicitly mark variables as shared or private, especially within parallel regions or parallel loops.

Shared variables are accessible by all threads. Any write by any thread can affect later reads by other threads. Private variables are logically replicated per thread, so each thread has its own independent copy. Modifying a private variable does not affect other threads. This distinction is essential to avoid unnecessary sharing and the associated synchronization overhead.

For example, consider a simple accumulator variable sum used inside a loop. If multiple threads execute the loop in parallel and all update the same sum variable, you either need explicit synchronization or you will encounter race conditions. In contrast, if each thread has its own private accumulator, there is no interference. At the end of the computation you can combine the private results in a controlled way. Many programming models provide special constructs to support this pattern, often called reductions.

Incorrectly scoping variables is a common source of bugs when first learning shared memory programming. A loop index variable that you thought was independent per thread might actually be shared and cause corrupt behavior. Conversely, a variable that should be shared for coordination might mistakenly be declared private, leading to missed communication between threads.

It is helpful to systematically identify in your design which data structures must be shared and why. If a variable does not need to be shared, making it private simplifies reasoning and often improves performance. Shared variables should be limited to those that represent truly global state or that participate in well defined synchronization patterns.

Synchronization and Coordination Concepts

Any time multiple threads access shared variables where at least one access is a write, and there is no guarantee about the relative ordering of those accesses, there is a potential for incorrect behavior. Synchronization is the process of constraining the interleaving of operations across threads so that the program behaves as intended.

Several conceptual forms of synchronization commonly appear in shared memory programming. The most basic is mutual exclusion. Mutual exclusion ensures that only one thread at a time executes a particular critical section of code or accesses a particular data structure. This prevents data races but can also limit concurrency if overused or too coarse grained.

Another important form is barrier synchronization. At a barrier, each thread waits until all threads in a group have reached the same point. Only then do they all proceed. Barriers provide a way to implement phases of computation where all threads must complete one phase before starting the next. Barriers also have implications for memory visibility since they typically enforce ordering of memory operations around the synchronization point.

More subtle forms include read write coordination and atomic operations. Sometimes multiple threads may read shared data concurrently, but writes must be exclusive. Other times you need to update a shared counter or accumulate to a shared variable in a way that appears indivisible. Atomic operations provide such guarantees without requiring a full critical section. They can be more efficient but must be used carefully to avoid creating hotspots of contention.

Later sections of the course will introduce concrete synchronization mechanisms such as locks, semaphores, atomic primitives, and the specific constructs provided by shared memory programming frameworks. For now, it is enough to understand that correct shared memory programs are built by combining parallel execution with explicit points of coordination. If you attempt to rely on pure interleaving without structured synchronization, you are likely to produce non deterministic and unreliable results.

Race Conditions and Determinism

In shared memory parallel programs, a race condition occurs when the final result depends on the relative timing of operations across threads. Most race conditions are unintended and lead to incorrect or unpredictable behavior. For example, if two threads increment a shared counter without coordination, the final value may be less than the number of increments, because both threads may read the same old value and write back the same new value, losing one increment.

Race conditions are closely related to a lack of synchronization. If there is no ordering constraint between two writes to the same memory location, or between a write and a read of that location, then the order in which they happen is effectively random, governed by complex hardware scheduling. A program that seemed to work in a test run may fail when run again with a different number of threads or on a different machine.

Determinism is the property that a program always produces the same observable result for a given input. Sequential programs are usually deterministic unless they explicitly involve randomness or external interactions. Shared memory parallel programs can be deterministic if they carefully structure their synchronization and data access so that, despite internal concurrency, the effect is as if operations occurred in a fixed order.

Not all useful parallel programs must be fully deterministic. Some numerical algorithms tolerate small variations in the order of floating point operations, which can lead to minor differences in roundoff error without affecting the overall result. However, you should avoid unintended sources of nondeterminism that reflect bugs rather than algorithmic choices.

A practical approach is to identify which parts of your computation must be strictly deterministic and ensure that all shared accesses in those regions are fully synchronized. In other parts, such as global summations where small numeric differences are acceptable, you may accept benign nondeterminism as a trade off for performance. Understanding where your program lies on this spectrum is part of designing effective shared memory algorithms.

Performance Characteristics of Shared Memory

Shared memory parallel programming can provide strong performance on modern multicore CPUs, especially for workloads that fit well in the memory hierarchy of a single node. However, achieving good performance is not automatic. Besides correctness, you must also think about how your use of shared data interacts with hardware features such as caches and memory bandwidth.

Since all threads operate in a single address space, they often share cache lines. If two threads repeatedly update different variables that happen to reside on the same cache line, the hardware may spend significant time transferring that line back and forth between cores. This phenomenon, frequently called false sharing, does not usually affect correctness but can drastically reduce performance. Careful arrangement of data, for example by padding structures or grouping variables by their access patterns, can mitigate this effect.

Shared memory programs also contend for memory bandwidth. If many threads simultaneously stream through large arrays, they may saturate the memory subsystem, after which adding more threads yields little benefit. In some cases, performance may even decline when the overhead of managing threads and contention outweighs the gains. This is why understanding scaling behavior on a single node is a vital step in performance analysis.

Load balancing is another key concern. Even within a single node, some threads may finish their assigned work early while others are still busy. A good shared memory design distributes work such that all threads remain productive for most of the execution. Depending on your workload, you might use static or dynamic scheduling strategies. Static strategies assign work in advance, while dynamic strategies let threads pick up new tasks as they complete old ones.

Finally, synchronization itself has performance costs. Every lock acquisition, barrier wait, or atomic operation introduces overhead and potential contention. While necessary for correctness, excessive or poorly placed synchronization can serialize execution and limit speedup. An important part of shared memory performance tuning is reducing the frequency and granularity of synchronization without violating correctness.

Use the minimum synchronization needed for correctness.
Unnecessary locking or frequent barriers can eliminate the benefits of parallelism on shared memory systems.

Choosing Shared Memory in HPC Contexts

In HPC practice, shared memory parallelism is most effective within a node, where threads can exploit the fast communication provided by caches and on chip interconnects. It is particularly attractive for codes that are already written in a sequential style and rely heavily on large arrays or dense numerical kernels.

When deciding whether to use a shared memory approach, you should consider the following aspects. First, does the target workload fit into the memory capacity of a single node or a small number of nodes. If the data and computation naturally live within a node, shared memory programming can greatly simplify development. If the problem size requires many nodes, you will likely need a distributed memory model as well, and shared memory will become a component of a hybrid strategy.

Second, examine the algorithm’s natural decomposition. If most of the work can be expressed as parallel loops over shared data structures without complex cross thread dependencies, shared memory is appealing. Conversely, if your algorithm inherently involves many independent tasks with minimal shared state, a message passing approach may provide clearer boundaries.

Third, consider maintainability and portability. Shared memory parallel code written using widely adopted models can usually run on any multicore system. This is particularly useful during development and testing, where you might prototype on a laptop or workstation using the same shared memory concepts you will later apply on a compute node.

Within an HPC workflow, a common pattern is to first parallelize a code within a single node using shared memory, validate correctness, and then extend the parallelism across nodes. This layered approach allows you to debug many issues in a simpler environment before dealing with the additional complexity of distributed memory communication.

Summary

Shared memory parallel programming revolves around multiple threads executing concurrently within a single address space. It offers a natural way to parallelize computational kernels on multicore CPUs by sharing data structures instead of sending explicit messages. At the same time, it introduces new conceptual challenges concerning synchronization, memory visibility, race conditions, and performance effects such as contention and false sharing.

The key ideas are to clearly distinguish between shared and private data, to coordinate access to shared variables through well understood synchronization patterns, and to be aware of the underlying memory system. When applied thoughtfully, shared memory programming provides an essential tool for exploiting node level parallelism in modern HPC systems and forms a foundation for more advanced models and hybrid approaches covered later in the course.

7.1 Introduction to OpenMP

7.2 Threads and thread management

7.3 Parallel regions

7.4 Work-sharing constructs

7.5 Synchronization mechanisms

7.6 Race conditions

7.7 Performance considerations