7 Shared-Memory Parallel Programming

Goals of this Chapter

In this chapter you learn:

What “shared-memory” means in the context of parallel programming
How shared-memory parallelism differs from distributed-memory approaches
The basic mental model needed before using tools like OpenMP
Typical use cases, strengths, and limitations of shared-memory programming
How shared-memory programming fits into the overall HPC picture

Details of specific programming models (like OpenMP syntax, thread directives, etc.) are covered in later subsections.

What Is Shared-Memory Parallel Programming?

In shared-memory parallel programming, multiple threads (or lightweight execution units) run on the same physical node and all access a single, shared address space. In practice:

There is one main memory (RAM) visible to all threads.
Each thread can read and write any location in that memory (subject to permissions).
Communication happens implicitly through loads and stores to this shared memory, not by explicit message passing.

Conceptually, you have:

One process
Many threads inside that process
All sharing the same variables and data structures (unless designated as private to a thread)

This is in contrast to distributed-memory models, where each process has its own memory and data must be exchanged explicitly via messages.

Typical Hardware for Shared Memory

Shared-memory parallel programming typically targets:

Multi-core CPUs: A single CPU package with many cores, each capable of running one or more hardware threads.
Multi-socket nodes: Several CPU packages on the same motherboard, sharing system memory via an interconnect (e.g., NUMA architectures).
Accelerators with shared memory among cores: Although GPUs have their own specialized programming models, many of the same shared-memory principles apply.

As core counts increase on modern processors, shared-memory parallelism is the natural way to exploit all cores within a single node.

Shared vs. Distributed Memory: Conceptual Contrast

You will see both shared-memory and distributed-memory models in HPC. At a high level:

Shared-memory

Threads share variables.
Communication = memory loads/stores.
Simpler to write for moderate scales.
Limited by memory bandwidth and the maximum number of cores in a node.

Distributed-memory

Processes have separate memory.
Communication = explicit messages (e.g., MPI).
Scales to thousands of nodes.
More complex data distribution and communication logic.

In practice, many real HPC applications use both models together (hybrid programming), but this chapter focuses only on the intra-node shared-memory aspect.

Basic Concepts and Mental Model

Before working with specific APIs, it is useful to adopt a mental model for shared-memory parallel programs.

Threads as Workers

Think of a shared-memory program as:

A master thread that starts the program.
A team of worker threads that get created to perform parallel regions of work.
Threads run largely the same code but operate on different parts of the data.

Unlike separate processes:

Threads are very lightweight.
Thread creation and synchronization are usually cheaper than starting new processes.
Threads share file descriptors, global variables, and heap memory within the process.

Shared Address Space

In the shared-memory model:

A variable in memory can be read or written by any thread.
Threads can have:

Shared data: visible and modifiable by all threads.
Private data: each thread has its own copy (e.g., local variables on its stack or explicitly marked private in a parallel model).

A core part of programming in this model is deciding which data is shared and which is private, to ensure correctness and performance.

Parallel Regions and Work Sharing

Most shared-memory models follow a pattern along these lines:

Start as a single thread (serial code).
Encounter a parallel region declaration.
Create multiple threads to execute code within that region, possibly splitting the work among them.
Join back into a single thread at the end of the region.

Pseudo-structure:

// serial code here
// begin parallel region
create a team of threads
  each thread executes code in the region
join threads
// back to serial code

How exactly work is split (“work sharing”) is defined by constructs in the programming model (e.g., for/parallel for in OpenMP) and is covered later.

Synchronization and Ordering

Because threads can access shared data concurrently, you must:

Control when certain operations happen relative to each other.
Prevent multiple threads from updating data in inconsistent ways.

Conceptually, this is done using:

Barriers: all threads reach a point before anyone proceeds.
Locks/mutexes: only one thread can execute a critical section at a time.
Atomic operations: small operations that appear indivisible.

Synchronization details, race conditions, and correctness aspects are covered in the later subsections, but recognizing their necessity is essential before writing any shared-memory code.

Advantages of Shared-Memory Parallel Programming

Shared-memory models are popular in HPC for several reasons:

Simpler communication pattern

No need to manually send and receive messages for data that is naturally shared.
Easier to parallelize loops and data-parallel operations on arrays residing in one memory space.

Incremental parallelization

You can often parallelize existing serial code incrementally, focusing on hotspots such as loops or compute-intensive regions.
Many frameworks allow you to add compiler pragmas or directives to existing code.

Fine-grained parallelism

Threads can share complex data structures (graphs, trees, large objects) without serialization or marshalling.
Good for workloads where tasks frequently access common data.

Lower overhead compared to inter-process communication

Thread creation and memory sharing are generally more efficient than creating processes and exchanging messages.

Within a single node, shared-memory programming is usually the most natural way to exploit all the cores.

Limitations and Challenges

Shared-memory parallelism also has important limitations that influence how and when you use it.

Scalability Limits

The total parallelism you can exploit is bounded by:

The number of hardware cores or hardware threads per node.
The memory bandwidth and memory hierarchy (caches, NUMA domains).

Beyond a certain point:

Adding more threads may give diminishing returns or even slowdowns.
Contention for shared resources (e.g., memory or specific data structures) becomes a bottleneck.

These issues motivate using distributed-memory or hybrid approaches for very large systems.

Contention and False Sharing

Shared-memory codes can suffer from:

Contention: many threads trying to access the same data or lock.
False sharing: threads update different variables that happen to reside in the same cache line, leading to unnecessary cache coherence traffic.

Such problems can significantly degrade performance even when algorithms are otherwise well designed.

Correctness Issues

Because threads share data and run concurrently:

Ordering of operations is not always obvious.
Bugs like race conditions and deadlocks are common in poorly synchronized codes.
These bugs are often intermittent and hard to reproduce.

Later subsections on synchronization, race conditions, and debugging will address this in more detail.

Common Use Cases in HPC

Shared-memory programming is widely used in:

Numerical kernels on a single node

Parallelizing loops over array indices, grid points, particles, or matrix rows.

Multithreaded libraries

Linear algebra libraries that internally use threads to exploit multiple cores.

Task-based workloads

Parallelizing independent tasks that share a common data structure or memory pool.

Pre- and post-processing

Data reading, parsing, and analysis steps that run on a single node but benefit from multiple cores.

A typical pattern is:

Use shared-memory parallelism to fully exploit a node’s cores.
Use distributed-memory parallelism to scale across multiple nodes.

Programming Models for Shared-Memory

There are several ways to program shared memory; in this course we focus on the models most relevant for HPC.

Common approaches include:

Compiler-directive-based models

For example, languages extended with directives that tell the compiler where and how to parallelize (OpenMP is the main example and gets its own chapter).
Often the easiest starting point for newcomers.

Thread libraries

Lower-level APIs (e.g., POSIX threads in C/C++) that provide direct control over thread creation and synchronization.
More flexible but more verbose and error-prone.

Language-integrated concurrency

High-level constructs provided directly by the language (e.g., C++ threads, parallel algorithms; concurrency features in other languages).
Often built on top of lower-level threading primitives.

This chapter introduces the concepts common to all these models; later sections concentrate on OpenMP, which is the standard shared-memory model in HPC.

Basic Design Patterns

Several conceptual patterns recur in shared-memory parallel code:

Parallel Loops

Split iterations of a loop among threads:

Each thread processes a subset of array indices or data elements.
Data dependencies and shared-variable updates must be analyzed and handled correctly.

Parallel Sections / Tasks

Different threads run different independent tasks:

Useful when you have heterogeneous work (e.g., one thread handles I/O, others compute).
More flexible than simple loop parallelism but requires more careful work scheduling.

Master-Worker Pattern

One thread acts as a coordinator:

Distributes tasks or data chunks to worker threads.
Aggregates results at the end.

This is a conceptual pattern; specific programming models provide different ways to implement it (e.g., task constructs, work queues).

Performance Considerations at a High Level

Even before diving into the details, it is important to keep some performance principles in mind:

Exploit data locality

Try to have threads work on data that is nearby in memory to benefit from caches.
Avoid patterns where threads repeatedly jump around in memory.

Minimize synchronization

Synchronization operations introduce overhead and can force threads to wait.
Design algorithms to reduce the need for locks and barriers where possible.

Balance the workload

Aim for each thread to do roughly the same amount of work.
Imbalanced work leads to idle threads and wasted cores.

Be aware of the memory system

Shared memory does not mean “free communication”; all threads contend for the same memory bandwidth.
NUMA effects on multi-socket systems can impact performance if data placement is not considered.

These themes recur throughout the subsections on parallel regions, work-sharing, and performance in shared-memory codes.

How This Fits into the Overall Course

Within the larger HPC context:

Shared-memory parallel programming is the foundation for node-level parallelism.
Distributed-memory programming extends parallelism across nodes.
Hybrid and accelerator programming combine these ideas for modern, heterogeneous systems.

In the following subsections, you will apply the conceptual model from this chapter to: