Table of Contents
Concept of a Thread in Shared Memory
In shared memory parallel programming, a thread is a lightweight execution context that shares the same address space with other threads in the process. All threads in a process can see the same global variables, static variables, and dynamically allocated memory. Each thread has its own program counter, registers, and stack, but they cooperate on the same data structures in memory.
From the viewpoint of the operating system, a process usually represents an application instance, and threads are concurrent flows of control inside that process. In shared memory programming you typically create multiple threads within one process to exploit multiple cores on a single node.
A thread is an independent flow of control within a process that shares the same address space with other threads of that process.
This sharing of memory is what makes shared memory programming attractive and also what makes thread management and correctness nontrivial.
Creating and Destroying Threads
In a shared memory environment you rarely create threads directly with low level system calls. Instead, you normally use a threading library or a parallel programming model such as OpenMP. OpenMP has already been introduced in the parent chapter, so here we focus on how threads are managed conceptually and what that means in an OpenMP context.
When a program starts, it begins as a single thread, often called the master thread. In OpenMP, a parallel region instructs the runtime to create a team of threads. The master thread becomes part of this team, and additional worker threads are created as needed. When the parallel region ends, the worker threads are either destroyed or kept in a pool for future reuse, depending on the implementation.
Conceptually, thread creation has a cost. Each thread needs a stack, bookkeeping data structures, and initialization. On the destruction side, the runtime must clean up these resources. Creating and destroying threads repeatedly in short segments of code may significantly reduce performance. This is one reason why OpenMP implementations often use a pool of threads that are created once and reused.
In portable OpenMP codes you do not control thread creation directly, but you influence it through environment variables and runtime routines that set the size of teams. The runtime then decides when to allocate the necessary resources.
Thread Lifetimes and the Master Thread
Within an OpenMP program, the master thread is the one that begins execution before any parallel region is entered. This thread is in charge of orchestrating the work sharing and synchronization. In every parallel region, the master thread is part of the team and may perform both coordination and regular work.
The lifetime of a thread consists of three main phases. A thread is created by the runtime, it participates in one or more parallel regions, and then it eventually terminates when the runtime decides it is no longer needed or at program exit. The exact mapping of these phases to operating system threads is implementation specific, but the programming model guarantees that the logical team of threads behaves as specified.
Within a parallel region, each thread has an identity, typically an integer from 0 to $N-1$, where $N$ is the total number of threads in the team. The master thread usually has ID 0. These thread IDs are used for tasks such as assigning work, performing I/O from a single thread, or debugging.
Controlling the Number of Threads
One of the core aspects of thread management is deciding how many threads to use. Using too few threads underutilizes the hardware. Using too many can create oversubscription, where more threads attempt to run than there are hardware cores, which often hurts performance.
In OpenMP, the number of threads in a parallel region is influenced by several mechanisms. The environment variable OMP_NUM_THREADS sets a default maximum team size. The runtime routine omp_set_num_threads(int n) can set the desired number programmatically before entering a parallel region. The num_threads clause on a parallel construct can override these defaults for a specific region.
You rarely want to guess the number of threads. Typically, you match the number of threads to the number of physical cores available on the node. You can query this using system tools at the command line or by calling omp_get_num_procs() at runtime. In more complex applications, you might adjust the thread count at different phases of the program, for example using fewer threads for memory bound phases and more for compute bound phases.
Rule of thumb: For simple OpenMP programs, set the number of threads equal to the number of physical cores you intend to use on the node to avoid oversubscription.
Modern OpenMP also supports nested parallelism and dynamic adjustment of threads, but these features should be used carefully because they can complicate performance behavior.
Thread Affinity and Core Binding
Thread affinity refers to controlling which cores a given thread runs on. Operating systems usually schedule threads dynamically across available cores, but in high performance computing this can introduce performance variability and cache inefficiencies. By binding each thread to a specific core, you can reduce cache misses and minimize contention for shared resources.
The core idea is that a thread that repeatedly works on the same data benefits from staying on the same core because that core’s cache keeps useful data. Migration of a thread to another core forces this data to be reloaded into new caches, which costs time.
OpenMP exposes affinity control primarily through environment variables such as OMP_PROC_BIND and OMP_PLACES. The exact syntax depends on the OpenMP version and the implementation. For example, you can set OMP_PROC_BIND=close to keep threads close to each other in terms of hardware topology, or OMP_PROC_BIND=spread to spread threads across cores or sockets.
On clusters, the job scheduler and mpirun or srun can also control CPU binding at the process level. It is important to coordinate process placement with thread affinity. If several processes are pinned to overlapping cores and each process uses many threads, they may interfere with each other.
Careful thread affinity settings are particularly important on nodes with multiple sockets or complex NUMA layouts, as they interact with memory locality, which is treated in more detail in other chapters.
Stack Size and Thread Resources
Each thread needs a stack for local variables and for storing return addresses and function call frames. The default stack size per thread can vary widely between systems and compiler runtimes. In parallel codes with deep recursion or with large local arrays, the default stack size may be too small, which can lead to runtime errors or segmentation faults when a thread’s stack overflows.
OpenMP provides a way to control stack size through an environment variable, often OMP_STACKSIZE. Some systems also use system specific environment variables or shell commands such as ulimit to control stack limits. When using large local arrays inside parallel regions, it is generally safer to allocate them dynamically on the heap rather than on the stack, to avoid hitting these limits.
Avoid very large local arrays inside threaded regions. Prefer dynamic allocation or shared arrays to prevent per thread stack overflow.
In addition to the stack, each thread consumes kernel resources such as thread control blocks and scheduling structures. Excessive numbers of threads can exhaust these resources or increase context switch overhead. This is another reason to keep thread counts reasonable and matched to hardware capabilities.
Thread Identification and Introspection
Managing threads inside a program often requires that each thread knows who it is and what other threads exist. In OpenMP, threads can query their identity within the team using omp_get_thread_num() and the team size using omp_get_num_threads().
This information is useful for patterns like manual work distribution or debug logging. For instance, a common pattern is to let only thread 0 print diagnostic messages, while other threads perform quiet computation. Another pattern is to use the thread ID as an index into per thread arrays that store partial results.
Although manual control of work distribution by thread ID is sometimes useful, in many cases it is better to use higher level OpenMP work sharing constructs to let the runtime manage the distribution. Tying too much logic to specific thread IDs can reduce portability and makes it harder to tune performance when the thread count changes.
Thread Pools and Overheads
Creating and destroying threads is not free. The cost has two parts. There is a one time cost for creating the thread and assigning it a stack and other resources. There is also a recurring overhead when entering and leaving parallel regions and synchronizing threads at barriers.
To reduce overhead, many runtimes maintain a pool of worker threads that are created during the first parallel region and then kept alive for the lifetime of the program. Subsequent parallel regions reuse this pool, which avoids repeated thread creation. The master thread wakes dormant worker threads when they are needed and puts them back to sleep afterward.
Despite thread pooling, frequent short parallel regions can still cause overhead because of the need to initialize and synchronize threads and to dispatch small amounts of work. To manage this, you can try to structure your code so that each parallel region performs a substantial amount of work, or you can use constructs that fuse multiple operations inside a single region.
The cost of thread management becomes more visible when the amount of computation per thread is small. In such cases, the overhead may dominate, and a serial execution may even be faster. This trade off is at the heart of deciding how to structure threaded code for good performance.
Nested Threads and Hierarchical Parallelism
Thread management becomes more complex when nested parallelism is used. Nested parallelism means that a thread that is already inside a parallel region enters another parallel region, thereby creating an additional level of threads.
OpenMP can enable or disable nested parallelism via runtime routines or environment variables. When disabled, nested parallel regions are usually serialized, meaning they execute with a single thread. When enabled, each level of nesting can create its own team.
Nested threads can be useful in applications that have natural hierarchical structure, such as domain decomposition inside domain decomposition. However, they can also easily create far more threads than the hardware can support efficiently. For example, if a program spawns 8 threads, and each of them then creates 8 more, the program may attempt to run 64 threads on a system that has only a small number of cores.
For most introductory HPC codes, it is advisable to avoid nested parallelism until there is a clear performance reason and a good understanding of the underlying hardware and scheduling behavior.
Interaction with the Operating System Scheduler
Although the threading runtime presents a high level abstraction to you, operating system schedulers ultimately decide how threads are mapped to cores and when they run. The OS uses policies and priorities to distribute CPU time among threads of all processes in the system.
In a typical HPC environment, you run on a compute node that is reserved for your job. This reduces interference from other users, but system daemons and kernel threads still compete for some CPU time. Job schedulers can pin processes and threads to cores or CPUs and can isolate cores for particular tasks, but the final scheduling decisions still lie with the OS.
The OS may preempt threads and switch between them at a high frequency. Context switches have a cost, especially when caches must be reloaded. Good thread management in HPC aims to minimize unnecessary context switches by limiting oversubscription and by using affinity controls so that the scheduler keeps threads on consistent cores.
Basic Patterns for Managing Work Among Threads
While detailed work sharing constructs are treated elsewhere, it is useful here to see how thread management interacts with work distribution at a conceptual level. When multiple threads share a memory space, the central challenge is to divide work so that all threads have something useful to do without excessive coordination overhead.
In a static work distribution pattern, the total amount of work is divided evenly among threads at the start. Each thread then performs its assigned chunk independently. This method has low scheduling overhead but can suffer from load imbalance if some chunks take longer than others.
In a dynamic pattern, threads request work units from a central scheduler when they become idle. This balances load better at the cost of more coordination. The runtime must also maintain data structures that map work units to threads. OpenMP offers scheduling policies that implement these patterns at the loop level.
Management of threads therefore includes not only their creation and binding but also how they are fed with work. A good thread management strategy keeps all threads busy, minimizes idle time at synchronization points, and avoids unnecessary data movement.
Debugging and Observing Threads
Understanding what your threads are doing is crucial for correctness and performance. Although deeper debugging techniques are covered in another chapter, there are some thread management related practices that are useful early on.
First, you can insert diagnostic prints that include the thread ID, such as Thread 0 entering region or Thread 3 processing chunk 5. These messages help you verify that the expected number of threads are active and that work is distributed as you intended. However, you should remember that interleaved output from multiple threads may be hard to read and may change from run to run because of nondeterministic scheduling.
Second, many tools and profilers provide thread aware views that show which thread runs on which core and how much time each thread spends in computation, waiting, or synchronization. These tools rely on the correct use of thread management APIs and give insight into problems like imbalance, oversubscription, or poor affinity.
Finally, you should become comfortable with environment variables and runtime routines that control threading. Being able to quickly change the number of threads, stack size, and binding policy and then observe the effect is an essential part of practical HPC thread management.