Table of Contents
What is a Thread in Shared-Memory Programming
A thread is a lightweight execution unit within a process. In the shared‑memory setting:
- All threads of a process share:
- The same address space (global variables, heap)
- Open file descriptors
- Program code
- Each thread has:
- Its own program counter (where it is in the code)
- Its own registers
- Its own stack (local variables, function calls)
For OpenMP in particular, you typically have:
- A master (or main) thread that starts the program
- Additional worker threads that are created in parallel regions
- All of them forming a team of threads
Thread-based shared memory parallelism is about controlling:
- How many threads you have
- Where and when they are created
- What work each thread performs
- How they interact and terminate
Thread Lifecycle
A typical thread in a shared-memory program goes through these stages:
- Creation
- The runtime (e.g., OpenMP library or pthreads) allocates resources for a new thread and starts its execution at a given function or code region.
- Running
- The thread executes code, may enter and exit parallel regions, may synchronize with other threads, and may be scheduled onto different cores by the OS.
- Blocking / Waiting
- The thread might wait at a barrier, lock, condition variable, or busy-wait loop, not doing useful work while it waits.
- Completion
- The thread finishes its assigned work and exits the parallel region or its start routine.
- Joining / Reaping
- Another thread (often the master) waits for the finishing thread and cleans up its resources.
In OpenMP, most of this lifecycle is hidden. You mark regions to run in parallel and the runtime handles creation, scheduling, and joining of threads.
Threads in OpenMP: The Basic Model
In OpenMP, threads are created and managed via parallel regions (covered elsewhere). Here we only focus on how the threads themselves are structured and controlled.
Inside a parallel region:
- A team of threads exists
- Each thread has:
- A thread ID within the team:
omp_get_thread_num() - A shared team size:
omp_get_num_threads()
Example (C) to see thread identities:
#include <omp.h>
#include <stdio.h>
int main() {
#pragma omp parallel
{
int tid = omp_get_thread_num();
int nt = omp_get_num_threads();
printf("Hello from thread %d of %d\n", tid, nt);
}
return 0;
}Key ideas:
- The number of threads can be fixed or decided at runtime.
- OpenMP runtime may create and keep a pool of worker threads to avoid repeated creation/destruction costs.
Controlling the Number of Threads
Choosing and controlling thread counts is a central part of thread management.
Global Thread Count Settings
You can set a default number of threads in several ways:
- Environment variable:
export OMP_NUM_THREADS=8(in shell)- Runtime library call (C/C++ or Fortran):
omp_set_num_threads(8);
The runtime uses this as a hint. It may not always be honored exactly, depending on:
- Implementation
- System resource limits
- Nested parallelism settings
Per-Region Thread Control
You can also specify the team size for a particular parallel region using a clause:
#pragma omp parallel num_threads(4)
{
// parallel work with exactly 4 threads (if possible)
}Common patterns:
- Use environment variables for global defaults on a cluster.
- Use
num_threadsselectively where a region has specific needs.
Master and Worker Threads
In a typical OpenMP program:
- The program starts with a single initial thread.
- When encountering a parallel region, this initial thread becomes the master of the team.
- The master spawns (or wakes) worker threads to form a team.
Roles:
- Master thread
- Has
omp_get_thread_num() == 0 - Can be used for serial tasks that should not be done by others (I/O, coordination)
- Worker threads
- Perform computational work
- May also participate in synchronization and communication
You can restrict work to the master thread within a parallel region:
#pragma omp parallel
{
// code executed by all threads
#pragma omp master
{
// executed only by the master thread (no implicit barrier)
}
// code executed by all threads again
}
Note: master does not imply an implicit barrier at the end, unlike single (details of constructs are covered elsewhere).
Thread Affinity and Core Binding (Conceptual)
Thread affinity is about where threads run:
- Ideally, each thread runs on a core and stays there (affinity).
- Poor affinity can lead to:
- Cache thrashing
- NUMA penalties (accessing remote memory)
- Unpredictable performance
Common mechanisms (exact details are system and compiler dependent):
- Environment variables:
OMP_PROC_BIND– controls whether threads stick to a place (e.g.,true,false,close,spread)OMP_PLACES– controls which cores or hardware threads form the set of “places” threads can occupy
Basic idea:
close: put threads near each other (good for shared data, may be bad if memory bandwidth is limiting).spread: spread threads across sockets / cores (good for memory bandwidth, may have more latency).
You usually:
- Start with default settings or simple
OMP_PROC_BIND=true. - Adjust affinity when you have performance measurement tools to guide you.
Nested Parallelism and Thread Teams
Nested parallelism means having parallel regions inside parallel regions.
Key terms:
- Outer team: threads created by an outer parallel region.
- Inner team(s): threads created by parallel regions inside these threads.
For example:
#pragma omp parallel num_threads(2)
{
int outer_tid = omp_get_thread_num();
// Outer team: 2 threads
#pragma omp parallel num_threads(3)
{
int inner_tid = omp_get_thread_num();
// Inner team: potentially 3 threads *per outer thread*
}
}This can quickly multiply the total number of threads (up to 2×3 = 6 here, but often much more in real programs).
Control:
- Enable or disable nested parallelism:
- Env var:
OMP_NESTED,OMP_MAX_ACTIVE_LEVELS - API:
omp_set_nested(),omp_set_max_active_levels() - In many HPC codes, nested parallelism is disabled by default to avoid oversubscription and complexity.
Typical HPC practice:
- Start with a flat threading model (one level of parallelism).
- Use nesting only when you have a clear need and understand the resource implications (e.g., hybrid MPI+OpenMP within nodes).
Dynamic vs Static Number of Threads
Some runtimes allow dynamically adjusting the number of threads during execution.
- Static number of threads:
- Fixed thread count for all parallel regions (or a region).
- Easier to reason about, more predictable performance.
- Dynamic number of threads:
- Runtime may increase or decrease threads depending on system load or internal heuristics.
OpenMP controls:
- Environment:
OMP_DYNAMIC - API:
omp_set_dynamic(int flag);
In HPC:
- Dynamic adjustment is often turned off to get consistent performance and avoid unpredictable interaction with other jobs on the node.
- Use dynamic behavior only if you have a specific reason (e.g., co-running workloads, resource sharing experiments).
Thread Management Overheads
Managing threads is not free. Important cost components:
- Creation and Destruction
- Allocating stacks, setting up OS structures.
- High cost if done repeatedly.
- Context Switching
- When the OS switches the CPU from one thread to another.
- Too many threads per core cause high context-switch overhead.
- Synchronization
- Locks, barriers, atomics all have overhead.
- Improper use can lead to more overhead than parallel speedup.
- Scheduling Within Regions
- OpenMP must distribute loop iterations and tasks among threads.
- Different scheduling strategies trade off overhead and load balance.
Practical implications for beginners:
- Prefer a thread pool model (as OpenMP does) over manual creation for each small task.
- Avoid creating very short parallel regions inside tight loops; use larger-grain parallelism.
- Match the number of threads to hardware (e.g., number of cores) to avoid oversubscription.
Oversubscription and Resource Limits
Oversubscription occurs when you run more runnable threads than hardware execution contexts (e.g., cores or hardware threads).
Consequences:
- Increased context switching
- Cache pollution
- Unstable and often worse performance
Common oversubscription causes:
- Too many software threads relative to cores
- Nested parallelism turned on accidentally
- Combining independent threaded libraries that each create their own thread pools (e.g., your OpenMP code calls a threaded math library)
Management strategies:
- Set a sensible
OMP_NUM_THREADSconsistent with the cores per node. - Coordinate thread counts across all threaded libraries used.
- When combining MPI and OpenMP, plan
MPI ranks × threads per ranknot to exceed available hardware threads.
Thread Safety and Library Use
Not all libraries are thread-safe. Thread safety issues include:
- Global state inside libraries that is not protected
- Hidden buffers reused by multiple threads
- Use of non-reentrant functions
Thread management implications:
- Prefer thread-safe libraries for use inside parallel regions.
- If a library is not thread-safe:
- Limit its use to one thread (e.g., with
masterorsingleconstructs). - Use explicit synchronization around calls (only if safe and documented).
You should also be aware that:
- Some libraries manage their own internal threads (e.g., threaded BLAS, FFT libraries).
- Thread management may need to be coordinated at the application level (e.g., disabling internal threads when using OpenMP outside, or vice versa).
Practical Thread Management Tips for Beginners
- Start Simple
- Use a single outer parallel region to cover major work.
- Set
OMP_NUM_THREADSequal to the number of physical cores per node (unless advised otherwise). - Avoid Nesting Until Needed
- Keep nested parallelism off at first.
- Only enable it when you have a clear design and understand the resource implications.
- Watch Affinity on HPC Systems
- Use system or site documentation for recommended
OMP_PROC_BINDandOMP_PLACESsettings. - Test performance with and without binding.
- Minimize Parallel Region Overheads
- Avoid repeatedly entering and exiting very small parallel regions in tight loops.
- Group work into fewer, more substantial parallel regions.
- Coordinate with MPI and Other Libraries
- Ensure the product of MPI ranks and threads per rank stays within hardware limits.
- Check documentation of numerical libraries for their threading behavior.
- Measure, Don’t Guess
- Use timing and profiling to see whether thread counts and management choices improve or degrade performance.
- Adjust thread numbers, affinity, and scheduling policies based on evidence from runs on your target HPC system.
Summary
Thread and thread management in shared-memory programming involve:
- Understanding what a thread is within a process and how it runs.
- Controlling how many threads exist, when they are created, and where they execute.
- Dealing with nested parallelism and avoiding oversubscription.
- Managing overheads from creation, scheduling, and synchronization.
- Coordinating thread usage with external libraries and the rest of the software stack.
These concepts provide the foundation needed to use OpenMP constructs effectively and to reason about performance and correctness in shared-memory parallel programs.