7.2 Threads and thread management

Table of Contents

What is a Thread in Shared-Memory Programming

A thread is a lightweight execution unit within a process. In the shared‑memory setting:

All threads of a process share:

The same address space (global variables, heap)
Open file descriptors
Program code

Each thread has:

Its own program counter (where it is in the code)
Its own registers
Its own stack (local variables, function calls)

For OpenMP in particular, you typically have:

A master (or main) thread that starts the program
Additional worker threads that are created in parallel regions
All of them forming a team of threads

Thread-based shared memory parallelism is about controlling:

How many threads you have
Where and when they are created
What work each thread performs
How they interact and terminate

Thread Lifecycle

A typical thread in a shared-memory program goes through these stages:

Creation

The runtime (e.g., OpenMP library or pthreads) allocates resources for a new thread and starts its execution at a given function or code region.

Running

The thread executes code, may enter and exit parallel regions, may synchronize with other threads, and may be scheduled onto different cores by the OS.

Blocking / Waiting

The thread might wait at a barrier, lock, condition variable, or busy-wait loop, not doing useful work while it waits.

Completion

The thread finishes its assigned work and exits the parallel region or its start routine.

Joining / Reaping

Another thread (often the master) waits for the finishing thread and cleans up its resources.

In OpenMP, most of this lifecycle is hidden. You mark regions to run in parallel and the runtime handles creation, scheduling, and joining of threads.

Threads in OpenMP: The Basic Model

In OpenMP, threads are created and managed via parallel regions (covered elsewhere). Here we only focus on how the threads themselves are structured and controlled.

Inside a parallel region:

A team of threads exists
Each thread has:

A thread ID within the team: omp_get_thread_num()
A shared team size: omp_get_num_threads()

Example (C) to see thread identities:

#include <omp.h>
#include <stdio.h>
int main() {
    #pragma omp parallel
    {
        int tid  = omp_get_thread_num();
        int nt   = omp_get_num_threads();
        printf("Hello from thread %d of %d\n", tid, nt);
    }
    return 0;
}

Key ideas:

The number of threads can be fixed or decided at runtime.
OpenMP runtime may create and keep a pool of worker threads to avoid repeated creation/destruction costs.

Controlling the Number of Threads

Choosing and controlling thread counts is a central part of thread management.

Global Thread Count Settings

You can set a default number of threads in several ways:

Environment variable:

export OMP_NUM_THREADS=8 (in shell)

Runtime library call (C/C++ or Fortran):

omp_set_num_threads(8);

The runtime uses this as a hint. It may not always be honored exactly, depending on:

Implementation
System resource limits
Nested parallelism settings

Per-Region Thread Control

You can also specify the team size for a particular parallel region using a clause:

#pragma omp parallel num_threads(4)
{
    // parallel work with exactly 4 threads (if possible)
}

Common patterns:

Use environment variables for global defaults on a cluster.
Use num_threads selectively where a region has specific needs.

Master and Worker Threads

In a typical OpenMP program:

The program starts with a single initial thread.
When encountering a parallel region, this initial thread becomes the master of the team.
The master spawns (or wakes) worker threads to form a team.

Roles:

Master thread

Has omp_get_thread_num() == 0
Can be used for serial tasks that should not be done by others (I/O, coordination)

Worker threads

Perform computational work
May also participate in synchronization and communication

You can restrict work to the master thread within a parallel region:

#pragma omp parallel
{
    // code executed by all threads
    #pragma omp master
    {
        // executed only by the master thread (no implicit barrier)
    }
    // code executed by all threads again
}

Note: master does not imply an implicit barrier at the end, unlike single (details of constructs are covered elsewhere).

Thread Affinity and Core Binding (Conceptual)

Thread affinity is about where threads run:

Ideally, each thread runs on a core and stays there (affinity).
Poor affinity can lead to:

Cache thrashing
NUMA penalties (accessing remote memory)
Unpredictable performance

Common mechanisms (exact details are system and compiler dependent):

Environment variables:

OMP_PROC_BIND – controls whether threads stick to a place (e.g., true, false, close, spread)
OMP_PLACES – controls which cores or hardware threads form the set of “places” threads can occupy

Basic idea:

close: put threads near each other (good for shared data, may be bad if memory bandwidth is limiting).
spread: spread threads across sockets / cores (good for memory bandwidth, may have more latency).

You usually:

Start with default settings or simple OMP_PROC_BIND=true.
Adjust affinity when you have performance measurement tools to guide you.

Nested Parallelism and Thread Teams

Nested parallelism means having parallel regions inside parallel regions.

Key terms:

Outer team: threads created by an outer parallel region.
Inner team(s): threads created by parallel regions inside these threads.

For example:

#pragma omp parallel num_threads(2)
{
    int outer_tid = omp_get_thread_num();
    // Outer team: 2 threads
    #pragma omp parallel num_threads(3)
    {
        int inner_tid = omp_get_thread_num();
        // Inner team: potentially 3 threads *per outer thread*
    }
}

This can quickly multiply the total number of threads (up to 2×3 = 6 here, but often much more in real programs).

Control:

Enable or disable nested parallelism:

Env var: OMP_NESTED, OMP_MAX_ACTIVE_LEVELS
API: omp_set_nested(), omp_set_max_active_levels()

In many HPC codes, nested parallelism is disabled by default to avoid oversubscription and complexity.

Typical HPC practice:

Start with a flat threading model (one level of parallelism).
Use nesting only when you have a clear need and understand the resource implications (e.g., hybrid MPI+OpenMP within nodes).

Dynamic vs Static Number of Threads

Some runtimes allow dynamically adjusting the number of threads during execution.

Static number of threads:

Fixed thread count for all parallel regions (or a region).
Easier to reason about, more predictable performance.

Dynamic number of threads:

Runtime may increase or decrease threads depending on system load or internal heuristics.

OpenMP controls:

Environment: OMP_DYNAMIC
API: omp_set_dynamic(int flag);

In HPC:

Dynamic adjustment is often turned off to get consistent performance and avoid unpredictable interaction with other jobs on the node.
Use dynamic behavior only if you have a specific reason (e.g., co-running workloads, resource sharing experiments).

Thread Management Overheads

Managing threads is not free. Important cost components:

Creation and Destruction

Allocating stacks, setting up OS structures.
High cost if done repeatedly.

Context Switching

When the OS switches the CPU from one thread to another.
Too many threads per core cause high context-switch overhead.

Synchronization

Locks, barriers, atomics all have overhead.
Improper use can lead to more overhead than parallel speedup.

Scheduling Within Regions

OpenMP must distribute loop iterations and tasks among threads.
Different scheduling strategies trade off overhead and load balance.

Practical implications for beginners:

Prefer a thread pool model (as OpenMP does) over manual creation for each small task.
Avoid creating very short parallel regions inside tight loops; use larger-grain parallelism.
Match the number of threads to hardware (e.g., number of cores) to avoid oversubscription.

Oversubscription and Resource Limits

Oversubscription occurs when you run more runnable threads than hardware execution contexts (e.g., cores or hardware threads).

Consequences:

Increased context switching
Cache pollution
Unstable and often worse performance

Common oversubscription causes:

Too many software threads relative to cores
Nested parallelism turned on accidentally
Combining independent threaded libraries that each create their own thread pools (e.g., your OpenMP code calls a threaded math library)

Management strategies:

Set a sensible OMP_NUM_THREADS consistent with the cores per node.
Coordinate thread counts across all threaded libraries used.
When combining MPI and OpenMP, plan MPI ranks × threads per rank not to exceed available hardware threads.

Thread Safety and Library Use

Not all libraries are thread-safe. Thread safety issues include:

Global state inside libraries that is not protected
Hidden buffers reused by multiple threads
Use of non-reentrant functions

Thread management implications:

Prefer thread-safe libraries for use inside parallel regions.
If a library is not thread-safe:

Limit its use to one thread (e.g., with master or single constructs).
Use explicit synchronization around calls (only if safe and documented).

You should also be aware that:

Some libraries manage their own internal threads (e.g., threaded BLAS, FFT libraries).
Thread management may need to be coordinated at the application level (e.g., disabling internal threads when using OpenMP outside, or vice versa).

Practical Thread Management Tips for Beginners

Start Simple

Use a single outer parallel region to cover major work.
Set OMP_NUM_THREADS equal to the number of physical cores per node (unless advised otherwise).

Avoid Nesting Until Needed

Keep nested parallelism off at first.
Only enable it when you have a clear design and understand the resource implications.

Watch Affinity on HPC Systems

Use system or site documentation for recommended OMP_PROC_BIND and OMP_PLACES settings.
Test performance with and without binding.

Minimize Parallel Region Overheads

Avoid repeatedly entering and exiting very small parallel regions in tight loops.
Group work into fewer, more substantial parallel regions.

Coordinate with MPI and Other Libraries

Ensure the product of MPI ranks and threads per rank stays within hardware limits.
Check documentation of numerical libraries for their threading behavior.

Measure, Don’t Guess

Use timing and profiling to see whether thread counts and management choices improve or degrade performance.
Adjust thread numbers, affinity, and scheduling policies based on evidence from runs on your target HPC system.

Summary

Thread and thread management in shared-memory programming involve:

Understanding what a thread is within a process and how it runs.
Controlling how many threads exist, when they are created, and where they execute.
Dealing with nested parallelism and avoiding oversubscription.
Managing overheads from creation, scheduling, and synchronization.
Coordinating thread usage with external libraries and the rest of the software stack.

These concepts provide the foundation needed to use OpenMP constructs effectively and to reason about performance and correctness in shared-memory parallel programs.

Comments

Please login to add a comment.

Don't have an account? Register now!