Table of Contents
What is a Parallel Region?
In shared-memory programming with OpenMP, a parallel region is a block of code that multiple threads execute concurrently.
Conceptually:
- Outside a parallel region: the program runs with a single thread (the master thread).
- Inside a parallel region: multiple threads exist, and they all enter the region and run (parts of) the code.
- When the parallel region ends: all threads synchronize, and control continues with a single thread.
In OpenMP, parallel regions are introduced using pragmas (directives) in C/C++ and comments/directives in Fortran. The simplest form:
- C/C++:
#pragma omp parallel
{
/* code to run in parallel */
}- Fortran:
!$omp parallel
! code to run in parallel
!$omp end parallelBasic Structure and Semantics
Single-thread vs. multi-thread sections
A typical OpenMP program alternates between:
- Serial (single-thread) sections: normal code, executed by one thread.
- Parallel regions: code executed by a team of threads.
Execution timeline:
- Program starts: one thread runs from
main(C/C++) or the main program (Fortran). - Encounters a
paralleldirective: - A team of threads is created.
- All threads execute the block associated with the directive.
- End of parallel region:
- Implicit barrier: all threads wait until every thread finishes the region.
- Only a single thread continues after the region (the master, conceptually).
Fork-join model
Parallel regions realize the fork-join model:
- Fork: At the start of a parallel region, the master thread spawns worker threads.
- Parallel work: All threads run code in the region.
- Join: At the end of the region, worker threads synchronize and (conceptually) terminate, returning control to the master thread.
In practice, runtime implementations often reuse threads between regions to avoid creation/destruction overhead, but logically the model is fork-join.
Creating a Parallel Region in OpenMP
Minimal example
C/C++:
#include <stdio.h>
#include <omp.h>
int main(void) {
printf("Before parallel region\n");
#pragma omp parallel
{
int id = omp_get_thread_num();
printf("Hello from thread %d\n", id);
}
printf("After parallel region\n");
return 0;
}Fortran:
program parallel_region_example
use omp_lib
implicit none
print *, 'Before parallel region'
!$omp parallel
print *, 'Hello from thread', omp_get_thread_num()
!$omp end parallel
print *, 'After parallel region'
end programKey points:
#pragma omp parallel/!$omp parallelmarks the beginning of the region.- Each thread executes the code inside the region.
- The runtime library (e.g.,
omp_get_thread_num()) lets you query thread-specific information.
Controlling the Number of Threads
Within a parallel region, you can control how many threads are used.
Environment variable `OMP_NUM_THREADS`
The simplest way is via an environment variable before running the program:
- Bash:
export OMP_NUM_THREADS=4
./a.out- This sets the default thread count for parallel regions.
`omp_set_num_threads` and `omp_get_num_threads`
In code (to set default for subsequent parallel regions):
- C/C++:
#include <omp.h>
int main(void) {
omp_set_num_threads(8); // request 8 threads for future parallel regions
#pragma omp parallel
{
// ...
}
}- Fortran:
call omp_set_num_threads(8)
!$omp parallel
! ...
!$omp end parallelInside a parallel region:
omp_get_num_threads()returns the number of threads in the current team.omp_get_thread_num()returns the ID of the calling thread (0 tonum_threads-1).
Clause `num_threads`
You can override the default for a specific parallel region:
- C/C++:
#pragma omp parallel num_threads(4)
{
// This region will have 4 threads (if possible)
}- Fortran:
!$omp parallel num_threads(4)
! ...
!$omp end parallelThe effective number of threads can still be constrained by the runtime and system (e.g., resource limits, compiler flags, nesting rules).
Scope of Variables in Parallel Regions
Entering a parallel region affects how variables are shared between threads. The default rules are important to understand to avoid subtle bugs.
Shared vs private
- Shared: All threads see the same memory location.
- Private: Each thread has its own copy (uninitialized unless specified).
Baseline rules (simplified):
- Global / static variables (C/C++), module variables (Fortran): typically shared.
- Local variables inside the parallel block: language-dependent defaults (often private for loop indices, shared for others, but rely on explicit clauses rather than assumptions).
Common clauses on the parallel directive:
shared(list): variables inlistare shared among all threads.private(list): each thread gets its own uninitialized copy.firstprivate(list): private, but initialized with the value from before the region.default(shared)/default(none): control the default behavior.
Example (C/C++):
int x = 10;
#pragma omp parallel shared(x)
{
int y = 0; // each thread gets its own y
// x is shared, y is private by being declared inside the block
}More explicit:
int x = 10;
int y = 20;
#pragma omp parallel shared(x) private(y)
{
// x: same for all threads
// y: each thread has its own uninitialized y
}Fortran:
integer :: x, y
x = 10
y = 20
!$omp parallel shared(x) private(y)
! x is shared, y is private
!$omp end parallel
Using default(none) is a good practice in real HPC codes to force explicit scoping:
#pragma omp parallel default(none) shared(x) private(y)
{
// All variables must be in shared/private/etc. clauses
}Work Distribution Inside a Parallel Region
By default, every thread executes all the code in the parallel block. You typically combine parallel regions with work-sharing constructs (covered in a separate chapter) to divide work among threads, e.g., for / do, sections, single.
However, for understanding parallel regions themselves, keep in mind:
- Without a work-sharing construct, each thread runs the same statements.
- You often branch on the thread ID to specialize behavior:
C/C++:
#pragma omp parallel
{
int id = omp_get_thread_num();
if (id == 0) {
// master-like work
} else {
// worker-like work
}
}Fortran:
!$omp parallel
integer :: id
id = omp_get_thread_num()
if (id == 0) then
! master-like work
else
! worker-like work
end if
!$omp end parallelMaster and Single Execution Within a Parallel Region
Sometimes you want only one thread in the parallel region to execute a piece of code, while still being inside the region and able to use shared data.
Two key constructs (details belong to other subchapters, but their relation to parallel regions is important):
master: only the master thread (ID 0) executes the enclosed block, no implicit barrier at the end.single: exactly one arbitrary thread executes the block, implicit barrier at the end (unlessnowaitis used).
C/C++ examples:
#pragma omp parallel
{
// Only thread 0 executes this
#pragma omp master
{
printf("This is the master thread.\n");
}
// Exactly one thread (not necessarily 0) executes this
#pragma omp single
{
printf("This is executed once by thread %d\n", omp_get_thread_num());
}
}These constructs are only meaningful inside a parallel region; they modify how threads participate without leaving the region.
Nested Parallel Regions
You can create parallel regions inside other parallel regions. This is called nested parallelism.
Basic structure:
C/C++:
#pragma omp parallel num_threads(2)
{
printf("Outer thread %d\n", omp_get_thread_num());
#pragma omp parallel num_threads(3)
{
printf(" Inner thread %d (outer %d)\n",
omp_get_thread_num(), omp_get_ancestor_thread_num(1));
}
}Conceptually:
- Outer parallel region: creates an outer team of threads.
- Inner
parallel: each outer thread may (logically) create its own inner team.
In practice:
- Nested parallelism is often disabled by default in many OpenMP runtimes.
- The runtime can serialize inner parallel regions (treat them as if they had one thread) to avoid explosion in thread count.
- For most beginner and many production HPC codes, you usually avoid nested regions and focus on one level of parallelism.
Enabling nested parallelism (C/C++):
omp_set_nested(1); // or omp_set_max_active_levels(...)Use nested regions carefully; they interact with performance and scheduling in complex ways.
Barriers and Synchronization at Region Boundaries
Parallel regions have built-in synchronization behavior:
- At the entry of a parallel region:
- The master thread (conceptually) waits until all worker threads are ready to start the region.
- At the exit of a parallel region:
- There is an implicit barrier: all threads must finish the region before any can proceed beyond it.
- After the barrier, only one thread (the master) continues in the serial code.
This means:
- You do not need an explicit barrier right at the end of a parallel region; it’s implicit.
- If you need to synchronize threads inside a parallel region (before the end), you use explicit constructs (like
barrier), which are separate topics.
Parallel Region Overheads and Granularity
Parallel regions are not “free”: starting and finishing them has a runtime cost (thread creation, synchronization, scheduling).
Implications:
- Too many small parallel regions can severely hurt performance.
- Instead of:
// bad pattern
for (int i = 0; i < N; ++i) {
#pragma omp parallel
{
// small amount of work
}
}you generally want:
// better pattern
#pragma omp parallel
{
#pragma omp for
for (int i = 0; i < N; ++i) {
// work per iteration
}
}- It is often beneficial to:
- Create fewer, larger parallel regions (“coarse-grained” parallelism).
- Keep threads alive and reusing them for multiple operations, instead of repeatedly entering/exiting regions.
Deciding how many parallel regions to use and how big they should be is part of performance tuning.
Common Patterns of Parallel Region Usage
Single *main* parallel region
A common pattern in HPC codes:
- Initialize in serial.
- Enter one large parallel region.
- Inside it, use work-sharing constructs (
for,sections, etc.) as needed. - Exit once at the end.
C/C++:
int main(void) {
// serial initialization
#pragma omp parallel
{
// multiple parallel sections of work
#pragma omp for
for (int i = 0; i < N; ++i) {
// ...
}
#pragma omp single
{
// some one-time operation
}
#pragma omp for
for (int j = 0; j < M; ++j) {
// ...
}
}
// serial finalization
}Advantages:
- Reduces parallel region overhead.
- Simplifies reasoning about threads (one team for main computation).
Multiple structured regions
Another idiom: several structured parallel regions separated by serial phases, e.g., I/O or setup that must not run concurrently:
// serial input
#pragma omp parallel
{
// parallel compute phase 1
}
#pragma omp parallel
{
// parallel compute phase 2
}
// serial outputUse this pattern when it matches the structure of your computation and when serial phases are logically necessary.
Practical Tips for Working with Parallel Regions
- Know when you’re in parallel: use
omp_in_parallel()to check at runtime.
if (omp_in_parallel()) {
// we're inside a parallel region
}- Avoid nested unintentional parallelism:
- Sometimes libraries you call may use OpenMP internally.
- Calling them from inside your own parallel region can lead to oversubscription (too many threads).
- Pinning and affinity:
- Thread placement on cores (affinity) can affect performance.
- Environment variables like
OMP_PROC_BINDandOMP_PLACEScontrol this in many implementations. - Debugging:
- Start with a small number of threads.
- Print thread IDs and variable values to understand how code behaves in parallel regions (for small test cases).
- Reproducibility:
- Be aware that different thread counts or scheduling policies can change execution order inside a parallel region, affecting floating-point rounding and sometimes results.
Small Exercise Ideas
To solidify understanding of parallel regions:
- Hello, ID:
- Write a program with a parallel region where each thread prints its ID and the total number of threads.
- Shared vs private:
- Create a variable outside the region and one inside.
- Print their addresses and values from each thread to observe sharing vs privateness.
- Overhead test:
- Time a loop that repeatedly enters/exits a small parallel region.
- Compare to a version with one big region and a
forwork-sharing inside.
These small experiments highlight how parallel regions behave and why their structure affects both correctness and performance.