7.3 Parallel regions

What is a Parallel Region?

In shared-memory programming with OpenMP, a parallel region is a block of code that multiple threads execute concurrently.

Conceptually:

Outside a parallel region: the program runs with a single thread (the master thread).
Inside a parallel region: multiple threads exist, and they all enter the region and run (parts of) the code.
When the parallel region ends: all threads synchronize, and control continues with a single thread.

In OpenMP, parallel regions are introduced using pragmas (directives) in C/C++ and comments/directives in Fortran. The simplest form:

C/C++:

  #pragma omp parallel
  {
      /* code to run in parallel */
  }

Fortran:

  !$omp parallel
    ! code to run in parallel
  !$omp end parallel

Basic Structure and Semantics

Single-thread vs. multi-thread sections

A typical OpenMP program alternates between:

Serial (single-thread) sections: normal code, executed by one thread.
Parallel regions: code executed by a team of threads.

Execution timeline:

Program starts: one thread runs from main (C/C++) or the main program (Fortran).
Encounters a parallel directive:

A team of threads is created.
All threads execute the block associated with the directive.

End of parallel region:

Implicit barrier: all threads wait until every thread finishes the region.
Only a single thread continues after the region (the master, conceptually).

Fork-join model

Parallel regions realize the fork-join model:

Fork: At the start of a parallel region, the master thread spawns worker threads.
Parallel work: All threads run code in the region.
Join: At the end of the region, worker threads synchronize and (conceptually) terminate, returning control to the master thread.

In practice, runtime implementations often reuse threads between regions to avoid creation/destruction overhead, but logically the model is fork-join.

Creating a Parallel Region in OpenMP

Minimal example

C/C++:

#include <stdio.h>
#include <omp.h>
int main(void) {
    printf("Before parallel region\n");
    #pragma omp parallel
    {
        int id = omp_get_thread_num();
        printf("Hello from thread %d\n", id);
    }
    printf("After parallel region\n");
    return 0;
}

Fortran:

program parallel_region_example
  use omp_lib
  implicit none
  print *, 'Before parallel region'
!$omp parallel
  print *, 'Hello from thread', omp_get_thread_num()
!$omp end parallel
  print *, 'After parallel region'
end program

Key points:

#pragma omp parallel / !$omp parallel marks the beginning of the region.
Each thread executes the code inside the region.
The runtime library (e.g., omp_get_thread_num()) lets you query thread-specific information.

Controlling the Number of Threads

Within a parallel region, you can control how many threads are used.

Environment variable `OMP_NUM_THREADS`

The simplest way is via an environment variable before running the program:

Bash:

  export OMP_NUM_THREADS=4
  ./a.out

This sets the default thread count for parallel regions.

`omp_set_num_threads` and `omp_get_num_threads`

In code (to set default for subsequent parallel regions):

C/C++:

  #include <omp.h>
  int main(void) {
      omp_set_num_threads(8);  // request 8 threads for future parallel regions
      #pragma omp parallel
      {
          // ...
      }
  }

Fortran:

  call omp_set_num_threads(8)
!$omp parallel
  ! ...
!$omp end parallel

Inside a parallel region:

omp_get_num_threads() returns the number of threads in the current team.
omp_get_thread_num() returns the ID of the calling thread (0 to num_threads-1).

Clause `num_threads`

You can override the default for a specific parallel region:

C/C++:

  #pragma omp parallel num_threads(4)
  {
      // This region will have 4 threads (if possible)
  }

Fortran:

!$omp parallel num_threads(4)
  ! ...
!$omp end parallel

The effective number of threads can still be constrained by the runtime and system (e.g., resource limits, compiler flags, nesting rules).

Scope of Variables in Parallel Regions

Entering a parallel region affects how variables are shared between threads. The default rules are important to understand to avoid subtle bugs.

Shared vs private

Shared: All threads see the same memory location.
Private: Each thread has its own copy (uninitialized unless specified).

Baseline rules (simplified):

Global / static variables (C/C++), module variables (Fortran): typically shared.
Local variables inside the parallel block: language-dependent defaults (often private for loop indices, shared for others, but rely on explicit clauses rather than assumptions).

Common clauses on the parallel directive:

shared(list): variables in list are shared among all threads.
private(list): each thread gets its own uninitialized copy.
firstprivate(list): private, but initialized with the value from before the region.
default(shared) / default(none): control the default behavior.

Example (C/C++):

int x = 10;
#pragma omp parallel shared(x)
{
    int y = 0;  // each thread gets its own y
    // x is shared, y is private by being declared inside the block
}

More explicit:

int x = 10;
int y = 20;
#pragma omp parallel shared(x) private(y)
{
    // x: same for all threads
    // y: each thread has its own uninitialized y
}

Fortran:

integer :: x, y
x = 10
y = 20
!$omp parallel shared(x) private(y)
  ! x is shared, y is private
!$omp end parallel

Using default(none) is a good practice in real HPC codes to force explicit scoping:

#pragma omp parallel default(none) shared(x) private(y)
{
    // All variables must be in shared/private/etc. clauses
}

Work Distribution Inside a Parallel Region

By default, every thread executes all the code in the parallel block. You typically combine parallel regions with work-sharing constructs (covered in a separate chapter) to divide work among threads, e.g., for / do, sections, single.

However, for understanding parallel regions themselves, keep in mind:

Without a work-sharing construct, each thread runs the same statements.
You often branch on the thread ID to specialize behavior:

C/C++:

#pragma omp parallel
{
    int id = omp_get_thread_num();
    if (id == 0) {
        // master-like work
    } else {
        // worker-like work
    }
}

Fortran:

!$omp parallel
  integer :: id
  id = omp_get_thread_num()
  if (id == 0) then
    ! master-like work
  else
    ! worker-like work
  end if
!$omp end parallel

Master and Single Execution Within a Parallel Region

Sometimes you want only one thread in the parallel region to execute a piece of code, while still being inside the region and able to use shared data.

Two key constructs (details belong to other subchapters, but their relation to parallel regions is important):

master: only the master thread (ID 0) executes the enclosed block, no implicit barrier at the end.
single: exactly one arbitrary thread executes the block, implicit barrier at the end (unless nowait is used).

C/C++ examples:

#pragma omp parallel
{
    // Only thread 0 executes this
    #pragma omp master
    {
        printf("This is the master thread.\n");
    }
    // Exactly one thread (not necessarily 0) executes this
    #pragma omp single
    {
        printf("This is executed once by thread %d\n", omp_get_thread_num());
    }
}

These constructs are only meaningful inside a parallel region; they modify how threads participate without leaving the region.

Nested Parallel Regions

You can create parallel regions inside other parallel regions. This is called nested parallelism.

Basic structure:

C/C++:

#pragma omp parallel num_threads(2)
{
    printf("Outer thread %d\n", omp_get_thread_num());
    #pragma omp parallel num_threads(3)
    {
        printf("  Inner thread %d (outer %d)\n",
               omp_get_thread_num(), omp_get_ancestor_thread_num(1));
    }
}

Conceptually:

Outer parallel region: creates an outer team of threads.
Inner parallel: each outer thread may (logically) create its own inner team.

In practice:

Nested parallelism is often disabled by default in many OpenMP runtimes.
The runtime can serialize inner parallel regions (treat them as if they had one thread) to avoid explosion in thread count.
For most beginner and many production HPC codes, you usually avoid nested regions and focus on one level of parallelism.

Enabling nested parallelism (C/C++):

omp_set_nested(1);  // or omp_set_max_active_levels(...)

Use nested regions carefully; they interact with performance and scheduling in complex ways.

Barriers and Synchronization at Region Boundaries

Parallel regions have built-in synchronization behavior:

At the entry of a parallel region:

The master thread (conceptually) waits until all worker threads are ready to start the region.

At the exit of a parallel region:

There is an implicit barrier: all threads must finish the region before any can proceed beyond it.
After the barrier, only one thread (the master) continues in the serial code.

This means:

You do not need an explicit barrier right at the end of a parallel region; it’s implicit.
If you need to synchronize threads inside a parallel region (before the end), you use explicit constructs (like barrier), which are separate topics.

Parallel Region Overheads and Granularity

Parallel regions are not “free”: starting and finishing them has a runtime cost (thread creation, synchronization, scheduling).

Implications:

Too many small parallel regions can severely hurt performance.
Instead of:

  // bad pattern
  for (int i = 0; i < N; ++i) {
      #pragma omp parallel
      {
          // small amount of work
      }
  }

you generally want:

  // better pattern
  #pragma omp parallel
  {
      #pragma omp for
      for (int i = 0; i < N; ++i) {
          // work per iteration
      }
  }

It is often beneficial to:

Create fewer, larger parallel regions (“coarse-grained” parallelism).
Keep threads alive and reusing them for multiple operations, instead of repeatedly entering/exiting regions.

Deciding how many parallel regions to use and how big they should be is part of performance tuning.

Common Patterns of Parallel Region Usage

Single main parallel region

A common pattern in HPC codes:

Initialize in serial.
Enter one large parallel region.
Inside it, use work-sharing constructs (for, sections, etc.) as needed.
Exit once at the end.

C/C++:

int main(void) {
    // serial initialization
    #pragma omp parallel
    {
        // multiple parallel sections of work
        #pragma omp for
        for (int i = 0; i < N; ++i) {
            // ...
        }
        #pragma omp single
        {
            // some one-time operation
        }
        #pragma omp for
        for (int j = 0; j < M; ++j) {
            // ...
        }
    }
    // serial finalization
}

Advantages:

Reduces parallel region overhead.
Simplifies reasoning about threads (one team for main computation).

Multiple structured regions

Another idiom: several structured parallel regions separated by serial phases, e.g., I/O or setup that must not run concurrently:

// serial input
#pragma omp parallel
{
    // parallel compute phase 1
}
#pragma omp parallel
{
    // parallel compute phase 2
}
// serial output

Use this pattern when it matches the structure of your computation and when serial phases are logically necessary.

Practical Tips for Working with Parallel Regions

Know when you’re in parallel: use omp_in_parallel() to check at runtime.

  if (omp_in_parallel()) {
      // we're inside a parallel region
  }

Avoid nested unintentional parallelism:

Sometimes libraries you call may use OpenMP internally.
Calling them from inside your own parallel region can lead to oversubscription (too many threads).

Pinning and affinity:

Thread placement on cores (affinity) can affect performance.
Environment variables like OMP_PROC_BIND and OMP_PLACES control this in many implementations.

Debugging:

Start with a small number of threads.
Print thread IDs and variable values to understand how code behaves in parallel regions (for small test cases).

Reproducibility:

Be aware that different thread counts or scheduling policies can change execution order inside a parallel region, affecting floating-point rounding and sometimes results.

Small Exercise Ideas

To solidify understanding of parallel regions:

Hello, ID:

Write a program with a parallel region where each thread prints its ID and the total number of threads.

Shared vs private:

Create a variable outside the region and one inside.
Print their addresses and values from each thread to observe sharing vs privateness.

Overhead test:

Time a loop that repeatedly enters/exits a small parallel region.
Compare to a version with one big region and a for work-sharing inside.

These small experiments highlight how parallel regions behave and why their structure affects both correctness and performance.

Comments

Please login to add a comment.

Don't have an account? Register now!

7.3 Parallel regions

What is a Parallel Region?

Basic Structure and Semantics

Single-thread vs. multi-thread sections

Fork-join model

Creating a Parallel Region in OpenMP

Minimal example

Controlling the Number of Threads

Environment variable `OMP_NUM_THREADS`

`omp_set_num_threads` and `omp_get_num_threads`

Clause `num_threads`

Scope of Variables in Parallel Regions

Shared vs private

Work Distribution Inside a Parallel Region

Master and Single Execution Within a Parallel Region

Nested Parallel Regions

Barriers and Synchronization at Region Boundaries

Parallel Region Overheads and Granularity

Common Patterns of Parallel Region Usage

Single *main* parallel region

Multiple structured regions

Practical Tips for Working with Parallel Regions

Small Exercise Ideas

Comments

Where to Move

Single main parallel region