Kahibaro
Discord Login Register

Parallel regions

What is a Parallel Region?

In shared-memory programming with OpenMP, a parallel region is a block of code that multiple threads execute concurrently.

Conceptually:

In OpenMP, parallel regions are introduced using pragmas (directives) in C/C++ and comments/directives in Fortran. The simplest form:

  #pragma omp parallel
  {
      /* code to run in parallel */
  }
  !$omp parallel
    ! code to run in parallel
  !$omp end parallel

Basic Structure and Semantics

Single-thread vs. multi-thread sections

A typical OpenMP program alternates between:

  1. Serial (single-thread) sections: normal code, executed by one thread.
  2. Parallel regions: code executed by a team of threads.

Execution timeline:

  1. Program starts: one thread runs from main (C/C++) or the main program (Fortran).
  2. Encounters a parallel directive:
    • A team of threads is created.
    • All threads execute the block associated with the directive.
  3. End of parallel region:
    • Implicit barrier: all threads wait until every thread finishes the region.
    • Only a single thread continues after the region (the master, conceptually).

Fork-join model

Parallel regions realize the fork-join model:

In practice, runtime implementations often reuse threads between regions to avoid creation/destruction overhead, but logically the model is fork-join.

Creating a Parallel Region in OpenMP

Minimal example

C/C++:

#include <stdio.h>
#include <omp.h>
int main(void) {
    printf("Before parallel region\n");
    #pragma omp parallel
    {
        int id = omp_get_thread_num();
        printf("Hello from thread %d\n", id);
    }
    printf("After parallel region\n");
    return 0;
}

Fortran:

program parallel_region_example
  use omp_lib
  implicit none
  print *, 'Before parallel region'
!$omp parallel
  print *, 'Hello from thread', omp_get_thread_num()
!$omp end parallel
  print *, 'After parallel region'
end program

Key points:

Controlling the Number of Threads

Within a parallel region, you can control how many threads are used.

Environment variable `OMP_NUM_THREADS`

The simplest way is via an environment variable before running the program:

  export OMP_NUM_THREADS=4
  ./a.out

`omp_set_num_threads` and `omp_get_num_threads`

In code (to set default for subsequent parallel regions):

  #include <omp.h>
  int main(void) {
      omp_set_num_threads(8);  // request 8 threads for future parallel regions
      #pragma omp parallel
      {
          // ...
      }
  }
  call omp_set_num_threads(8)
!$omp parallel
  ! ...
!$omp end parallel

Inside a parallel region:

Clause `num_threads`

You can override the default for a specific parallel region:

  #pragma omp parallel num_threads(4)
  {
      // This region will have 4 threads (if possible)
  }
!$omp parallel num_threads(4)
  ! ...
!$omp end parallel

The effective number of threads can still be constrained by the runtime and system (e.g., resource limits, compiler flags, nesting rules).

Scope of Variables in Parallel Regions

Entering a parallel region affects how variables are shared between threads. The default rules are important to understand to avoid subtle bugs.

Shared vs private

Baseline rules (simplified):

Common clauses on the parallel directive:

Example (C/C++):

int x = 10;
#pragma omp parallel shared(x)
{
    int y = 0;  // each thread gets its own y
    // x is shared, y is private by being declared inside the block
}

More explicit:

int x = 10;
int y = 20;
#pragma omp parallel shared(x) private(y)
{
    // x: same for all threads
    // y: each thread has its own uninitialized y
}

Fortran:

integer :: x, y
x = 10
y = 20
!$omp parallel shared(x) private(y)
  ! x is shared, y is private
!$omp end parallel

Using default(none) is a good practice in real HPC codes to force explicit scoping:

#pragma omp parallel default(none) shared(x) private(y)
{
    // All variables must be in shared/private/etc. clauses
}

Work Distribution Inside a Parallel Region

By default, every thread executes all the code in the parallel block. You typically combine parallel regions with work-sharing constructs (covered in a separate chapter) to divide work among threads, e.g., for / do, sections, single.

However, for understanding parallel regions themselves, keep in mind:

C/C++:

#pragma omp parallel
{
    int id = omp_get_thread_num();
    if (id == 0) {
        // master-like work
    } else {
        // worker-like work
    }
}

Fortran:

!$omp parallel
  integer :: id
  id = omp_get_thread_num()
  if (id == 0) then
    ! master-like work
  else
    ! worker-like work
  end if
!$omp end parallel

Master and Single Execution Within a Parallel Region

Sometimes you want only one thread in the parallel region to execute a piece of code, while still being inside the region and able to use shared data.

Two key constructs (details belong to other subchapters, but their relation to parallel regions is important):

C/C++ examples:

#pragma omp parallel
{
    // Only thread 0 executes this
    #pragma omp master
    {
        printf("This is the master thread.\n");
    }
    // Exactly one thread (not necessarily 0) executes this
    #pragma omp single
    {
        printf("This is executed once by thread %d\n", omp_get_thread_num());
    }
}

These constructs are only meaningful inside a parallel region; they modify how threads participate without leaving the region.

Nested Parallel Regions

You can create parallel regions inside other parallel regions. This is called nested parallelism.

Basic structure:

C/C++:

#pragma omp parallel num_threads(2)
{
    printf("Outer thread %d\n", omp_get_thread_num());
    #pragma omp parallel num_threads(3)
    {
        printf("  Inner thread %d (outer %d)\n",
               omp_get_thread_num(), omp_get_ancestor_thread_num(1));
    }
}

Conceptually:

In practice:

Enabling nested parallelism (C/C++):

omp_set_nested(1);  // or omp_set_max_active_levels(...)

Use nested regions carefully; they interact with performance and scheduling in complex ways.

Barriers and Synchronization at Region Boundaries

Parallel regions have built-in synchronization behavior:

This means:

Parallel Region Overheads and Granularity

Parallel regions are not “free”: starting and finishing them has a runtime cost (thread creation, synchronization, scheduling).

Implications:

  // bad pattern
  for (int i = 0; i < N; ++i) {
      #pragma omp parallel
      {
          // small amount of work
      }
  }

you generally want:

  // better pattern
  #pragma omp parallel
  {
      #pragma omp for
      for (int i = 0; i < N; ++i) {
          // work per iteration
      }
  }

Deciding how many parallel regions to use and how big they should be is part of performance tuning.

Common Patterns of Parallel Region Usage

Single *main* parallel region

A common pattern in HPC codes:

C/C++:

int main(void) {
    // serial initialization
    #pragma omp parallel
    {
        // multiple parallel sections of work
        #pragma omp for
        for (int i = 0; i < N; ++i) {
            // ...
        }
        #pragma omp single
        {
            // some one-time operation
        }
        #pragma omp for
        for (int j = 0; j < M; ++j) {
            // ...
        }
    }
    // serial finalization
}

Advantages:

Multiple structured regions

Another idiom: several structured parallel regions separated by serial phases, e.g., I/O or setup that must not run concurrently:

// serial input
#pragma omp parallel
{
    // parallel compute phase 1
}
#pragma omp parallel
{
    // parallel compute phase 2
}
// serial output

Use this pattern when it matches the structure of your computation and when serial phases are logically necessary.

Practical Tips for Working with Parallel Regions

  if (omp_in_parallel()) {
      // we're inside a parallel region
  }

Small Exercise Ideas

To solidify understanding of parallel regions:

  1. Hello, ID:
    • Write a program with a parallel region where each thread prints its ID and the total number of threads.
  2. Shared vs private:
    • Create a variable outside the region and one inside.
    • Print their addresses and values from each thread to observe sharing vs privateness.
  3. Overhead test:
    • Time a loop that repeatedly enters/exits a small parallel region.
    • Compare to a version with one big region and a for work-sharing inside.

These small experiments highlight how parallel regions behave and why their structure affects both correctness and performance.

Views: 10

Comments

Please login to add a comment.

Don't have an account? Register now!