7.3 Parallel regions

Table of Contents

Understanding Parallel Regions in Shared-Memory Programming

In shared memory programming with OpenMP, the concept of a parallel region is central. A parallel region is a block of code that is executed by multiple threads at the same time, each running on a different core of a shared memory system. In this chapter, we focus on what a parallel region is, how it behaves, and how to use it correctly and effectively in real programs.

The Basic Idea of a Parallel Region

In a typical serial program, there is exactly one flow of control, often called the main thread. When you introduce OpenMP, you can create a team of threads that execute parts of your program concurrently. The part of the program where multiple threads exist and run together is called a parallel region.

In C or C++, the simplest OpenMP parallel region uses a pragma:

#pragma omp parallel
{
    /* code here is executed by multiple threads */
}

In Fortran it looks similar in spirit:

fortran

!$omp parallel
  ! code here is executed by multiple threads
!$omp end parallel

When the program reaches the start of the parallel region, the main thread creates a team of threads, including itself, and all of them execute the code inside the region. When they reach the end of the region, the extra threads are synchronized and then typically destroyed, and execution continues with a single thread.

Inside a parallel region, there are multiple threads executing the same code at the same time. Any access to shared data must be considered carefully because race conditions and incorrect results can easily occur.

Fork–join execution model for parallel regions

When thinking about parallel regions, it is useful to adopt the fork–join model of execution. Even though parallel programming can become complex, the basic execution pattern of OpenMP parallel regions is relatively simple:

The program starts with a single thread, often called the master thread.
When a parallel region is encountered, the master thread forks a team of threads.
All threads in the team execute the code within the parallel region.
At the end of the region, there is an implicit barrier, where all threads wait until every thread has finished the region.
The extra threads are then joined back, and execution continues with a single master thread.

This model provides a structured way to define which parts of a program should run in parallel and which should stay serial.

Creating and identifying threads inside a parallel region

Inside a parallel region, each thread has a unique identifier, often used to assign different work to different threads. In OpenMP, you can obtain this identifier with omp_get_thread_num() and the total number of threads with omp_get_num_threads().

Here is a simple example in C illustrating a parallel region that prints a message from each thread:

#include <stdio.h>
#include <omp.h>
int main(void) {
    #pragma omp parallel
    {
        int tid = omp_get_thread_num();
        int nthreads = omp_get_num_threads();
        printf("Hello from thread %d out of %d\n", tid, nthreads);
    }
    return 0;
}

Each thread prints its own message, and the output lines can appear in any order, since the threads run concurrently.

The exact number of threads in a parallel region depends on runtime settings and environment variables. You can control this with calls such as omp_set_num_threads(n) or with environment variables such as OMP_NUM_THREADS. The scheduler chapter for OpenMP threads and performance considerations will cover tuning of thread counts in more detail.

Default data sharing within a parallel region

In a shared memory program, some variables are shared between all threads, while others are private to each thread. A parallel region is the main place where these properties are specified and take effect.

By default, in OpenMP, variables defined outside a parallel region are shared, while variables declared inside the region are private to each thread. However, this default can be modified with clauses on the parallel directive.

For example, you can explicitly declare variables as private or shared:

int x = 10;
int y = 0;
#pragma omp parallel private(y) shared(x)
{
    int tid = omp_get_thread_num();
    y = tid;
    /* x is shared, y is private */
}

In this example, the variable x has one copy that all threads see. Variable y is private, so each thread has its own independent copy of y and changes by one thread do not affect the others.

Always be explicit about data sharing in nontrivial parallel regions. Relying on default sharing rules can easily lead to subtle bugs and race conditions.

A parallel region can also use a default clause to control default data sharing behavior, for example default(shared) or default(none). The default(none) choice forces you to declare everything explicitly and is often recommended for safety.

Nested parallel regions

An OpenMP program can contain parallel regions inside other parallel regions, which is called nested parallelism. A nested parallel region is created when a parallel directive appears inside an already active parallel region.

For example:

#pragma omp parallel
{
    /* outer team of threads */
    #pragma omp parallel
    {
        /* possible inner team of threads */
    }
}

In practice, nested parallelism may be disabled by default or may be limited by the runtime or system configuration. When nested parallelism is disabled, an inner parallel region is treated as if it were serial and executed by a single thread from the outer region. If nested parallelism is enabled, a new team may be created at the inner region, which can quickly increase the number of threads.

This behavior makes nested parallelism powerful but also potentially dangerous in terms of oversubscribing cores and wasting resources. The use of nested parallelism is a more advanced topic and is usually only appropriate in particular algorithms or when combining libraries that use internal parallel regions.

Controlling the structure of parallel regions

Parallel regions are flexible, and you can control how they are entered and how many threads they use. Some useful clauses for the parallel directive include:

num_threads(n) specifies the number of threads for that region:

#pragma omp parallel num_threads(4)
{
    /* up to 4 threads will run this region */
}

if(condition) controls whether a region is truly parallel. If the condition is false, the region executes as if it were serial, which can be useful for small problems where overhead would dominate:

#pragma omp parallel if(n > 1000)
{
    /* parallel only if n is large enough */
}

These controls allow you to match the structure of your parallel regions to the characteristics of your problem and the machine you are using.

Structured vs unstructured parallel regions

Parallel regions in OpenMP are structured. This means that the parallel block begins at a specific point in the program and ends at a matching point, and all threads must enter and leave the region in a coordinated way. A structured parallel region has these properties:

The region is a single lexical block of code that begins at the parallel directive and ends at its end clause.

There is an implicit barrier at the end of the region, unless explicitly disabled.

No thread can skip entering or leaving the parallel region through arbitrary control flow such as jumps or returns that cross the boundary.

This structured approach simplifies reasoning about synchronization and data sharing, at the cost of some flexibility. If more sophisticated or unstructured control flow is needed, it is typically built using other constructs or separate function calls inside the region, rather than by breaking the structure of the region itself.

Implicit and explicit barriers in parallel regions

All threads in a team must reach the end of a parallel region before any of them can continue execution after the region. This is called an implicit barrier. It guarantees that when execution continues after the parallel region, the work inside the region has been completed by all threads.

Inside a parallel region, some constructs also introduce implicit barriers. For performance reasons, it is sometimes useful to avoid these barriers when possible, but the end of the parallel region itself always has an implicit synchronization point.

The specifics of synchronization within a region, including explicit barriers, are covered more fully when discussing synchronization mechanisms, but you should keep in mind that every parallel region has at least that final barrier.

Parallel regions and nested work-sharing constructs

Parallel regions provide the overall framework in which work is divided among threads. Inside a parallel region, you can use work-sharing constructs to split specific loops or sections of code among threads in the team.

For example, a common pattern uses a parallel region combined with a loop work-sharing construct:

#pragma omp parallel
{
    #pragma omp for
    for (int i = 0; i < N; i++) {
        /* loop iterations are distributed among threads */
    }
}

Here, the parallel region creates the team of threads, and the for construct inside distributes the iterations among them. This separation allows a clear division between creating and managing threads, which is the job of the parallel region, and distributing specific work, which is the job of work-sharing constructs.

The details of work-sharing constructs themselves, such as how loops are split and how scheduling is controlled, are treated in their own chapter.

Lifetime and reuse of parallel regions

Each time the program encounters a parallel directive, the runtime creates a team of threads, executes the region, and then synchronizes and tears down the team, at least conceptually. In practice, implementations may reuse threads internally to reduce overhead, but from the programmer’s perspective, each parallel region is independent.

This means that variables with local scope inside a parallel region are created anew each time the region is executed, and there is no persistent state between separate parallel regions unless you store it in shared variables or external data structures.

Because creating a parallel region involves some overhead, it is usually more efficient to use fewer, larger parallel regions than many tiny ones. Instead of starting and stopping a parallel region around each small chunk of work, it is often better to create a single region and then use constructs inside it to distribute multiple pieces of work.

Avoid placing many very small parallel regions in performance critical parts of your program. The overhead of repeatedly creating and joining threads can easily dominate the useful computation.

Summary

Parallel regions are the fundamental building blocks of shared memory parallel programming with OpenMP. They define where in the program execution switches from a single thread to a team of threads and back again. Inside a parallel region, threads can cooperate to perform computations on shared data, provided that data sharing and synchronization are handled correctly.

Understanding how to define parallel regions, control the number of threads, manage data sharing, and reason about the fork–join execution model is essential before you move on to work-sharing constructs, synchronization mechanisms, and performance tuning in shared memory programs.

Comments

Please login to add a comment.

Don't have an account? Register now!