7.4 Work-sharing constructs

Table of Contents

Overview

Work sharing in shared memory parallel programming refers to the way multiple threads cooperate to execute different parts of a computation. In OpenMP, work-sharing constructs are the primary mechanisms that let you divide iterations of loops, independent tasks, and sections of code among threads in a controlled and portable way.

This chapter focuses on the main OpenMP work-sharing constructs, how they differ from each other, and what typical usage patterns look like for beginners. You will see how these constructs interact with parallel regions, but the concepts of threads and parallel regions themselves are assumed to be known from the parent chapter.

The Idea of Work Sharing vs Parallel Regions

A parallel region in OpenMP creates a team of threads. Within that region, all threads, by default, execute the same sequence of instructions. Without additional directives, every thread would run the same code on the same data, which is usually not what you want.

Work-sharing constructs appear inside a parallel region and specify how the code or iterations should be divided among the threads in the team. They do not create threads themselves. Instead, they coordinate how existing threads share the work.

In OpenMP, every work-sharing construct applies to the current team of threads. Threads not selected to run a specific piece of work will skip it and wait at an implied barrier, unless that barrier is explicitly removed.

Key rule: Work-sharing constructs do not create threads. They distribute work among the threads of an existing parallel region.

The `for` (or `do`) Work-sharing Construct

For many scientific and engineering applications, loops dominate the runtime. The most important OpenMP work-sharing construct is the loop construct, written as #pragma omp for in C and C++, and !$omp do in Fortran. It splits loop iterations among threads.

In C or C++ a basic pattern is:

#pragma omp parallel
{
    #pragma omp for
    for (int i = 0; i < N; i++) {
        a[i] = b[i] + c[i];
    }
}

Here, a team of threads is created by #pragma omp parallel. The #pragma omp for that follows tells the runtime to divide the iterations from 0 to N - 1 among the threads. Each iteration is executed by exactly one thread.

OpenMP guarantees that, for a given loop, each iteration is executed once and only once, and that no two threads execute the same iteration index.

The for or do construct has important clauses that control how iterations are assigned, including scheduling clauses, which are closely related to performance and load balancing. Fine details of scheduling and performance trade offs are often discussed in performance oriented chapters, so here we just outline the main types.

Scheduling of Loop Iterations

The schedule clause on omp for or omp do determines how loop iterations are mapped to threads. In C or C++ it looks like:

#pragma omp for schedule(kind, chunk)
for (int i = 0; i < N; i++) {
    /* loop body */
}

The two parts are:

The schedule kind, which defines the policy.
The chunk size, which is an optional number controlling granularity.

Common schedule kinds are static, dynamic, and guided.

Static Schedule

With schedule(static), iterations are divided into contiguous chunks and assigned to threads before execution begins. If you do not specify a chunk size, all iterations are divided into roughly equal blocks by the number of threads.

Example:

#pragma omp parallel
{
    #pragma omp for schedule(static)
    for (int i = 0; i < N; i++) {
        /* each thread gets a fixed subset of iterations */
    }
}

Static scheduling has very low overhead, because the assignment is computed once. It is usually the best choice when all iterations have similar computational cost.

You can also specify a chunk size, for example schedule(static, 4). Then the iterations are divided into fixed-size blocks of 4 and assigned in a round robbin way to threads. This gives more interleaving among threads, which can help with some load imbalance patterns, at the cost of a little more overhead.

Dynamic Schedule

With schedule(dynamic), thread assignments are made during execution. Each thread grabs a chunk of iterations, runs them, and when it is done, requests the next chunk.

Example:

#pragma omp parallel
{
    #pragma omp for schedule(dynamic, 4)
    for (int i = 0; i < N; i++) {
        /* loop body may have irregular cost per iteration */
    }
}

If some iterations are much more expensive than others, dynamic scheduling allows faster threads to take on more work when they finish early. This can improve load balance.

The trade off is higher scheduling overhead and weaker control over memory access patterns because adjacent iterations might not be executed by the same thread.

Guided Schedule

guided scheduling is a compromise between static and dynamic. The runtime starts by giving large chunks of iterations to each thread, then gradually decreases the chunk size as fewer iterations remain. The aim is to reduce the initiation overhead of many small chunks while still providing dynamic balancing for the tail of the loop.

Example:

#pragma omp parallel
{
    #pragma omp for schedule(guided)
    for (int i = 0; i < N; i++) {
        /* cost per iteration is irregular */
    }
}

You can also give a minimum chunk size, such as schedule(guided, 1).

Runtime and Auto Schedules

The schedule(runtime) option defers the scheduling policy to a runtime setting, usually controlled via the environment variable OMP_SCHEDULE. This lets you experiment with scheduling without recompiling.

auto lets the compiler and OpenMP implementation decide the schedule. It is a hint to allow potential optimizations, but you have less direct control.

Important statement: If loop iterations have uniform cost, prefer schedule(static) for lower overhead. If iteration cost varies, consider schedule(dynamic) or schedule(guided) to improve load balance.

Controlling Loop Work Sharing: `nowait` and Barriers

By default, at the end of a for or do work-sharing construct, there is an implied barrier. All threads wait until every thread has finished its assigned iterations before any thread can continue beyond the loop.

You can remove this barrier with the nowait clause:

#pragma omp parallel
{
    #pragma omp for nowait
    for (int i = 0; i < N; i++) {
        /* work */
    }
    /* some threads may reach this code earlier than others */
    do_more_work();
}

This can improve performance if a barrier is unnecessary, but it requires that later code does not depend on all loop iterations having completed, unless you introduce your own synchronization elsewhere.

Reductions with Work-sharing Loops

A very common usage pattern is to compute a sum, minimum, maximum, or similar reduction over a loop. A naive parallelization of a loop with a shared accumulator variable would create data races. To avoid this, OpenMP provides the reduction clause.

Example of a sum:

double sum = 0.0;
#pragma omp parallel
{
    #pragma omp for reduction(+:sum)
    for (int i = 0; i < N; i++) {
        sum += a[i];
    }
}

Conceptually, each thread gets its own private copy of sum. It accumulates into that private value for the iterations it executes. At the end of the loop, OpenMP combines all private copies into the shared sum using the indicated operator +.

Reductions work with several built in operators, such as +, *, min, and max, and with several data types. They let you write parallel loops that look similar to serial code, while still being correct and race free.

Reduction rule: Use the reduction clause when a loop with a shared accumulator variable is parallelized with a work-sharing construct. Do not rely on manual synchronization for simple reductions.

The `sections` Work-sharing Construct

Not all work is naturally expressed as a loop. Sometimes, you have a few distinct code blocks that can run in parallel, each performing different tasks. The sections construct is designed for this pattern.

The basic idea is:

Create a parallel region.
Within that region, use sections.
Inside sections, specify each independent task as a section.

Example:

#pragma omp parallel
{
    #pragma omp sections
    {
        #pragma omp section
        {
            compute_pressure();
        }
        #pragma omp section
        {
            compute_temperature();
        }
        #pragma omp section
        {
            compute_velocity();
        }
    }
}

Each section is assigned to a thread in the team. If there are more sections than threads, some threads will execute multiple sections. If there are more threads than sections, some threads will remain idle for this part of the program.

Like for, sections has an implied barrier at the end unless you specify nowait. If you use nowait, threads that finish their sections early can proceed without waiting for others.

sections is useful when you wish to overlap distinct tasks that do not share a simple index space. For instance, one thread might perform I/O, another might prepare data, and another might compute.

The `single` and `master` Constructs

Work sharing also includes the ability to specify that a block of code should be executed by only one thread in a parallel region. OpenMP provides two constructs for this purpose: single and master.

Both are not fully symmetric to for because they do not divide work across all threads. Instead, they choose one thread, but they are still considered part of the work-sharing family because they relate to which threads execute which statements.

The `single` Construct

single designates a block of code that must be executed by exactly one thread in the team. Any unspecified thread can become the "single" thread. All other threads skip the block and, by default, wait at a barrier after the block completes.

Example:

#pragma omp parallel
{
    #pragma omp single
    {
        initialize_data();
    }
    compute_with_data();
}

Only one thread runs initialize_data. The others wait at the end of the single block until the initialization is done, then all threads call compute_with_data. This is useful for one time work that still needs to be inside a parallel region, such as I/O, allocation, or setup.

You can remove the implicit barrier with nowait:

#pragma omp parallel
{
    #pragma omp single nowait
    {
        prepare_input();
    }
    /* some threads may proceed even if prepare_input is not done */
    other_independent_work();
}

If you remove this barrier, you must be sure that any later code does not use data that depends on the single region, or that there is some other synchronization mechanism.

The `master` Construct

master specifies that a block of code is executed only by the master thread, that is, the thread that created the team. Other threads skip the block and do not wait implicitly at the end of it.

Example:

#pragma omp parallel
{
    #pragma omp master
    {
        print_status();
    }
    parallel_compute();
}

Here print_status runs only on the master thread. There is no implied barrier at the end of master, so other threads may already be executing parallel_compute while the master is still printing.

master is often used for interactions with external resources that should not be executed by arbitrary threads. For instance, writing to a log file or managing external libraries that are not thread safe.

Difference to remember: single selects any one thread and has an implied barrier at the end by default. master always uses the master thread and never has an implied barrier.

The `task` Construct and Work Sharing

Although tasking is usually considered a more advanced feature and may be covered in more detail elsewhere, it is mentioned briefly here because it also describes how work is divided among threads within a parallel region.

A task defines a logical unit of work that can be scheduled to run by any thread in the team, possibly at a different time from when it was created. Unlike for and sections, tasks are not tied to a loop index or a fixed set of prepared sections.

Conceptually, tasks offer a flexible work-sharing mechanism where the programmer describes what to do, and the runtime decides when and where to execute it.

A very basic pattern looks like this:

#pragma omp parallel
{
    #pragma omp single
    {
        for (int i = 0; i < N; i++) {
            #pragma omp task
            process_item(i);
        }
    }
}

Here, the single thread creates many tasks, one per item. The OpenMP runtime distributes these tasks across threads in the team. This is especially useful when work is highly irregular, recursive, or does not conform well to a simple loop with contiguous indices.

Tasking introduces additional concepts such as dependencies and task synchronization, which are typically covered in a dedicated tasking or advanced OpenMP chapter.

Choosing Among Work-sharing Constructs

When you parallelize a piece of code with OpenMP, one of the first design decisions is which work-sharing construct makes the most sense for the structure of your computation.

If the work is a simple loop with independent iterations, use for or do. It is the most natural and efficient choice for typical numeric kernels, such as vector updates, matrix operations, or stencil sweeps.

If you have a small fixed number of independent tasks that are not easily expressed as loop iterations, use sections. Each task goes into its own section block.

If you need one thread to perform initialization or I/O while still being inside a parallel region, use single or master. Use single when any thread can do the work and a barrier is helpful. Use master when the master thread must do it and a barrier is not desired.

If your algorithm has irregular, recursive, or dynamically generated work, and a fixed mapping of iterations to threads is not ideal, use task. This allows the runtime to manage a work queue and balance the load dynamically across the available threads.

Interactions with Data Scoping

All work-sharing constructs inherit data scoping rules from the enclosing parallel region unless you override them. For instance, variables that are shared in the parallel region remain shared in the for or sections construct. Private variables remain private.

There are also clauses such as private, firstprivate, and lastprivate that let you control how variables are handled for each thread in a work-sharing construct. For loop constructs, lastprivate can be used to capture the value from the last logical iteration. The detailed semantics of these clauses and how they relate to data races and correctness belong to synchronization and data scoping topics, but you should be aware that work sharing and data scoping always go together.

Common Pitfalls with Work-sharing Constructs

Work-sharing constructs are simple to write superficially, but there are several common pitfalls.

One mistake is to parallelize a loop with dependencies between iterations. For example, if an iteration reads a value written by a previous iteration, then a for construct without additional synchronization is not safe. You must ensure that each iteration is independent or restructure the code.

Another frequent issue is forgetting that sections, single, and for typically end with an implicit barrier. Unexpected barriers can serialize parts of your code and reduce performance. Using nowait appropriately can help, but only if it does not break the logical ordering required for correctness.

Finally, incorrect or missing reductions are a classic source of data races. Whenever several iterations update the same variable, consider whether a reduction clause is needed. If the pattern is not a simple associative operation, you may need critical sections or other synchronization, which are covered in the synchronization mechanisms chapter.

Summary

Work-sharing constructs in OpenMP provide the basic toolkit for dividing work among threads in a shared memory program. They transform a general parallel region with identical execution across threads into a coordinated structure where each thread can perform different pieces of a larger computation.

The for construct splits loop iterations, and its scheduling clause controls how iterations map to threads. Reductions integrate naturally with for to handle accumulations safely. The sections construct handles a small set of unrelated tasks, while single and master ensure that certain code is executed only by one thread. Tasking extends the work-sharing paradigm to irregular and dynamic workloads.

By choosing the appropriate work-sharing construct and combining it with proper data scoping and synchronization, you can express parallel algorithms cleanly and portably in OpenMP and exploit the power of shared memory systems more effectively.

Comments

Please login to add a comment.

Don't have an account? Register now!

7.4 Work-sharing constructs

Overview

The Idea of Work Sharing vs Parallel Regions

The `for` (or `do`) Work-sharing Construct

Scheduling of Loop Iterations

Static Schedule

Dynamic Schedule

Guided Schedule

Runtime and Auto Schedules

Controlling Loop Work Sharing: `nowait` and Barriers

Reductions with Work-sharing Loops

The `sections` Work-sharing Construct

The `single` and `master` Constructs

The `single` Construct

The `master` Construct

The `task` Construct and Work Sharing

Choosing Among Work-sharing Constructs

Interactions with Data Scoping

Common Pitfalls with Work-sharing Constructs

Summary

Comments

Where to Move