6.2 Types of parallelism

Table of Contents

Overview

In parallel computing, the central question is not only how many resources you use, but how you split the work. The way you decompose your problem into pieces for multiple cores, nodes, or GPUs is called a parallelism model or type of parallelism.

This chapter focuses on two fundamental types that appear everywhere in HPC: task parallelism and data parallelism. Almost every parallel application is built from some combination of these two. Understanding them helps you reason about algorithm design, performance, and which programming tools to use.

What “types of parallelism” mean

A parallel program coordinates multiple activities at once. These activities can be thought of as:

Different kinds of work operating on the same or different data.
Similar operations repeated across many pieces of data.

The first viewpoint leads to task parallelism. The second leads to data parallelism. Both answer the same fundamental question in different ways: how do we divide the original sequential work into smaller units that can be executed concurrently?

The division strategy has consequences for:

Programming style. How you write loops, functions, and communication patterns.

Data layout. How arrays and data structures are stored and accessed in memory or across nodes.

Synchronization. When and how parallel workers must coordinate or wait for each other.

Performance. What limits your speedup, for example, communication, load imbalance, or memory bandwidth.

Later chapters will provide concrete APIs like OpenMP, MPI, CUDA, and hybrid models. Here the goal is to recognize the patterns those tools implement.

Task parallelism

Task parallelism focuses on executing different tasks at the same time. A “task” is usually a distinct computation or operation that can run independently or with limited communication.

Conceptual view

Imagine an application that has to perform several logically different computations, for example:

Reading input data from disk.

Preprocessing or filtering the data.

Running a simulation or model.

Writing results back to disk.

In a purely sequential program, these stages happen one after another. With task parallelism, you look for tasks that:

Are logically separate.

Can start before others finish, or can run in parallel with them.

Have few or well-defined dependencies.

In code, a task might be a function call, a subroutine, or a block of work that a task-based runtime system executes. The key is that each task represents a unit of work rather than a specific chunk of data.

Dependencies between tasks

Tasks are rarely completely independent. They often have data dependencies, for example, task B needs the output of task A. These dependencies form a directed acyclic graph, often called a task dependency graph.

A scheduler or runtime uses this graph to determine which tasks can safely run in parallel at each moment. If tasks are written so that their inputs and outputs are clear and side effects are minimized, more concurrency becomes available.

The maximum amount of concurrency is limited by the longest chain of dependent tasks. Even if you have many cores, tasks that form a strictly sequential chain cannot overlap.

Granularity of tasks

Task granularity describes how large each task is in terms of work and runtime.

Fine grained task parallelism uses many small tasks. This can expose a lot of potential concurrency, but if tasks are too small, overheads from scheduling and synchronization dominate.

Coarse grained task parallelism uses fewer, larger tasks. This reduces overhead but might limit how well you can use many cores if there are not enough tasks.

Choosing task granularity is a design decision. It depends on the cost of each task, the scheduler overhead, and the typical core count. Modern task-based runtimes often encourage moderately fine grained tasks, but not so fine that each task runs for only microseconds.

Examples of task parallel patterns

Task parallelism appears in many HPC workflows in characteristic patterns.

One example is a pipeline pattern. Suppose you have three stages, A, B, and C, where A reads and preprocesses data, B runs a compute intensive kernel, and C compresses or writes results. While B processes the first chunk of data, A can already start reading the next chunk, and C can finalize the previous one. Different stages run concurrently on different chunks. The work flows through stages like an assembly line.

Another example is multiple independent simulations. If you need to run the same model with many different parameters or random seeds, each simulation can be a task. These tasks have no dependencies on each other, and you can distribute them over many nodes. This is a very simple and effective form of task parallelism, often seen in parameter sweeps and ensemble simulations.

A third example is adaptive algorithms. Many numerical methods generate new work dynamically, for instance, adaptive mesh refinement, tree searches, or recursive divide and conquer. Each subdivision creates new tasks. The total number of tasks and their execution time may not be known in advance, which makes task-based runtimes particularly attractive to schedule work dynamically.

Load balancing in task parallelism

In task parallelism, one major goal is to keep all workers busy. If one task takes much longer than others, some cores will sit idle. Task parallel strategies often rely on dynamic load balancing methods such as work stealing, where idle workers take tasks from busier ones.

Task-based designs are therefore a natural fit for irregular problems in which different parts of the computation require different amounts of work and cannot be predicted in advance. The runtime redistributes tasks as the program runs, instead of relying on a static schedule.

Data parallelism

Data parallelism focuses on applying the same operation to many pieces of data at the same time. The classic example is a loop that updates every element of an array with the same formula.

Where task parallelism partitions work by kind of computation, data parallelism partitions work by subset of the data.

Conceptual view

Consider an operation

$$
y[i] = a \cdot x[i] + y[i]
$$

for all indices $i$ in some range. In a sequential implementation, you write a loop and update each element one after another. In a data parallel view, every iteration is the same operation applied to a different element, and many iterations can execute concurrently.

You can assign different subsets of indices to different threads, cores, or processes. On a shared memory system, threads might each handle a chunk of the array. On a distributed memory system, each MPI process might own a subset of the array corresponding to some part of the global index space.

The key requirement is that iterations are independent or have only limited, well understood dependencies.

Data decomposition

To use data parallelism, you must decompose your data into parts that can be processed in parallel. Common strategies are block and cyclic decompositions in one or more dimensions.

In a block decomposition, each worker gets a contiguous block of indices or elements. This is simple and often cache friendly.

In a cyclic decomposition, workers get elements in a round robin fashion, for instance, process 0 gets indices $0, p, 2p, \dots$, process 1 gets $1, p+1, 2p+1, \dots$, where $p$ is the number of workers. This can improve load balance when different elements require different amounts of compute.

Higher dimensional data, such as two dimensional or three dimensional grids, can be decomposed into blocks or tiles in multiple dimensions. This is especially common in simulations that solve partial differential equations on structured grids.

The choice of decomposition affects not only load balance but also communication patterns between neighboring data subdomains and memory access efficiency.

Data dependencies and locality

For pure data parallel operations that only read and write each element independently, there are no cross element data dependencies. These are often called embarrassingly parallel data operations.

More interesting computations, such as stencils or convolutions, have local data dependencies. For example, updating an element may require reading its nearest neighbors. Data parallelism is still possible, but each worker sometimes needs boundary data that belongs to another worker. This leads to well defined communication patterns, such as halo exchanges in distributed memory settings.

Data parallel designs should aim to maximize data locality, which means that a worker mostly accesses data from its own assigned subset, and minimize communication or remote accesses. Good decompositions align with how the algorithm uses neighbor information.

Vectorization as fine grained data parallelism

Inside a single core, modern CPUs can perform data parallel operations on small groups of values using vector instructions. While this chapter does not cover low level SIMD details, it is helpful to recognize that vectorization is essentially a very fine grained form of data parallelism.

At different levels of the hardware, data parallelism appears as:

Multiple cores executing the same loop body on different data subsets.

Multiple nodes in a cluster each owning part of a large array.

Vector units in each core operating on several data elements per instruction.

Recognizing data parallel opportunities at the algorithm level usually makes it easier for compilers and runtimes to exploit vectorization and multi core parallelism automatically or with minimal annotations.

Common data parallel patterns

Many numerical kernels and HPC codes are built around data parallel patterns such as:

Elementwise transforms, in which each element of an array is updated independently.

Reductions, which combine many data elements into a single value or small set of values, for instance, sums, norms, and maxima.

Prefix operations, such as prefix sums, where each output depends on a subset of the input data.

Neighborhood operations, like stencils, convolutions, and smoothing filters, where each element depends on its neighbors.

Although reductions and neighborhood operations introduce dependencies, they still follow a data parallel structure, since the same type of operation is applied everywhere. Efficient parallel implementations rely on organizing the computation so that as many parts as possible can proceed concurrently while managing the necessary synchronization or communication.

Key idea: Data parallelism means the same operation is performed on different pieces of data concurrently. Task parallelism means different operations or work units run concurrently, possibly on the same or different data.

Comparing task and data parallelism

Task and data parallelism are not mutually exclusive. They describe different ways to map work to resources. Comparing them clarifies which is more natural for a given problem and where their strengths and limitations lie.

In data parallelism, the primary organizing principle is the data structure itself. You ask: how can I divide this array, matrix, or grid among workers so each applies the same kernel locally? Performance is strongly tied to memory access patterns, data locality, and communication along the boundaries of data partitions.

In task parallelism, the primary organizing principle is the workflow or algorithm flow. You ask: what distinct tasks does my algorithm perform, what are their dependencies, and how can they overlap? Performance is strongly tied to task granularity, scheduling overhead, and balancing the workload across workers.

Data parallelism is often easier to reason about when the computation is uniform, for instance, large loops over arrays or regular grids. It also maps naturally to accelerators such as GPUs that are designed to handle many similar operations on different data at once.

Task parallelism is often more flexible for irregular or heterogeneous workloads, for example, graph algorithms, adaptive meshes, or complex workflows with many stages and varying costs. It helps when some parts of the algorithm are inherently more expensive than others.

From the programmer perspective, data parallel code often looks like parallel loops and collective operations over arrays and matrices. Task parallel code often looks like a set of functions or tasks with explicit or implicit dependencies, scheduled by a runtime or by explicit orchestration.

The performance characteristics differ as well. Data parallel code tends to stress memory bandwidth and interconnect bandwidth when data is distributed. Task parallel code tends to stress the scheduler and may suffer from idle time if tasks are poorly balanced.

Despite these differences, both types of parallelism can exploit the same underlying hardware resources. The choice is more about your problem structure and how conveniently you can express it.

Combining task and data parallelism

Most realistic HPC applications use both task and data parallelism at different levels. This combination is sometimes called hybrid parallelism, although that term is also used more specifically for mixes of programming models such as MPI and OpenMP.

A common structure is to first identify large scale tasks, and within each task, use data parallelism to exploit all the cores or accelerators available. Some examples illustrate this layered view.

One example is multiple data parallel simulations. You might need to run many simulations with different parameters. Each simulation is a task. Inside each simulation, the core numerical kernel is data parallel, for example, updating a grid or solving a linear system. Nodes or groups of nodes are assigned different simulations, and within each node, all cores collaborate on the local data parallel work.

Another example is multi stage pipelines with data parallel kernels. A workflow might have several stages, each performing different computations on large datasets. Each stage can be viewed as a task, and inside each stage you exploit data parallelism over the dataset. Parallelism exists across stages through pipelining and within stages through data parallel kernels.

A third example is adaptive algorithms with local data parallel work. In adaptive mesh refinement, tasks create, refine, or coarsen mesh regions. Each region, when processed, contains a local data parallel computation over its cells. Tasks determine which regions are processed and in what order, while data parallelism is exploited within each region.

This combination gives you flexibility. Task parallelism helps manage irregularity and dynamic creation of work. Data parallelism ensures that, once a block of data is available, all local resources are used efficiently to process it.

Important design principle: Use task parallelism to organize what to compute and in what order. Use data parallelism to organize how to compute efficiently on large data structures. Most scalable HPC applications mix both.

Choosing a parallelism type for a given problem

When you face a new problem and want to parallelize it, a helpful approach is to look at it through both task and data parallel lenses and see which gives a simpler or more efficient design.

Start by asking whether the work is naturally described as many similar operations over a large dataset, or as a collection of distinct operations with dependencies. If the primary structure is a big loop over data with identical updates, a predominantly data parallel approach is likely appropriate. If the primary structure is a workflow or graph of different computations, a predominantly task parallel approach might be better.

Next, consider regularity and predictability. If each operation on your data takes roughly the same amount of time and the data structure is regular, static data parallel decomposition can give both simplicity and efficiency. If the cost varies significantly between data elements or tasks and you cannot predict it ahead of time, task parallelism with dynamic scheduling may handle load balancing more robustly.

Also consider where communications or shared data occur. If communication happens mostly along known boundaries between partitions, structured data parallel designs can optimize that. If communication patterns are complex, intermittent, or driven by data dependent control flow, task parallel approaches may be easier to manage.

Finally, take into account the programming models and hardware you plan to use. Some tools primarily target data parallelism, others target task parallelism, and many support both. The conceptual pattern you choose should match your tools so that the code remains readable and the runtime system can efficiently exploit the hardware.

Throughout this course, when you see specific technologies such as OpenMP, MPI, GPUs, and hybrid models, you can interpret their constructs partly in terms of task and data parallelism. This conceptual layer makes it easier to understand not only how to use each tool, but why different tools emphasize different styles of parallel computation.

6.2 Types of parallelism

Overview

What “types of parallelism” mean

Task parallelism

Conceptual view

Dependencies between tasks

Granularity of tasks

Examples of task parallel patterns

Load balancing in task parallelism

Data parallelism

Conceptual view

Data decomposition

Data dependencies and locality

Vectorization as fine grained data parallelism

Common data parallel patterns

Comparing task and data parallelism

Combining task and data parallelism

Choosing a parallelism type for a given problem

Comments

Where to Move