Table of Contents
Overview
Parallel programs can exploit different kinds of “sameness” or “independence” in a problem. When we talk about types of parallelism, we are usually classifying what is being done in parallel:
- Different pieces of work ⇒ often called task (functional) parallelism
- The same work on different data ⇒ often called data parallelism
- Combinations and refinements of these
This chapter focuses on recognizing and structuring these types of parallelism, not on the specific programming models (which are covered later with OpenMP, MPI, GPUs, etc.).
Dimensions of Parallelism
Before focusing on the two main types (task and data), it helps to see that parallelism can be classified along several axes:
- What is parallel?
- Different operations or functions ⇒ task parallelism
- The same operation on many data items ⇒ data parallelism
- Where does parallelism come from?
- From the algorithm structure (independent steps)
- From the data structure (elements can be processed independently)
- Where is it executed?
- Within one shared-memory node (threads)
- Across many nodes (processes, distributed memory)
- On accelerators (GPUs, vector units)
The rest of this chapter focuses on the “what” dimension: task vs data parallelism, and how to recognize and use them.
Task Parallelism (Functional Parallelism)
Task parallelism appears when you can identify distinct units of work that can proceed (partly or fully) independently, possibly on different data.
Characteristics
Typical features:
- Different code paths or functions run in parallel.
- Each task may:
- Have its own control flow and logic.
- Operate on different or overlapping data.
- Start and finish at different times.
- Coordination is often needed to:
- Start tasks in the right order.
- Exchange intermediate results.
- Detect when all tasks are finished.
Conceptually, you are parallelizing over operations or activities, not primarily over individual data elements.
Examples of Task Parallelism
- Independent simulation scenarios
You run several different configurations of a model: - Scenario A: parameter set 1
- Scenario B: parameter set 2
- Scenario C: parameter set 3
Each scenario is a separate task: same code, different inputs. They can run in parallel with almost no communication.
- Workflow or pipeline stages
A typical scientific pipeline: - Preprocess raw data.
- Run a simulation.
- Postprocess and visualize results.
When handling many datasets, you can pipeline:
- Dataset 1: stage 2
- Dataset 2: stage 1
- Dataset 0: stage 3
Different stages for different datasets run in parallel (a pipeline of tasks).
- Multi-physics or coupled models
In coupled simulations (e.g., fluid dynamics + structural mechanics), you may run: - CFD solver (fluid)
- FEM solver (structure)
as separate tasks that periodically exchange boundary conditions and loads.
- Service-style workloads
Multiple independent jobs (e.g., alignment of many genomes, training multiple machine learning models with different hyperparameters) are often scheduled as separate tasks.
When Task Parallelism Works Well
Task parallelism is effective when:
- Tasks are coarse-grained (each does a significant amount of work).
- Tasks are relatively independent, with limited communication.
- You care about throughput (many tasks completed per unit time) more than about minimizing the time of a single task.
Many “embarrassingly parallel” problems are naturally expressed as collections of independent tasks.
Challenges
- Load imbalance: some tasks take longer than others, leaving some resources idle.
- Dependencies: some tasks must wait for data or results from others.
- Scheduling: deciding which tasks to run where and when can affect performance.
In practice, you often need to design tasks to be similar in cost, or use dynamic scheduling mechanisms to distribute them efficiently.
Data Parallelism
Data parallelism appears when the same operation is applied to many data items that are largely independent.
Characteristics
Typical features:
- One conceptual operation (or small set of operations) is repeated; only the data items differ.
- You can partition your dataset into chunks and process each chunk independently or with limited communication.
- The parallelism degree scales with the size of the data (more elements ⇒ more potential parallelism).
Conceptually, you are parallelizing over data elements, not over distinct functions.
Examples of Data Parallelism
- Element-wise array operations
Consider updating each cell of a large array: A[i] = B[i] + C[i]for all validi.T[i] = f(T_old[i])for each grid point in a simulation.
Each element update can be done independently (or nearly so) in parallel.
- Matrix operations
Many linear algebra operations are data-parallel: - Matrix–vector and matrix–matrix multiplication.
- Vector norms, dot products (with small reductions at the end).
- Element-wise nonlinear functions (e.g., activation functions in ML).
- Image and volume processing
Operations like filtering, thresholding, or point-wise transformations: - Each pixel or voxel is processed using a local neighborhood or only its own value.
- Parallelism arises from the large number of pixels/voxels.
- Monte Carlo methods
Many independent random samples: - Generate $N$ random paths.
- Evaluate a function on each path.
Each sample evaluation is a data element processed in parallel.
Data Decomposition
To exploit data parallelism, you typically decompose the data:
- 1D decomposition: split a long array into contiguous segments.
- 2D/3D decomposition: split a grid or domain into blocks/tiles/subdomains.
Each processing element (thread, process, GPU block, etc.) handles one or more chunks.
The choice of decomposition affects:
- Load balance (are all chunks similar in size and cost?).
- Communication (do boundary elements need halo exchanges?).
- Memory locality (is each chunk stored contiguously for better cache behavior?).
Operations That Fit Data Parallelism
The following patterns naturally map to data parallelism:
- Map: apply a function independently to each element (e.g.,
y[i] = f(x[i])). - Stencil: update each element based on its neighbors (e.g., finite difference schemes).
- Reduce: combine many values into one (sum, max, min, etc.).
- Scan / prefix sum: cumulative operations over arrays.
Parallel implementations of these patterns are the basis of many HPC kernels and also appear in GPU programming and vectorization.
Comparing Task and Data Parallelism
Although often presented as distinct, in real applications you almost always see both types.
Conceptual Difference
- Task parallelism:
- Units: distinct activities or functions.
- Parallelism source: independence between activities.
- Typical goal: execute many different or independent jobs at once; pipeline workflow stages.
- Data parallelism:
- Units: data elements (array entries, particles, grid cells).
- Parallelism source: independence between data items or local neighborhoods.
- Typical goal: speed up a large, uniform computation on a dataset.
Common Misconceptions
- “Task parallel = MPI, data parallel = OpenMP”
Not necessarily. Both task and data parallelism can be implemented with many models. For example: - MPI can implement data-parallel domain decomposition.
- Shared-memory threads can process distinct tasks.
- GPUs are strongly geared towards data parallelism, but can also handle some task-like patterns.
- “Task parallel is coarse, data parallel is fine-grained”
Often true in practice, but not a strict rule; you can have coarse-grained data-parallel chunks and fine-grained tasks.
Examples of Mixed Use
Most real HPC applications mix task and data parallelism:
- Multiple simulations, each with a parallel solver
- Task level: each simulation scenario is a separate task.
- Data level: inside each simulation, arrays and grids are updated with data-parallel kernels.
- Pipeline with parallel stages
- Task level: stages of the pipeline (preprocess, simulate, analyze).
- Data level: each stage internally uses data-parallel algorithms.
Being able to identify both levels helps you design scalable codes and workflows.
Matching Problem Structure to Parallelism Type
Choosing an appropriate type (or mix) of parallelism is mostly about recognizing structure in your problem.
When to Prefer Task Parallelism
Consider a task-centric approach if:
- You have many independent or loosely coupled jobs:
- Parameter sweeps or design-of-experiments.
- Ensemble forecasts or multiple random seeds.
- The control flow differs between parts of the work:
- Some branches run different algorithms.
- Some tasks may skip computations others require.
- You are building workflows or pipelines:
- Data moves through distinct processing stages.
- Communication between tasks is expensive or infrequent.
In such cases, you typically want to maximize throughput by executing as many independent tasks as resources allow.
When to Prefer Data Parallelism
Consider a data-centric approach if:
- Your computation is dominated by large arrays, grids, or matrices.
- You apply similar updates repeatedly over these data structures.
- You care about time-to-solution of a single large problem, not throughput over many small ones.
- You want to exploit hardware features heavily tuned for data parallelism:
- Vector units (SIMD).
- GPUs and other accelerators.
In such cases, you structure computations around data decomposition and parallel kernels.
Hybrid and Hierarchical Parallelism
Modern HPC systems are hierarchically parallel (nodes, cores, vector units, GPUs), and many applications are hierarchically structured as well. This naturally leads to combinations of parallelism types.
Nested Parallelism
You can layer types of parallelism:
- Outer level: task parallelism over independent simulations or pipeline stages.
- Inner level: data parallelism inside each simulation or stage.
This can be extended to multiple nested levels, each matching a hardware or algorithmic level.
Spatial vs Functional Decomposition
Two common ways of combining parallelism:
- Spatial decomposition (data-centric):
- Divide the domain (e.g., physical space) into subdomains.
- Often maps well to distributed memory and node-level parallelism.
- Functional decomposition (task-centric):
- Assign different physics or processing stages to different components.
- Often maps to coupled codes or workflow systems.
Many large codes use a mix of both: subdomains (data parallel) solved by different solvers or physics modules (tasks), which themselves may be data-parallel internally.
Practical Guidelines
When examining a problem to identify parallelism:
- List independent activities (tasks)
- Are there many parameter sets, time windows, or input files that can be handled separately?
- Are there distinct stages that can overlap in time?
- Inspect your data structures
- Do you operate on large arrays, grids, or collections where operations are similar across elements?
- Can you partition these data structures cleanly?
- Identify dependencies
- Where do results from one part of the work feed into another?
- Are those dependencies between tasks, or just along boundaries of data partitions?
- Match to hardware
- Use data parallelism where vector units or GPUs can help.
- Use task parallelism to exploit many cores on independent or loosely coupled work.
- Expect to mix types
- Rarely is a real application purely task- or data-parallel.
- Plan for a hierarchical design where both types co-exist.
Understanding these types of parallelism at the conceptual level will make it easier to grasp the concrete programming models and performance topics discussed in later chapters.