Table of Contents
What makes data parallelism distinct
Data parallelism focuses on performing (roughly) the same operation on many pieces of data at once. The parallelism comes from splitting the data, not from splitting different tasks or algorithm phases.
Key characteristics:
- A large data set (array, grid, image, particle list, etc.)
- The same computation applied to many elements
- Each processing element (core, GPU thread, node, etc.) handles a subset of the data
- Limited or structured interactions between elements (e.g., only neighbors)
This contrasts with task parallelism, where different tasks or functions may run in parallel and might not operate on the same data structures.
Typical examples:
- Element-wise operations on arrays or matrices
- Image processing on pixels
- Simulations on many particles or grid cells
- Applying the same model to many input samples (e.g., mini-batches in ML)
Conceptual model of data parallel execution
Imagine a loop over an array:
for i in range(N):
a[i] = b[i] + c[i]In a data-parallel model:
- The loop iterations are conceptually independent.
- The data is partitioned into chunks:
- Core 0 handles indices
0..N/p-1 - Core 1 handles indices
N/p..2N/p-1 - …
- All cores perform the same operation on their respective slices.
As a mental model:
$$
\forall i \in \{0, \dots, N-1\}: \quad a_i = f(b_i, c_i)
$$
All $i$ can, in principle, be processed simultaneously, subject to available hardware resources.
Forms of data parallelism
Element-wise (embarrassingly parallel) operations
Each output element depends only on the corresponding input element(s), with no need for communication between parallel workers.
Examples:
- Scaling a vector: $y_i = \alpha x_i$
- Adding two arrays: $z_i = x_i + y_i$
- Applying a function to each pixel independently (e.g., brightness adjustment)
These are often called embarrassingly parallel because they are very easy to parallelize.
Neighborhood or stencil operations
Each element depends on a small neighborhood of data. This is still data parallel but requires structured communication or halo/ghost regions.
Examples:
- 1D smoothing filter:
$$
y_i = \frac{1}{3}(x_{i-1} + x_i + x_{i+1})
$$
- 2D 5-point stencil on a grid (e.g., for PDEs):
$$
u_{i,j}^{\text{new}} = \frac{1}{4}\big(u_{i+1,j} + u_{i-1,j} + u_{i,j+1} + u_{i,j-1}\big)
$$
Parallelization is still based on partitioning the data (rows, columns, or blocks), but boundaries require data exchange.
Reductions and scans (collective data operations)
Sometimes you apply a per-element operation, then combine results:
- Reduction: combine all elements into one (or a few) values.
- Sum, max, min, logical OR, etc.
- Prefix operations (scan):
- Prefix sums: $y_i = \sum_{k=0}^i x_k$
These have a data-parallel structure but also require tree-like combination steps. Many parallel frameworks treat them as built-in data-parallel primitives.
Data decomposition strategies
The central design choice in data parallelism is how to partition the data across resources.
1D decompositions
For 1D arrays or when flattening higher dimensions:
- Block (contiguous) partitioning:
- Each core gets a contiguous chunk.
- Good locality, simple indexing.
- Cyclic partitioning:
- Core 0 gets indices
0, p, 2p, … - Core 1 gets indices
1, p+1, 2p+1, … - Useful for balancing load when work per element varies.
2D and higher-dimensional decompositions
For matrices or grids:
- Row-wise or column-wise:
- Each core gets a set of rows or columns.
- Simple, but may create communication hotspots for certain algorithms.
- Block (tile) decomposition:
- Each core gets a sub-block (e.g., a rectangular tile).
- Often better for locality and balanced communication.
- Block-cyclic:
- Blocks are distributed cyclically for better load balance in irregular workloads.
Choosing the decomposition affects:
- Load balance
- Amount and pattern of communication
- Cache and memory locality
Data parallelism and underlying hardware
Data parallelism maps naturally onto several hardware concepts:
SIMD and vector units
SIMD (Single Instruction, Multiple Data) executes the same instruction across multiple data elements in a single core.
Data parallel loops like:
for (int i = 0; i < N; i++) {
a[i] = b[i] * c[i] + d[i];
}are ideal candidates for automatic or manual vectorization. Here, data parallelism exists at the instruction level within one core.
GPUs and massive threading
GPUs are designed for large-scale data parallelism:
- Thousands of lightweight threads
- Often the same kernel function applied to many data elements
- Data partitioned into thread blocks and warps
Typical GPU kernels for scientific computing are data-parallel over grid points, particles, or matrix elements.
Distributed memory (nodes in a cluster)
Data parallelism at the cluster level:
- Each node (or MPI process) owns a portion of the global data set.
- Operations proceed mostly independently on local data.
- Communication is typically structured:
- Halo exchanges for stencils
- Collective operations for reductions
This is conceptually the same idea: each node is applying the same logic to its local slice of a global data structure.
Common patterns and idioms in data-parallel codes
Map
Apply a function independently to each element:
$$
y_i = f(x_i)
$$
Example:
for i in range(N):
y[i] = f(x[i])
Found in many frameworks as map, transform, or vectorized array operations.
Map + Reduce
Compute a global property while processing data elements:
- Example: sum of squares
$$
s = \sum_{i=0}^{N-1} x_i^2
$$
Pattern:
- Map: compute local contributions ($x_i^2$)
- Reduce: sum results (possibly in parallel)
Stencil
Each element updated based on nearby elements:
- Common in PDE solvers, image filters, fluid dynamics
- Parallel pattern:
- Decompose data
- Exchange halos/ghost cells
- Apply stencil locally
Gather and scatter
- Gather: collect data elements from various locations into a contiguous structure.
- Scatter: distribute data from a contiguous region to scattered positions.
These are often necessary when the mapping from data indices to physical elements is not 1-to-1 or is irregular (e.g., unstructured meshes).
Benefits and limitations of data parallelism
Benefits
- Often simpler mental model than complex task graphs:
- "Same operation on different data."
- Maps well to many levels of modern hardware:
- SIMD within cores
- Multicore CPUs
- GPUs
- Multi-node clusters
- Many scientific and ML workloads are naturally data-parallel.
Limitations
- Not all algorithms are naturally data-parallel (e.g., highly sequential logic, some graph algorithms).
- Dependencies between data elements (e.g., long-range or irregular interactions) can limit scalability.
- Uneven work per data element can cause load imbalance:
- E.g., some grid cells or particles require more computation.
Practical considerations when using data parallelism
Identifying safe parallel regions
For a loop or operation to be safely data-parallel:
- Each iteration should not write to the same memory location as another iteration.
- If updates to shared data (e.g., a reduction variable) are needed, they must be done via a well-defined parallel pattern (atomic operations, reductions, etc.).
- Dependencies like
a[i]depending ona[i-1]can break simple data parallelism unless restructured.
Memory layout and locality
Data parallel performance is strongly affected by how data is laid out:
- Contiguous layout of parallelized dimension improves:
- Cache usage
- Vectorization
- Memory bandwidth utilization
- Strided access patterns or indirect addressing can hurt performance.
Granularity
Choosing the right granularity for work units:
- Too fine-grained:
- Overheads (thread creation, scheduling, communication) dominate.
- Too coarse-grained:
- May underutilize resources or reduce load balance.
- Often, one maps chunks of data elements to each thread or process, not single elements (especially on CPUs and clusters).
Example: simple data-parallel transformation
As a concrete, small-scale example, consider normalizing an array:
$$
y_i = \frac{x_i}{\max_j |x_j|}
$$
High-level data-parallel structure:
- Find the maximum absolute value (reduction).
- Apply the normalization to each element (map).
Pseudo-code:
# Step 1: reduction to find max_abs
max_abs = 0
for i in range(N):
if abs(x[i]) > max_abs:
max_abs = abs(x[i])
# Step 2: data-parallel map
for i in range(N):
y[i] = x[i] / max_absIn a data-parallel implementation:
- The reduction is executed in parallel using a tree-like scheme.
- The normalization loop is distributed across threads, vector units, GPU threads, or nodes, each processing a chunk of
ivalues.
This illustrates how typical scientific computations naturally decompose into data-parallel steps combined with a few collective operations.