6.2.2 Data parallelism

What makes data parallelism distinct

Data parallelism focuses on performing (roughly) the same operation on many pieces of data at once. The parallelism comes from splitting the data, not from splitting different tasks or algorithm phases.

Key characteristics:

A large data set (array, grid, image, particle list, etc.)
The same computation applied to many elements
Each processing element (core, GPU thread, node, etc.) handles a subset of the data
Limited or structured interactions between elements (e.g., only neighbors)

This contrasts with task parallelism, where different tasks or functions may run in parallel and might not operate on the same data structures.

Typical examples:

Element-wise operations on arrays or matrices
Image processing on pixels
Simulations on many particles or grid cells
Applying the same model to many input samples (e.g., mini-batches in ML)

Conceptual model of data parallel execution

Imagine a loop over an array:

for i in range(N):
    a[i] = b[i] + c[i]

In a data-parallel model:

The loop iterations are conceptually independent.
The data is partitioned into chunks:

Core 0 handles indices 0..N/p-1
Core 1 handles indices N/p..2N/p-1
…

All cores perform the same operation on their respective slices.

As a mental model:

$$
\forall i \in \{0, \dots, N-1\}: \quad a_i = f(b_i, c_i)
$$

All $i$ can, in principle, be processed simultaneously, subject to available hardware resources.

Forms of data parallelism

Element-wise (embarrassingly parallel) operations

Each output element depends only on the corresponding input element(s), with no need for communication between parallel workers.

Examples:

Scaling a vector: $y_i = \alpha x_i$
Adding two arrays: $z_i = x_i + y_i$
Applying a function to each pixel independently (e.g., brightness adjustment)

These are often called embarrassingly parallel because they are very easy to parallelize.

Neighborhood or stencil operations

Each element depends on a small neighborhood of data. This is still data parallel but requires structured communication or halo/ghost regions.

Examples:

1D smoothing filter:

$$
y_i = \frac{1}{3}(x_{i-1} + x_i + x_{i+1})
$$

2D 5-point stencil on a grid (e.g., for PDEs):

$$
u_{i,j}^{\text{new}} = \frac{1}{4}\big(u_{i+1,j} + u_{i-1,j} + u_{i,j+1} + u_{i,j-1}\big)
$$

Parallelization is still based on partitioning the data (rows, columns, or blocks), but boundaries require data exchange.

Reductions and scans (collective data operations)

Sometimes you apply a per-element operation, then combine results:

Reduction: combine all elements into one (or a few) values.

Sum, max, min, logical OR, etc.

Prefix operations (scan):

Prefix sums: $y_i = \sum_{k=0}^i x_k$

These have a data-parallel structure but also require tree-like combination steps. Many parallel frameworks treat them as built-in data-parallel primitives.

Data decomposition strategies

The central design choice in data parallelism is how to partition the data across resources.

1D decompositions

For 1D arrays or when flattening higher dimensions:

Block (contiguous) partitioning:

Each core gets a contiguous chunk.
Good locality, simple indexing.

Cyclic partitioning:

Core 0 gets indices 0, p, 2p, …
Core 1 gets indices 1, p+1, 2p+1, …
Useful for balancing load when work per element varies.

2D and higher-dimensional decompositions

For matrices or grids:

Row-wise or column-wise:

Each core gets a set of rows or columns.
Simple, but may create communication hotspots for certain algorithms.

Block (tile) decomposition:

Each core gets a sub-block (e.g., a rectangular tile).
Often better for locality and balanced communication.

Block-cyclic:

Blocks are distributed cyclically for better load balance in irregular workloads.

Choosing the decomposition affects:

Load balance
Amount and pattern of communication
Cache and memory locality

Data parallelism and underlying hardware

Data parallelism maps naturally onto several hardware concepts:

SIMD and vector units

SIMD (Single Instruction, Multiple Data) executes the same instruction across multiple data elements in a single core.

Data parallel loops like:

for (int i = 0; i < N; i++) {
    a[i] = b[i] * c[i] + d[i];
}

are ideal candidates for automatic or manual vectorization. Here, data parallelism exists at the instruction level within one core.

GPUs and massive threading

GPUs are designed for large-scale data parallelism:

Thousands of lightweight threads
Often the same kernel function applied to many data elements
Data partitioned into thread blocks and warps

Typical GPU kernels for scientific computing are data-parallel over grid points, particles, or matrix elements.

Distributed memory (nodes in a cluster)

Data parallelism at the cluster level:

Each node (or MPI process) owns a portion of the global data set.
Operations proceed mostly independently on local data.
Communication is typically structured:

Halo exchanges for stencils
Collective operations for reductions

This is conceptually the same idea: each node is applying the same logic to its local slice of a global data structure.

Common patterns and idioms in data-parallel codes

Map

Apply a function independently to each element:

$$
y_i = f(x_i)
$$

Example:

for i in range(N):
    y[i] = f(x[i])

Found in many frameworks as map, transform, or vectorized array operations.

Map + Reduce

Compute a global property while processing data elements:

Example: sum of squares

$$
s = \sum_{i=0}^{N-1} x_i^2
$$

Pattern:

Map: compute local contributions ($x_i^2$)
Reduce: sum results (possibly in parallel)

Stencil

Each element updated based on nearby elements:

Common in PDE solvers, image filters, fluid dynamics
Parallel pattern:

Decompose data
Exchange halos/ghost cells
Apply stencil locally

Gather and scatter

Gather: collect data elements from various locations into a contiguous structure.
Scatter: distribute data from a contiguous region to scattered positions.

These are often necessary when the mapping from data indices to physical elements is not 1-to-1 or is irregular (e.g., unstructured meshes).

Benefits and limitations of data parallelism

Benefits

Often simpler mental model than complex task graphs:

"Same operation on different data."

Maps well to many levels of modern hardware:

SIMD within cores
Multicore CPUs
GPUs
Multi-node clusters

Many scientific and ML workloads are naturally data-parallel.

Limitations

Not all algorithms are naturally data-parallel (e.g., highly sequential logic, some graph algorithms).
Dependencies between data elements (e.g., long-range or irregular interactions) can limit scalability.
Uneven work per data element can cause load imbalance:

E.g., some grid cells or particles require more computation.

Practical considerations when using data parallelism

Identifying safe parallel regions

For a loop or operation to be safely data-parallel:

Each iteration should not write to the same memory location as another iteration.
If updates to shared data (e.g., a reduction variable) are needed, they must be done via a well-defined parallel pattern (atomic operations, reductions, etc.).
Dependencies like a[i] depending on a[i-1] can break simple data parallelism unless restructured.

Memory layout and locality

Data parallel performance is strongly affected by how data is laid out:

Contiguous layout of parallelized dimension improves:

Cache usage
Vectorization
Memory bandwidth utilization

Strided access patterns or indirect addressing can hurt performance.

Granularity

Choosing the right granularity for work units:

Too fine-grained:

Overheads (thread creation, scheduling, communication) dominate.

Too coarse-grained:

May underutilize resources or reduce load balance.

Often, one maps chunks of data elements to each thread or process, not single elements (especially on CPUs and clusters).

Example: simple data-parallel transformation

As a concrete, small-scale example, consider normalizing an array:

$$
y_i = \frac{x_i}{\max_j |x_j|}
$$

High-level data-parallel structure:

Find the maximum absolute value (reduction).
Apply the normalization to each element (map).

Pseudo-code:

# Step 1: reduction to find max_abs
max_abs = 0
for i in range(N):
    if abs(x[i]) > max_abs:
        max_abs = abs(x[i])
# Step 2: data-parallel map
for i in range(N):
    y[i] = x[i] / max_abs

In a data-parallel implementation:

The reduction is executed in parallel using a tree-like scheme.
The normalization loop is distributed across threads, vector units, GPU threads, or nodes, each processing a chunk of i values.

This illustrates how typical scientific computations naturally decompose into data-parallel steps combined with a few collective operations.

Comments

Please login to add a comment.

Don't have an account? Register now!