Kahibaro
Discord Login Register

Data parallelism

What makes data parallelism distinct

Data parallelism focuses on performing (roughly) the same operation on many pieces of data at once. The parallelism comes from splitting the data, not from splitting different tasks or algorithm phases.

Key characteristics:

This contrasts with task parallelism, where different tasks or functions may run in parallel and might not operate on the same data structures.

Typical examples:

Conceptual model of data parallel execution

Imagine a loop over an array:

for i in range(N):
    a[i] = b[i] + c[i]

In a data-parallel model:

As a mental model:

$$
\forall i \in \{0, \dots, N-1\}: \quad a_i = f(b_i, c_i)
$$

All $i$ can, in principle, be processed simultaneously, subject to available hardware resources.

Forms of data parallelism

Element-wise (embarrassingly parallel) operations

Each output element depends only on the corresponding input element(s), with no need for communication between parallel workers.

Examples:

These are often called embarrassingly parallel because they are very easy to parallelize.

Neighborhood or stencil operations

Each element depends on a small neighborhood of data. This is still data parallel but requires structured communication or halo/ghost regions.

Examples:

$$
y_i = \frac{1}{3}(x_{i-1} + x_i + x_{i+1})
$$

$$
u_{i,j}^{\text{new}} = \frac{1}{4}\big(u_{i+1,j} + u_{i-1,j} + u_{i,j+1} + u_{i,j-1}\big)
$$

Parallelization is still based on partitioning the data (rows, columns, or blocks), but boundaries require data exchange.

Reductions and scans (collective data operations)

Sometimes you apply a per-element operation, then combine results:

These have a data-parallel structure but also require tree-like combination steps. Many parallel frameworks treat them as built-in data-parallel primitives.

Data decomposition strategies

The central design choice in data parallelism is how to partition the data across resources.

1D decompositions

For 1D arrays or when flattening higher dimensions:

2D and higher-dimensional decompositions

For matrices or grids:

Choosing the decomposition affects:

Data parallelism and underlying hardware

Data parallelism maps naturally onto several hardware concepts:

SIMD and vector units

SIMD (Single Instruction, Multiple Data) executes the same instruction across multiple data elements in a single core.

Data parallel loops like:

for (int i = 0; i < N; i++) {
    a[i] = b[i] * c[i] + d[i];
}

are ideal candidates for automatic or manual vectorization. Here, data parallelism exists at the instruction level within one core.

GPUs and massive threading

GPUs are designed for large-scale data parallelism:

Typical GPU kernels for scientific computing are data-parallel over grid points, particles, or matrix elements.

Distributed memory (nodes in a cluster)

Data parallelism at the cluster level:

This is conceptually the same idea: each node is applying the same logic to its local slice of a global data structure.

Common patterns and idioms in data-parallel codes

Map

Apply a function independently to each element:

$$
y_i = f(x_i)
$$

Example:

for i in range(N):
    y[i] = f(x[i])

Found in many frameworks as map, transform, or vectorized array operations.

Map + Reduce

Compute a global property while processing data elements:

$$
s = \sum_{i=0}^{N-1} x_i^2
$$

Pattern:

  1. Map: compute local contributions ($x_i^2$)
  2. Reduce: sum results (possibly in parallel)

Stencil

Each element updated based on nearby elements:

Gather and scatter

These are often necessary when the mapping from data indices to physical elements is not 1-to-1 or is irregular (e.g., unstructured meshes).

Benefits and limitations of data parallelism

Benefits

Limitations

Practical considerations when using data parallelism

Identifying safe parallel regions

For a loop or operation to be safely data-parallel:

Memory layout and locality

Data parallel performance is strongly affected by how data is laid out:

Granularity

Choosing the right granularity for work units:

Example: simple data-parallel transformation

As a concrete, small-scale example, consider normalizing an array:

$$
y_i = \frac{x_i}{\max_j |x_j|}
$$

High-level data-parallel structure:

  1. Find the maximum absolute value (reduction).
  2. Apply the normalization to each element (map).

Pseudo-code:

# Step 1: reduction to find max_abs
max_abs = 0
for i in range(N):
    if abs(x[i]) > max_abs:
        max_abs = abs(x[i])
# Step 2: data-parallel map
for i in range(N):
    y[i] = x[i] / max_abs

In a data-parallel implementation:

This illustrates how typical scientific computations naturally decompose into data-parallel steps combined with a few collective operations.

Views: 13

Comments

Please login to add a comment.

Don't have an account? Register now!