Data parallelism

Table of Contents

Local vs global view of data

Data parallelism is about performing the same operation on many pieces of data at the same time. In contrast to task parallelism, where different tasks may do different work, data parallel programs apply one conceptual operation to a collection of elements.

It is useful to distinguish between the global view of the data and the local view seen by each processing element. Globally, you may think of an array, grid, or dataset as a single large object. Locally, each core or process sees and operates on only part of that object. Data parallelism is about designing algorithms and data layouts so that each worker can process its local elements with minimal coordination, while still contributing to a correct global result.

In shared memory, the global data structure often exists in one address space, but programmers still conceptually partition it into chunks assigned to threads. In distributed memory, the global structure is split physically and each process owns a subset. In both cases, data parallel thinking starts with the question: how can I partition my data so that each worker can apply the same computation independently to its portion.

The basic data parallel pattern

The core data parallel pattern is simple. You have a collection $D$ of elements and a function $f$ that you wish to apply to each element. Conceptually, you want to compute a new collection $E$ where
$$
E[i] = f(D[i]) \quad \text{for all } i.
$$

In a sequential implementation, a single loop runs over all $i$ and applies $f$. In a data parallel implementation, iterations of this loop are executed concurrently, each on a different processing element. The iterations are independent, which means the result of one iteration does not affect the result of another. This independence is what makes the pattern simple and powerful.

This idea generalizes to higher dimensional data, such as 2D grids for images or 3D meshes for physical simulations. The key property remains the same. The same computation is applied to each cell, pixel, or mesh element according to some rule.

Typical data parallel operations

Many common operations in scientific and numerical computing are naturally data parallel. Elementwise vector operations operate on arrays or vectors where each position can be updated independently with a formula like
$$
y_i = a \cdot x_i + b.
$$
This is data parallel because each output element depends only on the corresponding input element and a set of shared scalar parameters.

Map operations use a general function applied independently to each element of a collection. Logical operations on masks, filtering operations that compute new values based on each data point, and many image processing kernels are of this type.

Stencil updates are also frequently data parallel. In a stencil, each grid point is updated based on its own value and the values of a fixed pattern of neighbors. A simple 1D stencil example is
$$
u_{\text{new}}[i] = \alpha \, u[i-1] + \beta \, u[i] + \gamma \, u[i+1].
$$
Within the interior of the domain, all points can be updated concurrently, since they only read values from an old time step and write results into a new array. Neighbor dependencies introduce communication or synchronization requirements at boundaries, but most of the work remains data parallel.

Dense linear algebra kernels, such as matrix vector multiplication and matrix matrix multiplication, also contain strong data parallel structure. Each output element is computed from combinations of input elements in a regular pattern, and subsets of output elements can be assigned to different workers.

Data parallelism and SIMD / vectorization

Although data parallelism and SIMD are discussed in separate chapters, the conceptual connection is direct. SIMD hardware is designed to apply the same operation to multiple data elements in parallel within a single core. Data parallel code patterns are precisely those that allow a compiler or programmer to safely use SIMD instructions.

When you write a loop where each iteration is independent and follows the same control flow, a compiler can often vectorize it. That means it transforms the scalar loop into operations where a single instruction processes a vector of elements. For example, an operation on four consecutive elements of an array might be performed with one SIMD instruction instead of four scalar ones.

At a higher scale, data parallelism also appears across cores. The same loop can be partitioned so that different threads or processes receive disjoint segments of the iteration space. Within each segment, SIMD operations can act on multiple elements at once. In practice, efficient HPC codes often combine thread level and instruction level data parallelism by structuring loops to be both parallel across threads and vectorizable inside each thread.

Data parallel loops must avoid cross iteration dependencies. If iteration $i$ writes data that iteration $j$ reads or writes during the same parallel step, the loop is not safely data parallel.

Data decomposition strategies

Effective data parallelism depends critically on how you partition data among workers. This process is called data decomposition. The goal is to divide the data so that each worker receives a subset that contains enough work to justify scheduling, while keeping communication and synchronization costs small.

Block partitioning is one common strategy. For a 1D array of length $N$ and $P$ workers, each worker gets a contiguous block of approximately $N / P$ elements. For example, worker 0 might process indices $0$ to $N/P - 1$, worker 1 indices $N/P$ to $2N/P - 1$, and so on. Block decompositions tend to have good spatial locality, since each worker traverses a continuous region of memory.

Cyclic partitioning assigns elements in a round robin fashion. Worker $p$ might receive indices $p, p + P, p + 2P$, and so forth. Cyclic decompositions can help balance load when the cost per element varies systematically with the index, but they usually reduce locality compared to block decompositions.

For multidimensional data, such as a 2D grid, you can use 1D or 2D decompositions. A 1D row decomposition assigns whole rows to workers. A 2D block decomposition divides the grid into rectangular subblocks. Two dimensional decompositions can reduce the surface to volume ratio of subdomains. This reduces the amount of boundary communication relative to computation, which is important for performance in stencil and PDE codes.

Data parallelism in shared memory and distributed memory

In shared memory environments, such as a multicore node programmed with threads, data parallelism typically appears as parallel loops over shared arrays. The data structure itself exists once in memory. Each thread processes a subset of indices. The main design challenge is to avoid race conditions and to choose chunk sizes and scheduling policies that balance load without causing false sharing or contention.

In distributed memory environments, such as MPI programs running over a cluster, data parallelism requires explicit distribution of arrays across processes. Each process stores and updates only its portion. Data parallel algorithms must be formulated in terms of local computations on each process followed by communication steps to exchange boundary data or to combine partial results. For example, after performing a local stencil update, processes may need to send and receive boundary values with neighbors.

Despite these differences, the core data parallel idea is the same. Each worker independently performs the same operations on its local data subset, with synchronization or communication only where truly necessary.

Reductions as a special data parallel case

Reductions are a particularly important data parallel pattern. In a reduction, many input elements are combined to produce a single output using an associative operation such as sum, product, minimum, or maximum. Mathematically, a reduction over an array $x$ with an operation $\oplus$ and identity element $e$ can be written as
$$
r = x_0 \oplus x_1 \oplus \dots \oplus x_{N-1}.
$$

Data parallel implementations of reductions partition the input array among workers. Each worker computes a partial reduction over its local segment, then partial results are combined into the final result. Associativity, and if available commutativity, permits combining partial results in different orders, which is what makes reductions parallelizable.

In practice, floating point operations are not perfectly associative, so the parallel reduction result may differ slightly from a sequential computation. This is usually acceptable but becomes important when exact bit reproducibility is required.

Reductions often appear together with elementwise operations. For example, computing the Euclidean norm of a vector involves a data parallel elementwise square followed by a sum reduction:
$$
\|x\|_2 = \sqrt{\sum_{i=0}^{N-1} x_i^2}.
$$

To be safely parallelizable, a reduction must use an operation that is associative and must start from a correct identity element. Incorrect identity values or non associative operations break parallel correctness.

Communication patterns in data parallel programs

Pure data parallel operations that only read and write independent elements can be performed with no communication or synchronization between workers. Many useful algorithms, however, also involve limited interaction through structured communication patterns.

Nearest neighbor communication is common in stencil computations on grids. After updating interior points, each worker must exchange boundary values with its immediate neighbors to prepare for the next iteration. Communication cost grows with the size of the boundary, which is one reason multidimensional decompositions and subdomain shapes matter.

Global reductions use collective communication to combine partial results from all workers. Parallel sum, minimum, maximum, and logical operations are typical. Libraries and frameworks usually provide optimized primitives for these operations, which hide the details of communication trees and algorithms from application code.

Some operations require more complex interaction, such as all to all communication or irregular gathers and scatters. While these can still be considered data parallel in the sense that all workers run the same code on their local data, the cost of communication often dominates. Efficient data parallel designs try to use simple communication patterns and limit the frequency and volume of transferred data.

Data parallelism in real applications

Many large scale HPC applications rely heavily on data parallel structure. In climate and weather models, the atmosphere or ocean is represented as a discretized grid. Physical laws are approximated by finite difference or finite volume schemes. At each time step, similar update formulas are applied to each cell. The global domain is decomposed into subdomains, each subdomain assigned to a process or a group of threads, and the same update kernel runs everywhere.

In molecular dynamics simulations, forces between atoms or particles are computed using similar formulas that depend on positions and sometimes neighbor lists. Each particle often follows the same integration scheme for its motion. Data parallelism arises in updating positions, velocities, and forces across all particles in the system.

In image and signal processing, filters, convolutions, and transforms often exhibit data parallel behavior. Applying the same filter to each pixel or signal element fits naturally into a data parallel model. Hardware accelerators, such as GPUs, exploit this structure to process very large images or signal streams efficiently.

Machine learning workloads, especially during training, use data parallelism extensively. A model with fixed parameters is evaluated on different mini batches of input examples. Gradients for each batch are computed in parallel and then combined, usually through a reduction like operation that averages or sums gradient contributions before updating the model.

Performance considerations specific to data parallelism

Good performance in data parallel programs depends on more than just parallel correctness. Data locality is critical. Because data parallel computations often stream through large arrays, organizing data contiguously in memory and aligning partitions to cache lines can have a big impact on speed. Access patterns should favor sequential traversal rather than random indexing.

Load balance is another central concern. If some workers have significantly more data or more expensive data elements to process, they will lag behind the others and limit overall speedup. Choosing an appropriate decomposition and, when necessary, using dynamic scheduling strategies can reduce imbalance. However, dynamic approaches may introduce overhead or harm locality, so they must be applied carefully.

Minimizing synchronization and communication is especially important. Every barrier or collective operation introduces a point where all workers must wait for the slowest one. Data parallel patterns that can proceed for many steps using only local data, with occasional synchronization, usually scale better than patterns that require frequent global coordination.

Finally, vectorization and use of accelerators often align naturally with data parallelism. By designing data structures and loops that are contiguous, aligned, and independent, you give compilers, libraries, and GPUs the opportunity to exploit wide parallelism efficiently. Many high performance libraries for linear algebra, transforms, and machine learning are internally structured around data parallel operations, which is one reason they are so effective on modern architectures.

When data parallelism is and is not a good fit

Data parallelism is well suited to problems where the same operation applies across many elements and interactions between elements are limited, structured, or infrequent. Regular grids, dense arrays, and large sets of similar objects usually work well. Algorithms that can be expressed in terms of maps, stencils, and reductions are typically good candidates.

In contrast, problems where each element follows very different control flow, or where each step depends on complex, irregular interactions between many elements, may not map cleanly onto a simple data parallel model. Graph algorithms with highly irregular connectivity, event driven simulations with unpredictable behavior, and workloads where a few elements dominate the work can all challenge pure data parallel approaches. In practice, such applications often combine data parallel components with other forms of parallelism to achieve good performance.

Understanding data parallelism and recognizing it in algorithms is a core skill in HPC. It guides how you structure your data, design loops, choose decompositions, and select appropriate programming models and libraries across shared memory, distributed memory, and accelerator based systems.

Comments

Please login to add a comment.

Don't have an account? Register now!