Table of Contents
Overview
Collective communication in MPI is communication that involves a whole group of processes at once, not just a sender and a receiver. Instead of matching individual MPI_Send and MPI_Recv calls, all processes in a communicator participate in a single collective operation that is defined in terms of the group as a whole.
In this chapter, the focus is on what collective communication does, how typical patterns look, and what practical issues to watch for in real HPC applications. Details about basic MPI setup, point-to-point communication, and communicators are covered elsewhere and are assumed here.
The role of collective operations
Collective operations are used when data movement or coordination has a natural group scope. Examples include gathering partial results from all processes to one process, distributing configuration data from one process to all others, computing global sums or minima, or synchronizing progress before entering a critical phase of a computation.
Collectives are defined on a communicator, typically MPI_COMM_WORLD or a subcommunicator. Every process in that communicator must call the collective routine with compatible arguments. Collectives are often implemented with optimized tree or pipeline algorithms provided by the MPI library, which are usually faster and more scalable than hand-written loops of point-to-point messages.
In MPI, every process in the communicator must participate in a collective call, and they must call compatible collectives in the same order, or the program will typically hang.
Broadcast: one-to-all communication
A broadcast operation sends the same data from one root process to all the other processes in the communicator. Only the root provides the valid send buffer. All participating processes, including the root, call the same function.
A typical broadcast call in MPI looks like:
MPI_Bcast(buffer, count, datatype, root, comm);
The root process has buffer filled with data before the call. After MPI_Bcast returns, every participating process has the same data in its local buffer. This is commonly used to distribute configuration parameters, problem sizes, or initial conditions that are read from a file by a single process and then shared.
Broadcast is conceptually simple, but the MPI implementation usually uses a tree or similar scheme internally rather than sending from the root directly to every process, which would not scale well.
An important practical rule is that every process must call MPI_Bcast with the same root, count, datatype, and comm, and with a buffer that is large enough to hold the data. Only the root needs the input data to be valid before the call.
Gather and scatter: all-to-one and one-to-all with distinct data
Broadcast sends the same data everywhere. Sometimes each process owns distinct data that should be collected together, or one process owns a large array that should be divided among processes. The gather and scatter collectives express these patterns.
Gather
A gather operation collects equal sized pieces from every process and concatenates them on the root process. Each process provides a send buffer. Only the root process provides a receive buffer large enough to hold all the data.
In MPI:
MPI_Gather(sendbuf, sendcount, sendtype,
recvbuf, recvcount, recvtype,
root, comm);
Non root processes ignore the recvbuf, recvcount, and recvtype arguments. The root receives the data in rank order. If each process sends sendcount elements, then the root receives size * recvcount elements, where size is the number of processes in the communicator. Typically, sendcount and recvcount are equal.
A common use is collecting local results or diagnostic values at the end of a step for analysis or writing a serial output file.
Scatter
Scatter is the reverse of gather. The root process distributes distinct chunks of data to all processes, including itself. Only the root needs a full receive array that contains all the initial data, and every process has a local buffer to receive its piece.
In MPI:
MPI_Scatter(sendbuf, sendcount, sendtype,
recvbuf, recvcount, recvtype,
root, comm);
On the root, sendbuf must contain at least size * sendcount elements, which are conceptually divided into size blocks, each of length sendcount. Rank 0 receives the first block, rank 1 receives the second block, and so on.
Scatter can be used to distribute different pieces of an array to different processes at the start of a parallel computation, for example to assign different subdomains or blocks of a matrix.
Variants: Gatherv and Scatterv
In many applications, processes do not own equal sized data segments. MPI provides vector variants that allow different counts per process.
MPI_Gatherv allows each process to send a different number of elements. On the root, the receive buffer is accompanied by an array of counts and an array of displacements that describe where each process's data is placed.
Similarly, MPI_Scatterv allows the root to send varying sized chunks to each process, with counts and displacements that describe the layout in the root buffer.
These variants are crucial in codes with irregular decomposition, such as adaptive meshes or load balanced domain partitions.
Allgather and all-to-all distributions
Sometimes all processes need to end up with a copy of data from every other process, not just a single root. Allgather and all-to-all operations cover these patterns.
Allgather
Allgather can be seen as a gather followed by a broadcast of the gathered result, but performed as one optimized operation. Every process sends its contribution, and when the call completes, every process has the complete gathered array.
In MPI:
MPI_Allgather(sendbuf, sendcount, sendtype,
recvbuf, recvcount, recvtype,
comm);
The recvbuf on every process is filled with the contributions in rank order, in exactly the same layout as on the root of a plain gather. All processes supply a valid receive buffer.
An irregular variant, MPI_Allgatherv, is also available for the case where processes hold different numbers of elements.
Allgather is useful for distributing meta information or small data sets where each process must know about the state or contribution of every other process. It is commonly used in algorithms where local data structures depend on a global description of the decomposition.
Alltoall
Alltoall represents the most general regular pattern. Every process sends distinct data to every other process and receives distinct data from every other process. You can think of it as a full communication matrix where entry $(i, j)$ is the data sent from process $i$ to process $j$.
In MPI:
MPI_Alltoall(sendbuf, sendcount, sendtype,
recvbuf, recvcount, recvtype,
comm);
Every process provides a contiguous send buffer that is conceptually split into blocks, one block for each destination rank, and a contiguous receive buffer that is split into blocks, one for each source rank. The total number of elements sent by each process is size * sendcount.
Alltoall is widely used in codes that perform global redistributions, for example, changing data layouts between different stages of a computation, or performing transposes in parallel FFT algorithms.
For irregular or variable sized exchanges, MPI_Alltoallv allows different counts and displacements per destination. This is powerful but more complex to program and can be costly if used frequently with large, irregular patterns.
Reductions: global operations on distributed data
Global reduction operations compute a single result or a reduced array from partial contributions stored on each process. Instead of gathering all the data and then operating on it, the MPI reduction collects and combines data using an associative operation as it moves through the communicator.
A reduction can compute sums, products, minima, maxima, bitwise operations, or logical operators, among others. MPI provides predefined operations for built in datatypes and also supports user defined reduction operations if needed.
Basic reductions with MPI_Reduce
A simple reduction to a root process is performed with:
MPI_Reduce(sendbuf, recvbuf, count, datatype,
op, root, comm);
Each process provides a sendbuf containing its local data. The root process also provides a receive buffer. When the call completes, only the root has the result. For example, to compute the global sum of a scalar value x stored on each process, one uses MPI_SUM as the op.
If each process has an array of length count, then the reduction is performed elementwise. This is a common way to compute global norms or totals for monitoring convergence in iterative solvers.
Allreduce for globally available results
Often, every process needs to know the global result, not just the root. In this case, MPI_Allreduce is used:
MPI_Allreduce(sendbuf, recvbuf, count, datatype,
op, comm);
After MPI_Allreduce returns, every process has the same result in its receive buffer. This is more scalable than using a reduce to a root followed by a broadcast.
A typical pattern in parallel iterative algorithms is to compute a local contribution to a residual norm, then call MPI_Allreduce with MPI_SUM or MPI_MAX, and then use the resulting global norm to decide whether to stop iterating.
Common reduction operations
The most common predefined operations include MPI_SUM, MPI_PROD, MPI_MIN, MPI_MAX, MPI_LAND, MPI_LOR, and bitwise operators. They are defined only for specific datatypes.
It is important to use operations and datatypes that are compatible. For example, integer reductions should use integer datatypes, and floating point reductions must use floating point datatypes. Mixing them is an error.
Reductions are usually mathematically defined as combining values using an associative operation $f$. Ideally, all $N$ values $x_0, x_1, \dots, x_{N 1}$ are combined as
$$
x_0 \, f \, x_1 \, f \, \dots \, f \, x_{N-1}.
$$
In practice, especially for floating point operations, the result may depend slightly on the order in which the operations are carried out, because floating point addition and multiplication are not exactly associative in the presence of rounding. This can lead to small differences in results across different process counts or MPI implementations.
Prefix operations and scans
Some algorithms require not just a global reduction, but partial reductions along the ranks. MPI supports this with scan operations.
A scan computes prefix reductions. For processes labeled by increasing rank, process $k$ receives the reduction over processes $0$ through $k$. In exclusive scans, process $k$ receives the reduction over processes $0$ through $k 1$.
MPI provides MPI_Scan for inclusive scans and MPI_Exscan for exclusive scans.
A typical call is:
MPI_Scan(sendbuf, recvbuf, count, datatype,
op, comm);Scan operations can be used in distributed prefix sums, evaluation of polynomial expressions in parallel, or ordered accumulation of weights. They are less widely used than reductions or broadcasts, but they can replace more complex point-to-point patterns and can benefit from optimized collective implementations.
Synchronization with barriers
MPI barriers provide a simple form of global synchronization. All processes in the communicator must call the barrier. The function returns on each process only when all processes have entered the barrier.
In MPI:
MPI_Barrier(comm);A barrier does not exchange data, but it can incur communication overhead. It is mainly used to delimit logical phases in an application, to measure performance of specific code sections, or to coordinate operations that must not overlap.
In well designed parallel programs, barriers should be used sparingly. Often, more specific synchronization that emerges from the communication pattern itself is sufficient, and unnecessary barriers can reduce performance.
Collective communication semantics and ordering
Collectives have specific semantics that differ from simple send and receive operations. There are several rules that are important for correct and performant use.
First, every process in the communicator must call the collective with matching arguments. Collectives do not pair off dynamically like sends and receives. If any process skips a collective or passes inconsistent arguments, such as a different root rank or count, the program will usually hang.
Second, collectives must be called in the same order across all processes. If one process calls a broadcast and another process, at the same logical point, calls a reduce, both on the same communicator, a deadlock is likely, because the MPI library matches operations according to the call sequence.
Third, many implementations of MPI allow collective calls to be nonblocking internally, but the standard collective functions are blocking in the sense that they do not return until their local part of the operation is complete. The actual algorithms may overlap communication and computation internally, but from the user's point of view, the function call is synchronous.
MPI also defines nonblocking collectives such as MPI_Ibcast, MPI_Iallreduce, and others. These separate initiation and completion and allow explicit overlap of communication and computation. They require careful handling of MPI requests and completion checks, and are covered when discussing performance oriented MPI programming.
All processes in a communicator must call the same collective operations in the same order, with compatible arguments, or the program behavior is undefined and will often hang.
Performance considerations for collectives
Collective operations are critical for scalability. A single poorly chosen collective can limit performance when the number of processes grows. Several practical principles are useful when designing algorithms that rely on collectives.
First, prefer collective operations over manually coded loops of point-to-point messages whenever the pattern matches a standard collective. MPI libraries typically implement collectives using algorithms that reduce communication steps from $O(P)$ to $O(\log P)$ where $P$ is the number of processes, for example through tree based schemes or ring algorithms.
Second, minimize the size and frequency of global collectives such as MPI_Allreduce, especially inside tight iteration loops. Global reductions introduce dependencies across all processes, which makes them latency sensitive at large scales. Algorithms that reduce the number of global synchronization points, or that use local convergence checks where possible, often scale better.
Third, consider restricting collectives to subcommunicators. If only a subset of processes truly needs to participate in a global operation, creating a communicator for that subset and using collectives on it can reduce overhead.
Fourth, be aware that collectives may internally allocate buffers or use different algorithms depending on message size and communicator size. Some MPI implementations provide tuning parameters or environment variables that select collective algorithms. On large systems, performance analysis tools can reveal which collectives dominate runtime and may need tuning.
Finally, remember that collectives are not always implicit synchronizations. For example, some implementations of MPI_Bcast may complete on different processes at slightly different times, even though they all have the data when the function returns locally. Rely only on the guarantees stated in the MPI standard, and use explicit barriers or other synchronization only when strictly necessary.
Choosing appropriate collectives in applications
In practice, a distributed memory application often has several phases where data must be redistributed, gathered, or combined. Mapping these phases to appropriate collectives is an important part of algorithm design.
For example, at the start of a simulation, a root process might read configuration and input data, broadcast the configuration, and scatter subarrays to different ranks. During the simulation, processes might periodically compute local diagnostic values and participate in an allreduce to compute global metrics. For checkpointing or output, an application might use gather operations to collect small data, or parallel I/O routines for large data.
A useful design approach is to express communication intent at the highest possible level. If the goal is “all processes need the sum of all values,” then express that directly as an allreduce instead of coding a custom pattern. This makes the code clearer and allows the MPI implementation to do the optimization work.
As applications scale to more nodes and processes, collective communication patterns can become the key determinant of parallel efficiency. Recognizing when a pattern is naturally collective, understanding which specific collective to use, and being aware of its cost are essential skills in distributed memory programming with MPI.