Table of Contents
What Makes Collective Communication Different
In distributed-memory programming with MPI, collective communication refers to operations that involve all processes in a communicator (often MPI_COMM_WORLD) according to a well-defined pattern. Unlike point-to-point communication:
- Every process in the communicator must call the collective operation (with matching parameters).
- Data movement patterns are predefined (broadcast, gather, reduce, etc.).
- Implementations are usually highly optimized and topology-aware.
Collective operations are essential for:
- Distributing input data.
- Combining partial results (e.g., sums, maxima).
- Reorganizing data across processes.
- Implementing synchronization points.
This chapter assumes you already understand basic MPI, communicators, and point-to-point communication.
Categories of Collective Operations
Collective operations can be grouped conceptually by what they accomplish:
- Synchronization and barriers
- One-to-many and many-to-one (broadcast, gather, scatter)
- Many-to-many / all-to-all (allgather, all-to-all)
- Global reductions (reduce, allreduce, scan)
- Neighbor collectives (topology-aware collectives)
Most MPI libraries provide both blocking and non-blocking variants for many of these operations.
Barriers and Synchronization
A barrier ensures that no process passes the barrier until all processes in the communicator have reached it.
- Typical MPI call:
MPI_Barrier(comm) - Use cases:
- Marking phases of a computation.
- Getting clean timing around a code section.
- Misuse:
- Overusing barriers can destroy performance.
- They are not needed before most collectives—collectives themselves define the needed synchronization.
Example (pseudo-code):
MPI_Barrier(MPI_COMM_WORLD);
/* Now all processes start this timed section "together" */
start = MPI_Wtime();
/* compute and communicate */
end = MPI_Wtime();One-to-Many and Many-to-One Operations
Broadcast (`MPI_Bcast`)
Broadcast sends the same data from one root process to all other processes.
Key features:
- One root, many receivers.
- All processes call the function; only root provides the initial data.
- The buffer arguments must be consistent across processes (same datatype, same count).
Typical usage:
int root = 0;
double params[10];
if (rank == root) {
/* initialize params */
}
MPI_Bcast(params, 10, MPI_DOUBLE, root, MPI_COMM_WORLD);
/* Now every process has the same "params" */Common use cases:
- Distribute global configuration or input.
- Distribute initial conditions.
Gather (`MPI_Gather`) and Allgather (`MPI_Allgather`)
Gather
Gather collects values from all processes to a single root.
- Each process sends
sendcountelements. - Root receives
size * sendcountelements in a receive buffer.
Usage example:
int local = rank; /* each process has one integer */
int *all = NULL;
if (rank == 0) {
all = malloc(size * sizeof(int));
}
MPI_Gather(&local, 1, MPI_INT,
all, 1, MPI_INT,
0, MPI_COMM_WORLD);
/* At root: all[i] == i from process i */Common use cases:
- Collect diagnostics or small results.
- Aggregating per-process statistics.
Allgather
Allgather is like gather but distributes the full collected result back to every process.
- After
MPI_Allgather, every process has the complete array.
int local = rank;
int *all = malloc(size * sizeof(int));
MPI_Allgather(&local, 1, MPI_INT,
all, 1, MPI_INT,
MPI_COMM_WORLD);
/* Now every process has all[0..size-1] */Use cases:
- After local initialization, every process needs information from all others (e.g., small metadata, parameters per rank).
Scatter (`MPI_Scatter`)
Scatter distributes distinct chunks of an array from root to all processes.
- Root has an array of length
size * sendcount. - Each process receives
sendcountelements.
Example:
double *global_data = NULL;
double local_chunk[CHUNK];
if (rank == 0) {
global_data = malloc(CHUNK * size * sizeof(double));
/* fill global_data */
}
MPI_Scatter(global_data, CHUNK, MPI_DOUBLE,
local_chunk, CHUNK, MPI_DOUBLE,
0, MPI_COMM_WORLD);
/* Each process now has its own piece of global_data */Use cases:
- Distributing input data across processes at the start of a parallel computation.
Reduction Operations
Basic Reductions: `MPI_Reduce` and `MPI_Allreduce`
Reductions combine values from all processes using an operation, such as:
MPI_SUM,MPI_PRODMPI_MAX,MPI_MIN- Logical operations:
MPI_LAND,MPI_LOR, etc.
`MPI_Reduce`
Combines data from all processes and stores the result on a root.
double local_sum = /* partial sum on each process */;
double global_sum;
MPI_Reduce(&local_sum, &global_sum, 1,
MPI_DOUBLE, MPI_SUM,
0, MPI_COMM_WORLD);
/* Only root=0 has global_sum */`MPI_Allreduce`
Same as MPI_Reduce, but the result is sent back to all processes.
double local_sum = /* partial sum */;
double global_sum;
MPI_Allreduce(&local_sum, &global_sum, 1,
MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
/* All processes now know global_sum */Typical uses:
- Global norms in iterative solvers.
- Computing global maxima/minima for convergence checks.
- Global counters (e.g., total number of iterations, total work).
Other Reduction Variants
MPI_Reduce_scatter: Performs a reduction and simultaneously scatters the result segments to processes. Useful when each process needs only a portion of the reduced data.MPI_ScanandMPI_Exscan: Prefix reductions.MPI_Scancomputes partial reductions up to each rank (inclusive).MPI_Exscanis exclusive (does not include the current process’s own input).
Example (MPI_Scan):
int local = 1;
int prefix_sum;
MPI_Scan(&local, &prefix_sum, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
/* On rank r: prefix_sum == r+1, assuming ranks start from 0 */All-to-All Communication
All-to-all patterns let each process exchange distinct data with every other process.
`MPI_Alltoall`
Each process sends a separate block to every other process and receives one from each.
- Send buffer layout: logically divided into
sizeblocks. - Receive buffer layout: one block from each process.
Example (fixed-size blocks):
int sendbuf[size]; /* sendbuf[i] is intended for rank i */
int recvbuf[size];
for (int i = 0; i < size; ++i) {
sendbuf[i] = rank * 100 + i; /* identify source and destination */
}
MPI_Alltoall(sendbuf, 1, MPI_INT,
recvbuf, 1, MPI_INT,
MPI_COMM_WORLD);
/* recvbuf[j] came from process j */Use cases:
- Transpose operations in dense linear algebra.
- Redistributing data between different decompositions (e.g., domain decompositions in multi-dimensional grids).
Variable-Sized All-to-All
When each process sends a different amount of data to each destination:
MPI_Alltoallv: variable counts per sender and receiver.- Useful for irregular communication patterns (e.g., mesh adaptivity, graph algorithms).
Conceptually:
- Arrays
sendcounts[],sdispls[]andrecvcounts[],rdispls[]describe how many items to send/receive and where in the buffers to place them. - Great flexibility but can be more expensive and complex to set up.
Neighbor and Topology-Aware Collectives
In codes using MPI process topologies (e.g., Cartesian grids), neighbor collectives perform collectives between processes that are neighbors in that topology, not all processes.
Examples:
MPI_Neighbor_allgatherMPI_Neighbor_alltoall
Use cases:
- Stencil computations on structured grids.
- Multi-dimensional decompositions where each process communicates only with its logical neighbors.
Benefits:
- Communication restricted to necessary neighbors.
- MPI library can exploit routing/topology to optimize traffic.
Blocking vs Non-blocking Collectives
Many collective operations have non-blocking forms, e.g.:
- Blocking:
MPI_Bcast,MPI_Allreduce,MPI_Alltoall, ... - Non-blocking:
MPI_Ibcast,MPI_Iallreduce,MPI_Ialltoall, ...
Characteristics:
- Non-blocking collectives initiate the operation and return immediately with a request handle (
MPI_Request). - Completion must be ensured with
MPI_WaitorMPI_Test.
Typical pattern:
MPI_Request req;
MPI_Ibcast(params, 10, MPI_DOUBLE, 0, MPI_COMM_WORLD, &req);
/* Overlap useful work that does not depend on params */
do_local_work();
/* Make sure the broadcast finished before using params */
MPI_Wait(&req, MPI_STATUS_IGNORE);Benefits:
- Potentially overlap communication and computation.
- Reduce idle time while waiting for global operations.
Caveats:
- Correctness can become more subtle (ensure data is not modified prematurely).
- Not every algorithm benefits; some are inherently synchronization-heavy.
Correct Usage and Common Pitfalls
Matching Participation
All processes in the communicator must:
- Call the same collective function.
- Use compatible arguments (datatype, count, root, communicator).
Common errors:
- One rank calls
MPI_Bcast, another callsMPI_Gatherat the same point. - Passing different counts or datatypes on different processes.
- Forgetting to call the collective on one or more processes.
These lead to deadlocks or mysterious hangs.
Root and Buffer Rules
For root-based collectives (e.g., MPI_Bcast, MPI_Gather, MPI_Scatter):
- The
rootrank must be in the communicator. - Roots provide valid send buffers.
- Non-roots ignore the send buffer parameters (but they still need to pass them correctly to the function).
Memory layout must be consistent with the call:
- For
MPI_GatherandMPI_Scatter, root’s send/receive buffers must have enough space for all processes’ data.
Collectives Are Implicitly Synchronized, But…
Collectives provide the necessary synchronization for their own function:
- You do not need an explicit
MPI_Barrierbefore or after a collective for it to work.
However:
- Collectives do not guarantee any ordering relative to other operations, except for MPI’s usual ordering guarantees within a communicator for the same pair of processes.
- If you rely on timing behavior around collectives, use barriers only when actually needed.
Performance Considerations for Collective Operations
Collectives are often crucial to parallel performance because:
- They can involve all processes, so their cost can grow with the communicator size.
- They can become scalability bottlenecks (e.g., frequent
MPI_Allreducein tight loops).
Key performance aspects:
- Algorithm choice inside MPI: Libraries choose between tree-based, ring, recursive doubling, and other algorithms depending on message size and communicator size.
- Topology awareness: Good MPI implementations exploit network topology to reduce latency and congestion.
- Collective frequency: Reducing how often collectives are called often yields more benefit than micro-optimizing their parameters.
- Data volume: Minimize the size of messages used in global operations; avoid broadcasting or reducing unnecessarily large buffers.
Practical advice:
- Prefer
MPI_ReduceoverMPI_Allreduceif only one process truly needs the result. - Avoid putting global reductions inside innermost loops when possible; combine iterations or relax convergence checks.
- Use neighbor collectives or more localized patterns instead of full all-to-all when the algorithm allows it.
- Consider non-blocking collectives to overlap communication with computation.
Designing Algorithms Around Collectives
When designing distributed-memory algorithms:
- Identify where you need:
- One-time broadcasts (e.g., input data).
- Periodic global reductions (e.g., convergence criteria).
- Data redistributions (e.g., between decompositions).
- Choose the simplest appropriate collective:
MPI_Bcastfor one-to-all,MPI_Scatter/MPI_Gatherfor one-to-many/many-to-one,MPI_AllgatherorMPI_Alltoallfor many-to-many,MPI_Reduce/MPI_Allreducefor global results.- Minimize global synchronization:
- Combine multiple small collectives into fewer, larger ones when feasible.
- Keep global operations outside of tight performance-critical loops where possible.
Using collective communication effectively is central to scalable distributed-memory programming: it enables clear, concise code while leveraging highly optimized implementations inside MPI.