8.4 Collective communication

What Makes Collective Communication Different

In distributed-memory programming with MPI, collective communication refers to operations that involve all processes in a communicator (often MPI_COMM_WORLD) according to a well-defined pattern. Unlike point-to-point communication:

Every process in the communicator must call the collective operation (with matching parameters).
Data movement patterns are predefined (broadcast, gather, reduce, etc.).
Implementations are usually highly optimized and topology-aware.

Collective operations are essential for:

Distributing input data.
Combining partial results (e.g., sums, maxima).
Reorganizing data across processes.
Implementing synchronization points.

This chapter assumes you already understand basic MPI, communicators, and point-to-point communication.

Categories of Collective Operations

Collective operations can be grouped conceptually by what they accomplish:

Synchronization and barriers
One-to-many and many-to-one (broadcast, gather, scatter)
Many-to-many / all-to-all (allgather, all-to-all)
Global reductions (reduce, allreduce, scan)
Neighbor collectives (topology-aware collectives)

Most MPI libraries provide both blocking and non-blocking variants for many of these operations.

Barriers and Synchronization

A barrier ensures that no process passes the barrier until all processes in the communicator have reached it.

Typical MPI call: MPI_Barrier(comm)
Use cases:

Marking phases of a computation.
Getting clean timing around a code section.

Misuse:

Overusing barriers can destroy performance.
They are not needed before most collectives—collectives themselves define the needed synchronization.

Example (pseudo-code):

MPI_Barrier(MPI_COMM_WORLD);
/* Now all processes start this timed section "together" */
start = MPI_Wtime();
/* compute and communicate */
end = MPI_Wtime();

One-to-Many and Many-to-One Operations

Broadcast (`MPI_Bcast`)

Broadcast sends the same data from one root process to all other processes.

Key features:

One root, many receivers.
All processes call the function; only root provides the initial data.
The buffer arguments must be consistent across processes (same datatype, same count).

Typical usage:

int root = 0;
double params[10];
if (rank == root) {
    /* initialize params */
}
MPI_Bcast(params, 10, MPI_DOUBLE, root, MPI_COMM_WORLD);
/* Now every process has the same "params" */

Common use cases:

Distribute global configuration or input.
Distribute initial conditions.

Gather (`MPI_Gather`) and Allgather (`MPI_Allgather`)

Gather

Gather collects values from all processes to a single root.

Each process sends sendcount elements.
Root receives size * sendcount elements in a receive buffer.

Usage example:

int local = rank;  /* each process has one integer */
int *all = NULL;
if (rank == 0) {
    all = malloc(size * sizeof(int));
}
MPI_Gather(&local, 1, MPI_INT,
           all,    1, MPI_INT,
           0, MPI_COMM_WORLD);
/* At root: all[i] == i from process i */

Common use cases:

Collect diagnostics or small results.
Aggregating per-process statistics.

Allgather

Allgather is like gather but distributes the full collected result back to every process.

After MPI_Allgather, every process has the complete array.

int local = rank;
int *all = malloc(size * sizeof(int));
MPI_Allgather(&local, 1, MPI_INT,
              all,    1, MPI_INT,
              MPI_COMM_WORLD);
/* Now every process has all[0..size-1] */

Use cases:

After local initialization, every process needs information from all others (e.g., small metadata, parameters per rank).

Scatter (`MPI_Scatter`)

Scatter distributes distinct chunks of an array from root to all processes.

Root has an array of length size * sendcount.
Each process receives sendcount elements.

Example:

double *global_data = NULL;
double local_chunk[CHUNK];
if (rank == 0) {
    global_data = malloc(CHUNK * size * sizeof(double));
    /* fill global_data */
}
MPI_Scatter(global_data, CHUNK, MPI_DOUBLE,
            local_chunk, CHUNK, MPI_DOUBLE,
            0, MPI_COMM_WORLD);
/* Each process now has its own piece of global_data */

Use cases:

Distributing input data across processes at the start of a parallel computation.

Reduction Operations

Basic Reductions: `MPI_Reduce` and `MPI_Allreduce`

Reductions combine values from all processes using an operation, such as:

MPI_SUM, MPI_PROD
MPI_MAX, MPI_MIN
Logical operations: MPI_LAND, MPI_LOR, etc.

`MPI_Reduce`

Combines data from all processes and stores the result on a root.

double local_sum = /* partial sum on each process */;
double global_sum;
MPI_Reduce(&local_sum, &global_sum, 1,
           MPI_DOUBLE, MPI_SUM,
           0, MPI_COMM_WORLD);
/* Only root=0 has global_sum */

`MPI_Allreduce`

Same as MPI_Reduce, but the result is sent back to all processes.

double local_sum = /* partial sum */;
double global_sum;
MPI_Allreduce(&local_sum, &global_sum, 1,
              MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
/* All processes now know global_sum */

Typical uses:

Global norms in iterative solvers.
Computing global maxima/minima for convergence checks.
Global counters (e.g., total number of iterations, total work).

Other Reduction Variants

MPI_Reduce_scatter: Performs a reduction and simultaneously scatters the result segments to processes. Useful when each process needs only a portion of the reduced data.
MPI_Scan and MPI_Exscan: Prefix reductions.

MPI_Scan computes partial reductions up to each rank (inclusive).
MPI_Exscan is exclusive (does not include the current process’s own input).

Example (MPI_Scan):

int local = 1;
int prefix_sum;
MPI_Scan(&local, &prefix_sum, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
/* On rank r: prefix_sum == r+1, assuming ranks start from 0 */

All-to-All Communication

All-to-all patterns let each process exchange distinct data with every other process.

`MPI_Alltoall`

Each process sends a separate block to every other process and receives one from each.

Send buffer layout: logically divided into size blocks.
Receive buffer layout: one block from each process.

Example (fixed-size blocks):

int sendbuf[size]; /* sendbuf[i] is intended for rank i */
int recvbuf[size];
for (int i = 0; i < size; ++i) {
    sendbuf[i] = rank * 100 + i;  /* identify source and destination */
}
MPI_Alltoall(sendbuf, 1, MPI_INT,
             recvbuf, 1, MPI_INT,
             MPI_COMM_WORLD);
/* recvbuf[j] came from process j */

Use cases:

Transpose operations in dense linear algebra.
Redistributing data between different decompositions (e.g., domain decompositions in multi-dimensional grids).

Variable-Sized All-to-All

When each process sends a different amount of data to each destination:

MPI_Alltoallv: variable counts per sender and receiver.
Useful for irregular communication patterns (e.g., mesh adaptivity, graph algorithms).

Conceptually:

Arrays sendcounts[], sdispls[] and recvcounts[], rdispls[] describe how many items to send/receive and where in the buffers to place them.
Great flexibility but can be more expensive and complex to set up.

Neighbor and Topology-Aware Collectives

In codes using MPI process topologies (e.g., Cartesian grids), neighbor collectives perform collectives between processes that are neighbors in that topology, not all processes.

Examples:

MPI_Neighbor_allgather
MPI_Neighbor_alltoall

Use cases:

Stencil computations on structured grids.
Multi-dimensional decompositions where each process communicates only with its logical neighbors.

Benefits:

Communication restricted to necessary neighbors.
MPI library can exploit routing/topology to optimize traffic.

Blocking vs Non-blocking Collectives

Many collective operations have non-blocking forms, e.g.:

Blocking: MPI_Bcast, MPI_Allreduce, MPI_Alltoall, ...
Non-blocking: MPI_Ibcast, MPI_Iallreduce, MPI_Ialltoall, ...

Characteristics:

Non-blocking collectives initiate the operation and return immediately with a request handle (MPI_Request).
Completion must be ensured with MPI_Wait or MPI_Test.

Typical pattern:

MPI_Request req;
MPI_Ibcast(params, 10, MPI_DOUBLE, 0, MPI_COMM_WORLD, &req);
/* Overlap useful work that does not depend on params */
do_local_work();
/* Make sure the broadcast finished before using params */
MPI_Wait(&req, MPI_STATUS_IGNORE);

Benefits:

Potentially overlap communication and computation.
Reduce idle time while waiting for global operations.

Caveats:

Correctness can become more subtle (ensure data is not modified prematurely).
Not every algorithm benefits; some are inherently synchronization-heavy.

Correct Usage and Common Pitfalls

Matching Participation

All processes in the communicator must:

Call the same collective function.
Use compatible arguments (datatype, count, root, communicator).

Common errors:

One rank calls MPI_Bcast, another calls MPI_Gather at the same point.
Passing different counts or datatypes on different processes.
Forgetting to call the collective on one or more processes.

These lead to deadlocks or mysterious hangs.

Root and Buffer Rules

For root-based collectives (e.g., MPI_Bcast, MPI_Gather, MPI_Scatter):

The root rank must be in the communicator.
Roots provide valid send buffers.
Non-roots ignore the send buffer parameters (but they still need to pass them correctly to the function).

Memory layout must be consistent with the call:

For MPI_Gather and MPI_Scatter, root’s send/receive buffers must have enough space for all processes’ data.

Collectives Are Implicitly Synchronized, But…

Collectives provide the necessary synchronization for their own function:

You do not need an explicit MPI_Barrier before or after a collective for it to work.

However:

Collectives do not guarantee any ordering relative to other operations, except for MPI’s usual ordering guarantees within a communicator for the same pair of processes.
If you rely on timing behavior around collectives, use barriers only when actually needed.

Performance Considerations for Collective Operations

Collectives are often crucial to parallel performance because:

They can involve all processes, so their cost can grow with the communicator size.
They can become scalability bottlenecks (e.g., frequent MPI_Allreduce in tight loops).

Key performance aspects:

Algorithm choice inside MPI: Libraries choose between tree-based, ring, recursive doubling, and other algorithms depending on message size and communicator size.
Topology awareness: Good MPI implementations exploit network topology to reduce latency and congestion.
Collective frequency: Reducing how often collectives are called often yields more benefit than micro-optimizing their parameters.
Data volume: Minimize the size of messages used in global operations; avoid broadcasting or reducing unnecessarily large buffers.

Practical advice:

Prefer MPI_Reduce over MPI_Allreduce if only one process truly needs the result.
Avoid putting global reductions inside innermost loops when possible; combine iterations or relax convergence checks.
Use neighbor collectives or more localized patterns instead of full all-to-all when the algorithm allows it.
Consider non-blocking collectives to overlap communication with computation.

Designing Algorithms Around Collectives

When designing distributed-memory algorithms:

Identify where you need:

One-time broadcasts (e.g., input data).
Periodic global reductions (e.g., convergence criteria).
Data redistributions (e.g., between decompositions).

Choose the simplest appropriate collective:

MPI_Bcast for one-to-all,
MPI_Scatter/MPI_Gather for one-to-many/many-to-one,
MPI_Allgather or MPI_Alltoall for many-to-many,
MPI_Reduce/MPI_Allreduce for global results.

Minimize global synchronization:

Combine multiple small collectives into fewer, larger ones when feasible.
Keep global operations outside of tight performance-critical loops where possible.

Using collective communication effectively is central to scalable distributed-memory programming: it enables clear, concise code while leveraging highly optimized implementations inside MPI.

Comments

Please login to add a comment.

Don't have an account? Register now!

8.4 Collective communication

What Makes Collective Communication Different

Categories of Collective Operations

Barriers and Synchronization

One-to-Many and Many-to-One Operations

Broadcast (`MPI_Bcast`)

Gather (`MPI_Gather`) and Allgather (`MPI_Allgather`)

Gather

Allgather

Scatter (`MPI_Scatter`)

Reduction Operations

Basic Reductions: `MPI_Reduce` and `MPI_Allreduce`

`MPI_Reduce`

`MPI_Allreduce`

Other Reduction Variants

All-to-All Communication

`MPI_Alltoall`

Variable-Sized All-to-All

Neighbor and Topology-Aware Collectives

Blocking vs Non-blocking Collectives

Correct Usage and Common Pitfalls

Matching Participation

Root and Buffer Rules

Collectives Are Implicitly Synchronized, But…

Performance Considerations for Collective Operations

Designing Algorithms Around Collectives

Comments

Where to Move