8.7 Performance pitfalls

Table of Contents

Typical Performance Pitfalls in MPI Programs

Distributed-memory programs often “work” but perform far below the potential of the hardware. This chapter focuses on common performance pitfalls specific to MPI-style, distributed-memory codes and how to recognize and avoid them.

1. Overhead from Too Many Small Messages

Sending many tiny messages is a classic source of poor performance in MPI programs:

Each message has a fixed latency cost, independent of size.
Many small messages multiply this latency.
Network injection and protocol overhead become dominant.

Typical symptoms:

Runtime dominated by communication even though total data volume is modest.
Profilers showing many frequent calls to MPI_Send, MPI_Recv, or small collectives.

Common causes:

Sending one element at a time in loops.
Per-cell or per-particle communication where each cell/particle is a separate message.
Fine-grained synchronization using small messages.

Mitigations (conceptual):

Aggregate data into larger buffers before sending.
Use derived datatypes to send non-contiguous data in a single message.
Replace repeated point-to-point patterns with suitable collectives when appropriate.

2. Imbalanced Work and Load Imbalance

Even if each process has the same number of tasks, some may be more expensive than others, leading to:

Some ranks finishing much earlier and sitting idle.
Overall runtime determined by the slowest rank.

Indicators:

Different process runtimes on the same job.
Profilers showing some ranks spending much time in MPI_Barrier or waiting in blocking receives.

Typical sources:

Static domain decomposition with highly non-uniform physics (e.g., adaptive mesh, localized hotspots).
Rank-specific I/O or pre/post-processing.
Assigning special roles to certain ranks (e.g., one rank does global aggregation while others are idle).

Mitigations (conceptual):

Better domain decomposition to equalize computational cost per rank.
Dynamic or load-balanced work assignment where feasible.
Avoiding “master does everything, workers wait” designs in the hot path.

3. Excessive Global Synchronization

Global synchronizations (explicit or implicit) can dominate runtime:

MPI_Barrier is the most obvious source; if used frequently, it serializes progress.
Many collectives (e.g., MPI_Allreduce) can act as implicit synchronization if overused.

Symptoms:

Trace timelines show processes repeatedly converging at the same point and waiting.
Removing or relaxing barriers gives immediate speed-ups.

Common patterns:

Using MPI_Barrier as a debugging aid and leaving it in production code.
Adding barriers to “make sure everyone is ready” instead of reasoning about the needed communication.
Using global reductions at every iteration for diagnostics or convergence checks, even when not strictly necessary.

Mitigations:

Remove unnecessary barriers; most MPI programs need very few.
Batch global operations (e.g., checking convergence every N iterations instead of every iteration).
Use non-blocking collectives to overlap computation with communication where available.

4. Poor Communication Patterns and Contention

How processes communicate can greatly affect performance:

4.1. Hotspotting and Imbalanced Communication

If many ranks communicate heavily with a small subset of ranks:

Those ranks become communication bottlenecks.
Network links local to those ranks become saturated (hotspots).

Examples:

Centralized “root collector” that receives large data from all ranks.
Global “manager” rank that all workers contact every iteration.

Mitigations:

Tree-based reductions or scatter/gather patterns instead of flat star patterns.
Hierarchical or multi-level communication (intra-node, then inter-node).
Using collective operations that are implemented with efficient communication trees.

4.2. All-to-All Communication Overuse

All-to-all patterns are inherently expensive on large systems:

Data movement scales poorly (often $O(P^2)$ in number of messages).
Network contention becomes severe as the process count grows.

Mitigations:

Avoid full MPI_Alltoall where sparse or structured communication is sufficient.
Exploit application structure (e.g., nearest-neighbor, stencil) to limit communication partners.
Reorder data or computation to reduce need for global reshuffles.

5. Blocking Communication and Lost Overlap

Blocking communication (MPI_Send, MPI_Recv, blocking collectives) can lead to:

Processes stalled while waiting for data, even though they have local computation that could be done.
Sequentialization of communication and computation phases.

Typical pattern:

MPI_Send(...);   // all ranks send
MPI_Recv(...);   // then all ranks receive
compute(...);    // only after communication

When used naively, this prevents overlap between communication and computation.

Mitigations (conceptual):

Use non-blocking communication (MPI_Isend, MPI_Irecv, non-blocking collectives) to start communication early and compute while data is in flight.
Structure code to expose independent work that can be done while waiting.
Avoid ping-pong patterns where two ranks repeatedly block on each other.

6. Poor Use of Collective Operations

Collective operations are powerful but can be misused:

6.1. Overusing Global Collectives

Repeated use of MPI_Allreduce, MPI_Allgather, or MPI_Bcast can dominate runtime:

Each collective often touches all ranks.
Performance impact grows with process count.

Examples:

Logging or diagnostics using MPI_Allgather every iteration.
Multiple separate MPI_Allreduce calls for related quantities that could be combined.

Mitigations:

Reduce frequency of collectives where possible.
Combine related reductions into a single collective using arrays or derived types.
Use more targeted collectives (e.g., reduce to root then broadcast only when needed).

6.2. Not Using Collectives When They Help

Conversely, re-implementing collective patterns with ad hoc point-to-point communication:

Usually performs worse than optimized library collectives.
Adds more code complexity and greater risk of inefficiencies.

Mitigation:

Prefer standard collectives for common global operations unless you have a specific, justified reason not to.

7. Inefficient Process Topology and Mapping

Ignoring the relationship between MPI ranks and hardware topology hurts performance:

Neighboring ranks in the algorithm may be placed far apart in the network.
Communication that is logically local becomes physically expensive.

Symptoms:

Large differences in communication times depending on rank numbers.
Better performance on smaller jobs or when ranks are accidentally well placed.

Typical issues:

Random or default rank placement on systems where topology-aware placement is required for good performance.
2D/3D domain decomposition mapped onto ranks in a way that doesn’t reflect network locality.

Mitigations:

Use MPI process topologies to match the communication pattern (e.g., Cartesian communicators).
Use job scheduler options or MPI mapping options to keep communicating ranks close (e.g., same node, neighboring nodes).
Group communicators according to hardware locality (nodes, sockets, NUMA domains).

8. Communication/Computation Granularity Problems

The size of work units affects scalability:

Too fine-grained: Very frequent communication relative to computation; overhead dominates.
Too coarse-grained: Very large phases of computation followed by heavy, bursty communication; harder to overlap.

Typical pitfalls:

Synchronous, iteration-by-iteration communication for very small updates.
Very large time steps or iterations that require extensive global synchronization afterwards.

Mitigations:

Adjust algorithm granularity so that each communication step amortizes its overhead over enough computation.
Reorganize algorithms to reduce the frequency of global interactions while preserving correctness.

9. Inefficient I/O Patterns in Distributed Runs

While I/O is covered in detail elsewhere, certain patterns are particularly harmful in distributed-memory MPI programs:

All ranks writing many small files to a shared filesystem (e.g., per-rank logs, checkpoints).
Collective writes implemented as many independent serial writes.
Root rank doing massive serial I/O while others wait.

Performance consequences:

Filesystem metadata servers become bottlenecks.
I/O phases dominate runtime as scale increases.

Mitigations (conceptual):

Prefer collective or aggregated I/O via suitable libraries or MPI I/O.
Combine per-rank outputs where possible.
Reduce output frequency and volume, especially in large runs.

10. Scalability Limits: Latency, Bandwidth, and Amdahl-Like Effects

As you increase the number of processes, certain hidden costs become visible:

Latency dominates for small per-rank workloads.
Bandwidth limitations show up when many ranks communicate concurrently.
Sequential or poorly parallelized parts of the code (including initialization and finalization) limit speedup.

Common manifestations:

Adding more nodes no longer speeds up the job (or even slows it down).
Profiling shows a growing fraction of time in communication routines and non-parallel sections.

Mitigations:

Increase per-rank work when possible (weak scaling) or redesign for better strong scaling.
Reduce global operations that grow with process count.
Identify and optimize (or parallelize) the most expensive sequential sections.

11. Neglecting Performance Portability and Tuning

Finally, a more subtle pitfall:

Assuming that communication patterns and parameters tuned on one system are optimal everywhere.
Hard-coding parameters (e.g., message sizes, decomposition layouts, buffer sizes) that worked on a small test but don’t scale or port well.

Consequences:

Code appears fine on small local clusters but performs poorly on large production systems.
Unexpected regressions when moving between machines or MPI implementations.

Mitigations:

Expose tunable parameters (e.g., domain decomposition choices) via configuration rather than hard-coding.
Benchmark and profile on target systems, not only on development machines.
Be aware that different MPI libraries and interconnects may favor different granularities and patterns.

12. Recognizing and Addressing Pitfalls

To effectively avoid performance pitfalls:

Measure and profile: intuition alone is often misleading at scale.
Compare communication time vs computation time as process count grows.
Examine patterns: frequency, size, and type of MPI calls; who talks to whom; where ranks wait.

Most performance problems in distributed-memory codes are not due to “slow hardware” but to a mismatch between the communication/computation structure of the code and the characteristics of the underlying system. Identifying the specific pitfall patterns described above is the first step toward systematic optimization.

Comments

Please login to add a comment.

Don't have an account? Register now!

8.7 Performance pitfalls

Typical Performance Pitfalls in MPI Programs

1. Overhead from Too Many Small Messages

2. Imbalanced Work and Load Imbalance

3. Excessive Global Synchronization

4. Poor Communication Patterns and Contention

4.1. Hotspotting and Imbalanced Communication

4.2. All-to-All Communication Overuse

5. Blocking Communication and Lost Overlap

6. Poor Use of Collective Operations

6.1. Overusing Global Collectives

6.2. Not Using Collectives When They Help

7. Inefficient Process Topology and Mapping

8. Communication/Computation Granularity Problems

9. Inefficient I/O Patterns in Distributed Runs

10. Scalability Limits: Latency, Bandwidth, and Amdahl-Like Effects

11. Neglecting Performance Portability and Tuning

12. Recognizing and Addressing Pitfalls

Comments

Where to Move