Kahibaro
Discord Login Register

Performance pitfalls

Typical Performance Pitfalls in MPI Programs

Distributed-memory programs often “work” but perform far below the potential of the hardware. This chapter focuses on common performance pitfalls specific to MPI-style, distributed-memory codes and how to recognize and avoid them.

1. Overhead from Too Many Small Messages

Sending many tiny messages is a classic source of poor performance in MPI programs:

Typical symptoms:

Common causes:

Mitigations (conceptual):

2. Imbalanced Work and Load Imbalance

Even if each process has the same number of tasks, some may be more expensive than others, leading to:

Indicators:

Typical sources:

Mitigations (conceptual):

3. Excessive Global Synchronization

Global synchronizations (explicit or implicit) can dominate runtime:

Symptoms:

Common patterns:

Mitigations:

4. Poor Communication Patterns and Contention

How processes communicate can greatly affect performance:

4.1. Hotspotting and Imbalanced Communication

If many ranks communicate heavily with a small subset of ranks:

Examples:

Mitigations:

4.2. All-to-All Communication Overuse

All-to-all patterns are inherently expensive on large systems:

Mitigations:

5. Blocking Communication and Lost Overlap

Blocking communication (MPI_Send, MPI_Recv, blocking collectives) can lead to:

Typical pattern:

MPI_Send(...);   // all ranks send
MPI_Recv(...);   // then all ranks receive
compute(...);    // only after communication

When used naively, this prevents overlap between communication and computation.

Mitigations (conceptual):

6. Poor Use of Collective Operations

Collective operations are powerful but can be misused:

6.1. Overusing Global Collectives

Repeated use of MPI_Allreduce, MPI_Allgather, or MPI_Bcast can dominate runtime:

Examples:

Mitigations:

6.2. Not Using Collectives When They Help

Conversely, re-implementing collective patterns with ad hoc point-to-point communication:

Mitigation:

7. Inefficient Process Topology and Mapping

Ignoring the relationship between MPI ranks and hardware topology hurts performance:

Symptoms:

Typical issues:

Mitigations:

8. Communication/Computation Granularity Problems

The size of work units affects scalability:

Typical pitfalls:

Mitigations:

9. Inefficient I/O Patterns in Distributed Runs

While I/O is covered in detail elsewhere, certain patterns are particularly harmful in distributed-memory MPI programs:

Performance consequences:

Mitigations (conceptual):

10. Scalability Limits: Latency, Bandwidth, and Amdahl-Like Effects

As you increase the number of processes, certain hidden costs become visible:

Common manifestations:

Mitigations:

11. Neglecting Performance Portability and Tuning

Finally, a more subtle pitfall:

Consequences:

Mitigations:

12. Recognizing and Addressing Pitfalls

To effectively avoid performance pitfalls:

Most performance problems in distributed-memory codes are not due to “slow hardware” but to a mismatch between the communication/computation structure of the code and the characteristics of the underlying system. Identifying the specific pitfall patterns described above is the first step toward systematic optimization.

Views: 17

Comments

Please login to add a comment.

Don't have an account? Register now!