8.1 Introduction to MPI

Table of Contents

Context and Scope

In distributed memory programming, each process has its own private address space. Exchanging data requires explicit communication. MPI, the Message Passing Interface, is the dominant standard for this style of programming in HPC. This chapter introduces what MPI is, what problems it solves, how an MPI program looks at a high level, and how to compile and run simple MPI applications on typical HPC systems. Later chapters will cover MPI processes, point to point and collective communication, and performance pitfalls in more detail.

What MPI Is and What It Is Not

MPI is a standardized library interface for message passing. The standard defines functions, data types, and behaviors that allow separate processes to communicate by sending and receiving data. Multiple vendors and projects provide implementations of this standard, such as MPICH, Open MPI, and Intel MPI. They all expose the same basic programming interface, with some extensions and performance differences.

MPI is not a programming language. You do not write programs "in MPI". Instead, you write programs in C, C++, Fortran, or other supported languages and call MPI functions from that language. In a typical MPI program, most of the code is ordinary serial code. Only specific parts, where data needs to be exchanged or work needs to be coordinated among processes, call MPI routines.

MPI is not a shared memory model. When you use MPI, each process has its own memory. A pointer in one process has no meaning in another. Any data that must be visible across processes must be sent explicitly through MPI calls. This property makes reasoning about correctness simpler in many cases, but it shifts the responsibility for data movement to the programmer.

MPI is also not tied to a single machine architecture. The same MPI code can run on a laptop, on a cluster, or on a supercomputer. An MPI implementation hides the details of the interconnect, for example whether it uses Ethernet or InfiniBand, and provides a unified communication interface.

Typical Use Cases for MPI

MPI shines when you need to scale across many nodes in a cluster. Each node runs multiple MPI processes that cooperate to solve one large problem. This model fits many scientific and engineering computations that involve large arrays or grids, such as climate models, fluid dynamics, molecular dynamics, and large linear algebra problems.

An MPI program is usually organized as a single executable launched many times, one instance per process. All instances start together and form a group that knows how many processes participate and which rank each process has. The ranks and group concept are central in MPI and will be discussed further in the next chapter on MPI processes.

MPI is appropriate when:

You need to use more memory than one node provides. Each process can hold a subset of the global data in its local memory, and MPI transfers only the necessary pieces.

You want to exploit many cores across many nodes. MPI allows you to distribute computation across many processes that may be located on different nodes.

You need explicit control over data distribution and communication. MPI gives you fine grained control over how data is partitioned, when communication happens, and what pattern of communication is used.

On the other hand, MPI can be more complex than shared memory approaches when used inside a single node, especially for beginners. In practice, many real applications use MPI across nodes and shared memory programming such as OpenMP within a node.

Basic Structure of an MPI Program

Although MPI offers many advanced features, its core structure is simple. An MPI program must follow a few basic rules so that the MPI implementation can set up and shut down the communication environment.

At a high level, an MPI program in C or C++ looks like this:

Include the MPI header.
Call MPI_Init near the start of the program.
Query information about the MPI world, such as the number of processes and the rank of the current process.
Perform computations and communications.
Call MPI_Finalize before the program exits.

A minimal C program that follows this pattern is:

#include <mpi.h>
#include <stdio.h>
int main(int argc, char **argv) {
    MPI_Init(&argc, &argv);
    int world_size;
    int world_rank;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
    printf("Hello from rank %d of %d\n", world_rank, world_size);
    MPI_Finalize();
    return 0;
}

The call to MPI_Init sets up the MPI environment and must be called before any other MPI function. The last call, MPI_Finalize, cleans up that environment. Between these two calls, the program has a fully working MPI world.

Each process runs the same program image. The key difference between them is the value of the rank that each process obtains from MPI_Comm_rank. This rank is an integer identifier inside a communicator. The predefined communicator MPI_COMM_WORLD contains all processes that were started together for the program. Later chapters will discuss how to create additional communicators for subgroups of processes.

A critical rule is that, except for a few explicitly designated routines, every MPI process must make calls to MPI routines in a consistent way. For example, if one process calls a collective routine on a communicator, all processes in that communicator must participate in that call. Violating this rule typically results in deadlock or program abort.

Every MPI program must call MPI_Init exactly once per process before any other MPI call, and must call MPI_Finalize exactly once per process after all MPI operations are complete.

Compiling MPI Programs

MPI programs are compiled using special wrapper compilers provided by the MPI implementation. These wrappers are not compilers themselves. Instead, they call an underlying compiler, such as gcc or icc, with the correct header and library flags.

Common wrapper commands are:

mpicc for C code.

mpicxx or mpiCC for C++ code.

mpifort or mpif90 for Fortran code.

The wrapper knows where the MPI headers and libraries are installed. This avoids manual specification of include directories and link flags. On an HPC cluster, the MPI wrappers are often provided through environment modules, so you may need to load an MPI module before compiling.

For the minimal example shown earlier, compilation might look like:

mpicc -O2 -o hello_mpi hello_mpi.c

The -O2 flag enables standard optimization by the underlying compiler. Debug flags, optimization flags, and other options are passed through the wrapper and handled by the real compiler as usual.

If multiple MPI implementations are installed, separate wrappers distinguish them. For example, on some systems you might see both mpicc from Open MPI and mpicc from MPICH, controlled by which module is loaded.

Running MPI Programs

An MPI program is not run directly with the usual ./program command when you want parallel execution. Instead, you use an MPI launcher that starts many instances of the program and arranges them into a single MPI job. The most common launcher is mpirun or mpiexec, although launch details may be integrated with the cluster job scheduler.

A basic launch command for running on a workstation may look like:

mpirun -np 4 ./hello_mpi

Here -np 4 requests four processes. MPI will start four instances of ./hello_mpi. All four processes call MPI_Init, determine their rank and the size of the communicator, and print a message.

On a cluster, you will usually not run mpirun directly on the command line of a login node. Instead, you place the mpirun command inside a job script that is submitted to the scheduler. The scheduler allocates nodes and cores, then launches mpirun on a suitable set of compute nodes. The exact interaction between mpirun and the scheduler depends on the MPI implementation and cluster configuration. Later chapters on job scheduling and MPI processes will give details and best practices for that environment.

A typical pattern is that all processes start executing the same code but make decisions based on their rank. This is called single program, multiple data, often abbreviated SPMD. For example, a program may assign different sections of an array to each process based on the rank. Communication then moves the necessary boundary data between neighboring processes using point to point calls.

Basic MPI Concepts Introduced Informally

Although detailed treatment of MPI processes, message types, and communication routines appears in subsequent chapters, it is helpful to introduce the basic ideas informally here.

First, an MPI message is a unit of data sent from one process to another. A message has a source rank, a destination rank, a communicator, a tag, a data buffer, and a datatype. The buffer and datatype together describe which bytes in memory should be sent or received.

Second, MPI programs are organized around communicators. A communicator defines a group of processes that can communicate with each other. The most widely used communicator is MPI_COMM_WORLD, which contains all initially started processes. Communicators provide both scope and context for messages. This helps avoid interference between different parts of a program or between library and application communications.

Third, MPI identifies processes within a communicator using integer ranks from 0 to p - 1, where $p$ is the number of processes in that communicator. The rank is local to the communicator. Rank 0 in MPI_COMM_WORLD is not necessarily rank 0 in a different communicator. This distinction becomes important for large and modular codes that use several groups of processes.

Finally, MPI provides a rich collection of communication patterns. Point to point communication moves data between specific pairs of processes. Collective communication involves many or all processes in a communicator and implements operations like broadcast, reduction, and gather. Choosing between these patterns and applying them appropriately is central to designing efficient MPI programs.

Error Handling and Return Codes

Most MPI routines return an integer error code. By convention, a value of MPI_SUCCESS indicates that the operation completed successfully. For beginners, it is easy to ignore return codes, especially since many example codes in tutorials do so. In more robust applications, you should check these return values and handle errors gracefully where possible.

MPI also provides mechanisms for custom error handlers attached to communicators. These handlers decide what happens when an error occurs. For simple programs, the default handler often aborts the entire job on error. Although harsh, this behavior avoids inconsistent states when communication fails.

For introductory learning and experimentation, it is acceptable to rely on the default error behavior and ignore return values. However, it is useful to know that MPI routines can report errors and that more sophisticated handling is available for production codes.

Common First Examples

Introductory MPI code examples usually demonstrate a few standard patterns that illustrate the basic mechanics of message passing:

A "hello world" program, as shown above, which prints the rank and size of the MPI world.

A ring communication pattern, where each process sends a value to the next rank and receives from the previous one. This illustrates point to point communication and wrap around logic with modular arithmetic.

A parallel sum or average of a list of numbers using collective communication. Each process holds part of the data. MPI collectives combine the local results into a global sum or average.

These examples are kept small on purpose. They demonstrate the SPMD structure, show how ranks are used to make decisions, and provide a first impression of how explicit communication replaces implicit shared memory. In real codes, the data structures are much larger, and communication patterns are chosen to match the problem domain and the physical layout of the cluster.

Portability and Performance Considerations

A key advantage of MPI is its portability. Correct MPI code that adheres to the standard will usually compile and run with different MPI implementations and on different platforms. The MPI standard is designed so that simple programs that use only core functionality remain stable across MPI versions.

At the same time, MPI leaves room for implementation specific performance optimizations. Different MPI libraries may exploit hardware features in different ways, such as specialized network hardware or shared memory optimizations for processes on the same node. Many MPI implementations provide environment variables or configuration flags that tune performance, such as selecting communication protocols or buffer sizes.

For your first MPI programs, it is more important to write correct, clear code than to worry about these performance tuning options. Once a program is functionally correct and you have basic profiling information, you can begin to explore performance tuning, both at the algorithm level, such as rearranging communication patterns, and at the implementation level, such as selecting optimized MPI settings.

Summary

MPI provides a standardized library interface for distributed memory parallel programming. An MPI program consists of many processes that start together, call MPI_Init and MPI_Finalize, and communicate explicitly via messages. Processes are organized into communicators and identified inside each communicator by ranks. You compile MPI programs with wrapper compilers, such as mpicc, and run them using launchers such as mpirun or environment specific mechanisms integrated with job schedulers.

Later chapters will build on this introduction. They will explain MPI processes in more detail, introduce point to point and collective communication routines, describe communicators and process topologies, and explore how to avoid common performance pitfalls in real MPI applications.

Comments

Please login to add a comment.

Don't have an account? Register now!