8.3 Point-to-point communication

Table of Contents

Overview

Point to point communication is the most basic way that distributed memory programs exchange data. In a point to point operation, exactly one process sends a message and exactly one process receives that message. Everything else in MPI builds on this idea. Collective communications, process topologies, and many higher level patterns are constructed from combinations of point to point operations.

In this chapter we focus on how point to point messaging works conceptually, how the main MPI routines behave, and what kinds of bugs and performance issues appear in practice. We assume you already understand the general idea of distributed memory and MPI from the parent chapter, so here we specialize to the mechanics and patterns of direct process to process communication.

Messages and their components

In a distributed memory program, a message is not just “some bytes sent over the network.” MPI associates several pieces of information with every point to point message. You need to understand these to write correct code and to reason about behavior.

A point to point message in MPI is characterized by:

The communicator, often MPI_COMM_WORLD in simple examples.
The source rank and destination rank within that communicator.
The message tag, an integer that helps distinguish different types of messages.
The data buffer that holds the payload.
The datatype that describes the layout of that payload.
The count, which tells MPI how many elements of the given datatype the buffer contains.

The communicator defines the group of processes that can talk to each other and the communication context. The rank numbers are only meaningful inside that communicator. The tag lets a process receive some messages and ignore others that are not relevant at that moment. Tags are very useful for structuring protocols when multiple different messages are in flight between the same pair of processes.

The buffer, datatype, and count together let MPI move the payload. For simple codes, the datatype is often one of the predefined types such as MPI_INT or MPI_DOUBLE. More complex layouts can be described with derived datatypes, which are covered elsewhere. For now, it is enough to think of the triple (buffer pointer, datatype, count) as describing a contiguous array of values to be sent or received.

MPI matches messages based on communicator, source rank, destination rank, and tag. The smallest unit MPI understands during matching is one expected receive operation. If you call a receive that expects a message with a certain source and tag, the MPI implementation will scan its internal queue of messages that have already arrived. If it finds one that matches, it will deliver it. If not, it will wait until such a message arrives.

Important rule for MPI message matching
A receive operation is matched with exactly one send operation if and only if the communicator, source rank, destination rank, and tag all match. If any of these differ, the messages are not considered the same, and communication will not complete as intended.

Basic MPI point to point routines

MPI provides several flavors of send and receive. The core idea is always the same, but their completion semantics differ and this strongly affects both correctness and performance. The most important routines are:

MPI_Send and MPI_Recv implement standard blocking communication.
MPI_Isend and MPI_Irecv implement nonblocking communication.
MPI_Ssend, MPI_Rsend, and MPI_Bsend implement special send modes with different guarantees.

In this chapter we focus on how these behave at a conceptual level, not on full API details.

Standard blocking send and receive

The simplest operations are the blocking send and receive, typically used as:

MPI_Send(buffer, count, datatype, dest, tag, comm);
MPI_Recv(buffer, count, datatype, source, tag, comm, &status);

Both calls block the calling process until the operation has completed in a specific sense. MPI_Recv returns only after the matching message has been fully received and stored in the user buffer. At that point you are free to use the data.

The behavior of MPI_Send is more subtle. The MPI standard allows the implementation to choose different internal protocols based on message size and system details. For small messages, MPI might copy the data into an internal buffer and return immediately, even if the receiver has not yet posted a matching MPI_Recv. For large messages, MPI may wait until the receiver is ready, in order to transfer data directly from sender buffer to receiver buffer without extra copies.

This flexibility is important for performance, but it has a serious implication for correctness. You cannot safely assume that a blocking send always returns immediately, or that it always blocks until the matching receive is posted. In particular, if two processes both perform blocking sends to each other before any receive, programs can easily deadlock. This is a central practical concern when you design point to point communication patterns.

Nonblocking send and receive

Nonblocking communication separates initiation from completion. The routines MPI_Isend and MPI_Irecv start a communication but return immediately, even if the operation has not finished. Later, you complete the operation with a wait or test.

The basic pattern looks like:

MPI_Isend(sendbuf, count, datatype, dest, tag, comm, &request);
// do useful work while the data is in transit
MPI_Wait(&request, &status);

The request object represents the pending operation. You can pass it to MPI_Wait, MPI_Test, or their variants (for multiple requests) to ensure completion. The key point is that both the send and receive can progress in the background while the process is free to compute or initiate other communication.

Nonblocking operations give you much more control over ordering and allow overlap of communication and computation. They also help avoid deadlocks, because you can post receives early without blocking, then initiate sends, then wait later.

The matching rules for nonblocking operations are exactly the same as for blocking ones. The difference lies purely in when the function returns and when the data is safe to reuse. For a nonblocking send, you must not modify or free the send buffer until the corresponding wait has completed. For a nonblocking receive, you must not read from the receive buffer until completion is confirmed.

Crucial correctness rule for nonblocking communication
You must not access a buffer involved in a nonblocking send or receive until the associated operation has completed. For sends, do not modify the buffer. For receives, do not read from the buffer. Always use MPI_Wait or a related routine to guarantee completion before using the data.

Send and receive modes

MPI defines several send modes that affect when a send operation can complete. Standard mode uses MPI_Send. In addition to that, you may see:

Synchronous send, using MPI_Ssend. The send completes only when the matching receive has started.
Buffered send, using MPI_Bsend. The send can complete early if there is user-provided buffer space.
Ready send, using MPI_Rsend. The send assumes that a matching receive is already posted.

Although beginners usually use only standard send, understanding the others clarifies how MPI is allowed to behave and why deadlocks occur.

Standard mode

Standard mode, MPI_Send, allows the implementation to choose between internal buffering and rendezvous protocols. If a small message fits into an internal buffer, the send may copy the data and return. If the message is large, the send may participate in a handshake with the receiver so that data moves directly between user buffers.

As a result, behavior with respect to blocking is not uniform across systems or message sizes. A program that “happens to work” for small messages may deadlock when message sizes grow, because the implementation switches from buffered to rendezvous protocol. You should not rely on any particular buffering behavior of MPI_Send.

Synchronous mode

Synchronous send makes completion behavior explicit. MPI_Ssend returns only after the matching receive has begun. In effect, the sender blocks until it knows the receiver has posted the corresponding receive. This mode is very useful for debugging, because it exposes deadlocks more reliably. It also avoids unexpected buffering on the sender side, which can be important when you want to limit memory use.

With synchronous sends, if two processes both call MPI_Ssend to each other without first posting receives, the program will deadlock. The situation is very visible, which helps identify ordering issues in your communication pattern.

Buffered and ready modes

Buffered sends, MPI_Bsend, rely on a user-provided buffer attached with MPI_Buffer_attach. Data is copied into this buffer, then the call returns. The real network transfer happens asynchronously when the receiver posts a matching receive. Buffered sends can be helpful in some special cases, but they complicate memory management and are rarely needed in modern practice.

Ready sends, MPI_Rsend, are only valid if a matching receive is already posted. If that condition is not met, behavior is undefined. This is extremely dangerous for beginners and offers only marginal performance benefits, so ready mode sends are typically avoided in introductory and even many production codes.

Blocking semantics and deadlocks

Blocking point to point operations influence both the structure of your program and its liveness. The main risk is deadlock, where processes wait on each other indefinitely. Understanding typical deadlock patterns helps you design safer communication.

A simple deadlock scenario is symmetric blocking send:

Process 0 calls MPI_Send to process 1, then MPI_Recv from process 1.
Process 1 calls MPI_Send to process 0, then MPI_Recv from process 0.

If the MPI implementation uses a rendezvous protocol for these messages, both processes will block on their sends, because neither side has yet posted the matching receive. Since both are stuck, the program never progresses.

The same pattern can appear with MPI_Ssend, and even with MPI_Send when messages are large enough. Sometimes this pattern appears to “work” on a particular machine with small message sizes, which is misleading. The correctness of a program should never depend on the internal buffering behavior of the MPI library.

To avoid such deadlocks, you can:

Order sends and receives differently on the two sides, so that one receives first while the other sends.
Use nonblocking communication, posting all receives before sends and then waiting.
Use MPI_Sendrecv, which combines a send and a receive into one call with defined matching behavior.

Practical anti deadlock rule
Do not rely on symmetric blocking sends between two processes. Either post receives first, use nonblocking operations, or use MPI_Sendrecv. If your correctness depends on small messages being buffered, your code is fragile and may deadlock on other systems or larger problem sizes.

Tags, sources, and flexible matching

Tags and source specifications are powerful tools to control how messages are matched. MPI allows you to explicitly specify an expected source rank and tag, or to accept a wider range of possibilities.

When you call MPI_Recv, you can request a specific source rank and tag, or you can use the wildcard values MPI_ANY_SOURCE and MPI_ANY_TAG. This can simplify protocols where a process can receive messages from many senders, or where you are not sure in which order messages will arrive.

A typical flexible receive might look like:

MPI_Recv(buffer, count, datatype, MPI_ANY_SOURCE, tag, comm, &status);

Here you accept messages from any sender, but still require a specific tag. After the receive completes, you can inspect the status object to see the actual source. This pattern is useful for workpool implementations or irregular applications where tasks are dynamically assigned.

You can also use MPI_ANY_TAG when any tag is acceptable. However, using wildcards too liberally makes reasoning about matching harder. If multiple messages could satisfy a given wildcard receive, MPI is free to choose any one of them, and that choice may depend on low level timing differences.

A consistent tagging scheme is important. For example, you might use one tag for boundary data exchanges and another for control messages. This lets you ensure that data intended for a particular purpose is not accidentally consumed by an unrelated receive.

Status objects and message information

Each receive in MPI can produce a status object. This object records information about the particular message that was matched:

The source rank that actually sent the message.
The tag of the received message.
An error code that applies to the communication.

For some applications, you also need to know how many elements were actually received, especially when you posted a receive with a count that is larger than the incoming message. You can obtain the actual count using:

MPI_Get_count(&status, datatype, &received_count);

This is vital if you use wildcard receives or variable length messages. The typical pattern is to post a receive with a buffer large enough to hold the maximum possible message, then after the receive completes call MPI_Get_count to determine the actual message size. You may then process only the meaningful part of the buffer.

In some designs, you first probe for the message to discover its size, then allocate an appropriately sized buffer and finally receive the message. Probing mechanisms belong to the broader topic of message management and are treated elsewhere, but the idea is that you can separate discovery of message properties from actual reception.

Common communication patterns

Point to point communication can be composed into higher level patterns that occur very frequently in distributed programs. These patterns are fundamental building blocks for both scientific simulations and data processing applications.

Neighbor exchanges

Many algorithms use a domain decomposition where each process owns a subregion of the total domain, along with some overlap region to exchange with neighbors. For one dimensional decompositions, each process typically communicates with its left and right neighbors. In two or three dimensions, processes may have up to four or six neighbors.

A simple neighbor exchange can be implemented as:

Each process sends its boundary data to its left neighbor and receives from the right, then sends to the right and receives from the left. If you use blocking operations, you must be careful to order operations to avoid deadlock. A safer approach is to post nonblocking receives for both neighbors, then nonblocking sends, then wait for completion of all.

MPI_Sendrecv simplifies neighbor exchanges for pairwise patterns, because it handles both directions at once with clear matching and less risk of mistakes. You specify a send destination and tag, a receive source and tag, and MPI internally ensures the correct ordering.

Ring communication

In a ring pattern, each process sends data to (rank + 1) modulo the communicator size and receives from (rank - 1) modulo the communicator size. This pattern is often used in simple benchmarks or for algorithms that require passing a token or data around all processes.

Ring communication is a straightforward example of how a single pattern of point to point sends and receives can connect the entire set of processes. Although collectives can express this more simply and perform better on real systems, understanding the point to point version is crucial for mastering MPI’s behavior.

Client server style interactions

Another common pattern is client server style interaction. A set of worker processes sends requests to a master process, which responds with data or tasks. The master uses receives with MPI_ANY_SOURCE to accept requests from any worker that becomes ready, while workers typically send with a distinguished tag for requests.

This pattern illustrates the importance of tags and status objects. By inspecting the source and tag in each received message, the master can distinguish between different message types and from which worker they originated, even when messages arrive in an unpredictable order.

Ordering and message delivery guarantees

MPI provides particular guarantees about message ordering that are important for reasoning about correctness. These guarantees are local to pairs of processes and depend on matching conditions.

If one process performs two sends in order, both to the same destination, with the same tag and communicator, and the destination posts two receives that match these sends in the same way, then MPI guarantees that the messages will be received in the order they were sent. This means you can rely on FIFO behavior for a given source destination tag combination.

However, if tags differ or receives use wildcards, the visible order might change. If you send with tags 0 and 1, and the receiver posts one receive with MPI_ANY_TAG followed by another with MPI_ANY_TAG, MPI may deliver the tag 1 message first, depending on arrival timing. If your program logic depends on receiving a particular message first, you must express that requirement through tags and specific receives, not by relying on network order.

MPI does not guarantee any global ordering between messages involving different sources or tags. When you design protocols, think in terms of explicit sequencing points: a receive that expects exactly one message, identified by a unique tag and source, acts as such a point.

Performance aspects of point to point communication

Performance of point to point operations depends on several factors, such as message size, communication pattern, and computation overlap. While detailed performance analysis is covered elsewhere, there are a few key ideas that are specific to point to point communication.

For small messages, latency dominates. Each message incurs a certain startup time, regardless of its payload size. This leads to a common performance model for point to point message cost:

$$
T_{\text{message}} = \alpha + \beta \cdot n
$$

where $\alpha$ is the latency cost, $\beta$ is the cost per byte, and $n$ is the message size in bytes. This simple model illuminates why many tiny messages are often slower than fewer larger messages, even if the total data volume is the same.

For large messages, bandwidth becomes critical. The term $\beta \cdot n$ dominates the runtime. In this regime you should focus on moving data efficiently, perhaps by aligning buffers or organizing data so that it is contiguous and amenable to high throughput.

Point to point patterns also interact with network topology. If many processes send to the same destination at once, that destination process or its network links may become a bottleneck. Some algorithms try to spread communication over time or across the topology to reduce contention.

Nonblocking operations are a primary tool to hide communication cost. By initiating communications early and overlapping them with computation, you can reduce the visible time spent waiting on messages. You must analyze your algorithm to identify parts that do not depend on the results of outstanding communications and schedule that computation to occur while data is in transit.

Finally, the choice between point to point and collectives matters. Many operations that can be expressed using chains of point to point messages, such as reductions or broadcasts, are more efficiently implemented using collective routines. Those collectives often use algorithms that are finely tuned to the network and scale better than naive point to point implementations. For understanding and for some specialized patterns, point to point remains essential, but for common global operations you should rely on collectives whenever possible.

Robust design of point to point protocols

Designing robust point to point communication protocols means thinking carefully about who sends what to whom, when, and under what tags. You should be able to answer the following questions for every message:

Which process sends it, and what triggers the send.
Which process is supposed to receive it, and how that receive is expressed.
What tag is used, and whether that tag is unique for this purpose.
Whether the sender or receiver needs confirmation that the operation took place.

A good protocol design often starts by sketching a timeline for each process and marking sends and receives. Check that every send has exactly one corresponding receive. Look for symmetric blocking patterns that could deadlock. Verify that tags are used consistently and unambiguously.

It is usually wise to keep message types few and clearly documented. For example, you might define that tag 0 is always used for initial data distribution, tag 1 for boundary exchanges, and tag 2 for final collection of results. Recording such conventions in comments makes it easier to maintain and debug the code.

When your application grows more complex, you might introduce separate communicators for different phases or roles. This isolates messages for one purpose from those for another and reduces the risk of mis matches. Even then, point to point communication remains the primitive that actually moves data between processes.

By mastering point to point communication, you gain the tools to express a wide variety of distributed algorithms. Higher level constructs in MPI and other libraries often hide these details, but a solid understanding of how direct sends and receives behave is essential for debugging, performance tuning, and the design of new communication patterns.

Comments

Please login to add a comment.

Don't have an account? Register now!