4.5.2 InfiniBand

Table of Contents

Overview

InfiniBand is a high performance network technology that is widely used in HPC clusters to connect compute nodes. It is designed for very low latency and very high bandwidth communication between nodes, and it supports advanced features that are critical for scalable parallel applications. In this chapter the focus is specifically on InfiniBand as an interconnect technology, assuming that the general idea of interconnects and cluster components is already familiar from the parent chapters.

Key Characteristics of InfiniBand

InfiniBand is a high speed, switched fabric interconnect. Instead of connecting nodes directly in a simple bus or daisy chain, it uses switches that create a fabric of connections. Nodes are attached to this fabric through Host Channel Adapters, usually abbreviated as HCAs. An HCA is typically a PCIe card inside the node that provides InfiniBand ports and offloads many communication tasks from the CPU.

Two core performance metrics are latency and bandwidth. Latency is the time to send a small message from one node to another. Bandwidth is the volume of data that can be transferred per second. InfiniBand targets latencies on the order of microseconds and bandwidths up to hundreds of gigabits per second per link, depending on the generation and link width.

InfiniBand defines link speeds in generations, often named SDR, DDR, QDR, FDR, EDR, HDR, and NDR, each corresponding to a higher data rate. It also allows multiple lanes to be combined in one link, for example x1, x4, or x8. A modern cluster might use, for example, HDR at 200 Gbit/s on a x4 link, or NDR at even higher rates. The exact numbers evolve with each generation, but the important idea is that InfiniBand offers significantly higher bandwidth and lower latency than typical Ethernet used in commodity networks.

InfiniBand supports multiple transport services. Some of these guarantee reliable, in order delivery, which is crucial for MPI-based applications, while others are optimized for very low latency or hardware offload.

InfiniBand is designed to provide very low latency and very high bandwidth through a switched fabric with HCAs on each node and specialized transport services that support reliable and efficient message passing.

InfiniBand in the HPC Cluster Node

On each compute node, InfiniBand connectivity is provided by an HCA. The HCA appears to the operating system somewhat like a network card, but it can do more than simply send and receive packets. It implements parts of the InfiniBand verbs interface in hardware, and it can offload message handling, connection management, and sometimes even memory operations.

The HCA connects to the CPU and memory through PCI Express. The peak throughput of the InfiniBand link can be limited by the PCIe bandwidth, so HPC systems are designed to match PCIe generation and width to the InfiniBand link capabilities. Placement of the HCA in the node is also important. On multi socket systems, it matters which CPU socket the HCA is attached to. If traffic must cross between sockets inside the node to reach the HCA, there is an additional cost. Many MPI libraries and job schedulers are topology aware and try to place tasks close to the HCA that will carry their network traffic.

InfiniBand also supports multiple ports per HCA. A node can be connected to multiple switches for redundancy or additional aggregate bandwidth. In practice, most beginner users just need to know that their node has an InfiniBand device that MPI and other libraries can use automatically, but it is useful to understand that there is physical hardware doing significant work under the hood.

InfiniBand Fabric Topologies

The InfiniBand switches and links form a fabric. How this fabric is wired affects the performance and scalability of the whole cluster. The two most common topologies in HPC are fat tree and various forms of mesh or torus, but fat tree is particularly typical for InfiniBand.

In a fat tree, nodes are connected to leaf switches, and those leaf switches are connected upward to spine switches. A properly built fat tree is often called non blocking, which means that there is enough capacity in the upper layers of the network to support full bandwidth communication between all node pairs. In practice, many clusters use slightly oversubscribed trees, which save hardware but can introduce contention if many nodes communicate at full rate simultaneously.

Some systems use dragonfly or custom topologies to reduce the number of hops between nodes or to reduce cabling complexity. Regardless of the exact layout, the network designer tries to minimize the hop count between communicating nodes, to balance the load across links, and to keep latency low and predictable.

For an application, the visible effect of topology is usually changes in communication performance as node count or communication pattern changes. For example, an all to all communication on a cluster with an oversubscribed fabric may slow down more rapidly than on a non blocking fabric.

InfiniBand Verbs and RDMA

At the software level, InfiniBand is exposed through the verbs API. Verbs are low level operations that allow software to create communication endpoints, post send and receive requests, and interact with completion queues to learn when data transfers finish. High level libraries such as MPI are built on top of verbs, so most users never call verbs directly, but the verbs model is important for understanding key InfiniBand capabilities.

A central feature of InfiniBand is Remote Direct Memory Access, or RDMA. RDMA allows one node to read from or write to the memory of another node without involving the remote CPU in the data movement path. The remote side must explicitly register the memory and grant permission, so this is not uncontrolled access, but once established, RDMA operations can be very efficient.

In a typical send and receive operation over a conventional network, data is copied from user memory into kernel buffers, then into network interface buffers, and a similar sequence occurs on the receive side. Each copy and context switch adds overhead. With RDMA, the HCA can move data directly between application buffers on the two nodes, often bypassing the kernel and avoiding copies.

This is particularly useful for MPI one sided communication and for high performance storage solutions. Many HPC storage stacks and parallel file systems are built on top of RDMA to achieve high throughput and low CPU overhead.

Remote Direct Memory Access (RDMA) allows an InfiniBand HCA to read or write remote memory directly, with minimal CPU involvement, which reduces copies, context switches, and latency.

Queue Pairs, Completion Queues, and Work Requests

The verbs API is built around several core objects: queue pairs, completion queues, and work requests. A queue pair consists of a send queue and a receive queue. To communicate, a process creates a queue pair on its HCA, associates it with a remote queue pair, and then posts work requests to these queues.

A work request is a description of a communication or RDMA operation. For example, it can specify that a certain memory region should be sent to a remote endpoint, or that a remote memory region should be written using RDMA write. The HCA processes these work requests asynchronously.

Completion queues are used to report when work requests have completed. Software polls a completion queue or waits for a notification to know when a data transfer has finished. This asynchronous model allows the CPU to overlap computation and communication. While the HCA moves data in the background, the CPU can execute other work, only checking back for completion when necessary.

MPI implementations on InfiniBand rely on these objects internally. MPI point to point and collective operations are mapped into sequences of work requests on queue pairs. The user sees a simple API such as MPI_Send and MPI_Recv, but under the surface the InfiniBand verbs and the HCA handle the low level details.

Transport Types and Reliability

InfiniBand defines several transport types to support different kinds of communication. Reliable Connected transport establishes a connection between two queue pairs and provides reliable, in order delivery. This is a common choice for MPI communication since it closely matches the semantics required by many MPI operations.

Reliable Datagram and Unreliable Datagram transports support different trade offs between reliability, connection management, and scalability. Datagrams do not require a dedicated connection per pair of endpoints, which can be useful in very large systems, but they require more careful handling by software.

InfiniBand uses techniques such as end to end credits, acknowledgements, and retransmissions in hardware to enforce reliability in the reliable transports. This means that the HCA ensures that data either arrives correctly or the operation is flagged as failed, without the CPU needing to manage the details of lost or corrupted packets.

From a user viewpoint, the main benefit is that MPI and other high level libraries can depend on the fabric to provide strong reliability guarantees, which simplifies parallel programming. The trade off is that full reliability in hardware can sometimes limit maximum scalability or performance, so hardware vendors and MPI implementers tune these mechanisms carefully.

InfiniBand and MPI Performance

InfiniBand and MPI are closely associated in HPC. Most MPI libraries provide optimized InfiniBand support, taking advantage of RDMA and verbs to reduce latency and CPU usage. The performance gains are visible in both small message and large message scenarios, but they appear differently.

For small messages, the key metric is latency. InfiniBand can deliver very low end to end latencies, which benefits fine grained communication patterns where processes exchange frequent small messages. For large messages, bandwidth dominates. InfiniBand links can saturate at high throughputs, which allows large data transfers, such as halo exchanges or collective operations, to complete quickly.

The effective bandwidth and latency seen by an MPI program depend on several factors. These include link speed and width, number of hops in the fabric between communicating ranks, how crowded the fabric is with other traffic, and how well the MPI implementation uses RDMA and connection management. Oversubscription of the network, where there are more active flows than capacity, can cause contention and increase both latency and transfer time.

Users can sometimes influence MPI over InfiniBand behavior through environment variables, such as selecting specific communication protocols or eager vs rendezvous thresholds for message sizes. However, for beginners the main point is to understand that MPI applications are very sensitive to interconnect performance and that InfiniBand is designed to serve exactly this use case.

Physical Layer and Cabling

On the physical level, InfiniBand uses copper or optical cables with specific connectors, typically QSFP variants. Copper cables are common for short distances within a rack or between nearby racks. Optical cables are used for longer distances because they preserve signal quality over tens of meters or more.

Link speed and cable type affect signal integrity and maximum distance. Higher speed links are more sensitive to signal loss and noise. Cluster designers select cabling carefully to maintain error free operation at the required speeds. Errors at the physical layer can cause retransmissions and reduce effective throughput, even if the link appears to be up.

InfiniBand also defines link training and auto negotiation, which allow devices to negotiate common link parameters when a cable is plugged in. If a cable or port cannot support the full designed speed, the link may fall back to a lower speed. From a performance perspective, a single slow link in an important path can act as a bottleneck and degrade communication for many nodes.

Management, Partitions, and Quality of Service

InfiniBand fabrics are managed with a subnet manager. The subnet manager discovers the topology, assigns addresses known as Local Identifiers, computes routing tables, and configures fabric parameters. This management function can run on a dedicated management node, on a switch, or on a regular host.

InfiniBand supports partitions, often called PKeys, which act somewhat like VLANs in Ethernet. Partitions allow the administrator to separate traffic from different users or projects logically, even though they share the same physical fabric. Only nodes in the same partition can communicate, unless specific exceptions are configured. This improves security and can help control interference between workflows.

Quality of Service features allow the network to differentiate traffic classes. Some MPI jobs or storage traffic can be given higher priority or reserved bandwidth, while background traffic is handled with lower priority. This can be important when running mixed workloads on a shared cluster, such as interactive analysis jobs and large batch simulations at the same time.

From a user perspective, partitions and QoS usually appear indirectly. For example, jobs in certain queues may automatically be placed into specific InfiniBand partitions. Interactive users might find that certain nodes have different connectivity or performance patterns depending on fabric configuration.

InfiniBand and Storage

InfiniBand is used not only for MPI communication between nodes, but also for high performance storage access. Protocols such as IP over InfiniBand, or IPoIB, allow TCP/IP traffic to flow over the InfiniBand fabric. More importantly for HPC, RDMA based storage protocols such as SRP or NVMe over Fabrics can use InfiniBand to provide high throughput, low latency access to parallel file systems and block devices.

In a typical HPC environment, compute nodes access a parallel file system through InfiniBand to achieve much higher aggregate bandwidth than would be possible through slower Ethernet links. The use of RDMA in the storage stack reduces CPU usage on both compute nodes and storage servers, allowing more cycles for computation and I/O processing.

Although the details of parallel file systems belong to other chapters, it is useful to understand that InfiniBand can carry both MPI traffic and storage traffic. Administrators must design the fabric so that storage and MPI workloads do not interfere excessively. Techniques such as QoS or separate fabrics for storage and MPI are used when necessary.

Troubleshooting and Practical Considerations

For day to day use, most beginners will not directly manage InfiniBand, but some basic awareness can help diagnose performance problems. Common symptoms of InfiniBand issues include unexpectedly high communication times, MPI jobs that stall during collective operations, or variable job runtimes even on identical node counts.

Common causes include misconfigured or failing cables, ports that negotiated down to a lower speed, fabric congestion due to oversubscribed topologies or heavy competing workloads, and incorrect binding of processes to cores and HCAs. Monitoring tools can report link errors, retransmissions, or port speeds, which guide administrators in identifying hardware or configuration problems.

From the application side, poorly designed communication patterns can exacerbate any network limitations. For example, many ranks sending data simultaneously to a single rank can create hotspots in the fabric. Performance chapters in this course will discuss algorithmic strategies to reduce such contention, but it is useful to remember that InfiniBand is fast yet still finite, and that communication patterns matter.

Effective use of InfiniBand depends on both hardware configuration and application communication patterns. Miswired or oversubscribed fabrics and poorly balanced communication can severely reduce the performance benefits of InfiniBand.

Summary

InfiniBand is a specialized interconnect technology that provides the low latency, high bandwidth, and advanced communication features required for parallel applications on HPC clusters. It uses HCAs in each node, a switched fabric with carefully chosen topologies, and a verbs based API that enables RDMA and efficient message passing. InfiniBand integrates closely with MPI and with high performance storage systems, and its configuration and health have a direct impact on the performance and scalability of HPC workloads. Understanding the basic principles of InfiniBand helps users reason about communication performance, interpret observed behavior of parallel applications, and appreciate the role of the interconnect in the overall design of HPC systems.

Comments

Please login to add a comment.

Don't have an account? Register now!