4.5.2 InfiniBand

Table of Contents

What Makes InfiniBand Different in HPC

InfiniBand is a high‑performance network technology widely used to connect nodes in HPC clusters. Within the broader topic of interconnects, InfiniBand is specifically designed for:

Very high bandwidth between nodes
Very low and predictable latency
Offloading communication work from the CPU to a dedicated network adapter
Scalable communication across thousands of nodes

For HPC users, you’ll mostly encounter InfiniBand indirectly—through job scripts, MPI runs, and filesystem access—rather than configuring it yourself. Understanding its basic properties helps explain why some options and performance behaviors look the way they do.

Core Components and Terminology

Host Channel Adapter (HCA)

On each compute node, InfiniBand connectivity is provided by a Host Channel Adapter:

Analogous to a network interface card (NIC) in Ethernet, but more capable.
Connects the node’s PCIe bus to the InfiniBand fabric.
Handles many communication operations in hardware (offload), reducing CPU overhead.
Often branded with names like Mellanox/ NVIDIA ConnectX, etc.

As a user, you might see references to InfiniBand devices in commands like ibstat or in MPI settings that refer to “HCAs” or specific device names (e.g., mlx5_0).

Switches and Fabric

InfiniBand switches connect nodes to form a “fabric”:

Fabric: the entire interconnected network of HCAs and switches.
Typical topologies in HPC: fat-tree, dragonfly, or variants that aim for high bisection bandwidth.
Multiple switches can be arranged hierarchically for large clusters.

From a user’s viewpoint, this matters because:

Network contention and path length impact communication performance.
System documentation may mention the fabric topology when describing expected scaling limits.

Queue Pairs and Verbs (Conceptual)

Communication in InfiniBand is based on queue pairs (QPs) and a low‑level API often called “verbs”:

Queue pair: send queue + receive queue associated with an endpoint.
Verbs: functions to post send/receive operations and manage QPs.

You typically do not program with verbs directly as a beginner; MPI and other libraries use them under the hood. But this model underlies InfiniBand’s efficiency and the advanced communication modes described next.

Communication Modes Relevant to HPC

Reliable Connection (RC) vs Unreliable Datagram (UD)

InfiniBand can support several communication modes. HPC MPI implementations most commonly use:

Reliable Connection (RC)

Logical connection between a pair of endpoints.
Guarantees in‑order, reliable delivery.
Well-suited for typical MPI point‑to‑point communication.

Unreliable Datagram (UD)

No guarantee of delivery or order.
Scales to more endpoints with less state, but less common in basic MPI usage.

For you as a user, this is mostly invisible, but helps explain that InfiniBand is not just “faster Ethernet”—it provides richer communication semantics that MPI can exploit.

RDMA: Remote Direct Memory Access

Remote Direct Memory Access (RDMA) is one of InfiniBand’s key features:

A process can read from or write to memory on a remote node directly over the network.
Data moves between memories without copying through the remote CPU.
Reduces CPU overhead and latency compared to traditional send/receive stacks.

In HPC:

Many MPI implementations use RDMA internally when possible.
“One‑sided” communication models (e.g., MPI RMA, PGAS languages, some I/O libraries) may explicitly leverage RDMA‑like semantics.

You’re unlikely to enable RDMA explicitly as a beginner; instead, you benefit when using MPI libraries configured to exploit InfiniBand RDMA.

Connection vs Connectionless Use

At scale, maintaining a separate connection (RC) between every pair of processes can be expensive. MPI libraries might:

Use RC for intra‑node or near‑neighbor communication.
Use UD or other scalable schemes for global or all‑to‑all patterns.

Users might see tuning options in MPI documentation about “dynamic connections,” “connectionless InfiniBand,” or “on‑demand connection setup,” which are strategies to manage these trade‑offs.

Performance Characteristics

Bandwidth and Latency

InfiniBand is engineered for high bandwidth and low latency:

Bandwidth

Expressed in “Gb/s” for each link and direction.
Typical line rates (older to newer generations): QDR (40 Gb/s), FDR (56 Gb/s), EDR (100 Gb/s), HDR (200 Gb/s), NDR (400 Gb/s).
Links can be combined (x1, x4, x8, etc.) for higher effective throughput.

Latency

End‑to‑end message latency is typically a few microseconds or less for small messages, much lower than typical Ethernet-based solutions.

Implications for HPC applications:

Collectives and fine-grained communications can scale better.
Algorithms that rely on frequent small messages benefit significantly from InfiniBand.

Message Size and Performance Regimes

InfiniBand performance typically shows:

For small messages:

Latency dominates; protocol overhead is critical.
MPI implementations may use special “eager” protocols to reduce round trips.

For large messages:

Bandwidth dominates; RDMA and zero‑copy techniques are used.
Tuning buffer sizes and using large, contiguous messages helps reach peak throughput.

When you benchmark your code, you might observe a “knee” in performance where the effective bandwidth increases significantly once messages become large enough.

InfiniBand and MPI/OpenMP/Hybrid Codes

MPI over InfiniBand

On InfiniBand clusters, MPI is typically built to use InfiniBand transport layers:

Often via libraries such as:

OFED verbs API
UCX (Unified Communication X)
libfabric/OFI

MPI may automatically choose InfiniBand when available, or require environment variables or mpirun/srun options.

Practical pointers:

Check the system documentation for:

Which MPI modules to load (e.g., module load mpi/openmpi-ib).
Any recommended environment variables (e.g., to select UCX or tune InfiniBand usage).

Use --mca btl or similar options (Open MPI) or equivalent in other MPI stacks to confirm InfiniBand transport is in use, if needed.

Node‑Local vs Network Communication

Modern InfiniBand stacks and MPI implementations can:

Use shared memory for communication between ranks on the same node.
Use InfiniBand for ranks on different nodes.

From a user perspective:

You don’t need to choose this manually; MPI does so automatically.
However, thread/process placement (e.g., hybrid MPI+OpenMP) can influence how much traffic goes across InfiniBand vs stays local, affecting performance.

Practical User Interactions with InfiniBand

Recognizing InfiniBand on a Cluster

Common signs that a cluster uses InfiniBand:

Documentation mentions “IB fabric,” “HDR,” “EDR,” etc.
Network devices shown via commands such as:

ibstat
ibv_devinfo
lspci | grep -i infiniband

Network interface names like ib0, ib1, or IPoIB interfaces such as ib0 used in ip addr.

You generally don’t configure these; system administrators handle it. But knowing that InfiniBand exists helps interpret performance expectations and tuning advice.

IP over InfiniBand (IPoIB)

Clusters may provide IP over InfiniBand:

Presents the InfiniBand device as a regular IP network interface.
Allows TCP/IP-based tools (e.g., SSH, NFS) to run over InfiniBand.

However, for HPC code:

MPI and parallel file systems typically use InfiniBand more directly for superior performance.
IPoIB is primarily for compatibility and management, not maximum performance.

Job Scripts and Resource Selection

Some schedulers and environments expose InfiniBand‑related options:

Constraints or features ensuring nodes are connected to a particular InfiniBand fabric or partition.
Environment modules that select MPI builds optimized for InfiniBand.
Variables controlling network behavior, such as:

UCX_NET_DEVICES
MPI transport selection flags.

When scaling up jobs, it’s good practice to:

Read site‑specific documentation on how to best use the InfiniBand fabric.
Respect any recommendations about maximum core counts or ranks per node to avoid saturating the network.

Reliability, Congestion, and QoS

Reliability and Flow Control

InfiniBand implements hardware flow control and error detection:

Prevents buffer overruns by pausing senders when receivers are full.
Uses CRC checks and retransmissions at the link level for reliability (especially in RC mode).

As a user, this contributes to:

More predictable performance under load versus best‑effort Ethernet.
Fewer silent data corruptions, which is critical for long scientific computations.

Congestion and Oversubscription

Despite high bandwidth, InfiniBand fabrics can still experience congestion:

If the fabric is oversubscribed (more potential traffic than bisection bandwidth allows).
Under heavy all‑to‑all or irregular communication patterns.

Site documentation might:

Warn about certain job sizes or communication patterns.
Provide best practices, e.g., avoiding unnecessarily large all‑to‑all collectives.

Partitions and Quality of Service (QoS)

InfiniBand supports:

Partitions: logical isolation of traffic (similar to VLANs).
QoS: assigning different service levels or priorities to traffic classes.

While typically configured by administrators, this can affect:

Which nodes you can communicate with (e.g., separate partitions for test vs production).
The performance of I/O vs compute traffic when both share the fabric.

InfiniBand Generations and Compatibility

Evolution of InfiniBand Speeds

Major InfiniBand generations relevant to HPC:

SDR, DDR (older, now rare in new systems)
QDR (~40 Gb/s)
FDR (~56 Gb/s)
EDR (~100 Gb/s)
HDR (~200 Gb/s)
NDR (~400 Gb/s)

For users:

Different partitions or subclusters might use different InfiniBand generations.
Performance expectations and scaling limits can vary between them.

Backward Compatibility

InfiniBand is designed with some degree of backward compatibility:

Newer HCAs and switches may support multiple speeds or fall back to older link speeds.
Mixed environments might negotiate to the highest common speed.

When comparing performance between systems, ensure you’re aware of:

Which InfiniBand generation each cluster uses.
Whether your job might be crossing lower‑speed links or older hardware.

InfiniBand and Parallel Filesystems

Parallel filesystems such as Lustre or GPFS may be deployed over InfiniBand:

Object storage servers (OSS) connect to the IB fabric.
Dedicated IB networks may be used for storage vs MPI traffic, or they may share the same fabric.

For users, practical points include:

High throughput file I/O may rely on InfiniBand performance.
Busy I/O periods (e.g., many jobs checkpointing at once) can stress both the filesystem and the network.
Following site guidance on checkpointing intervals and large data transfers helps maintain stable performance.

Summary: What You Should Remember as a Beginner

InfiniBand is a specialized high‑performance interconnect widely used in HPC clusters.
It provides high bandwidth, low latency, and RDMA capabilities that MPI and parallel filesystems exploit.
You rarely configure InfiniBand directly; you interact through MPI, job scripts, and modules.
Understanding that InfiniBand exists—and is different from Ethernet—helps explain:

Why certain MPI builds or options are recommended.
Why small changes in process placement or message sizes can have large performance effects.
Why your cluster’s documentation spends time on “fabric topology,” “IB partitions,” and related terms.

As you move on to topics like MPI and parallel filesystems, you’ll see how these software layers make use of InfiniBand’s features to deliver scalable HPC performance.

Comments

Please login to add a comment.

Don't have an account? Register now!