4.5 Interconnects

Table of Contents

Role of Interconnects in HPC Clusters

In a high performance computing cluster, many separate machines must cooperate as a single system. Interconnects are the communication fabric that links these machines together. They carry data between nodes, between nodes and storage, and sometimes between components inside a node. The qualities of this fabric, especially its latency and bandwidth, are central to the performance of parallel applications that run across multiple nodes.

An interconnect is more than a cable. It includes the physical medium, the network interface hardware on each node, the switches that route traffic, and the communication protocols that govern data transfer. In small systems you can think of the interconnect as “just the network,” but in HPC the design is specialised and tightly integrated with the rest of the cluster.

Because this chapter sits within the broader discussion of HPC clusters and infrastructure, you should already have a picture of nodes and their roles. The focus here is on how these nodes talk to each other and why different kinds of interconnects matter.

Latency, Bandwidth, and Bisection Bandwidth

Two properties dominate when discussing interconnect performance. These are latency and bandwidth. A third, bisection bandwidth, is often used to describe the capacity of the network as a whole.

Latency is the time taken to send the smallest possible message from one node to another and receive a response. It includes the time to prepare the message, traverse the network stack, transmit on the wire, route through switches, and process it on the receiving node. In communication intensive codes, especially those that exchange many small messages, latency can dominate overall cost.

Bandwidth describes how much data per second the interconnect can transfer once a stream is flowing. It is often measured in Gbit/s or GB/s. Applications that send large chunks of data, for example for domain decomposition or checkpointing, care strongly about available bandwidth.

A useful approximation is that the time to send a message of size $S$ bytes across a network link can be modelled as
$$
T(S) = T_{\text{latency}} + \frac{S}{B},
$$
where $T_{\text{latency}}$ is the latency and $B$ is the sustained bandwidth in bytes per second.

The total communication time is often approximated as
$$T(S) = T_{\text{latency}} + \frac{S}{B}.$$
Low latency is crucial for many small messages. High bandwidth is crucial for large messages.

Bisection bandwidth describes the minimum sum of link bandwidths that must be cut to divide the network into two halves with equal numbers of nodes. It captures how well the network can support many simultaneous communications. A network with high per link bandwidth but low bisection bandwidth can suffer from congestion as parallel jobs scale up.

Topologies and Their Impact

The pattern in which switches and links are connected is the network topology. It has strong effects on latency, bisection bandwidth, fault tolerance, and cost. You do not need to design these topologies, but understanding them helps you reason about performance and about job placement.

Simple topologies like a star or a tree are easy to build and expand, but they often have limited bisection bandwidth and can create hot spots at higher levels of the tree. Larger HPC systems often use more complex topologies such as fat trees, torus networks, or dragonfly designs.

A fat tree uses more bandwidth in higher levels of the tree than at the leaves. It tries to ensure that there is enough capacity between any two groups of nodes, reducing oversubscription. A torus connects nodes in a grid that wraps around at the edges, which keeps path lengths short and uniform. Dragonfly networks group nodes into local clusters, then connect these clusters with a high bandwidth global network, which can reduce the number of hops between groups.

From a user perspective, the main consequences of topology are that communication costs can depend on which nodes your job receives, and that communication patterns that align with the topology can perform better. Many batch schedulers and MPI libraries attempt to map processes onto nodes in ways that respect the underlying topology.

Switching Fabric and Routing

Interconnects are built around switches that forward packets between links. The set of all switches and links is often called the switching fabric. Modern HPC switches implement hardware based routing. They decide which output port to use for a given packet and can perform this decision at very high speed.

Routing can be static or adaptive. With static routing, each path between a source and destination is fixed. With adaptive routing, the network can choose among multiple paths in real time. Adaptive routing can reduce congestion and improve performance for unpredictable communication patterns, which are common in some scientific applications.

The routing algorithm can affect both performance and correctness. Certain combinations of routing decisions and communication patterns can cause deadlocks in the network itself. HPC interconnects therefore often support deadlock avoidance mechanisms at the hardware level, such as virtual channels or traffic classes.

Protocols and Offload Capabilities

At the logical level, interconnects use protocols to define how data is packaged, addressed, and acknowledged. In commodity Ethernet environments, communication often uses TCP or UDP sockets and relies heavily on the host CPU to implement the protocol stack. This is flexible but can impose significant CPU overhead.

HPC interconnects, and increasingly advanced Ethernet hardware, provide offload capabilities. These include hardware support for message segmentation, reassembly, congestion control, and even one sided operations where a remote memory region can be read or written without explicit participation of the remote CPU at the time of transfer.

One important family of capabilities is remote direct memory access, or RDMA. RDMA allows an application to move data directly between the memory of two nodes with minimal CPU involvement. It reduces overhead, can bypass parts of the operating system network stack, and often yields both lower latency and higher throughput.

Some MPI implementations and communication libraries are designed to exploit these offload and RDMA capabilities. From the user’s perspective, you keep the same MPI calls, but the library may choose fast paths over RDMA enabled interconnects if available.

Reliability, Congestion, and Quality of Service

HPC clusters typically run long jobs with many communicating processes. Interconnect reliability is therefore a central concern. Bit errors on the wire are mitigated through error detection codes, retransmissions, or forward error correction. Switches and network interface cards track error counters so that administrators can detect flaky cables or ports.

Congestion arises when many flows compete for the same links. This can increase latency and reduce throughput. Some interconnects implement congestion control, for example explicit congestion notification and rate limiting, to prevent excessive queuing. Adaptive routing can also mitigate hot spots by redistributing traffic.

Quality of service refers to the ability to prioritise some kinds of traffic over others. In mixed use clusters, storage traffic, system management traffic, and user application traffic may share the same interconnect. Quality of service features in the network can help ensure that control traffic remains responsive even when large data transfers are occurring.

From the point of view of application developers, these details are abstracted away by MPI and other communication libraries. However, awareness of congestion and fairness issues is useful when planning communication patterns and when interpreting performance anomalies.

Measuring and Understanding Interconnect Performance

Interconnect performance is usually characterised through microbenchmarks and application benchmarks. Microbenchmarks test simple patterns, such as point to point ping pong or bandwidth tests, between two or more nodes. They report latency and sustainable bandwidth for small and large messages. Application benchmarks run representative parallel codes and measure overall runtime or throughput.

A common observation is that theoretical peak bandwidth is rarely achieved in practice. Factors such as protocol overhead, shared links, switch buffering, and host CPU limitations reduce the effective bandwidth. For small messages, effective bandwidth can be much lower than the nominal link rate because latency dominates and because each message carries fixed overhead.

The effective performance of an interconnect can also change with system load. On shared clusters, jobs from multiple users may contend for the same switch backplanes or uplinks. This can lead to variable performance, even if your own job uses the same number of nodes and the same program.

Understanding how to interpret network benchmarks, how to distinguish host side bottlenecks from network bottlenecks, and how to report findings to system administrators is a useful skill for HPC practitioners. It allows you to make informed decisions about scaling studies and performance tuning.

Interaction with Parallel Programming Models

Interconnects are deeply tied to distributed memory parallel programming. When processes on different nodes exchange messages, those messages travel over the interconnect. The latency and bandwidth characteristics of the interconnect influence algorithm design, especially in communication intensive regions.

In message passing programs, it is often useful to minimise the number of messages, aggregate small messages into larger ones, and overlap communication with computation. These strategies all aim to work with the strengths and limitations of the interconnect. Algorithms with many global synchronisations can suffer if the interconnect has nonuniform performance or if contention is high.

Hybrid programming models that use threads within a node and message passing between nodes can also be affected by the interconnect. For example, if you use fewer MPI ranks per node and more threads per rank, you may reduce the number of endpoints that participate in global communications. This can reduce pressure on the interconnect, especially on broadcast or reduction operations.

GPU accelerated applications can use the interconnect in complex ways. Data may move between GPUs within a node, between GPUs across nodes, and between host memory and device memory. Some modern interconnects integrate with GPU aware protocols and libraries so that data can move directly between devices without intermediate copies.

Design Tradeoffs and Cost Considerations

Cluster designers must balance performance, cost, and flexibility when choosing an interconnect. Higher performance hardware tends to be more expensive not only in terms of switch and adapter cost, but also in terms of cabling, power, and cooling. Complex topologies with high bisection bandwidth require more switch ports and more links, which raises cost.

A common design strategy is to avoid overprovisioning the network while still supporting the intended workloads. For example, a cluster aimed primarily at high throughput computing, where jobs are mostly independent, may invest less in high end interconnect technologies and more in node count or storage. A cluster targeted at tightly coupled simulations is more likely to justify a high performance interconnect.

Oversubscription is another important tradeoff. An oversubscribed network has less aggregate bandwidth between groups of nodes than the sum of the bandwidth within each group. This can be acceptable if not all traffic patterns are worst case, but it can hurt performance for certain jobs. Many clusters are partially oversubscribed at higher levels of the topology.

As an HPC user, you might not make these design decisions, but they influence how you use the system. Documentation often describes the interconnect type, whether the network is oversubscribed, and any special partitions that have different network characteristics. Choosing the right partition for a communication heavy job can significantly affect runtime.

Practical Implications for Users

Although most details of interconnect design remain hidden, there are several practical points that users should keep in mind.

First, not all communication costs the same. Nearby nodes in the same rack or switch may have slightly lower latency or higher effective bandwidth than more distant nodes, even within a flat looking cluster. Schedulers may attempt to allocate contiguous blocks of nodes to reduce hop counts, but this is not always possible.

Second, the interconnect is a shared resource. When you run large jobs during busy times, your application may encounter more background traffic. Scaling studies that were performed on an empty or dedicated partition may not exactly predict performance on a busy shared system.

Third, the behaviour of collective operations, such as broadcasts and reductions, depends heavily on the interconnect. Libraries usually implement tree based or hierarchical algorithms that reflect the network topology. This is one reason why global operations may scale differently from pairwise communications.

Finally, the interconnect can fail or degrade. Symptoms such as sporadic timeouts, large variations in communication times, or frequent MPI errors can indicate hardware issues. Collecting basic observations, such as which nodes were involved and what patterns of communication were running, helps administrators diagnose and fix interconnect problems.

Understanding interconnects at this level does not require you to be a network engineer. It provides enough context to reason about communication costs, to interpret scaling results, and to make better use of the cluster resources that support your computations.