4.5 Interconnects

Table of Contents

What Interconnects Are in an HPC Cluster

In an HPC cluster, the interconnect is the network that links nodes together so they can exchange data. It is distinct from:

Local memory access inside a node
Access to storage systems
External internet connectivity

Interconnects matter in HPC because many workloads require frequent communication between nodes. The performance of this communication can dominate overall runtime.

At a high level, an interconnect is characterized by:

Topology – how nodes are wired together
Bandwidth – how much data per second can move between endpoints
Latency – how long a single message takes to travel
Offload and features – what the network hardware can do without CPU involvement
Reliability and fault-tolerance mechanisms

The choice and configuration of interconnects are central design decisions in HPC system architecture.

Key Performance Concepts: Latency and Bandwidth

Two core metrics define interconnect performance:

Latency: time to send a tiny message from one node to another
Measured in microseconds ($\mu$s). Lower is better.
Bandwidth: peak rate data can be transferred, often in Gbit/s or GB/s
Higher is better.

They matter in different ways:

Codes with many small messages (e.g., tightly coupled simulations) are latency-sensitive.
Codes that move large arrays or perform big file transfers are bandwidth-sensitive.

A rough delay to send $N$ bytes over a link can be modeled as:
$$
T(N) \approx T_{\text{latency}} + \frac{N}{\text{bandwidth}}
$$

This is why a network with slightly higher bandwidth but much higher latency can still be worse for many HPC applications.

Network Topologies in HPC Interconnects

How nodes are connected (the topology) impacts both performance and scalability. Some commonly used topologies:

Fat tree / Clos networks

Multi-level tree-like structure with more bandwidth near the root.
Popular because it can provide near-uniform bandwidth between any two nodes.
Easy to reason about in terms of bisection bandwidth (bandwidth across a “cut” splitting the system).

Dragonfly / Dragonfly+

Groups of nodes with high-bandwidth local connections, and high-speed links between groups.
Designed to reduce the number of hops between any two nodes.
Offers good scalability and cost-efficiency for large systems.

Torus / Mesh (e.g., 3D torus)

Nodes arranged in grids, each connected to neighbors.
Short, regular links; works well for applications with nearest-neighbor communication patterns.
Hop count can grow with the size of the system.

Star, ring, partial mesh

Less common in modern large-scale HPC, but appear in smaller or legacy systems.

From a user perspective, topology can affect:

How job schedulers place MPI ranks on nodes
Whether “locality-aware” job options exist (e.g., requesting nodes that are close together)
Performance variability depending on where your job is mapped in the network

Communication Models: Store-and-Forward vs Cut-Through

Interconnects differ in how network switches move packets:

Store-and-forward

Switch receives the entire packet, checks it, then forwards it.
Simpler but adds more latency per hop.

Cut-through / wormhole routing

Switch begins forwarding bytes as soon as the header is received.
Reduces per-hop latency and is common in high-end interconnects.

Routing strategies (e.g., deterministic vs adaptive routing) also affect:

How traffic is spread across multiple possible paths
How well the network tolerates congestion or failures

Offload and RDMA

Modern HPC interconnects often support Remote Direct Memory Access (RDMA):

Allows a NIC (network interface card) to directly read/write memory on a remote node
Minimizes CPU involvement in data movement
Reduces latency and CPU overhead

This contrasts with traditional networking where:

Data must pass through the kernel network stack
User-space processes copy data into kernel buffers
The CPU is heavily involved in every transfer

Interconnects may also offload collective operations (e.g., reductions, broadcasts) to the network hardware, further reducing CPU and memory traffic.

Transport Semantics: Reliable vs Unreliable, Ordered vs Unordered

HPC interconnects typically provide:

Reliable transport: packets are retransmitted if lost
In-order delivery: data arrives in the order it was sent on a connection
Flow control and congestion management: mechanisms to avoid overwhelming receivers or links

Some features to be aware of:

Connection-oriented vs connectionless communication
Multiple service levels or virtual lanes to separate traffic classes (e.g., user vs storage)
Support for quality of service (QoS) so critical traffic (e.g., filesystem metadata) gets priority

From the application’s viewpoint (through MPI or other libraries), these semantics are usually abstracted away, but they influence the reliability and performance you actually see.

Interconnects vs Traditional Ethernet Networking

While Ethernet is covered separately, from an interconnect perspective the contrast is useful:

Traditional datacenter Ethernet

Designed for general-purpose traffic
Higher latency, CPU-intensive networking stack
Benefit from economies of scale (cheap, standardized)

HPC-oriented interconnects

Lower latency, higher effective bandwidth between compute nodes
Often use RDMA and offload features
Tuned for bulk data and collective operations

Some HPC systems use:

Specialized interconnects for compute traffic
Standard Ethernet for management, login, and external connectivity

Others deploy advanced Ethernet-based fabrics that behave much more like traditional HPC interconnects, blurring the distinction.

Interconnects and Parallel Application Performance

For parallel applications, the interconnect influences:

Scalability: how performance changes as you increase node count
Communication/computation overlap: whether the network can keep up while computation continues
Sensitivity to job placement: how much performance varies based on where your job runs

Patterns particularly sensitive to interconnect performance include:

Global reductions (e.g., in iterative solvers)
Nearest-neighbor exchanges (e.g., stencil codes)
All-to-all communication (e.g., FFT-based codes, some data redistribution steps)

In practice, you may see:

Performance plateaus or degradation as node counts increase, limited by network
Different scaling curves on systems with different interconnects
Tuning advice that involves message sizes, communication frequency, and process layout based on interconnect characteristics

Practical User-Facing Aspects of Interconnects

As a user, you typically do not manage the interconnect directly, but it affects how you work:

Job submission options

Requests for node locality (e.g., same rack, same switch)
Constraints based on which network a job should use (compute vs specialized fabrics)

MPI and library configuration

Choice of transport “backends” (e.g., different network drivers or providers)
Environment variables to control how the library uses the interconnect (e.g., eager vs rendezvous thresholds, RDMA usage)

Performance debugging

Identifying when your job is network-bound rather than CPU-bound
Recognizing symptoms of network congestion (e.g., inconsistent runtimes at larger scales)

You generally interact with the interconnect indirectly via:

MPI, SHMEM, or other communication libraries
Parallel filesystems that move data across the same or overlapping networks
Job scheduler resource and placement policies

Understanding that a specialized interconnect exists—and that its characteristics matter—helps interpret performance results and informs how you design and scale your HPC workloads.