Table of Contents
What Interconnects Are in an HPC Cluster
In an HPC cluster, the interconnect is the network that links nodes together so they can exchange data. It is distinct from:
- Local memory access inside a node
- Access to storage systems
- External internet connectivity
Interconnects matter in HPC because many workloads require frequent communication between nodes. The performance of this communication can dominate overall runtime.
At a high level, an interconnect is characterized by:
- Topology – how nodes are wired together
- Bandwidth – how much data per second can move between endpoints
- Latency – how long a single message takes to travel
- Offload and features – what the network hardware can do without CPU involvement
- Reliability and fault-tolerance mechanisms
The choice and configuration of interconnects are central design decisions in HPC system architecture.
Key Performance Concepts: Latency and Bandwidth
Two core metrics define interconnect performance:
- Latency: time to send a tiny message from one node to another
Measured in microseconds ($\mu$s). Lower is better. - Bandwidth: peak rate data can be transferred, often in Gbit/s or GB/s
Higher is better.
They matter in different ways:
- Codes with many small messages (e.g., tightly coupled simulations) are latency-sensitive.
- Codes that move large arrays or perform big file transfers are bandwidth-sensitive.
A rough delay to send $N$ bytes over a link can be modeled as:
$$
T(N) \approx T_{\text{latency}} + \frac{N}{\text{bandwidth}}
$$
This is why a network with slightly higher bandwidth but much higher latency can still be worse for many HPC applications.
Network Topologies in HPC Interconnects
How nodes are connected (the topology) impacts both performance and scalability. Some commonly used topologies:
- Fat tree / Clos networks
- Multi-level tree-like structure with more bandwidth near the root.
- Popular because it can provide near-uniform bandwidth between any two nodes.
- Easy to reason about in terms of bisection bandwidth (bandwidth across a “cut” splitting the system).
- Dragonfly / Dragonfly+
- Groups of nodes with high-bandwidth local connections, and high-speed links between groups.
- Designed to reduce the number of hops between any two nodes.
- Offers good scalability and cost-efficiency for large systems.
- Torus / Mesh (e.g., 3D torus)
- Nodes arranged in grids, each connected to neighbors.
- Short, regular links; works well for applications with nearest-neighbor communication patterns.
- Hop count can grow with the size of the system.
- Star, ring, partial mesh
- Less common in modern large-scale HPC, but appear in smaller or legacy systems.
From a user perspective, topology can affect:
- How job schedulers place MPI ranks on nodes
- Whether “locality-aware” job options exist (e.g., requesting nodes that are close together)
- Performance variability depending on where your job is mapped in the network
Communication Models: Store-and-Forward vs Cut-Through
Interconnects differ in how network switches move packets:
- Store-and-forward
- Switch receives the entire packet, checks it, then forwards it.
- Simpler but adds more latency per hop.
- Cut-through / wormhole routing
- Switch begins forwarding bytes as soon as the header is received.
- Reduces per-hop latency and is common in high-end interconnects.
Routing strategies (e.g., deterministic vs adaptive routing) also affect:
- How traffic is spread across multiple possible paths
- How well the network tolerates congestion or failures
Offload and RDMA
Modern HPC interconnects often support Remote Direct Memory Access (RDMA):
- Allows a NIC (network interface card) to directly read/write memory on a remote node
- Minimizes CPU involvement in data movement
- Reduces latency and CPU overhead
This contrasts with traditional networking where:
- Data must pass through the kernel network stack
- User-space processes copy data into kernel buffers
- The CPU is heavily involved in every transfer
Interconnects may also offload collective operations (e.g., reductions, broadcasts) to the network hardware, further reducing CPU and memory traffic.
Transport Semantics: Reliable vs Unreliable, Ordered vs Unordered
HPC interconnects typically provide:
- Reliable transport: packets are retransmitted if lost
- In-order delivery: data arrives in the order it was sent on a connection
- Flow control and congestion management: mechanisms to avoid overwhelming receivers or links
Some features to be aware of:
- Connection-oriented vs connectionless communication
- Multiple service levels or virtual lanes to separate traffic classes (e.g., user vs storage)
- Support for quality of service (QoS) so critical traffic (e.g., filesystem metadata) gets priority
From the application’s viewpoint (through MPI or other libraries), these semantics are usually abstracted away, but they influence the reliability and performance you actually see.
Interconnects vs Traditional Ethernet Networking
While Ethernet is covered separately, from an interconnect perspective the contrast is useful:
- Traditional datacenter Ethernet
- Designed for general-purpose traffic
- Higher latency, CPU-intensive networking stack
- Benefit from economies of scale (cheap, standardized)
- HPC-oriented interconnects
- Lower latency, higher effective bandwidth between compute nodes
- Often use RDMA and offload features
- Tuned for bulk data and collective operations
Some HPC systems use:
- Specialized interconnects for compute traffic
- Standard Ethernet for management, login, and external connectivity
Others deploy advanced Ethernet-based fabrics that behave much more like traditional HPC interconnects, blurring the distinction.
Interconnects and Parallel Application Performance
For parallel applications, the interconnect influences:
- Scalability: how performance changes as you increase node count
- Communication/computation overlap: whether the network can keep up while computation continues
- Sensitivity to job placement: how much performance varies based on where your job runs
Patterns particularly sensitive to interconnect performance include:
- Global reductions (e.g., in iterative solvers)
- Nearest-neighbor exchanges (e.g., stencil codes)
- All-to-all communication (e.g., FFT-based codes, some data redistribution steps)
In practice, you may see:
- Performance plateaus or degradation as node counts increase, limited by network
- Different scaling curves on systems with different interconnects
- Tuning advice that involves message sizes, communication frequency, and process layout based on interconnect characteristics
Practical User-Facing Aspects of Interconnects
As a user, you typically do not manage the interconnect directly, but it affects how you work:
- Job submission options
- Requests for node locality (e.g., same rack, same switch)
- Constraints based on which network a job should use (compute vs specialized fabrics)
- MPI and library configuration
- Choice of transport “backends” (e.g., different network drivers or providers)
- Environment variables to control how the library uses the interconnect (e.g., eager vs rendezvous thresholds, RDMA usage)
- Performance debugging
- Identifying when your job is network-bound rather than CPU-bound
- Recognizing symptoms of network congestion (e.g., inconsistent runtimes at larger scales)
You generally interact with the interconnect indirectly via:
- MPI, SHMEM, or other communication libraries
- Parallel filesystems that move data across the same or overlapping networks
- Job scheduler resource and placement policies
Understanding that a specialized interconnect exists—and that its characteristics matter—helps interpret performance results and informs how you design and scale your HPC workloads.