Table of Contents
Role of Ethernet in HPC Interconnects
Ethernet is the most widely used networking technology in general IT and increasingly common in HPC clusters, especially small to medium systems or institutional clusters with constrained budgets. In HPC, its role is to connect nodes for:
- Job control and management traffic
- User login and file access
- Sometimes for the main MPI/parallel data traffic (especially in cost-sensitive clusters)
Understanding how Ethernet fits into HPC helps you reason about performance limits, network-related slowdowns, and what you can (and cannot) expect from the cluster network.
Basic Ethernet Concepts Relevant to HPC
Link speed
Common Ethernet link speeds you might see on clusters:
- 1 GbE (1 Gbit/s) – older; usually management-only today
- 10 GbE (10 Gbit/s) – older “fast” option; still in use
- 25 GbE (25 Gbit/s) – common building block in newer systems
- 40 GbE (40 Gbit/s) – now largely replaced by 25/50/100
- 50 GbE (50 Gbit/s)
- 100 GbE (100 Gbit/s) – common in new HPC and data center networks
- 200/400 GbE (emerging / high-end deployments)
For HPC, the effective throughput you get in applications is often lower than the nominal link speed due to protocol overheads, TCP/IP behavior, switch architecture, and contention.
Latency characteristics
Compared to specialized HPC interconnects (like InfiniBand), commodity Ethernet has:
- Higher latencies (microseconds to tens of microseconds per hop)
- Higher jitter (more variation in latency)
- More software stack overhead (TCP/IP processing in the OS)
For tightly coupled MPI codes that do many small messages, latency can significantly affect performance. For throughput-dominated workloads (e.g., big file transfers, embarrassingly parallel jobs), Ethernet is often sufficient.
Typical Ethernet Usage Patterns in HPC Clusters
Management and control network
Nearly all clusters use Ethernet as a “control plane”:
- SSH access to login and management nodes
- Monitoring and health checks
- Configuration management, OS provisioning, PXE booting
- Cluster management software communication
This is often a separate physical network from the high-performance interconnect to avoid interference with compute traffic.
Storage and file system access
Ethernet is also commonly used for:
- Access to NFS servers
- Access to parallel filesystems over Ethernet-based backends (e.g., Lustre over TCP, Ceph, BeeGFS)
- Object storage (S3-compatible systems, archival servers)
For I/O-intensive applications, the performance of this Ethernet-based storage network can be as important as the compute interconnect.
Data/compute network on Ethernet-only clusters
Some clusters use only Ethernet for:
- MPI communication
- Inter-node communication for distributed workloads (e.g., Spark, Dask, TensorFlow)
- Data shuffles in big data or AI workloads
In such environments, network design and tuning become critical to avoid the network becoming the main bottleneck.
Ethernet Network Topologies in HPC
Simple hierarchical (tree) topologies
Smaller clusters often use:
- A single top-of-rack (ToR) switch for all nodes in a rack
- One or a few aggregation/core switches connecting those ToR switches
This is simple and cheap, but:
- Links toward the top of the tree can be oversubscribed
- Traffic between racks may be bottlenecked at core switches
Fat-tree / Clos architectures over Ethernet
To improve bisection bandwidth (the available bandwidth between any two halves of the cluster), HPC systems may use:
- A “fat-tree-like” Ethernet topology:
- Multiple layers of switches
- More bandwidth at higher layers to reduce oversubscription
- ECMP (Equal-Cost Multi-Path) routing to load-balance flows across multiple links
While Clos/fat-tree designs are more typical in specialized HPC networks, the same principles can be applied using Ethernet switches.
Leaf–spine Ethernet fabric
Modern clusters often adopt a leaf–spine architecture:
- Leaf switches: ToR switches connecting directly to nodes
- Spine switches: Core layer switches connecting all leafs
Properties:
- Consistent, low path length (usually 2 hops: leaf → spine → leaf)
- Easier to scale horizontally
- Allows for near-non-blocking fabrics if designed with enough spine capacity
This architecture is widely used for both data center and Ethernet-based HPC networks.
Performance Considerations with Ethernet in HPC
Bandwidth vs. latency trade-offs
Ethernet-based HPC networks tend to have:
- Good bandwidth, especially with 25/100 GbE or faster
- Relatively higher latency for small messages compared to specialized interconnects
This impacts:
- Latency-sensitive MPI codes (frequent small messages, fine-grained synchronization)
- Collective operations that rely on many small exchanges
On the other hand, bulk data transfers (e.g., checkpointing, big matrix exchanges in large blocks) can perform well.
Oversubscription and contention
Oversubscription ratio is:
$$
\text{Oversubscription} = \frac{\text{Total possible node traffic}}{\text{Uplink capacity from the switch/rack}}
$$
Common scenarios:
- To save costs, uplinks from ToR switches to spine/core have less aggregate bandwidth than the sum of all node ports.
- When many nodes communicate across racks or access a shared resource (like a file server), links near the core can become congested.
Effects:
- Increased latency and queueing delays
- Reduced effective throughput for each job
- Performance variability depending on what other users are doing
For HPC users, this explains why network performance can change from run to run.
TCP/IP overhead
HPC applications on Ethernet typically use:
- TCP/IP sockets (e.g., MPI over TCP)
- Sometimes RDMA over Converged Ethernet (RoCE) for reduced CPU overhead and lower latency (still Ethernet-based, but with different stack behavior)
TCP/IP characteristics relevant to HPC:
- Higher CPU overhead for packet processing
- Congestion control algorithms that adapt throughput based on packet loss and delay
- Slow start behavior for new connections
On noisy or congested networks, TCP performance may degrade, impacting application throughput.
MTU and jumbo frames
MTU (Maximum Transmission Unit) is the largest packet size per frame. Standard MTU is commonly 1500 bytes, but on many HPC Ethernet networks:
- Jumbo frames (e.g., MTU 9000) are enabled to:
- Reduce per-packet overhead
- Improve throughput for large data transfers
- Lower CPU utilization for network processing
Caveats:
- All devices in the path must consistently support the same MTU.
- Misconfigured MTU can cause obscure connectivity or performance issues.
Quality of Service (QoS) and traffic isolation
To mitigate interference between different types of traffic, Ethernet switches may be configured with:
- VLANs to separate management, storage, and compute traffic
- QoS policies to give priority to latency-sensitive traffic (e.g., MPI, storage metadata) over bulk transfers (e.g., large file copy)
For users, this can manifest as more stable performance even when others run heavy I/O jobs.
Ethernet-Based High-Performance Enhancements
RDMA over Converged Ethernet (RoCE)
RoCE allows Remote Direct Memory Access over Ethernet:
- Bypasses much of the traditional TCP/IP stack
- Reduces CPU overhead and latency
- Brings Ethernet closer to specialized interconnects in behavior
Key ideas (without deep detail):
- Kernel-bypass networking: user-space communication engines
- Zero-copy data transfers directly between application buffers
Not all Ethernet-based clusters provide RoCE; it depends on NICs, switches, and configuration.
Data center Ethernet features for HPC
Modern Ethernet switches often support:
- Priority Flow Control (PFC) to reduce packet loss for specific traffic classes
- Explicit Congestion Notification (ECN) to signal congestion without dropping packets
- DCB (Data Center Bridging) features for more deterministic behavior
These features, properly configured, can significantly improve performance consistency for HPC workloads compared to basic Ethernet.
Practical Implications for HPC Users
Recognizing Ethernet-based limitations
On a cluster where the primary interconnect is Ethernet:
- Expect:
- Longer latencies for many small messages
- Greater sensitivity to network contention
- You may see:
- Performance drops when the cluster is heavily loaded
- Variation across runs of the same job
Adapting applications to Ethernet
Common strategies (conceptual, not implementation details):
- Reduce the number of small messages; send fewer, larger messages when possible
- Minimize global synchronization points (e.g., barriers, collective calls) in tight loops
- Overlap communication and computation where possible so communication latency is hidden
Even if you are not modifying code, understanding these ideas helps you interpret performance and choose algorithms or libraries more suited to Ethernet networks.
Interpreting cluster documentation
Cluster documentation might mention:
- “10/25/100 GbE network for MPI” → main compute communication runs over Ethernet
- “Non-blocking leaf–spine Ethernet fabric” → the network was designed to minimize oversubscription
- “Separate management and storage networks” → less interference with your compute traffic
- “RoCE-enabled Ethernet fabric” → low-latency features more similar to specialized interconnects
Knowing the meaning of these terms helps you set realistic performance expectations and discuss network-related issues with support staff.
Summary
Ethernet is a flexible, cost-effective interconnect technology that plays multiple roles in HPC clusters: management, storage, and sometimes the main compute network. While it generally offers higher latency and more variability than specialized interconnects, careful network design and modern features (leaf–spine topologies, jumbo frames, QoS, RoCE, DCB) allow Ethernet-based clusters to support many HPC workloads effectively. Understanding how Ethernet behaves in an HPC context helps you reason about performance, recognize network bottlenecks, and choose or tune applications appropriately for your cluster.