4.5.1 Ethernet

Table of Contents

Role of Ethernet in HPC Interconnects

Ethernet is the most widely used networking technology in general IT and increasingly common in HPC clusters, especially small to medium systems or institutional clusters with constrained budgets. In HPC, its role is to connect nodes for:

Job control and management traffic
User login and file access
Sometimes for the main MPI/parallel data traffic (especially in cost-sensitive clusters)

Understanding how Ethernet fits into HPC helps you reason about performance limits, network-related slowdowns, and what you can (and cannot) expect from the cluster network.

Basic Ethernet Concepts Relevant to HPC

Link speed

Common Ethernet link speeds you might see on clusters:

1 GbE (1 Gbit/s) – older; usually management-only today
10 GbE (10 Gbit/s) – older “fast” option; still in use
25 GbE (25 Gbit/s) – common building block in newer systems
40 GbE (40 Gbit/s) – now largely replaced by 25/50/100
50 GbE (50 Gbit/s)
100 GbE (100 Gbit/s) – common in new HPC and data center networks
200/400 GbE (emerging / high-end deployments)

For HPC, the effective throughput you get in applications is often lower than the nominal link speed due to protocol overheads, TCP/IP behavior, switch architecture, and contention.

Latency characteristics

Compared to specialized HPC interconnects (like InfiniBand), commodity Ethernet has:

Higher latencies (microseconds to tens of microseconds per hop)
Higher jitter (more variation in latency)
More software stack overhead (TCP/IP processing in the OS)

For tightly coupled MPI codes that do many small messages, latency can significantly affect performance. For throughput-dominated workloads (e.g., big file transfers, embarrassingly parallel jobs), Ethernet is often sufficient.

Typical Ethernet Usage Patterns in HPC Clusters

Management and control network

Nearly all clusters use Ethernet as a “control plane”:

SSH access to login and management nodes
Monitoring and health checks
Configuration management, OS provisioning, PXE booting
Cluster management software communication

This is often a separate physical network from the high-performance interconnect to avoid interference with compute traffic.

Storage and file system access

Ethernet is also commonly used for:

Access to NFS servers
Access to parallel filesystems over Ethernet-based backends (e.g., Lustre over TCP, Ceph, BeeGFS)
Object storage (S3-compatible systems, archival servers)

For I/O-intensive applications, the performance of this Ethernet-based storage network can be as important as the compute interconnect.

Data/compute network on Ethernet-only clusters

Some clusters use only Ethernet for:

MPI communication
Inter-node communication for distributed workloads (e.g., Spark, Dask, TensorFlow)
Data shuffles in big data or AI workloads

In such environments, network design and tuning become critical to avoid the network becoming the main bottleneck.

Ethernet Network Topologies in HPC

Simple hierarchical (tree) topologies

Smaller clusters often use:

A single top-of-rack (ToR) switch for all nodes in a rack
One or a few aggregation/core switches connecting those ToR switches

This is simple and cheap, but:

Links toward the top of the tree can be oversubscribed
Traffic between racks may be bottlenecked at core switches

Fat-tree / Clos architectures over Ethernet

To improve bisection bandwidth (the available bandwidth between any two halves of the cluster), HPC systems may use:

A “fat-tree-like” Ethernet topology:

Multiple layers of switches
More bandwidth at higher layers to reduce oversubscription

ECMP (Equal-Cost Multi-Path) routing to load-balance flows across multiple links

While Clos/fat-tree designs are more typical in specialized HPC networks, the same principles can be applied using Ethernet switches.

Leaf–spine Ethernet fabric

Modern clusters often adopt a leaf–spine architecture:

Leaf switches: ToR switches connecting directly to nodes
Spine switches: Core layer switches connecting all leafs

Properties:

Consistent, low path length (usually 2 hops: leaf → spine → leaf)
Easier to scale horizontally
Allows for near-non-blocking fabrics if designed with enough spine capacity

This architecture is widely used for both data center and Ethernet-based HPC networks.

Performance Considerations with Ethernet in HPC

Bandwidth vs. latency trade-offs

Ethernet-based HPC networks tend to have:

Good bandwidth, especially with 25/100 GbE or faster
Relatively higher latency for small messages compared to specialized interconnects

This impacts:

Latency-sensitive MPI codes (frequent small messages, fine-grained synchronization)
Collective operations that rely on many small exchanges

On the other hand, bulk data transfers (e.g., checkpointing, big matrix exchanges in large blocks) can perform well.

Oversubscription and contention

Oversubscription ratio is:

$$
\text{Oversubscription} = \frac{\text{Total possible node traffic}}{\text{Uplink capacity from the switch/rack}}
$$

Common scenarios:

To save costs, uplinks from ToR switches to spine/core have less aggregate bandwidth than the sum of all node ports.
When many nodes communicate across racks or access a shared resource (like a file server), links near the core can become congested.

Effects:

Increased latency and queueing delays
Reduced effective throughput for each job
Performance variability depending on what other users are doing

For HPC users, this explains why network performance can change from run to run.

TCP/IP overhead

HPC applications on Ethernet typically use:

TCP/IP sockets (e.g., MPI over TCP)
Sometimes RDMA over Converged Ethernet (RoCE) for reduced CPU overhead and lower latency (still Ethernet-based, but with different stack behavior)

TCP/IP characteristics relevant to HPC:

Higher CPU overhead for packet processing
Congestion control algorithms that adapt throughput based on packet loss and delay
Slow start behavior for new connections

On noisy or congested networks, TCP performance may degrade, impacting application throughput.

MTU and jumbo frames

MTU (Maximum Transmission Unit) is the largest packet size per frame. Standard MTU is commonly 1500 bytes, but on many HPC Ethernet networks:

Jumbo frames (e.g., MTU 9000) are enabled to:

Reduce per-packet overhead
Improve throughput for large data transfers
Lower CPU utilization for network processing

Caveats:

All devices in the path must consistently support the same MTU.
Misconfigured MTU can cause obscure connectivity or performance issues.

Quality of Service (QoS) and traffic isolation

To mitigate interference between different types of traffic, Ethernet switches may be configured with:

VLANs to separate management, storage, and compute traffic
QoS policies to give priority to latency-sensitive traffic (e.g., MPI, storage metadata) over bulk transfers (e.g., large file copy)

For users, this can manifest as more stable performance even when others run heavy I/O jobs.

Ethernet-Based High-Performance Enhancements

RDMA over Converged Ethernet (RoCE)

RoCE allows Remote Direct Memory Access over Ethernet:

Bypasses much of the traditional TCP/IP stack
Reduces CPU overhead and latency
Brings Ethernet closer to specialized interconnects in behavior

Key ideas (without deep detail):

Kernel-bypass networking: user-space communication engines
Zero-copy data transfers directly between application buffers

Not all Ethernet-based clusters provide RoCE; it depends on NICs, switches, and configuration.

Data center Ethernet features for HPC

Modern Ethernet switches often support:

Priority Flow Control (PFC) to reduce packet loss for specific traffic classes
Explicit Congestion Notification (ECN) to signal congestion without dropping packets
DCB (Data Center Bridging) features for more deterministic behavior

These features, properly configured, can significantly improve performance consistency for HPC workloads compared to basic Ethernet.

Practical Implications for HPC Users

Recognizing Ethernet-based limitations

On a cluster where the primary interconnect is Ethernet:

Expect:

Longer latencies for many small messages
Greater sensitivity to network contention

You may see:

Performance drops when the cluster is heavily loaded
Variation across runs of the same job

Adapting applications to Ethernet

Common strategies (conceptual, not implementation details):

Reduce the number of small messages; send fewer, larger messages when possible
Minimize global synchronization points (e.g., barriers, collective calls) in tight loops
Overlap communication and computation where possible so communication latency is hidden

Even if you are not modifying code, understanding these ideas helps you interpret performance and choose algorithms or libraries more suited to Ethernet networks.

Interpreting cluster documentation

Cluster documentation might mention:

“10/25/100 GbE network for MPI” → main compute communication runs over Ethernet
“Non-blocking leaf–spine Ethernet fabric” → the network was designed to minimize oversubscription
“Separate management and storage networks” → less interference with your compute traffic
“RoCE-enabled Ethernet fabric” → low-latency features more similar to specialized interconnects

Knowing the meaning of these terms helps you set realistic performance expectations and discuss network-related issues with support staff.

Summary

Ethernet is a flexible, cost-effective interconnect technology that plays multiple roles in HPC clusters: management, storage, and sometimes the main compute network. While it generally offers higher latency and more variability than specialized interconnects, careful network design and modern features (leaf–spine topologies, jumbo frames, QoS, RoCE, DCB) allow Ethernet-based clusters to support many HPC workloads effectively. Understanding how Ethernet behaves in an HPC context helps you reason about performance, recognize network bottlenecks, and choose or tune applications appropriately for your cluster.

Comments

Please login to add a comment.

Don't have an account? Register now!