4.5.1 Ethernet

Table of Contents

Historical role of Ethernet in HPC

Ethernet began in office and campus networks as a general purpose technology for connecting workstations, printers, and servers. For a long time it was considered unsuitable for high performance computing because of relatively high latency, modest bandwidth, and unpredictable congestion behavior. Early HPC systems instead relied on specialized interconnects such as Myrinet, Quadrics, or proprietary vendor fabrics.

Over time, Ethernet speeds increased, switching hardware improved, and new features appeared in both hardware and software. Modern Ethernet can reach 10, 25, 40, 100 Gb/s and beyond, and high quality switches can offer low and reasonably consistent latency. As a result, Ethernet has become a practical choice for many small to medium HPC clusters and for parts of large systems, especially where cost and compatibility with existing infrastructure are important.

In today’s HPC environment you will commonly find Ethernet used for management networks, user login and file service networks, and in many clusters as the primary interconnect for running parallel workloads.

Ethernet in a cluster network architecture

In an HPC cluster, Ethernet usually appears in at least one of three roles. First, it forms the management network, which connects head and management nodes, login nodes, and out of band management interfaces such as BMC or IPMI. This network handles tasks like provisioning, monitoring, and remote power control, and does not normally carry heavy scientific data traffic.

Second, Ethernet often underpins the service network. This is the path between login nodes, user workstations, shared storage servers, and external institutional or campus networks. User logins via SSH, transfers with scp or rsync, and web based or API access to services usually take place over Ethernet.

Third, in many clusters Ethernet is also the compute interconnect for parallel jobs. In this case, each compute node has at least one high speed Ethernet interface connected to a fabric of switches. MPI messages, distributed file I/O, and other application traffic flow across this fabric. The design of this network, its topology and bandwidth, has a direct impact on application performance and scalability.

A single physical Ethernet can also be logically separated into multiple networks using VLANs. For example, one VLAN might carry management traffic and another might carry user data, even though both use the same physical switches and cables. From the user point of view this separation is usually invisible, but it matters for security and performance isolation.

Ethernet speed, bandwidth, and practical limits

Ethernet links are labeled by their nominal data rate: 1 Gb/s, 10 Gb/s, 25 Gb/s, 40 Gb/s, 100 Gb/s, and higher. In an HPC context, 1 Gb/s is usually too slow for a primary compute interconnect, but may be acceptable for management or control traffic. Today, 10 Gb/s and above are common for user facing and storage networks, and 25 Gb/s or higher are used for compute fabrics.

The raw line rate is not the same as the application level throughput. Protocol headers, Ethernet framing, and software overhead reduce the payload rate that an MPI job can use. In addition, bandwidth is shared among flows that traverse the same links. If many nodes send data simultaneously through the same switch uplink, each flow will receive only a fraction of the total link capacity.

For parallel programs that transfer modest amounts of data infrequently, a 10 Gb/s Ethernet fabric may be sufficient to achieve good scalability on tens or even hundreds of nodes. For communication intensive codes that exchange large messages repeatedly, such as tightly coupled CFD or large scale linear algebra, the combination of available bandwidth and latency on Ethernet becomes a significant limiting factor at larger scale.

Rule of thumb: On Ethernet based clusters, applications that frequently move large volumes of data between many nodes will hit bandwidth and latency limits sooner than on specialized low latency fabrics, even when nominal link speeds look similar.

Latency characteristics of Ethernet

Latency is the time it takes for a message to travel from one node to another. For MPI style communication, both bandwidth and latency matter, but latency becomes especially important when messages are small and frequent.

In general purpose Ethernet networks, latency is influenced by multiple factors. Each switch that a packet traverses adds a forwarding delay. Contention on busy links introduces queueing delay. Features such as deep buffers can help absorb bursts of traffic but also increase worst case latency. The operating system network stack also contributes overhead compared to more specialized HPC fabrics.

Modern Ethernet equipment can achieve microsecond scale latencies, but they are still typically higher and more variable than purpose built HPC interconnects. Jitter, which is variation in latency, can be particularly harmful to tightly synchronized parallel codes because one delayed message can cause many processes to sit idle while they wait.

In practice, the impact of Ethernet latency on your application depends on communication patterns. Workflows that exchange large messages only at a few synchronization points may tolerate higher latency. Algorithms that rely on many small, tightly synchronized messages, such as fine grained domain decompositions or certain iterative solvers, are more sensitive and will often show reduced parallel efficiency on Ethernet compared to lower latency fabrics.

Ethernet topology and oversubscription in clusters

Topology describes how nodes and switches are connected. In HPC clusters that use Ethernet, common topologies include star like arrangements with a core switch, multi level trees, and folded Clos or fat tree designs implemented with multiple tiers of switches.

Oversubscription is a key concept in Ethernet based designs. It refers to the ratio between total bandwidth from nodes into a switch and the bandwidth from that switch toward the rest of the network. For example, if a switch has 32 ports to compute nodes at 25 Gb/s and 4 uplinks to the core at 100 Gb/s, then the edge switch has a potential oversubscription of 2:1 if all nodes send data toward the core simultaneously.

Oversubscription is common because it reduces hardware cost, but heavy all to all communication patterns can reveal it as a performance bottleneck. Traffic patterns that keep most communication within small groups of nodes suffer less from oversubscription, whereas global collective operations, such as large MPI Allreduce operations across many ranks, will stress oversubscribed links.

As a user, you rarely control the physical topology, but you will see its impact. If your application’s communication pattern is locality friendly, meaning most communication takes place within subsets of ranks that can be mapped to nodes close to each other, then Ethernet topology and oversubscription may matter less. If communication is global and unstructured, network design details will strongly affect your achievable scalability.

Important statement: Oversubscribed Ethernet fabrics can deliver good performance for many jobs, but MPI applications with heavy all to all or collective communication are especially vulnerable to congestion and may scale poorly as node count increases.

Ethernet features relevant to HPC performance

Several advanced features of Ethernet hardware and software stacks can partially close the gap between commodity networking and specialized HPC interconnects. For example, Receive Side Scaling (RSS) and multiple queues on network adapters let the operating system distribute incoming traffic across CPU cores more efficiently, which reduces software overhead when many flows are active.

Jumbo frames, which allow larger MTU sizes than the traditional 1500 bytes, can reduce per packet overhead for large transfers. When jumbo frames are enabled end to end, large MPI messages can move with fewer interrupts and context switches, improving throughput and reducing CPU load. However, misconfigured MTUs can cause subtle connectivity or performance problems, so administrators must enable them consistently.

Priority based flow control and Data Center Bridging (DCB) features aim to increase predictability by controlling congestion and prioritizing traffic classes. In an HPC cluster that shares Ethernet with storage or general campus traffic, these mechanisms can help keep job related traffic from being swamped by unrelated traffic.

Some vendors also implement Remote Direct Memory Access over Ethernet through protocols like RoCE. These aim to bypass parts of the kernel networking stack and provide lower latency and higher throughput, closer to specialized fabrics. From the user side, these capabilities often appear as support for specific MPI transports or fabric libraries that need to be selected at build or run time.

The cluster administrators usually decide which Ethernet features are enabled. As a user, you mainly need to be aware that such choices can affect performance, and you may see recommendations in documentation to select certain MPI settings or environment variables that match the configured Ethernet capabilities.

Reliability and congestion behavior

Ethernet was originally designed as a best effort delivery technology. Packets may be dropped under congestion, and higher level protocols such as TCP handle retransmission and reliability. For interactive applications, sporadic drops are acceptable, but for large parallel jobs, packet loss can trigger retransmissions that increase latency and reduce effective bandwidth.

In real clusters, loss can appear when links are oversubscribed or when incast events happen. An incast occurs when many senders transmit simultaneously to a single receiver or through the same bottleneck link, overloading buffers on switches and network interface cards. The effects include increased latency and tail latencies that hurt synchronization across ranks.

Modern Ethernet switches offer larger buffers, improved scheduling algorithms, and quality of service controls that reduce the likelihood of severe congestion. At the same time, tuning of TCP parameters and choice of congestion control algorithms influence how aggressively flows ramp up sending rates and how they react to congestion signals.

For MPI codes on Ethernet, this means that performance can vary depending on network tuning, background traffic, and workload patterns. A cluster might show excellent benchmarks when the network is quiet, but see degraded performance during busy periods when many data intensive jobs run simultaneously.

Key point: On Ethernet based HPC systems, packet loss and congestion can cause sudden and uneven slowdowns in parallel jobs even if average bandwidth and latency look acceptable under light load.

Differences between Ethernet and specialized HPC fabrics

In an HPC environment, the main practical differences between Ethernet and fabrics such as InfiniBand relate to latency, CPU overhead, and consistency under heavy load. Ethernet typically uses TCP or UDP, which involve kernel processing and copies between user and kernel space. This consumes CPU cycles that could otherwise go to computation, especially when communication is frequent.

Specialized HPC interconnects provide hardware support that reduces CPU involvement and shortens the critical path of message delivery. They usually integrate more tightly with MPI and support features such as hardware collectives, advanced routing capabilities, and rich performance counters for profiling communication.

From a user perspective, running MPI jobs over Ethernet often means slightly different performance characteristics and sometimes different MPI configuration settings. For example, some MPI implementations offer different transports, sometimes called Byte Transfer Layers or Communication Subsystems, and you may need to choose between a TCP based transport and one optimized for a specialized fabric.

Despite these differences, the programming interfaces remain the same. The same MPI program can usually run on Ethernet or on a specialized interconnect without code changes. The practical impact is in scalability and time to solution, not in the ability to execute the code at all.

When Ethernet is a good fit for HPC workloads

Ethernet is attractive in HPC for several reasons. It is widely available, uses standard cabling and connectors, and benefits from a large ecosystem of tools and expertise. Many institutions already have Ethernet based infrastructure, so extending it to an HPC cluster can be straightforward and relatively inexpensive.

For workloads that are loosely coupled, such as parameter sweeps, ensemble runs, or workflows that process independent data sets on different nodes, Ethernet can be entirely sufficient. In such cases, the network carries primarily control messages and file I/O, and compute nodes rarely need to exchange large amounts of data directly with each other. Even modest Ethernet speeds can deliver good throughput in this scenario.

Ethernet is also well suited for connecting login nodes, file servers, and external networks. User activity such as interactive sessions, file transfers, and access to web based monitoring or data portals are ideally carried over Ethernet. Separation between management and user facing networks can still be maintained using multiple physical networks or VLANs.

For beginners using shared institutional clusters, it is common that the first HPC systems they encounter are Ethernet based. This provides an accessible environment to learn job scheduling, MPI, and performance basics, even if absolute performance is not at the level of the largest supercomputers.

Practical considerations for users on Ethernet based clusters

As an HPC user on a cluster that uses Ethernet for at least part of its networking, there are several practical points to keep in mind. First, you should be aware that heavy network use can affect not only your job but also other users. Transferring very large data sets during peak hours can contribute to congestion on shared links, which may slow parallel jobs that rely on consistent latency.

Second, when you design parallel algorithms, you can mitigate some Ethernet limitations by reducing the number of messages, aggregating small messages into larger ones, and favoring communication patterns that keep data exchanges localized when possible. Choices in domain decomposition and data distribution can significantly influence how much your program suffers from latency and congestion.

Third, be mindful of the separation between compute and login environments. Large data transfers should generally be performed from or to dedicated data transfer nodes if the system provides them, rather than through login nodes that might be connected through more congested parts of the Ethernet fabric.

Finally, you may encounter environment variables or MPI options related to network transports or buffer sizes. On Ethernet systems, administrators often provide recommended settings to achieve better performance. Following site specific documentation usually gives better results than relying on default values intended for very different interconnects.

Summary of Ethernet’s role in HPC clusters

Ethernet has evolved from a pure office networking technology into a central component of many HPC clusters. It serves as the backbone for management and user access and, in many systems, as the primary compute fabric. While its latency and congestion behavior differ from specialized HPC interconnects, modern Ethernet can deliver adequate performance for a broad range of workloads, especially those that are not dominated by fine grained communication.

Understanding Ethernet in the HPC context equips you to reason about how network characteristics influence your applications. It also helps you interpret performance behavior that stems from topology, oversubscription, or congestion, and to design workflows that are aligned with the strengths and limitations of Ethernet based infrastructure.

Comments

Please login to add a comment.

Don't have an account? Register now!