Table of Contents
Role of Compute Nodes in an HPC Cluster
Compute nodes are the workhorses of an HPC cluster. They are the machines that actually run users’ calculations, simulations, and data analysis codes. Unlike login or management nodes, compute nodes are not meant for interactive use or heavy compilation activity. Once a job is submitted through the scheduler, it is dispatched to one or more compute nodes, which then execute the program with the requested resources such as cores, memory, and accelerators.
In a typical workflow, users connect to a login node, prepare and submit a job script, and then the scheduler allocates one or more compute nodes. During the run, all meaningful computational work happens on those compute nodes. Understanding how they are structured helps you request resources correctly and write software that uses them efficiently.
Typical Hardware Layout of a Compute Node
A compute node is usually a balanced system that combines CPUs, memory, interconnect hardware, and sometimes accelerators. Although it resembles a powerful server, its design is optimized for batch execution, stability, and predictable performance in a cluster environment.
Most compute nodes include one or more multi core CPUs. A common configuration might be 2 sockets per node, with each CPU providing many cores. The total number of cores per node, and whether they support hardware threads, constrains how many tasks or threads you can run simultaneously on that node.
Memory is another central component. Each node has a certain amount of main memory, often in the range of tens to hundreds of gigabytes. This memory is shared by all cores on the node, which has implications for shared memory programming models that will appear later in the course.
The network interface connects the node to the high speed interconnect fabric, such as InfiniBand or high performance Ethernet. This connection is crucial for distributed memory parallel programs, which use it to exchange data between nodes.
Compute nodes might also host GPUs or other accelerators. In that case, the node will include one or more accelerator devices attached to the CPUs, often through PCIe or a similar high bandwidth link. These nodes are sometimes referred to as GPU nodes or accelerator nodes and are typically requested explicitly through the job scheduler.
Storage inside compute nodes is usually limited. Many nodes rely on network attached parallel filesystems instead of large local disks. Some nodes may offer small local SSDs for scratch space or performance sensitive temporary data.
Cores, Sockets, and NUMA Inside a Node
From a user perspective, a compute node is not just a flat pool of cores and memory. Its internal structure affects performance and how you request resources.
A CPU package is often called a socket. A node might have one, two, or more sockets. Each socket contains multiple cores. When you request cores in a job script, you are indirectly deciding how many cores from these sockets you want to occupy.
Modern multi socket nodes typically present a non uniform memory access structure, often abbreviated as NUMA. Memory is attached to each socket, and access is fastest when a core uses memory that belongs to its local socket. Accessing memory that is physically attached to another socket is slower. This leads to the idea of NUMA domains inside a node.
Operating systems try to keep processes and their memory local to a socket. However, if you oversubscribe memory on one socket, processes may end up using remote memory, which can reduce performance. Many schedulers and runtime systems provide options to control how tasks are bound to cores and which NUMA domains they use, so that applications remain close to their data.
Hyperthreading or similar simultaneous multithreading features can present each physical core as multiple logical CPUs. This can increase throughput for some workloads but can also confuse resource counting if you are not aware of it. Clusters may choose to expose or hide these logical threads in different ways. The scheduler’s documentation will usually state whether a “core” refers to a physical core or a hardware thread.
Always distinguish between physical cores, logical CPUs, and sockets when requesting resources. Misunderstanding this mapping can lead to unexpected oversubscription and poor performance.
Memory Capacity and Node-Level Constraints
Each compute node has a fixed total memory capacity, and that capacity is shared. When you submit a job, you must ensure that the sum of memory used by all tasks or threads on the node does not exceed the node’s available memory.
Schedulers often allow you to request memory per node, per CPU, or per task. Internally, the scheduler uses this information to decide where to place your job and to prevent multiple jobs from exhausting the node’s memory simultaneously.
If your application underestimates its memory needs, the node may start swapping, or the system’s out of memory killer may terminate processes. In an HPC environment, swap usage is usually undesirable, since it severely degrades performance.
Memory bandwidth is another important node level property. It determines how fast data can be moved between memory and CPU cores. An application that is limited by memory bandwidth can show little speedup even if more cores are available, because the memory subsystem cannot feed data to the computation fast enough.
For large memory workloads, some clusters provide so called “fat nodes” with much higher RAM capacities. These nodes are useful for applications that require very large shared memory spaces but often represent a limited resource in the cluster.
Node Types and Specialization
Not all compute nodes in a cluster are identical. HPC centers often provide different node types tailored to various workloads. Understanding these classes of nodes helps you choose the right resources for your job.
Standard compute nodes form the bulk of many clusters. They have a balanced combination of CPUs and memory suitable for general purpose HPC applications. These are often the default target when you submit a job without special constraints.
High memory nodes, or fat nodes, provide significantly more RAM than standard nodes. They can handle large in memory datasets, such as big matrix factorizations, graph analytics, or certain machine learning workloads that do not distribute easily across many processes.
GPU or accelerator nodes attach one or more accelerators to the CPUs. Typical setups might offer multiple GPUs per node. These nodes are well suited to workloads that can exploit massive fine grained parallelism and high memory bandwidth on accelerators.
Some systems provide nodes optimized for I/O, often with local high performance storage or larger caches. These nodes can be used for I/O intensive parts of workflows, such as pre or post processing or data staging.
Clusters may also group nodes into partitions or queues according to their type. When you submit a job, you often specify a partition that selects a certain node class. The scheduler then restricts node allocation to that group.
Local Storage on Compute Nodes
While large scale data in HPC usually resides on shared parallel filesystems, compute nodes often have some form of local storage. This storage can be used for temporary files, scratch data, or caching.
Local disks or SSDs can provide faster access than network storage for certain workloads, because data does not traverse the interconnect. However, local storage is not typically backed up and might be cleared automatically at the end of a job or when the node reboots.
Some job schedulers provide environment variables or options that point to recommended scratch directories on each compute node. Using these paths allows your job to make use of the node’s local performance without affecting global shared filesystems more than necessary.
It is important not to rely on local node storage for long term persistence. Files stored only on a compute node will not be available once the job finishes or the node is reused.
Network Interfaces and Node Connectivity
Each compute node is attached to the cluster’s network fabric through one or more network interfaces. The characteristics of this connectivity have a direct impact on how fast nodes can exchange data.
On many HPC clusters, the primary network interface on a compute node provides high bandwidth and low latency communication suitable for distributed parallel applications. Some nodes may also have a secondary network interface for management traffic that is separate from user data traffic.
Applications that use distributed memory parallel programming models rely on these network interfaces for message passing. Performance characteristics such as bandwidth, latency, and topology matter for scalability. While this chapter does not cover interconnect technologies in depth, it is important to remember that compute nodes are designed to make the most of these high performance networks rather than regular data center networking alone.
Software Environment on Compute Nodes
Compute nodes typically run the same operating system as login nodes but often with a more minimal interactive environment. Direct login to compute nodes is usually restricted. Instead, you reach them indirectly through scheduled jobs, which start your program in a non interactive shell.
Environment modules, software stacks, and compilers are provided on compute nodes in the same way as on login nodes, so that the software you compile or select on a login node will run in a consistent environment on compute nodes. Some minor differences may exist, for example in available devices or local scratch paths, but system administrators strive to minimize these.
When you design job scripts, you generally load modules and set environment variables on the login node before submission or inside the job script itself. The scheduler then recreates that environment on the allocated compute nodes when your job starts.
Resource Allocation Within a Node
When a job starts on a compute node, the scheduler and runtime system decide how to map your requested resources onto the node’s hardware. This mapping includes which cores your processes or threads use, how memory is distributed, and how many tasks run on each node.
Basic jobs may request one node with a certain number of cores. More sophisticated jobs might request multiple nodes, specify tasks per node, or define how tasks are bound to sockets and cores. This is especially important for hybrid parallel applications that combine distributed and shared memory models.
Binding, sometimes called pinning, refers to fixing a process or thread to a specific core or set of cores. Proper binding improves cache locality and reduces interference between jobs on the same node. Many MPI and OpenMP runtimes offer options to control binding behaviors that are relevant at the compute node level.
Always align your job’s internal parallelism, such as number of MPI ranks and threads, with the physical layout of the compute node. Poor alignment can cause idle cores, memory contention, or degraded performance.
Monitoring and Health of Compute Nodes
Cluster administrators monitor compute nodes for hardware errors, failures, and performance anomalies. Nodes that experience issues may be drained, meaning no new jobs are scheduled on them, or removed from service for maintenance.
From a user point of view, a job that fails quickly with unusual errors may have encountered a sick node. Schedulers often provide ways to identify the node on which a job ran, and documentation usually explains how to report suspected node problems.
Compute nodes are also checked for thermal and power constraints. Heavy workloads can drive CPUs and accelerators to their limits, so nodes are equipped with sensors and controls that may throttle performance if temperatures rise too high.
Best Practices When Using Compute Nodes
Effective use of compute nodes involves both correct resource requests and considerate behavior toward the shared environment.
You should match your job’s process and thread counts to the node’s core layout, avoid oversubscribing cores, and request realistic memory amounts. You should not run heavy production computations on login nodes, but always through the scheduler so that work is placed on compute nodes properly.
Temporary or intermediate data should be written to recommended scratch locations, ideally local to the node when appropriate, and cleaned up when no longer needed. Large I/O operations that involve global filesystems should be planned so they do not overload shared resources unnecessarily.
Finally, when experimenting with new codes or debugging, it is good practice to request small allocations on compute nodes rather than relying on login nodes. This respects the division of roles in the cluster and ensures that performance and behavior are representative of real production runs.