Table of Contents
Role of Compute Nodes in an HPC Cluster
Compute nodes are the workhorses of an HPC cluster. Unlike login or head nodes, they are not meant for interactive work, compiling large software stacks, or running services. Their primary role is to execute user jobs scheduled by the batch system, typically in a non-interactive, highly controlled way.
Key characteristics:
- Optimized for running batch jobs, not for direct user interaction
- Managed entirely by the scheduler (e.g., SLURM) and cluster management software
- Usually have no (or very limited) outbound internet access
- Often have identical or very similar hardware within a given partition/queue
- Configured to maximize performance and isolation rather than convenience
You will rarely log in to compute nodes directly; instead, you request them via the job scheduler.
Typical Hardware Layout of a Compute Node
Compute nodes are designed around a balance of:
- CPU cores and/or GPUs
- Memory capacity and bandwidth
- Network connectivity to the cluster interconnect
- Local storage (if any)
A common schematic for a CPU-only compute node:
- 2 CPU sockets
- Each CPU with many cores (e.g., 16–64 cores)
- Several memory channels per CPU
- One or more high-speed network interfaces
- Possibly a small local SSD or NVMe drive
Cores, Sockets, and NUMA Domains
Inside a compute node, you will encounter:
- Sockets: Physical CPUs on the motherboard.
- Cores: Independent execution units inside each CPU.
- Hardware threads (e.g., Hyper-Threading): Multiple logical threads per core.
Even without going deep into microarchitecture, you need to be aware of:
- Job schedulers often let you request:
--nodes(whole nodes)--ntasks(MPI ranks/processes)--cpus-per-task(threads per rank)--ntasks-per-nodeor--ntasks-per-socket- Performance can depend on how you map processes/threads to cores and sockets.
NUMA (Non-Uniform Memory Access) is common on multi-socket nodes:
- Each socket has its own attached memory.
- Accessing memory “local” to a socket is faster than accessing memory attached to the other socket.
- This affects:
- MPI rank placement
- OpenMP thread affinity
- Choice of
--memvs--mem-per-cpuoptions in job scripts
Tools you might encounter for examining this layout:
lscpu(basic CPU, core, socket counts)numactl --hardware(NUMA domains and memory per domain)hwloc-lsorlstopo(graphical topology output, if available)
Accelerated Compute Nodes (GPU and Other Accelerators)
Many clusters have specialized compute nodes with GPUs or other accelerators.
Common features of GPU nodes:
- One or more high-end GPUs (e.g., NVIDIA A100, H100)
- CPUs mainly used for control and data preparation
- High-bandwidth connections between CPU and GPU (e.g., PCIe, NVLink)
- Often more memory and power budget than CPU-only nodes
You typically request these nodes via scheduler options like:
--gres=gpu:1(1 GPU)--gpus-per-node=4(4 GPUs per node)- Partition/queue names like
gpu,cuda,v100,a100, etc.
Important differences from CPU-only nodes:
- Jobs may be limited to specific GPU partitions.
- Software stacks and drivers differ (CUDA, ROCm, etc.).
- Resource accounting often tracks GPU-hours separately from CPU-hours.
Memory Characteristics on Compute Nodes
Compute nodes are configured with enough memory for large simulations and data processing workloads, but memory is still a finite, shared resource.
Total Memory and Per-Node Capacity
- Nodes may have from tens of GB to multiple TB of RAM.
- All jobs running on the same node share the node’s total memory.
- The scheduler typically enforces memory limits to prevent one job from crashing others.
Typical scheduler options:
--mem=64G(request 64 GB of memory per node)--mem-per-cpu=4G(request 4 GB per allocated core)
If you request too little memory:
- Your job may be killed by the system’s out-of-memory (OOM) killer.
- The scheduler may terminate your job once you exceed the allocation.
If you request too much:
- Your job may wait longer in the queue (fewer nodes can satisfy large-memory requests).
Large-Memory / “Fat” Nodes
Some clusters provide special large-memory compute nodes:
- Hundreds of GB to several TB of RAM
- Used for:
- Very large in-memory datasets
- Big graph analytics
- Certain bioinformatics workflows
- Often sit in a special partition (e.g.,
--partition=bigmem)
These nodes might be scarce and heavily contended; use them only when justified.
Local Storage on Compute Nodes
Compute nodes might have:
- No significant local storage (diskless nodes)
- One or more local SSDs/NVMe drives
- Small local scratch directories (e.g.,
/tmp,/scratch/local)
Typical uses for local storage:
- Temporary files during a simulation
- Intermediate results that don’t need to persist after the job
- I/O-heavy workloads that benefit from fast local disks
Things to watch:
- Local scratch is often not backed up.
- Files may be deleted automatically at the end of the job.
- Do not treat local node storage as long-term storage; copy important results back to shared filesystems.
Network Connectivity of Compute Nodes
Compute nodes are connected to:
- The cluster interconnect (for MPI and data exchange between nodes)
- The parallel filesystem (shared storage accessible to all nodes)
Common interconnect types include Ethernet and InfiniBand. For you as a user on compute nodes:
- All I/O to shared filesystems (e.g.,
/home,/project,/scratch) travels over the network. - Inter-node communication in distributed-memory jobs (MPI) also travels over this network.
- Heavy I/O or communication can saturate these links and affect performance.
Some clusters have:
- Separate networks for storage and MPI traffic
- “Fat-tree” or other topologies that influence job placement
Software Environment on Compute Nodes
Compute nodes typically share the same software environment as login nodes, but with some important differences:
- No graphical environment (no desktop, minimal X11)
- Restricted network access (e.g., no direct internet downloads)
- Focus on pre-installed compilers, libraries, and runtime environments
Common patterns:
- You load the same environment modules on the login node and compute nodes via your job script.
- Pre-compiled MPI, math libraries, and domain-specific tools are installed centrally.
- Some nodes may have node-type-specific modules (e.g., GPU toolchains on GPU nodes).
Node Allocation and Usage Models
You rarely think about a single compute node in isolation; instead, you think about:
- Whole node jobs:
- You request one or more full nodes.
- You are the only user on each allocated node.
- Common for tightly coupled MPI jobs or when you want predictable performance.
- Shared node jobs:
- Multiple jobs share the same node’s cores and memory.
- You request a subset of cores and memory.
- Common for smaller workloads or embarrassingly parallel tasks.
How you request resources affects:
- Performance (less interference with whole-node jobs)
- Queue wait time (smaller jobs might start sooner)
- Efficiency (sharing nodes can improve overall cluster utilization)
CPU, Memory, and GPU Binding
When you run code on compute nodes, especially hybrid or multi-threaded codes, placement matters:
- CPU affinity: Mapping threads/processes to specific cores/sockets.
- GPU binding: Ensuring each process uses the intended GPU on a multi-GPU node.
- NUMA placement: Keeping memory allocations close to the cores that use them.
The job scheduler and MPI libraries often provide options to control this, such as:
--cpus-per-task,--ntasks-per-node,--hint=nomultithread(SLURM examples)- MPI flags like
--bind-to core,--map-by socket
Efficient use of compute nodes requires aligning your job’s parallel structure with the node’s hardware layout.
Accessing and Inspecting Compute Nodes
While you normally do not login directly to compute nodes, you might:
- Use scheduler tools to run commands on allocated nodes, e.g.:
srun hostname
srun lscpu
srun free -h- Use these to:
- Confirm how many cores and NUMA domains are available
- Check memory size and current usage
- Validate GPU visibility inside your job
Typical constraints:
- You can only run commands on nodes that are already allocated to your job.
- Direct SSH to compute nodes may be disabled or limited to debugging partitions.
Best Practices When Using Compute Nodes
To use compute nodes effectively and fairly:
- Run heavy jobs only on compute nodes, never on login/head nodes.
- Request realistic resources:
- Enough cores, memory, and GPUs to run efficiently
- But not far in excess of what your job actually needs
- Use whole nodes when appropriate for tightly coupled parallel jobs.
- Respect local storage policies:
- Clean up large temporary files
- Don’t rely on local scratch for permanent storage
- Minimize unnecessary I/O to shared filesystems.
- Avoid launching many tiny jobs that underutilize nodes; prefer bundling small tasks into a single job if possible.
Understanding what compute nodes provide—and how they differ from login and management nodes—helps you craft job scripts and workflows that run efficiently at scale.