4.3 Compute nodes

Table of Contents

Role of Compute Nodes in an HPC Cluster

Compute nodes are the workhorses of an HPC cluster. Unlike login or head nodes, they are not meant for interactive work, compiling large software stacks, or running services. Their primary role is to execute user jobs scheduled by the batch system, typically in a non-interactive, highly controlled way.

Key characteristics:

Optimized for running batch jobs, not for direct user interaction
Managed entirely by the scheduler (e.g., SLURM) and cluster management software
Usually have no (or very limited) outbound internet access
Often have identical or very similar hardware within a given partition/queue
Configured to maximize performance and isolation rather than convenience

You will rarely log in to compute nodes directly; instead, you request them via the job scheduler.

Typical Hardware Layout of a Compute Node

Compute nodes are designed around a balance of:

CPU cores and/or GPUs
Memory capacity and bandwidth
Network connectivity to the cluster interconnect
Local storage (if any)

A common schematic for a CPU-only compute node:

2 CPU sockets
Each CPU with many cores (e.g., 16–64 cores)
Several memory channels per CPU
One or more high-speed network interfaces
Possibly a small local SSD or NVMe drive

Cores, Sockets, and NUMA Domains

Inside a compute node, you will encounter:

Sockets: Physical CPUs on the motherboard.
Cores: Independent execution units inside each CPU.
Hardware threads (e.g., Hyper-Threading): Multiple logical threads per core.

Even without going deep into microarchitecture, you need to be aware of:

Job schedulers often let you request:

--nodes (whole nodes)
--ntasks (MPI ranks/processes)
--cpus-per-task (threads per rank)
--ntasks-per-node or --ntasks-per-socket

Performance can depend on how you map processes/threads to cores and sockets.

NUMA (Non-Uniform Memory Access) is common on multi-socket nodes:

Each socket has its own attached memory.
Accessing memory “local” to a socket is faster than accessing memory attached to the other socket.
This affects:

MPI rank placement
OpenMP thread affinity
Choice of --mem vs --mem-per-cpu options in job scripts

Tools you might encounter for examining this layout:

lscpu (basic CPU, core, socket counts)
numactl --hardware (NUMA domains and memory per domain)
hwloc-ls or lstopo (graphical topology output, if available)

Accelerated Compute Nodes (GPU and Other Accelerators)

Many clusters have specialized compute nodes with GPUs or other accelerators.

Common features of GPU nodes:

One or more high-end GPUs (e.g., NVIDIA A100, H100)
CPUs mainly used for control and data preparation
High-bandwidth connections between CPU and GPU (e.g., PCIe, NVLink)
Often more memory and power budget than CPU-only nodes

You typically request these nodes via scheduler options like:

--gres=gpu:1 (1 GPU)
--gpus-per-node=4 (4 GPUs per node)
Partition/queue names like gpu, cuda, v100, a100, etc.

Important differences from CPU-only nodes:

Jobs may be limited to specific GPU partitions.
Software stacks and drivers differ (CUDA, ROCm, etc.).
Resource accounting often tracks GPU-hours separately from CPU-hours.

Memory Characteristics on Compute Nodes

Compute nodes are configured with enough memory for large simulations and data processing workloads, but memory is still a finite, shared resource.

Total Memory and Per-Node Capacity

Nodes may have from tens of GB to multiple TB of RAM.
All jobs running on the same node share the node’s total memory.
The scheduler typically enforces memory limits to prevent one job from crashing others.

Typical scheduler options:

--mem=64G (request 64 GB of memory per node)
--mem-per-cpu=4G (request 4 GB per allocated core)

If you request too little memory:

Your job may be killed by the system’s out-of-memory (OOM) killer.
The scheduler may terminate your job once you exceed the allocation.

If you request too much:

Your job may wait longer in the queue (fewer nodes can satisfy large-memory requests).

Large-Memory / “Fat” Nodes

Some clusters provide special large-memory compute nodes:

Hundreds of GB to several TB of RAM
Used for:

Very large in-memory datasets
Big graph analytics
Certain bioinformatics workflows

Often sit in a special partition (e.g., --partition=bigmem)

These nodes might be scarce and heavily contended; use them only when justified.

Local Storage on Compute Nodes

Compute nodes might have:

No significant local storage (diskless nodes)
One or more local SSDs/NVMe drives
Small local scratch directories (e.g., /tmp, /scratch/local)

Typical uses for local storage:

Temporary files during a simulation
Intermediate results that don’t need to persist after the job
I/O-heavy workloads that benefit from fast local disks

Things to watch:

Local scratch is often not backed up.
Files may be deleted automatically at the end of the job.
Do not treat local node storage as long-term storage; copy important results back to shared filesystems.

Network Connectivity of Compute Nodes

Compute nodes are connected to:

The cluster interconnect (for MPI and data exchange between nodes)
The parallel filesystem (shared storage accessible to all nodes)

Common interconnect types include Ethernet and InfiniBand. For you as a user on compute nodes:

All I/O to shared filesystems (e.g., /home, /project, /scratch) travels over the network.
Inter-node communication in distributed-memory jobs (MPI) also travels over this network.
Heavy I/O or communication can saturate these links and affect performance.

Some clusters have:

Separate networks for storage and MPI traffic
“Fat-tree” or other topologies that influence job placement

Software Environment on Compute Nodes

Compute nodes typically share the same software environment as login nodes, but with some important differences:

No graphical environment (no desktop, minimal X11)
Restricted network access (e.g., no direct internet downloads)
Focus on pre-installed compilers, libraries, and runtime environments

Common patterns:

You load the same environment modules on the login node and compute nodes via your job script.
Pre-compiled MPI, math libraries, and domain-specific tools are installed centrally.
Some nodes may have node-type-specific modules (e.g., GPU toolchains on GPU nodes).

Node Allocation and Usage Models

You rarely think about a single compute node in isolation; instead, you think about:

Whole node jobs:

You request one or more full nodes.
You are the only user on each allocated node.
Common for tightly coupled MPI jobs or when you want predictable performance.

Shared node jobs:

Multiple jobs share the same node’s cores and memory.
You request a subset of cores and memory.
Common for smaller workloads or embarrassingly parallel tasks.

How you request resources affects:

Performance (less interference with whole-node jobs)
Queue wait time (smaller jobs might start sooner)
Efficiency (sharing nodes can improve overall cluster utilization)

CPU, Memory, and GPU Binding

When you run code on compute nodes, especially hybrid or multi-threaded codes, placement matters:

CPU affinity: Mapping threads/processes to specific cores/sockets.
GPU binding: Ensuring each process uses the intended GPU on a multi-GPU node.
NUMA placement: Keeping memory allocations close to the cores that use them.

The job scheduler and MPI libraries often provide options to control this, such as:

--cpus-per-task, --ntasks-per-node, --hint=nomultithread (SLURM examples)
MPI flags like --bind-to core, --map-by socket

Efficient use of compute nodes requires aligning your job’s parallel structure with the node’s hardware layout.

Accessing and Inspecting Compute Nodes

While you normally do not login directly to compute nodes, you might:

Use scheduler tools to run commands on allocated nodes, e.g.:

  srun hostname
  srun lscpu
  srun free -h

Use these to:

Confirm how many cores and NUMA domains are available
Check memory size and current usage
Validate GPU visibility inside your job

Typical constraints:

You can only run commands on nodes that are already allocated to your job.
Direct SSH to compute nodes may be disabled or limited to debugging partitions.

Best Practices When Using Compute Nodes

To use compute nodes effectively and fairly:

Run heavy jobs only on compute nodes, never on login/head nodes.
Request realistic resources:

Enough cores, memory, and GPUs to run efficiently
But not far in excess of what your job actually needs

Use whole nodes when appropriate for tightly coupled parallel jobs.
Respect local storage policies:

Clean up large temporary files
Don’t rely on local scratch for permanent storage

Minimize unnecessary I/O to shared filesystems.
Avoid launching many tiny jobs that underutilize nodes; prefer bundling small tasks into a single job if possible.

Understanding what compute nodes provide—and how they differ from login and management nodes—helps you craft job scripts and workflows that run efficiently at scale.

Comments

Please login to add a comment.

Don't have an account? Register now!