Table of Contents
Big Picture: What Is an HPC System?
High-Performance Computing (HPC) systems are built to run many demanding computations simultaneously, often on very large datasets. Instead of one powerful machine, HPC typically uses many machines working together.
An HPC cluster is a collection of interconnected computers (nodes) that:
- Are managed as a single system.
- Share a common software environment.
- Provide access to large compute power, memory, and storage.
- Are accessed remotely (usually via a login node and a scheduler).
This chapter focuses on the physical and logical structure of such clusters, how the parts fit together, and what that means for you as a user.
You do not need to design hardware to use HPC, but understanding the basic components will help you:
- Request appropriate resources.
- Interpret performance behavior.
- Communicate effectively with system administrators.
Typical Components of an HPC Cluster
While each site is unique, most clusters share a broadly similar structure:
- Login nodes: Where users connect, edit code, compile, submit jobs.
- Compute nodes: Where batch jobs actually run.
- Head / management nodes: Internal nodes that orchestrate the cluster.
- Interconnect network: High-speed network linking nodes.
- Storage: Parallel filesystems and sometimes simpler network filesystems.
- System software stack: OS, scheduler, resource manager, monitoring, etc.
As a user, you mainly interact with:
- Login nodes (directly),
- Compute nodes (indirectly, via the job scheduler),
- Shared storage (filesystems),
- Interconnect (indirectly, via communication libraries like MPI).
Node Roles and Responsibilities
Login Nodes: Your Entry Point
Login nodes (also called front-end or access nodes) are the public face of the cluster.
Key characteristics:
- Accessible from outside the cluster (e.g., via SSH).
- Shared by many users at once.
- Typically have:
- A moderate number of cores.
- Enough memory for editing, compiling, light pre/post-processing.
- Connected to the shared filesystems.
Typical tasks done on login nodes:
- Logging in and managing files:
ssh,scp, text editors, version control. - Compiling codes and building software.
- Submitting and monitoring batch jobs via the scheduler.
- Light analysis or visualization that doesn’t consume huge resources.
Typical restrictions on login nodes:
- No long-running or heavy CPU/GPU jobs.
- Strict limits on number of processes, memory usage, and run time.
- Sometimes tools like
toporhtopmay be limited or discouraged.
Why this matters:
- Overloading login nodes impacts all users.
- Misuse can lead to your processes being killed automatically or to administrative action.
Good practices:
- Do all production runs through the scheduler on compute nodes.
- Keep interactive work on login nodes short and lightweight.
- Use interactive jobs (through the scheduler) for heavier, but still interactive, work.
Compute Nodes: Where Jobs Actually Run
Compute nodes are the workhorses of the cluster.
Typical characteristics:
- Many cores per node (e.g., 32–128+ CPU cores).
- A certain amount of RAM shared by all cores on that node.
- Sometimes one or more GPUs or other accelerators.
- Usually no direct external login (you cannot SSH in from outside directly).
- Accessed only via the scheduler (batch or interactive jobs).
Important implications for users:
- Your job runs on compute nodes, not on the login node where you type
sbatchorqsub. - Resource requests (cores, memory, GPUs) in your job script correspond to resources on compute nodes.
- Hardware configuration (e.g., cores per socket, NUMA layout) can significantly influence performance.
Common types of compute nodes:
- CPU-only nodes: General-purpose computing; often the majority of nodes.
- GPU/accelerator nodes: Specialized for GPU/AI/accelerated workloads.
- High-memory nodes: Extra-large RAM for memory-intensive applications.
- Special-purpose nodes: For example, with very fast local SSDs, special network hardware, or unique architectures.
As a user, you must:
- Know what kinds of nodes your site offers (usually described in documentation).
- Choose node types that match your workload.
- Request appropriate quantities and durations through the scheduler.
Head and Management Nodes: The Cluster Control Plane
Head or management nodes manage the cluster but are usually invisible to ordinary users.
Typical roles:
- Running the job scheduler and resource manager.
- Keeping track of which nodes are busy or free.
- Deploying OS images and software updates to compute nodes.
- Centralizing logging, monitoring, and health checks.
- Managing user accounts and authentication in collaboration with institutional systems.
You normally do not log into these nodes. However, understanding their role is useful:
- They are the “brains” of the cluster, deciding where and when your jobs run.
- Issues in management services (e.g., scheduler failures) can temporarily prevent job submission or execution.
Interconnects: How Nodes Talk to Each Other
Within an HPC cluster, the interconnect is the internal network connecting nodes. This is different from the institution’s regular office or campus network.
Key goals of HPC interconnects:
- High bandwidth (move lots of data quickly).
- Low latency (small delay per message).
- Scalability (supporting many nodes at once).
These properties are critical for:
- Parallel applications that use MPI for communication.
- Shared storage access from many nodes at once.
- Collective operations (e.g., global reductions, broadcasts).
The detailed technologies (e.g., Ethernet vs InfiniBand) are covered in their own subsections, but the important conceptual points here are:
- Interconnect performance can be a bottleneck for communication-heavy applications.
- Topology (how nodes are wired, e.g., fat-tree, dragonfly) can influence communication patterns and performance.
- The scheduler may try to place jobs so that processes in a parallel job are “near” each other in network terms.
Implications for you:
- Codes that send many small messages or do frequent synchronization are more sensitive to interconnect latency.
- Codes that move large arrays (e.g., halo exchanges) are more sensitive to bandwidth.
- Performance may change when you scale from a few nodes to many nodes because the interconnect becomes more stressed.
Memory Organization Across the Cluster
At the level of a single node, memory is usually shared among cores (shared-memory). At the level of a cluster, memory is distributed across nodes (distributed-memory).
For the entire cluster:
- Each node has its own RAM.
- Accessing memory on another node requires network communication (e.g., MPI messages).
- The aggregate memory of the cluster can be huge, but no single process can directly see all of it as “local” memory.
This leads to the common HPC programming model:
- Within a node: use shared-memory techniques (e.g., threads).
- Across nodes: use message passing (e.g., MPI) to access data owned by other nodes.
From an infrastructure perspective, this separation is fundamental:
- Hardware: memory chips are physically local to nodes.
- Network: carries all cross-node data communication.
- Software: must be aware of data distribution and communication.
Storage in HPC Clusters
While detailed filesystems and parallel I/O are discussed elsewhere, at the infrastructure level:
- Storage is typically centralized or shared:
- One or more large storage systems are attached to the cluster network.
- All nodes see shared filesystems (e.g.,
/home,/project,/scratch). - Some nodes may also have local storage (e.g., SSDs) used for temporary files or caching.
Common logical layers you will encounter:
- Home directories:
- Smaller, backed up, for source code, configuration files, and small datasets.
- Not ideal for massive I/O or scratch data.
- Project or group spaces:
- Larger, shared within a research group or project.
- Often used for shared datasets, reference data, and results.
- Scratch / work spaces:
- High-capacity, high-performance, not backed up.
- Intended for temporary or intermediate data created during runs.
Cluster infrastructure implications:
- Storage systems must handle many concurrent reads/writes from many nodes.
- Parallel filesystems are used to distribute data across multiple storage servers.
- The storage network is often integrated with the interconnect for performance.
From a user’s point of view:
- Use the right filesystem for each purpose (home vs project vs scratch).
- Do not rely on scratch spaces for long-term storage.
- Understand any I/O quotas or performance characteristics explained by your site.
Shared vs Distributed Memory Systems at the Cluster Level
HPC infrastructure often blends two fundamental models:
- Shared-memory systems:
- Multiple cores share a common address space on a single node (e.g., multi-core CPU with shared RAM).
- Suitable for threaded models like OpenMP.
- Distributed-memory systems:
- Multiple nodes each with private memory, connected via a network.
- Suitable for message-passing models like MPI.
In practice:
- A typical HPC cluster is a distributed-memory system built from shared-memory nodes.
- Each node is a shared-memory system; the cluster as a whole is distributed-memory.
Infrastructure decisions:
- Node design (number of sockets, NUMA layout) influences intra-node performance.
- Interconnect design influences inter-node performance.
As a user, you will:
- Request resources in units of nodes, cores, memory, and sometimes GPUs.
- Choose programming models that match this infrastructure.
- Be aware of where your processes and threads run relative to memory (e.g., via affinity settings).
How the Pieces Work Together During a Job
Putting it all together, a typical workflow looks like:
- You connect to the login node:
- Via SSH from your workstation/laptop.
- You see a Linux environment with shared filesystems mounted.
- You prepare your work:
- Transfer input data to appropriate storage (e.g., scratch).
- Edit and compile your code or prepare job scripts.
- You submit a job to the scheduler:
- Request a number of nodes, cores per node, memory, GPUs, and walltime.
- Specify the executable and input/output file paths.
- The management nodes schedule your job:
- Decide where and when to run your job.
- Reserve resources on one or more compute nodes.
- Your job runs on compute nodes:
- Processes (and possibly threads) execute on the reserved nodes.
- They access shared storage over the interconnect.
- They communicate between nodes via the interconnect (e.g., using MPI).
- Results are written to storage:
- Outputs are written to shared filesystems.
- You later access them from the login node.
- Resources are released:
- After completion or timeout, the scheduler frees resources for other jobs.
- Temporary data on local node disks may be removed depending on site policy.
Understanding this pipeline helps you:
- Anticipate where bottlenecks may occur (CPU, memory, network, storage).
- Know what each part of the cluster is responsible for.
- Use resources in a way that fits the infrastructure design.
Practical Considerations for New Users
When you start using a real HPC cluster:
- Read the site documentation:
- Node types, core counts, memory per node, GPU availability.
- Filesystem layout and policies (quotas, purge policies on scratch).
- Login node usage rules and expected behavior.
- Map your needs to the infrastructure:
- How much memory do you need per process?
- Is your application CPU-bound, memory-bound, or I/O-bound?
- Do you need GPUs or special nodes?
- Structure your workflow with the cluster in mind:
- Keep heavy computation off the login nodes.
- Stage large I/O to appropriate filesystems.
- Use batch jobs for long runs; use interactive jobs for exploratory work.
Over time, a good mental model of the cluster’s infrastructure will improve your:
- Performance tuning decisions.
- Resource requests (leading to more successful and faster-running jobs).
- Ability to troubleshoot issues (e.g., slow I/O, poor scaling, job failures).