Kahibaro
Discord Login Register

HPC Clusters and Infrastructure

Big Picture: What Is an HPC System?

High-Performance Computing (HPC) systems are built to run many demanding computations simultaneously, often on very large datasets. Instead of one powerful machine, HPC typically uses many machines working together.

An HPC cluster is a collection of interconnected computers (nodes) that:

This chapter focuses on the physical and logical structure of such clusters, how the parts fit together, and what that means for you as a user.

You do not need to design hardware to use HPC, but understanding the basic components will help you:

Typical Components of an HPC Cluster

While each site is unique, most clusters share a broadly similar structure:

As a user, you mainly interact with:

Node Roles and Responsibilities

Login Nodes: Your Entry Point

Login nodes (also called front-end or access nodes) are the public face of the cluster.

Key characteristics:

Typical tasks done on login nodes:

Typical restrictions on login nodes:

Why this matters:

Good practices:

Compute Nodes: Where Jobs Actually Run

Compute nodes are the workhorses of the cluster.

Typical characteristics:

Important implications for users:

Common types of compute nodes:

As a user, you must:

Head and Management Nodes: The Cluster Control Plane

Head or management nodes manage the cluster but are usually invisible to ordinary users.

Typical roles:

You normally do not log into these nodes. However, understanding their role is useful:

Interconnects: How Nodes Talk to Each Other

Within an HPC cluster, the interconnect is the internal network connecting nodes. This is different from the institution’s regular office or campus network.

Key goals of HPC interconnects:

These properties are critical for:

The detailed technologies (e.g., Ethernet vs InfiniBand) are covered in their own subsections, but the important conceptual points here are:

Implications for you:

Memory Organization Across the Cluster

At the level of a single node, memory is usually shared among cores (shared-memory). At the level of a cluster, memory is distributed across nodes (distributed-memory).

For the entire cluster:

This leads to the common HPC programming model:

From an infrastructure perspective, this separation is fundamental:

Storage in HPC Clusters

While detailed filesystems and parallel I/O are discussed elsewhere, at the infrastructure level:

Common logical layers you will encounter:

Cluster infrastructure implications:

From a user’s point of view:

Shared vs Distributed Memory Systems at the Cluster Level

HPC infrastructure often blends two fundamental models:

  1. Shared-memory systems:
    • Multiple cores share a common address space on a single node (e.g., multi-core CPU with shared RAM).
    • Suitable for threaded models like OpenMP.
  2. Distributed-memory systems:
    • Multiple nodes each with private memory, connected via a network.
    • Suitable for message-passing models like MPI.

In practice:

Infrastructure decisions:

As a user, you will:

How the Pieces Work Together During a Job

Putting it all together, a typical workflow looks like:

  1. You connect to the login node:
    • Via SSH from your workstation/laptop.
    • You see a Linux environment with shared filesystems mounted.
  2. You prepare your work:
    • Transfer input data to appropriate storage (e.g., scratch).
    • Edit and compile your code or prepare job scripts.
  3. You submit a job to the scheduler:
    • Request a number of nodes, cores per node, memory, GPUs, and walltime.
    • Specify the executable and input/output file paths.
  4. The management nodes schedule your job:
    • Decide where and when to run your job.
    • Reserve resources on one or more compute nodes.
  5. Your job runs on compute nodes:
    • Processes (and possibly threads) execute on the reserved nodes.
    • They access shared storage over the interconnect.
    • They communicate between nodes via the interconnect (e.g., using MPI).
  6. Results are written to storage:
    • Outputs are written to shared filesystems.
    • You later access them from the login node.
  7. Resources are released:
    • After completion or timeout, the scheduler frees resources for other jobs.
    • Temporary data on local node disks may be removed depending on site policy.

Understanding this pipeline helps you:

Practical Considerations for New Users

When you start using a real HPC cluster:

Over time, a good mental model of the cluster’s infrastructure will improve your:

Views: 22

Comments

Please login to add a comment.

Don't have an account? Register now!