2 Fundamentals of Computer Architecture

Table of Contents

Big picture: why architecture matters in HPC

High-performance computing is fundamentally about how fast we can turn data and algorithms into results. Computer architecture is the “shape” and organization of the hardware that does this work. For HPC users, you don’t need to be a hardware engineer, but you do need enough architectural understanding to:

Choose appropriate algorithms and implementations for a given machine.
Interpret performance results and bottlenecks.
Make sense of why the same code runs very differently on a laptop vs. a cluster.
Communicate effectively with system administrators and performance engineers.

This chapter gives a conceptual map of modern processor-based systems as they relate to HPC. Later chapters will go deeper into topics such as memory hierarchies, GPUs, and vectorization.

Basic components of a compute node

An HPC compute node (or any modern server) is typically built from a small number of recurring components:

One or more CPUs (sockets), each with multiple cores.
A memory system (DRAM) shared by cores on a socket, plus smaller, faster caches closer to each core.
One or more interconnects (inside the node and between nodes).
Optional accelerators (GPUs, FPGAs, etc.).
A connection to local or networked storage.

Logically, you can think of a node as a hierarchy of:

Processing units
Memory and storage levels
Communication paths

HPC performance is about using all three effectively.

From instructions to execution

At the lowest level, all of your compiled program is a sequence of instructions that the CPU understands. A modern processor goes through several conceptual stages to turn your code into operations on data:

Fetch: Get the next instruction from memory (usually from the instruction cache).
Decode: Interpret the bits to determine the operation and its operands.
Execute: Perform the computation (e.g., addition, multiplication, comparison).
Memory access: Load data from or store results to memory.
Write-back: Save the result into a register or memory.

In reality, processors overlap these stages and execute many instructions concurrently, using multiple techniques:

Pipelining: Different stages for different instructions operate in parallel.
Superscalar execution: Multiple instructions per cycle can be issued.
Out-of-order execution: Instructions are executed as soon as their inputs are ready, not strictly in program order.
Speculative execution: The CPU guesses the direction of branches to keep pipelines full.

For HPC developers, you don’t control these mechanisms directly, but they set the context for:

Why predictable, regular code patterns (loops over arrays) run very efficiently.
Why branches and irregular memory access patterns can degrade performance.
Why compilers can sometimes significantly reorder and optimize your code.

Latency, bandwidth, and throughput

Three fundamental performance notions recur across architecture levels:

Latency: How long it takes for a single operation/data transfer to complete (time per operation).
Bandwidth: How much data can move per unit time (e.g. GB/s).
Throughput: How many operations can be completed per unit time.

These appear at many layers:

Core: instruction latency vs. instruction throughput.
Caches and memory: access latency vs. sustained bandwidth.
Interconnects: message latency vs. network bandwidth.

HPC codes are often limited by:

Memory latency: Waiting for data to arrive.
Memory or network bandwidth: Not enough “pipe width” to keep cores busy.
Compute throughput: Not enough floating-point capacity for the desired workload.

Architectural features like caches, wide memory buses, vector units, and fast networks are all attempts to reduce effective latency and/or increase bandwidth and throughput.

Parallelism in modern architectures

Modern processors are deeply parallel at multiple levels. HPC performance comes from exploiting these levels together:

Instruction-level parallelism (ILP)
The CPU executes multiple independent instructions overlapping in time. This is largely automatic and handled by the hardware and compiler.
Data-level parallelism (DLP)
The same operation applied to many data elements (e.g. vector instructions, GPUs). This is the focus of SIMD/vectorization and GPU programming.
Thread-level parallelism (TLP)
Multiple threads running on different cores of the same CPU, sharing memory. Used by e.g. OpenMP.
Process-level parallelism
Many processes, possibly on many nodes, communicating via messages. Used by MPI.

The architecture of an HPC system directly determines what types of parallelism are available and how efficient they are.

Shared vs. distributed memory at the node scale

Architecturally, we often categorize systems by how processors access memory:

Shared-memory systems: All cores can (logically) access a common address space. Within a node, this is typically the case: each core may have local caches, but they see a single coherent memory.
Distributed-memory systems: Each processor (or node) has its own memory. Data must be explicitly sent over an interconnect.

Modern HPC clusters are usually:

Shared memory within a node (multiple cores, possibly multiple sockets).
Distributed memory across nodes (connected via a high-speed network).

This architectural split is what drives the need for different programming approaches (threading vs. message passing vs. hybrid).

The memory hierarchy as an architectural principle

Although details are covered separately, one key architectural idea is that memory is organized hierarchically:

Small, very fast storage close to the core (registers, caches).
Larger, slower main memory (DRAM).
Even larger, much slower storage (SSDs, disks, parallel filesystems).

As you move away from the core, both latency increases and bandwidth typically decreases, but capacity increases. Architectures are designed so that:

Frequently used data stays close to the cores.
The hardware automatically moves data between levels as needed.
Your code’s access patterns strongly influence how effective this is.

Much of HPC optimization is about aligning code with this hierarchy.

Interconnects and system-level architecture

Beyond a single CPU, architecture includes how components are wired together:

On-chip interconnects connect cores, caches, and memory controllers on a processor die.
On-board interconnects connect CPUs, memory, GPUs, and PCIe devices within a node.
System interconnects connect multiple nodes (e.g. Ethernet, InfiniBand) into a cluster.

The topology (how things are connected) and properties (latency, bandwidth, contention) of these interconnects affect:

How cost-effective it is to communicate between different parts of your program.
Which algorithms scale well to many nodes and which do not.
Where it is better to keep data (e.g. in-node vs. across nodes).

Architecturally, clusters can be organized as:

Simple fat-tree or similar topologies.
More complex torus or dragonfly networks in large systems.

For an HPC user, the main takeaway is that communication cost is highly non-uniform: talking to your own core’s registers is vastly cheaper than talking to a remote node.

Architectural trends relevant to HPC

Several long-term trends shape the design of modern HPC architectures:

Clock frequency scaling has largely stalled
Processors are not getting much faster in GHz; instead, we get more cores and wider vector units.
Core counts per socket keep rising
Parallelism within a node is increasing; using only one core wastes most of the available compute.
Memory bandwidth is a critical bottleneck
Architectures add features like multi-channel DRAM, HBM (high-bandwidth memory), and NUMA to balance compute with data delivery.
Heterogeneity is increasing
CPUs, GPUs, and other accelerators coexist, each with different strengths and programming models.
Energy efficiency constraints
Power and thermal limits push architects toward designs that maximize work per Joule, influencing clock speeds, core designs, and accelerator integration.

Understanding these trends helps explain why:

Legacy single-threaded codes struggle on modern systems.
Vectorization and data locality optimization are vital.
Hybrid CPU–GPU and multi-level parallel strategies are becoming the norm.

Architectural abstractions for the programmer

To work effectively with architecture, it is helpful to adopt a few mental models:

The core as a fast calculator with limited local scratch space
Registers and caches are “scratchpads” that must be fed with data.
The memory system as a set of increasingly distant warehouses
Getting data from a distant warehouse (remote node, disk) is expensive; you want to reuse data while it is nearby.
The interconnect as a road network
Some routes are wide and direct (within a core or socket); others are narrow, shared, and prone to congestion (off-node communication).
The node as the basic unit of deployment
Schedulers allocate nodes and cores; your code must map its work onto this structure.

These abstractions will underpin later discussions on parallel programming models, performance optimization, and the specific hardware components covered in subsequent chapters.

2.1 CPUs, cores, and clock speeds

2.2 Memory hierarchy

▼

2.3 Storage systems

2.4 GPUs and accelerators

2.5 SIMD and vectorization concepts