Table of Contents
Scope of this Chapter
This chapter introduces the basic ideas of computer architecture that matter for high performance computing. You will see what a CPU actually is, how it executes instructions, how data moves between different layers of memory, and why these details strongly affect performance on HPC systems.
Later chapters will cover specific components in more depth, such as GPUs, SIMD, storage, and detailed memory hierarchy topics. Here the goal is to build a mental model of how a modern computer is organized and why that structure creates both opportunities and limits for speed.
From Programs to Hardware
At the highest level, a computer turns your program into electrical activity. You write code in a high level language. A compiler turns that code into machine instructions. The processor fetches those instructions from memory and executes them using its internal circuits.
Almost every HPC system is based on the same conceptual model, often called the von Neumann architecture. In this model there is a central processing unit (CPU) that executes instructions, a memory that stores both data and instructions, and pathways that connect them.
When you think about performance, you can roughly separate two types of work. There is computation, which happens when the CPU performs operations like addition, multiplication, or comparisons. Then there is data movement, which happens when the CPU fetches instructions and data from memory and writes results back. High performance computing is often limited not by how fast the CPU can compute, but by how fast data can be moved to and from where it is needed.
A central rule of performance is: moving data is often more expensive than computing on it.
Understanding how hardware is organized will help you reason about which parts of your code cause computation costs and which parts cause data movement costs.
Main Components in a Modern Node
An HPC cluster is built from many nodes. Each node is itself a fairly powerful computer. Although implementations differ, most nodes share a similar set of hardware components.
At the center of a node sits one or more CPUs. Each CPU package contains multiple cores. Each core can execute program instructions independently. Cores share access to the main memory of the node and usually share some cache memory as well.
The main memory is physically separate from the CPU, connected through a memory bus or fabric. It holds the working data and code for all processes and threads executing on the node. The memory controller manages the flow of data between the CPU cores and the memory chips.
There are also storage devices. These include solid state drives and disks that provide persistent storage for files. Access to storage is much slower than access to main memory, but data remains stored even when power is off. HPC nodes frequently access remote shared storage over a network interconnect. Although storage and interconnects have their own dedicated chapters, you should remember that they are part of the larger architecture and influence performance whenever your program reads or writes files.
Finally, many nodes include accelerators such as GPUs. These are additional processors with their own memory and internal architecture. They are connected to the CPU through a high speed link. Later chapters will cover them in more detail, but for now it is enough to see them as specialized computation engines that extend the basic CPU based architecture.
The CPU as an Execution Engine
The CPU is the hardware unit that executes your program instructions. Each core inside the CPU follows a cycle: it fetches an instruction from memory, decodes it, reads any required data, performs the operation, and writes the result.
This sequence is called the instruction execution pipeline. Modern cores have deeply pipelined and highly optimized implementations of this cycle. While one instruction is being decoded, another can be fetched, and a third can be executed. Once the pipeline is full, the core can retire multiple instructions per clock cycle.
Internally, the CPU contains arithmetic and logic units that perform integer and floating point operations. There are also registers, which are small but very fast storage locations directly inside the core. The core operates primarily on data stored in registers. Instructions generally move data from memory into registers, perform the operation, then write results back to memory.
From a performance point of view, the crucial idea is that not all instructions cost the same. An addition that operates on values already in registers is very fast. A load that brings data from main memory into registers can be much slower, especially when that data is not already present in cache. Many of the optimizations used in HPC are attempts to keep the pipeline busy with arithmetic and to schedule memory accesses so that data is already in place when the core needs it.
Cores and Parallel Execution within a Node
A modern CPU package contains several to dozens of cores. Each core is an independent execution engine. When you run a parallel program on a node, multiple cores can work on different parts of the problem at the same time.
Cores share some resources. They share access to the memory controller and the main memory. They also often share some levels of cache. This shared structure is important. It allows cores to exchange data more quickly than if they had only separate memories. It also means cores can interfere with one another when they compete for the same resources.
Some CPU designs implement simultaneous multithreading, which allows a single physical core to run more than one hardware thread. The core then interleaves instructions from the different threads. This can hide certain types of latency, but it does not multiply the raw arithmetic resources. From an HPC perspective, hardware threads tend to be useful when there are many memory delays, but less helpful when the core is already fully busy performing computation.
Within a node, parallel programming models such as threads and shared memory take advantage of these cores. How you organize work across cores has a direct impact on performance, because it determines how efficiently this shared architecture is used.
Clock Speed and Latency
Each CPU core runs at a clock frequency, measured in gigahertz. This is the number of clock cycles per second. In very simplified terms, a core ideally performs a certain number of instructions per cycle, which gives a rough estimate of its peak computational throughput.
If a core could execute $I$ instructions per cycle at a frequency $f$ in cycles per second, then the ideal peak throughput is
$$\text{Peak instructions per second} = I \times f.$$
In real workloads, the achieved rate is much lower. The core frequently stalls because it is waiting for data from memory, or because later instructions depend on results of earlier ones.
Latency is the time between requesting data and receiving it. Clock speed affects how many useful operations you could do per second, but latency tells you how many of those cycles you waste waiting. High performance computing cares about both. A fast clock is only helpful if you can keep the pipeline fed with data and instructions.
Modern CPUs often use dynamic frequency scaling. They can increase clock speed temporarily when a few cores are active or lower it when many cores are used or when power and temperature limits are reached. In practice, this means the effective speed of a core depends on how the node is used. Understanding architecture helps you interpret why a job does not always run at the theoretical maximum performance of the hardware.
Instruction Level Parallelism and Pipelining
CPUs extract parallelism within a single stream of instructions, not only between cores. This is called instruction level parallelism. If two instructions are independent, the core may execute them at the same time in different parts of the pipeline.
Modern cores are superscalar. They can start multiple instructions in the same cycle, provided that hardware resources and dependency rules permit it. The compiler plays an important role here. It reorders instructions where allowed, schedules loads early, and arranges operations to reduce stalls.
Pipelining is central to this behavior. Instead of waiting for one instruction to fully complete before starting the next, the core overlaps stages of many instructions. The concept is similar to an assembly line. The key cost is that when a dependency or branch misprediction occurs, parts of the pipeline must be cleared, which wastes cycles.
For HPC codes, long pipelines and deep speculation are both an advantage and a risk. When code is regular and predictable, the CPU can exploit a lot of instruction level parallelism and maintain high throughput. When code has irregular memory accesses or unpredictable branches, the CPU spends more cycles on misses and stalls. Designing algorithms and data structures that align better with the pipeline behavior is a major performance theme in scientific computing.
Memory, Bandwidth, and the Cost of Access
The main memory that holds your data is much larger than the caches and registers near the core, but also much slower to access. Access cost is usually measured in both latency and bandwidth.
Latency is how long a single access takes to be served. Bandwidth is how much data per second the memory system can deliver when many accesses are issued. The architecture of the memory subsystem, including memory channels and the memory controller, determines these limits.
If the CPU demands data faster than the memory can supply it, the program becomes memory bound. In a memory bound region of code, speeding up the CPU or using more cores does little to help, because the main bottleneck is the memory system. In a compute bound region, the cores are the limit, and using more of their arithmetic features such as vector instructions and multiple cores will help.
This leads to an important performance identity. For a given part of your code, you can think about the operational intensity, which is the ratio of operations to bytes of data moved. If $F$ is the number of floating point operations and $B$ is the number of bytes loaded or stored, then
$$\text{Operational intensity} = \frac{F}{B} \ \text{[flops per byte]}.$$
If operational intensity is low, performance is likely limited by memory bandwidth. If operational intensity is high, performance is more likely limited by the peak compute throughput of the CPU or accelerator.
This idea is the basis of the roofline performance model, which is commonly used in HPC to evaluate how close a program is to hardware limits. Later chapters on performance analysis will use this concept in more detail, but the architectural side is already present here: main memory is much slower than core computation, and this gap shapes what is possible.
Hierarchy and Locality
A crucial feature of computer architecture is that storage is organized hierarchically. You will study the details of registers, cache, and RAM separately, but it is useful here to understand the general pattern.
At the top of the hierarchy, very close to the core, are registers and small caches. They are extremely fast and provide high bandwidth, but there are few of them. Farther away, main memory is much larger but slower. At the bottom, storage devices are huge but orders of magnitude slower again.
This hierarchy is a response to physical constraints. It is not possible to build a single block of memory that is simultaneously as fast, large, and cheap as would be ideal. The system uses several layers, each one trading speed for capacity.
Locality is the property that data accessed recently or nearby in time or in address space is likely to be used again soon. Architectures are designed to exploit locality. When the CPU requests a piece of data from memory, the hardware often brings in a whole block that contains neighboring addresses. If your code accesses data in a regular pattern, the caches and memory system can serve it efficiently.
When memory accesses do not respect locality, the hierarchy is less effective. The CPU must frequently fetch new blocks from main memory, evicting useful data and causing more misses. HPC programming techniques such as blocking and tiling are directly motivated by the memory hierarchy, and they aim to keep working data sets in fast layers as long as possible.
Buses, Interconnects, and Data Paths
Inside a node, data travels along electrical connections between components. These connections are often organized into buses or point to point links. They have their own bandwidth limits and latency characteristics.
The path from core to register is very short and very fast. The path from core to cache is slightly longer. From core to main memory, it passes through the memory controller over specific memory channels. The total number of channels and their speeds determine the maximum memory bandwidth for the node.
When data must travel between the CPU and another component such as a GPU or a network interface, it uses specialized high speed links. These are designed to provide enough bandwidth for typical workloads, but they are always slower than the connections inside the CPU complex itself.
For HPC, the important idea is that the speed of computation is only one side of performance. The speed of data paths at all levels inside and between nodes sets an upper bound on how fast data can flow through the system. This is why architecture documents list not only the number of cores and clock frequency, but also memory bandwidth, interconnect bandwidth, and storage throughput.
Architectural Limits and Performance Ceilings
Every system has theoretical peak performance numbers. For CPU floating point operations, the peak is determined by the number of cores, the operations each core can perform per cycle, and the clock frequency. For memory, peak is given by the maximum sustained bandwidth.
If a node has $N_{\text{cores}}$ cores, each core can perform $F_{\text{core}}$ floating point operations per cycle, and all cores run at frequency $f$, then an upper bound on node level floating point throughput is
$$\text{Peak flops} = N_{\text{cores}} \times F_{\text{core}} \times f.$$
You cannot exceed this limit no matter how well you program. The same type of bound exists for memory bandwidth and other resources. Many HPC codes achieve only a fraction of these peaks because of algorithmic structure, communication costs, and overheads.
From an architectural viewpoint, performance ceilings remind you that optimization is always about moving your code closer to these bounds. Some changes reduce wasted cycles on stalls. Others improve locality and use of caches. Some reorganize computation to increase operational intensity. All of them are attempts to better match the way your program behaves to the strengths of the hardware.
Why Architecture Knowledge Matters in HPC
High performance computing differs from everyday computing because it pushes hardware to its limits. On a desktop application, you might not notice mediocre use of caches or memory bandwidth. On a large simulation, these details can change runtime by factors of 10 or more and can determine whether a problem is even feasible.
Understanding the fundamentals of computer architecture gives you a framework for interpreting performance results. When a job scales poorly across cores, you can ask whether it is limited by memory bandwidth, by inter core communication, or by arithmetic throughput. When a code runs much slower on one system than another, architectural differences in core design, memory hierarchy, or interconnect often explain why.
Later chapters will use this architectural foundation in more specific contexts. When you learn about shared memory parallelism, you will see how threads share caches and memory on a node. When you explore distributed memory and interconnects, you will connect node level architecture to network level behavior. When you optimize performance, you will constantly return to core concepts such as pipelines, memory hierarchy, bandwidth, and locality that were introduced in this chapter.