Table of Contents
Role of CPUs in HPC
In HPC systems, the CPU (Central Processing Unit) is the primary general-purpose compute engine. While GPUs and other accelerators are increasingly important, nearly every HPC workflow still depends on CPUs for:
- Running the operating system and job scheduler daemons
- Coordinating and orchestrating work on accelerators
- Executing control logic, serial code regions, and I/O handling
- Running codes that are not GPU-enabled
For HPC programmers and users, understanding CPUs means understanding three tightly related ideas:
- The CPU socket (or package)
- The cores within each CPU
- The clock speed at which those cores execute instructions
These characteristics directly affect how many tasks your job can run in parallel and how fast each task can execute.
CPU Sockets vs Cores
CPU sockets
A socket (or CPU package) is the physical chip plugged into the motherboard. An HPC node may have:
- 1 socket (single-socket node)
- 2 sockets (dual-socket node; very common)
- More sockets in high-end servers
When you see descriptions such as “2 × 32-core CPUs,” that means the node has:
- 2 sockets
- Each socket with 32 cores
- Total of 64 physical cores per node
Schedulers and resource managers typically expose both nodes and cores as resources. The number of sockets matters for memory locality and interconnect topology (which are covered elsewhere), but from the perspective of this chapter, sockets are just containers for cores.
CPU cores
A core is an independent execution unit inside a CPU that can run its own instruction stream. Conceptually:
- One core ≈ one independent “worker” that can run one OS thread at a time (ignoring hardware threads for the moment).
- Multiple cores allow parallel execution of multiple tasks or threads on a single CPU socket.
Important properties of cores:
- They share some resources on the chip (e.g., last-level cache, memory controllers, interconnects).
- Each core has its own set of registers and usually private L1 and often L2 caches (cache details are handled in the memory hierarchy chapter).
- In HPC, “how many cores do I have?” is often equivalent to “how many CPU threads or MPI ranks can I reasonably run per node?” (not accounting yet for oversubscription or hyper-threading).
Modern HPC CPUs may have tens to hundreds of cores per socket. For example:
- 32–64 cores per socket is common for current x86-based HPC CPUs.
- Many-core architectures (e.g., some accelerators or certain CPU lines) may have more.
Hardware Threads and Simultaneous Multithreading (SMT)
Many CPUs support simultaneous multithreading (SMT), known as hyper-threading in Intel’s terminology.
- A hardware thread is an execution context on a core.
- A core with SMT can present 2 or more hardware threads to the OS.
- The OS then “sees” 2 logical CPUs per physical core (for 2-way SMT), or more.
For a CPU with:
- 32 physical cores
- 2 hardware threads per core (2-way SMT)
The OS will report 64 logical CPUs.
Key points for HPC:
- Hardware threads share the core’s execution units and caches.
- SMT improves throughput when one thread is stalled (e.g., waiting for memory); another thread can use the otherwise idle functional units.
- For compute-bound HPC kernels, SMT often provides limited or no benefit, and sometimes hurts performance due to increased contention for core resources.
- For latency-bound or I/O-heavy workloads, SMT can help keep cores busy.
HPC schedulers may let you request:
- Physical cores (often preferred)
- Logical CPUs (hardware threads), sometimes with a specific SMT policy
You need to know whether “64 CPUs per node” in documentation means:
- 64 physical cores, or
- 32 cores × 2 hardware threads per core
What Clock Speed Means
Clock cycles and frequency
The CPU core executes instructions in discrete time steps called clock cycles. The clock speed or frequency is how many cycles happen per second.
- 1 Hz = 1 cycle per second
- 1 GHz = $10^9$ cycles per second
A core with a 3.0 GHz clock runs:
$$
3.0 \times 10^9 \text{ cycles/second}
$$
The (idealized) time per cycle is:
$$
\text{cycle time} = \frac{1}{\text{frequency}}
$$
So at 3.0 GHz:
$$
\text{cycle time} = \frac{1}{3.0 \times 10^9} \approx 0.33 \text{ ns}
$$
Programs are made up of instructions; each instruction takes one or more cycles to complete when executed on the core. In practice:
- Not all cycles execute useful work (stalls, mispredicted branches, cache misses, etc.).
- Multiple instructions can often be in different pipeline stages concurrently (pipelining).
- Multiple operations can be issued per cycle (superscalar execution, vector units), but the details belong in other chapters.
Base, turbo, and all-core frequencies
Modern CPUs do dynamic frequency scaling:
- Base frequency: the guaranteed minimum frequency under sustained load and within power/thermal limits.
- Turbo (boost) frequency: a higher frequency that a core can reach temporarily when thermal and power budgets allow.
- All-core frequency: the typical frequency when many or all cores are active; this is often lower than the maximum turbo for a single active core.
In HPC:
- Small, lightly threaded workloads may see close to single-core turbo.
- Realistic production jobs that use many cores often run closer to all-core turbo or even base frequency, depending on cooling and power limits.
This means the “3.5 GHz” advertised on a spec sheet does not guarantee 3.5 GHz simultaneously on all cores for a long-running, full-node job.
Performance and frequency
Ignoring other factors, the time to complete a fixed number of instructions scales approximately inversely with frequency:
If:
- $T_1$ is time at frequency $f_1$
- $T_2$ is time at frequency $f_2$
and the same instruction stream is executed with the same efficiency, then:
$$
\frac{T_1}{T_2} \approx \frac{f_2}{f_1}
$$
For example, a move from 2.5 GHz to 3.0 GHz could ideally give:
$$
\text{speedup} \approx \frac{3.0}{2.5} = 1.2
$$
i.e., about 20% faster. In real workloads, memory and I/O bottlenecks often reduce the actual gain.
Cores, Frequency, and Throughput vs Latency
In HPC, it is useful to distinguish:
- Latency: how long it takes to complete one job or solve one problem instance.
- Throughput: how many jobs or problem instances you can process per unit time.
Relating this to cores and clock speed:
- More cores → higher potential throughput (more tasks in parallel).
- Higher clock speed per core → lower latency for a single-threaded or fixed-parallelization job.
Examples:
- If you run one single-threaded application on a node:
- Higher clock speed on one core will usually help more than additional idle cores.
- If you run many independent simulations (parameter sweeps, Monte Carlo, etc.):
- More cores allow more simulations to run concurrently, increasing throughput.
- For parallel codes that scale well with core count:
- Both more cores and good per-core performance (frequency + architecture efficiency) matter.
Cluster procurement trade-off (often not your decision, but useful to understand):
- High-core-count, lower-frequency CPUs:
- Good for workloads that parallelize well and are not tightly latency-sensitive per process.
- Lower-core-count, higher-frequency CPUs:
- Good for memory-bound or poorly scaling codes that can’t use many cores efficiently.
Measuring Core and CPU Performance: A Simple Model
A very basic model for per-core floating-point performance is:
$$
\text{FLOP/s per core} \approx f \times \text{FLOP per cycle}
$$
where:
- $f$ is the clock frequency (Hz)
- “FLOP per cycle” depends on:
- How many floating-point units are available
- Vector width and utilization
- Instruction mix and pipeline efficiency
Total theoretical performance of a node (ignoring overheads) is roughly:
$$
\text{FLOP/s per node} \approx (\text{FLOP/s per core}) \times (\text{number of cores})
$$
In practice:
- Achieved performance is a fraction of this peak due to memory limits, control flow, and communication.
- But this model shows how both cores and frequency contribute to the theoretical compute capacity.
Power, Thermal Limits, and Energy Considerations
Higher clock speeds and more active cores increase:
- Power consumption
- Heat generation
Modern CPUs manage this via:
- Dynamic Voltage and Frequency Scaling (DVFS)
- Thermal throttling if limits are exceeded
- Package power caps
For HPC systems:
- Power and cooling are major cost drivers.
- Many clusters enforce node-level power limits; this can reduce achievable turbo frequencies when many cores are active.
From a job’s perspective:
- A heavily loaded node may run at lower effective frequencies than a lightly loaded one.
- Energy-to-solution (total energy used to complete a job) can be as important as time-to-solution, especially at scale.
Some systems expose frequency controls or power management features to users (within limits), which can be leveraged for energy-aware computing (discussed in sustainability-focused sections).
Practical Implications for HPC Users
Reading node specifications
When you see something like:
- “2 × 32-core 2.6 GHz CPUs per node”
you can infer:
- Sockets: 2
- Cores per socket: 32
- Total cores per node: 64
- Nominal base or all-core frequency: around 2.6 GHz
This helps you decide:
- How many cores to request per job.
- How many MPI ranks or threads to run per node (in combination with memory and other considerations).
Choosing core counts for jobs
Common patterns:
- Pure MPI code: one MPI process per core (e.g., 64 ranks on a 64-core node), or sometimes one rank per socket with multiple threads per rank.
- Hybrid MPI + threads: fewer MPI ranks, each using several threads, often matched to cores per socket or per NUMA domain.
Although scheduling and hybrid strategies are addressed in later chapters, you should already be comfortable that:
- Requesting more cores lets you run more parallel work.
- But the benefit depends on how well your code scales with additional cores.
Not relying blindly on GHz
Clock speed alone is not a reliable performance predictor across different CPU generations or architectures because:
- Newer CPUs may do more work per cycle (wider vectors, more execution units).
- Memory subsystem differences can dominate for memory-bound codes.
However, within the same CPU family, at similar core counts:
- Higher clock speed usually correlates with slightly better single-thread performance, at the cost of higher power.
For performance-critical runs, it is common to:
- Benchmark code on available CPU types and node configurations.
- Use that data, not just GHz, to choose where to run large jobs.
Summary
- A CPU socket contains multiple cores; each core is an independent execution engine.
- Many CPUs support multiple hardware threads per core (SMT), presenting more logical CPUs than physical cores.
- Clock speed (frequency) determines how many cycles per second a core can execute; actual frequency varies dynamically (base vs turbo).
- More cores mainly increase throughput (more work in parallel); higher frequency mainly reduces latency (faster per-core execution).
- Node specifications such as “2 × 32-core 2.6 GHz” let you estimate how many cores you can use and the approximate per-core performance.
- Power and thermal constraints limit how much frequency and how many cores can be used simultaneously at full speed, which matters significantly in HPC environments.