Table of Contents
Role of Main Memory in the Hierarchy
Main memory (RAM) sits between the fast but tiny levels (registers and cache) and the large but slow storage (SSDs/HDDs). In the hierarchy, RAM is:
- The primary workspace for running programs and data.
- Much larger than cache, but slower and with higher latency.
- Directly accessible by the CPU through the memory controller.
For HPC, RAM size and bandwidth often limit:
- How large a problem you can run (e.g., size of matrices or grids).
- How fast your code can progress when it is not cache-resident.
Basic Characteristics of RAM
Volatile storage
RAM is volatile: its contents are lost when power is removed. This is why files must be saved to non-volatile storage (e.g. filesystem) for persistence.
Addressable memory
RAM is organized as a large array of bytes, each with a unique address:
- The CPU reads/writes data using these addresses.
- The operating system maps process virtual addresses to physical RAM.
Conceptually, you can think of it as:
$$
\text{RAM} \approx [\text{byte}_0, \text{byte}_1, \dots, \text{byte}_{N-1}]
$$
where $N$ is the total size in bytes.
Capacity vs. bandwidth vs. latency
Three key properties:
- Capacity: total size (e.g. 64 GB per node).
- Bandwidth: how much data per second can be moved between CPU and RAM (e.g. 200 GB/s).
- Latency: how long it takes to start receiving a value after a load request (tens to hundreds of CPU cycles).
In HPC, it’s common for performance to be limited by bandwidth or latency rather than pure CPU speed.
Types of RAM Relevant to HPC
DRAM vs. SRAM (at a high level)
- SRAM (static RAM) is fast and used for caches.
- DRAM (dynamic RAM) is slower and used for main memory.
Main memory in modern systems is almost always DRAM.
DDR generations
Most HPC nodes use some generation of DDR (Double Data Rate) DRAM, such as:
- DDR4: widely deployed, typical older clusters.
- DDR5: newer systems, higher bandwidth per module.
Each new DDR generation generally increases:
- Peak bandwidth.
- Potential capacity per module.
- Power and complexity of the memory subsystem.
You don’t usually program DDR directly, but the generation affects the node’s memory bandwidth and therefore performance.
High-Bandwidth Memory (HBM)
On some accelerators and newer CPUs, you may encounter HBM:
- Much higher bandwidth than standard DDR.
- Usually smaller capacity.
- Often appears as a separate NUMA-like region or as a “near” memory to GPUs.
While HBM is covered more deeply in GPU/accelerator discussions, for this chapter it’s important to recognize it as an additional kind of main memory with different characteristics.
Organization of Main Memory in Nodes
Channels, DIMMs, and sockets
Main memory is physically organized into:
- DIMMs (memory modules) plugged into the motherboard.
- Channels: each CPU socket has several memory channels, each connected to one or more DIMMs.
Bandwidth scales roughly with the number of active channels. For HPC:
- Fully populating memory channels often improves bandwidth.
- Unbalanced configurations can limit performance, even if total capacity seems large enough.
Memory controllers
Each CPU socket contains one or more memory controllers:
- They handle communication between cores and the DIMMs.
- They schedule reads/writes, manage refresh, and optimize access patterns internally.
From a programmer’s perspective, you don’t directly control the controller, but:
- Access patterns (sequential vs. random) can significantly affect achieved bandwidth.
- Strided or irregular access can lead to underutilized bandwidth.
NUMA: Non-Uniform Memory Access
On modern multi-socket servers (and even some single-socket CPUs with chiplets), main memory is typically NUMA-organized:
- Each CPU socket has “local” memory, directly attached to it.
- Cores on a socket can also access “remote” memory attached to other sockets, but with:
- Higher latency.
- Lower effective bandwidth.
This leads to:
- Local memory access: fast, preferred.
- Remote memory access: slower, can severely hurt performance if frequent.
Typical effects:
- An application that unintentionally uses mostly remote memory can be much slower than expected.
- Memory placement relative to threads/processes becomes an important tuning aspect in HPC.
Memory Pages and Virtual Memory (HPC-Relevant Aspects)
Virtual vs. physical memory
The OS uses virtual memory:
- Each process sees a contiguous address space.
- The OS maps virtual addresses to physical RAM (and possibly to disk-backed swap, though swap is generally undesirable in HPC).
For HPC users, notable consequences:
- You don’t deal with physical addresses directly.
- Oversubscribing RAM (forcing swap usage) can catastrophically degrade performance or cause jobs to be killed.
Page size and TLB effects
Memory is managed in units called pages (commonly 4 KB, but “huge pages” like 2 MB or 1 GB can also be used):
- Page tables record mappings from virtual to physical pages.
- The TLB (Translation Lookaside Buffer) caches these mappings.
Implications:
- Access patterns that touch many different pages can stress the TLB, causing additional overhead.
- Using huge pages can reduce the number of page table entries and TLB misses, sometimes improving performance in large-memory HPC codes.
Configuration of huge pages is typically a system-level concern, but many HPC applications and libraries provide options to use them.
Memory Bandwidth and Access Patterns
Sequential vs. random access
Performance of RAM depends strongly on how you access it:
- Sequential access:
- Reads/writes nearby addresses in order.
- Allows hardware prefetchers and DRAM row-buffer locality to be effective.
- Achieves close to peak bandwidth.
- Random access:
- Jumps around addresses.
- Reduces effectiveness of prefetching and row-buffer hits.
- Achieves much lower bandwidth and higher effective latency.
Many HPC codes are designed to use data structures and loops that favor sequential access in the innermost loops.
Strided and scattered access
Common patterns with performance impact:
- Strided access (e.g., accessing every $k$-th element):
- Can still be efficient if the stride is small.
- Large strides can behave similarly to random access.
- Scattered/gather access:
- Uses index arrays or indirect addressing.
- Often necessary in sparse or unstructured problems.
- Can be memory-bound and difficult to optimize.
In performance analysis, these patterns show up clearly in memory bandwidth metrics.
Capacity Constraints in HPC
Memory per core and per node
HPC nodes are often described by:
- Total memory per node, e.g. 256 GB.
- Memory per core, e.g. 4 GB/core on a 64-core node.
Why this matters:
- If your problem requires more memory per process than available, it will fail or swap.
- Running many processes/threads per node reduces memory per process.
When designing jobs:
- Choose problem sizes and process/thread counts that fit into available RAM.
- Be aware that other users’ jobs or system services may use some memory; avoid filling nodes to 100% of RAM.
Memory footprint of applications
For HPC codes, typical major contributors to memory usage:
- Large arrays (e.g., grids, matrices, fields).
- Ghost cells/halo regions for domain decomposition.
- Temporary work arrays and buffers.
- Data structures for communication (e.g., MPI buffers).
Memory reduction strategies belong to other chapters, but at this level:
- Always estimate memory usage for your problem size.
- Respect memory limits requested from the scheduler; under-requesting can cause job failures.
Main Memory and Parallelism on a Node
Shared main memory for threads
On a single node:
- All cores share the same main memory, though with NUMA nuances.
- Threads (e.g., OpenMP) share an address space and can directly access the same RAM-resident data structures.
Considerations:
- Concurrent access to shared data can cause false sharing and cache contention; these are closely tied to how data is laid out in RAM.
- Distribution of work and data across cores affects which memory banks and NUMA nodes are heavily used.
Memory bandwidth as a scaling limit
When you increase the number of threads or processes per node:
- CPU cores compete for the same memory bandwidth.
- Beyond some point, adding more cores yields little speedup because memory becomes the bottleneck.
This is why node-level performance studies often measure:
- Achieved memory bandwidth per core.
- Scaling efficiency as core count increases on a fixed-size problem.
Practical Considerations for HPC Users
Checking node memory and usage
On HPC systems, you’ll frequently need to know:
- How much RAM a node has.
- How much your job is using.
Typical tools/approaches (names will vary by system):
- Node info via scheduler (e.g.,
scontrol show nodein SLURM-based systems). - Monitoring tools on the node (e.g.,
free,top,htop) for rough usage during interactive or debug runs.
Requesting memory in job schedulers
Schedulers typically allow you to request memory:
- Per node (e.g.,
--mem=128G). - Per CPU/core (e.g.,
--mem-per-cpu=4G).
Good practice:
- Request slightly more than your estimated requirement but not excessively more, to:
- Avoid out-of-memory failures.
- Use cluster resources fairly.
Details of the syntax and policies are handled in job scheduling chapters, but they are fundamentally about allocating access to main memory.
Avoiding swapping
When a node runs out of RAM:
- The OS may start using swap (disk) as “extra memory”.
- Accessing swapped-out pages is orders of magnitude slower than RAM.
- On many HPC systems, heavy swapping is prevented or leads to job termination.
For HPC runs:
- Treat swapping as unacceptable for performance.
- Ensure memory usage stays well within physical RAM.
Summary
Main memory (RAM) in HPC:
- Is the primary large, fast-enough storage for active data.
- Determines the maximum problem size and strongly influences performance via capacity, bandwidth, and latency.
- Is organized in channels and NUMA nodes, making data placement and access patterns crucial.
- Becomes a shared and sometimes contended resource as you scale up cores and threads on a node.
A basic understanding of these aspects is essential for reasoning about performance and for making sensible choices about job configuration and code structure on HPC systems.