2.2.3 Main memory (RAM)

Table of Contents

Role of Main Memory in HPC Systems

Main memory, or RAM, is the working area for programs during execution. In an HPC context it is one of the central constraints that shapes how applications are written and how they perform. While cache and registers sit closer to the CPU and storage sits further away, RAM is the large, directly addressable space where most data structures of interest actually live.

In cluster nodes used for HPC, main memory capacity, speed, and organization often determine whether a problem can be run at all, how large the problem can be, and how efficiently the CPUs and accelerators can be kept busy. Memory is shared by all cores on a node and is continuously moving data to and from caches, so its characteristics matter for both single-node and multi-node performance.

Basic Characteristics of RAM

From an application perspective, RAM appears as a large array of bytes addressed by virtual addresses. Physical memory is organized in hardware into channels, ranks, banks, and rows, but the software view is much simpler: you allocate memory, you get an address range, and you read and write through it.

Two characteristics are particularly important in HPC:

Capacity determines how large your in-memory data can be. For example, if a node has 256 GB of RAM, the total size of all resident processes and their data must fit within that, with some overhead for the operating system. Large simulations in computational fluid dynamics, climate, or molecular dynamics often push right up against available memory.

Bandwidth determines how quickly data can be transferred between RAM and the CPU. Even if a node has enough capacity, a memory system with low bandwidth can starve fast processors, leading to poor utilization.

Latency, the time for a single data item to be fetched from RAM to the CPU, is also important, but for HPC codes that stream through large arrays, total sustained bandwidth usually matters more than individual access latency.

In many HPC applications, main memory bandwidth is a primary bottleneck, not CPU flops. Performance often scales with how fast data can be moved, not just with clock speed or number of cores.

DRAM Technology and its Consequences

Main memory in HPC systems is built from DRAM chips. DRAM stores each bit in a tiny capacitor that must be periodically refreshed, which gives it its characteristic performance profile and access patterns.

Accesses to DRAM are performed in relatively large chunks known as rows or pages. If successive accesses hit the same open row, access is faster, but if the access pattern keeps jumping between distant rows, more time is spent on row activations and precharges. Although application programmers do not manipulate DRAM rows directly, the choice of data structure and access pattern can align with or fight against how DRAM operates.

Modern systems use several generations of DRAM technology such as DDR4, DDR5, and high bandwidth variants like HBM. HPC nodes may combine traditional DDR memory with very high bandwidth on-package memory. For this chapter the crucial point is that different DRAM technologies trade off capacity, bandwidth, and cost. Standard DIMMs provide large capacity fairly cheaply, while HBM provides very high bandwidth at more limited capacity per device.

Memory Bandwidth and Parallel Access

When multiple CPU cores run in parallel, they share access to the same set of memory channels. Each channel can transfer a certain amount of data per second, so the total sustainable bandwidth of a node is roughly the sum over all channels.

If a node has $B$ bytes per second of memory bandwidth and $N$ cores, then the average bandwidth per core cannot exceed $B / N$ if all cores are fully active. In practice, some cores may get more and some less, but this simple ratio illustrates a frequent effect in HPC: adding more cores does not always increase performance if the memory system is already saturated.

Programs that perform many arithmetic operations per byte of data moved are called compute bound. Those that perform relatively few operations per byte are memory bound. Memory bound codes are limited by main memory bandwidth, and their runtime improves little once the memory subsystem is fully loaded, regardless of extra cores.

If your application performs only a small number of flops per byte loaded from RAM, it will likely be memory bandwidth bound. Adding more cores may not speed it up unless the memory access pattern becomes more efficient.

This is one of the reasons why understanding main memory is crucial in HPC. It directly influences scaling behavior on a node.

NUMA: Non-Uniform Memory Access

On a single HPC node with many sockets or many-core processors, main memory is often organized in a NUMA layout. Each CPU socket has its own directly attached memory, and remote access to memory attached to another socket is slower or has lower bandwidth.

From the operating system point of view, the sum of these memories is a single global memory space, but with regions that have different access costs from each core. If a process and its working data end up on different NUMA “nodes,” every access can incur a higher latency and lower bandwidth.

Modern operating systems provide NUMA-aware allocation and scheduling, and HPC programmers can use libraries or directives to control and inspect placement. For now, it is enough to understand the qualitative effect: where in physical memory your data resides, relative to the core executing the code, can significantly affect performance.

On NUMA systems, data locality matters. Accessing memory local to the executing core can be significantly faster than accessing memory attached to another socket.

Memory Allocation from an HPC Perspective

Most scientific codes use dynamic memory allocation extensively because problem sizes are not known at compile time. Typical languages and mechanisms include malloc/new in C and C++, allocate in Fortran, and array allocation in higher-level languages.

Several aspects of allocation influence performance:

First, alignment. Some memory operations and SIMD instructions achieve peak bandwidth when data is aligned to specific boundaries, for example 32 or 64 bytes. Many HPC libraries and compilers take care to align large arrays for this reason.

Second, contiguity. Arrays that are contiguous in memory allow the hardware to fetch data in large sequential bursts. Linked lists or irregular structures scatter data across RAM, which can increase the number of DRAM row activations and reduce effective bandwidth.

Third, lifetime and reuse. Allocating and freeing large amounts of memory repeatedly can introduce overhead and fragmentation. Long-lived, reused buffers are often preferred in performance critical codes to keep allocation costs under control and to maintain predictable memory usage.

Operating systems expose virtual memory, where each process has its own large address space. However, HPC applications usually try to avoid relying on swapping, because moving memory pages to disk is many orders of magnitude slower than DRAM and can destroy performance. In practice, “out of core” algorithms that deliberately use disk as an extension of memory are carefully designed and are a subject of specialized techniques beyond this chapter.

Main Memory, Caches, and Access Patterns

Although caches are discussed elsewhere, it is helpful here to emphasize how RAM and caches interact from an HPC perspective. RAM feeds data into caches in fixed size chunks called cache lines. Reads and writes to RAM always involve cache lines, not individual scalar values.

If you iterate over a large array in contiguous order, each cache line fetched from RAM will be fully used. The transfer from RAM to cache is efficient, and the sustained bandwidth approaches the peak of the memory subsystem.

If you repeatedly touch widely separated memory locations, you may fetch many cache lines but use only a small part of each. From the point of view of RAM, the bandwidth is being wasted. This is harmful for streaming codes which, in principle, could achieve high bandwidth.

In HPC, a central design goal for data structures and algorithms is to make access patterns as regular and contiguous as possible. Loop ordering, array layouts, and blocking techniques are often chosen primarily to make RAM traffic more sequential and predictable.

Measuring and Reasoning About Memory Traffic

To understand and optimize memory usage, it is helpful to think explicitly about memory traffic, that is, how many bytes are moved between RAM and the CPU per operation.

If a loop reads an array of $N$ double precision numbers and writes a result array of $N$ double precision numbers, then in the simplest model the memory traffic is approximately:

$$
\text{Bytes moved} \approx N \cdot 8 \, (\text{read}) + N \cdot 8 \, (\text{write}) = 16N \, \text{bytes}.
$$

Given a node memory bandwidth $B$ in bytes per second, a lower bound on the time $T$ for this loop, ignoring caches and other effects, is:

$$
T \ge \frac{16N}{B}.
$$

For memory bound codes, a simple bytes / bandwidth estimate can provide a hard lower bound on runtime, independent of CPU frequency or core count.

While real performance is more complex and caches can reduce the amount of data that must come from RAM, this sort of calculation helps to identify when main memory is the limiting factor.

Memory Capacity Limits and Problem Sizing

On HPC clusters, users often need to choose how many nodes to request and what problem sizes to run. Main memory capacity is one of the primary constraints that informs this decision.

If an application requires $M$ bytes of data for a given problem size, and each node provides $C$ bytes of usable RAM for the job, then the minimum number of nodes purely from a capacity perspective is:

$$
N_{\text{nodes}} \ge \left\lceil \frac{M}{C} \right\rceil.
$$

However, this arithmetic is simplistic, because overheads from the operating system, libraries, and intermediate buffers also consume memory. Most HPC centers recommend leaving a safety margin. Many codes also allocate temporary work arrays whose size may not be obvious from user-level input parameters.

As problem sizes grow, the growth rate of memory usage relative to the number of degrees of freedom becomes critical. For example, a three dimensional grid with $n$ points in each direction has $n^3$ total points. If each point stores several variables, and additional arrays hold intermediate results, total memory can grow rapidly beyond intuition. Understanding how memory scales with input parameters is an important part of designing and running HPC workloads.

Main Memory in Multi-node HPC

Although each node has its own physical RAM, distributed memory models treat each process’s local memory as part of a global distributed data structure. When building distributed data structures with MPI or other models, the memory layout within each node still matters.

Load balance is not only about distributing work evenly across processes, but also about distributing memory usage. If one process holds far more data than others, that process may run out of local RAM even if the global capacity is sufficient.

Furthermore, communication patterns between nodes are intertwined with memory layout. Data that must frequently cross node boundaries is staged from local RAM to the network interface. Contiguous buffers reduce both main memory overhead and communication overhead, while scattered fragments may require extra packing and unpacking operations.

Reliability and Error Correction

HPC systems typically use ECC (Error Correcting Code) memory. ECC adds extra bits to each memory word that allow single bit errors to be corrected and some multi-bit errors to be detected. In large memory systems with many DIMMs and long runtimes, memory errors are a statistical certainty and ECC protects the correctness of scientific results.

ECC introduces some small cost in terms of additional memory bits and sometimes a minor performance penalty. In HPC this cost is considered essential. Silent corruption of data in memory can invalidate long simulations, so reliability is valued over marginal raw throughput gains.

From the programmer’s perspective, ECC is mostly invisible. However, logs from the system may report corrected or uncorrected memory errors, and system administrators may decommission failing DIMMs based on error statistics. For long running HPC jobs, the stability of main memory is a crucial factor in overall reliability.

Summary

Main memory in HPC is more than just “where arrays live.” Its capacity limits problem size, its bandwidth often sets performance ceilings, and its organization through NUMA and DRAM structure influences how algorithms should be implemented. Effective use of RAM requires attention to access patterns, data layout, allocation strategies, and memory-aware problem sizing. In later chapters, these themes will reappear when discussing parallel programming models, performance optimization, and practical workflow design on HPC clusters.

Comments

Please login to add a comment.

Don't have an account? Register now!