7.5.2 Memory management

Table of Contents

Overview

Memory management in Linux controls how the system uses RAM and swap so that processes can run efficiently and the system remains responsive. At a high level, Linux abstracts physical memory into virtual address spaces, shares common data across processes, and uses sophisticated algorithms to decide what to keep in RAM and what to reclaim or move out.

This chapter focuses on how Linux manages memory internally. It connects concepts you may already know, such as swap or “out of memory” errors, to the mechanisms that actually implement them inside the kernel. You will see how virtual memory is organized, how the kernel tracks memory usage, how it decides what to reclaim, and what happens when memory runs out.

Virtual Memory and Address Spaces

Every process in Linux sees its own virtual address space. From the viewpoint of a process, it has a continuous range of memory addresses available, even though the underlying physical memory is fragmented and shared with other processes.

Virtual memory is organized into pages. A page is a fixed size block of memory, typically $4\ \text{KiB}$ on x86 and many other architectures, although larger “huge” pages are also supported. The kernel maintains page tables that map virtual addresses to physical page frames. Each user space process has its own set of page tables that define its view of memory.

Virtual memory is divided into regions with different purposes. Typical regions include program code, read-only data, writable data, the heap, the stack, memory mapped files, and potentially shared libraries or shared memory segments. These regions are represented in the kernel by memory areas that describe what the region is for, what protections it has, and how it is backed, for example by an anonymous zero filled page or by a file.

A key benefit of virtual memory is isolation. Processes cannot directly access each other’s memory because their virtual address spaces are separate. Kernel memory lives in a different part of the address space and is normally not directly reachable from user space. This isolation is enforced by the hardware memory management unit, which uses the kernel’s page tables to check permissions on each access.

Pages, Page Tables, and TLBs

Linux works at the granularity of pages. Physical RAM is divided into page frames. Virtual memory addresses are divided into a page number and an offset within the page. The page number is used to look up the corresponding physical page frame in the page tables.

Each entry in the page tables describes whether the page is present in physical memory, its permissions (read, write, execute), whether it is dirty or clean, and whether it is backed by anonymous memory or a file. If a virtual address is accessed and the corresponding page table entry is marked as not present, a page fault occurs.

The processor uses a Translation Lookaside Buffer, or TLB, as a cache of recent virtual to physical translations. When a mapping is changed in the page tables, the kernel must perform a TLB shootdown for the affected CPUs so that old translations are invalidated. This is important when unmapping or changing permissions, for example when freeing memory or applying copy on write.

Linux also supports huge pages, often of size $2\ \text{MiB}$ or $1\ \text{GiB}$ on some architectures. Huge pages reduce TLB pressure because one TLB entry covers more memory, but they are harder to allocate due to fragmentation and have different management rules.

Anonymous Memory and File Backed Memory

Linux distinguishes between anonymous memory and file backed memory. Anonymous memory is not directly backed by a filesystem object. This includes heap allocations from malloc, stacks, and many types of temporary buffers. File backed memory is created using memory mapped files, shared libraries, the executable image, and other data that lives on disk as a file.

Anonymous pages are typically created zeroed and are written to swap when the kernel needs to reclaim physical memory. File backed pages represent some offset in a file, such as your program’s binary or a shared library. If a file backed page is clean and not currently used, it can simply be discarded from memory because it can be read again from disk when needed.

This distinction is important for memory reclamation. The kernel prefers to drop clean file backed pages instead of swapping out anonymous pages, since dropping them does not require extra writes to disk. Dirty file backed pages must be written back to their files before they can be reclaimed.

Memory Allocation in the Kernel

Inside the kernel, memory allocation is more complex than simple user space allocations. The kernel cannot afford to block indefinitely or to fail unpredictably, and it must deal with different sizes and lifetimes of allocations.

At the lowest level, the kernel manages physical pages with the buddy allocator. This allocator maintains free lists of blocks of memory whose sizes are powers of two. When an allocation request for one or more contiguous pages comes in, the allocator splits larger blocks into smaller ones as necessary. When blocks are freed, the allocator tries to merge buddies back into larger blocks to reduce fragmentation.

Above the page level, the kernel uses slab like allocators to handle frequent small allocations of kernel objects. These allocators maintain caches of already initialized objects of specific types. When a new object is needed, the allocator can often hand out an object from the cache instead of allocating fresh pages and initializing them from scratch. This improves performance and locality.

Kernel allocations often specify allocation flags that control behavior. For example, an allocation may request that it not sleep, or that it can wait for I/O to free memory, or that it must use memory that is directly addressable by hardware devices. These flags guide the allocator and influence memory reclaim decisions.

User Space Allocators and the Kernel Interface

In user space, applications do not call the kernel directly for every small allocation. Instead, they use memory allocation libraries such as glibc’s malloc. These allocators request larger chunks of memory from the kernel through system calls such as brk for the traditional heap and mmap for mapped regions.

The allocator then manages that memory internally, carving it into smaller blocks for the application. When the application frees memory, the allocator may keep it in its own caches for reuse instead of returning it immediately to the kernel. This behavior explains why a process’s resident set size may not shrink immediately when it frees memory.

When large allocations are required, modern allocators often use mmap to obtain separate regions, which can then be returned to the kernel by unmapping them. This allows some memory regions to be fully reclaimed and helps reduce fragmentation within the user space heap.

Copy on Write and Fork

Copy on write is a key optimization used by Linux in several contexts, especially when a process calls fork. When fork is invoked, the kernel creates a new process that initially shares the same physical pages as the parent. Instead of copying all the pages immediately, the kernel marks shared pages as read only and records that they are shared.

If either the parent or the child later attempts to write to one of these pages, a page fault occurs. The kernel handles this fault by allocating a new page, copying the contents from the shared page into the new page, and updating the page table entry for the writing process to point to the private copy with write permission. The other process continues to use the original page.

This strategy preserves the semantics of separate address spaces while avoiding unnecessary copying for pages that are never modified in the child. It is particularly effective in combination with exec, where the child soon replaces its address space with a new program and most of the inherited pages are never written.

Copy on write is also used for shared memory mappings and for some filesystem features in copy on write filesystems. In all cases, the core idea is to delay copying until a write actually happens.

Demand Paging and Page Faults

Linux uses demand paging for both anonymous and file backed memory. When a process starts, the kernel does not load all its code and data pages into RAM at once. Instead, it sets up page tables that indicate what each region represents, but marks most pages as not present.

When the process accesses a virtual address whose page is not currently mapped, the CPU triggers a page fault exception. The kernel’s page fault handler examines the fault to determine what should be mapped. If the address lies within a valid virtual memory region, the kernel allocates a page frame or loads the required data from disk and updates the page tables. The process then continues as if nothing unusual happened.

If the faulting address lies outside any valid region, or if the access violates permissions, for example writing to a read only page, the kernel signals the process, often with SIGSEGV, leading to a segmentation fault. This mechanism enforces the protection rules for each memory region.

Demand paging combined with page faults allows Linux to start processes quickly, avoid loading unused parts of programs, and share memory efficiently. It also allows transparent use of swap, since swap in and swap out are handled as special kinds of page faults and page writes.

The Page Cache and Buffers

Linux uses a unified page cache to store data that is read from or written to files and block devices. When an application reads from a file, the kernel loads the relevant disk blocks into page cache pages. Subsequent reads of the same data can be served directly from RAM, which is far faster than going to disk.

Similarly, when an application writes to a file, the kernel often writes into the page cache first. The page is marked dirty, and the data is eventually flushed to disk by background writeback processes. This delayed write strategy allows multiple writes to be combined and reordered for efficiency.

The page cache is central to performance, and it uses whatever free memory is not needed for other purposes. Linux treats free memory as wasted opportunity, so it fills unused RAM with cache. When applications need more memory, the kernel can quickly reclaim cache pages, especially those that are clean and easily reproducible from disk.

Historically, Linux also used separate buffer caches for block devices. Modern kernels integrate buffering and caching into the page cache, so the distinction is largely internal. From the perspective of memory management, the page cache unifies file I/O and memory usage.

Swap and Anonymous Page Reclamation

Swap extends the system’s apparent memory by providing secondary storage for anonymous pages. When physical RAM is scarce, the kernel may move some anonymous pages to swap space. These pages remain part of the process’s address space, but their contents reside on disk until they are needed again.

When a swapped out page is accessed, a page fault occurs. The kernel reads the page back from swap into a new page frame, updates the page tables, and resumes the process. To create room in RAM, the kernel may swap out or drop other pages.

Linux maintains data structures to track which anonymous pages are eligible for swapping and which swaps slots are used. Anonymous pages that have not been accessed recently and that are not pinned for I/O or other reasons are more likely to be swapped out.

Swap is not the same as a pure extension of RAM, because swap space is much slower. Heavy swapping leads to performance degradation, sometimes called thrashing, when the system spends more time moving pages between disk and RAM than doing useful work. The kernel’s memory management strives to avoid this state by balancing reclaim decisions between cache and anonymous memory.

Memory Zones and NUMA

Physical memory in Linux is divided into zones based on hardware and architectural constraints. Typical zones include DMA capable memory for legacy devices, normal memory for most kernel and user space uses, and high memory on architectures where some RAM lies beyond the range that can be permanently mapped into the kernel’s address space.

Each zone has its own free lists and management policies. When the kernel allocates memory, it chooses an appropriate zone based on flags and availability. This ensures that special allocations, such as those that must be directly accessible by certain devices, come from suitable regions.

On Non Uniform Memory Access (NUMA) systems, physical memory is further grouped into nodes, each associated with one or more CPUs. Accessing memory on the local node is faster than accessing memory on a remote node. Linux is NUMA aware and tries to allocate memory on the node where the process is running, in order to improve locality and performance.

NUMA adds complexity to memory management because the kernel must consider not only how much free memory exists, but also where it is located. Load balancing across nodes and migrating processes or pages between nodes are part of advanced NUMA management.

Memory Overcommit and OOM

Linux supports memory overcommit, which means that it can allow processes to request more memory than the system has in physical RAM and swap combined. This is possible because many allocated pages are never actually used, and because copy on write and sharing reduce the real footprint. Overcommit can improve utilization but it carries risk.

The kernel can operate in different overcommit modes, controlled by sysctl settings. In a liberal mode, it largely trusts that not all allocations will be used. In a strict mode, it tries to ensure that the sum of requested memory does not exceed a calculated limit. The heuristic mode uses a formula that considers physical memory, swap, and a configurable overcommit ratio. A simplified form of the maximum allowed memory in heuristic mode is

$$
\text{commit\_limit} = \text{swap\_total} + \text{RAM\_total} \times \frac{\text{overcommit\_ratio}}{100}
$$

This value defines approximately how much memory can be committed to processes.

When memory and swap are exhausted and the kernel cannot satisfy an allocation even after reclaim, it may invoke the Out Of Memory killer. The OOM killer selects one or more processes to terminate in order to free memory. It uses scoring heuristics that consider process size, importance, and other factors. Processes can also influence their own OOM score by adjusting specific tuning parameters.

When the system runs out of memory, the kernel may kill processes through the OOM killer to keep the system alive. Memory overcommit can increase the risk of OOM events if it is not configured carefully.

Page Reclaim and the LRU Lists

To keep enough free memory available, the kernel runs page reclaim algorithms. It maintains least recently used (LRU) like lists of pages. There are separate lists for active and inactive pages, and for file backed and anonymous pages. Pages that are accessed frequently tend to stay on active lists, while those that are not accessed migrate to inactive lists.

When memory pressure occurs, the kernel scans the inactive lists to find pages to reclaim. For file backed pages, if they are clean, they can be dropped immediately. If they are dirty, they must be written back before eviction. For anonymous pages, reclaim usually means swapping them out if swap space is available.

The reclaim process is incremental and balanced. The kernel tries to avoid over reclaiming from one type of memory, such as page cache, while leaving too many anonymous pages untouched, or vice versa. It also adapts its scanning aggressiveness depending on how severe the memory pressure is.

Direct reclaim occurs when a process’s own allocation triggers reclaim and the process helps with reclaim work. Background reclaim is carried out by dedicated kernel threads that try to maintain a target amount of free memory. This combination aims to smooth out memory pressure and prevent stalls.

Shared Memory and Memory Mapping

Linux supports shared memory and memory mapped files as first class citizens of the virtual memory system. When a file is memory mapped with mmap, the kernel maps file pages directly into the process’s address space. Reads and writes to that memory are translated into page cache accesses and file I/O. This allows efficient file access and sharing between processes that map the same file.

Anonymous shared memory, created through mechanisms such as shm_open or memfd, allows processes to share regions that are not directly backed by a persistent file. These regions exist only as long as they are referenced. The kernel manages them similarly to other virtual memory mappings, with proper reference counting and permission checks.

Sharing works because multiple processes can have their page tables point to the same physical pages. The kernel reference counts these pages, and only when the last reference disappears can the page be freed. If any process writes to a shared writable mapping that should not change for others, the kernel may combine this with copy on write semantics.

Memory mapping integrates tightly with the page cache. A file page that is mapped into memory and also used for buffered I/O is usually represented by the same physical page frame in the cache. This unification avoids duplication and improves consistency, since changes through one path are visible through the other.

Kernel Memory vs User Memory

Kernel memory exists in a different domain from user space memory. It is always mapped in the kernel portion of the address space and is not directly accessible from user space. Some kernel memory is permanent during the system’s runtime, such as core data structures and code. Other kernel memory is dynamically allocated as needed and freed when no longer used.

Kernel memory must be managed carefully because leaks or fragmentation cannot easily be recovered without rebooting. The allocator subsystems, such as the buddy allocator and slab caches, are designed to provide predictable performance and to handle special constraints.

On 32 bit architectures, the kernel’s own address space is especially constrained, which leads to techniques like high memory and mapping windows. On 64 bit systems, the address space is much larger, so these constraints are less severe, but careful management is still important because physical memory is finite.

User space memory, in contrast, is managed within each process’s virtual address space. When a process exits, the kernel can reclaim all its virtual memory regions and all associated physical pages that are not shared with other processes. This clean separation simplifies resource cleanup and enforces isolation.

Memory Accounting and Control Groups

Linux provides detailed memory accounting through mechanisms such as control groups. Memory cgroups allow the kernel to track and limit memory usage for groups of processes. This is essential in containerized environments, where each container should have its own memory budget.

When a cgroup has a memory limit, the kernel applies its reclaim algorithms within that boundary. If processes in a cgroup allocate more memory than allowed, the cgroup may experience reclaim pressure or OOM events even if the system as a whole has free memory. This allows fine grained control over resource allocation and prevents one workload from starving others.

Memory accounting includes both anonymous and file backed memory, and can be further refined for different kinds of usage, such as kernel memory, page cache, and swap. These controls integrate with the global memory management mechanisms, so that per group limits and global limits are both respected.

Control groups can enforce strict memory limits. If a cgroup exceeds its configured limit, Linux may trigger reclaim and OOM killing inside that group even when the rest of the system has sufficient memory.

Conclusion

Linux memory management combines virtual memory, paging, caching, swapping, and careful allocation strategies into a coherent system that must balance performance, isolation, and robustness. Pages and page tables form the foundation, while higher level mechanisms such as copy on write, demand paging, and the page cache make efficient use of physical memory.

Understanding these internal components helps explain system behavior under load, why processes may be killed when memory is low, why free memory is often reported as small because of caching, and how advanced features like NUMA and cgroups affect memory usage. This knowledge is essential for tuning, debugging, and designing systems that run reliably on Linux.

Comments

Please login to add a comment.

Don't have an account? Register now!