Table of Contents
Big Picture: What Linux Memory Management Actually Does
Linux memory management is about deciding:
- Which data lives in RAM and which is temporarily stored on disk
- How each process sees its own memory
- How to share memory safely
- How to recover memory under pressure without crashing the system
You’ll see the same ideas repeated at different levels:
- Virtual vs physical memory
- Pages as the basic unit of memory
- Overcommit and reclamation of memory
- Caches (page cache, slab cache)
- Swapping and OOM (Out-Of-Memory) behavior
This chapter looks at how Linux implements those ideas internally and how you can observe and influence them.
Virtual Memory and Address Spaces
Each process in Linux sees a virtual address space (e.g. on x86_64, up to 128 TB of virtual addresses, but much less physical RAM).
Key points:
- The kernel uses paging: virtual memory is broken into fixed-size pages (typically 4 KiB).
- Each process has its own page tables that map virtual pages to:
- Physical frames in RAM
- No frame yet (demand paging)
- Swap
- File-backed pages (from binaries, shared libraries, memory-mapped files)
User vs Kernel Address Space
On most 64-bit Linux systems:
- User space: lower part of the virtual address space
- Kernel space: upper part, same mapping for all processes
User processes cannot directly access kernel space; transitions happen via syscalls, interrupts, etc.
The kernel describes each process’s virtual address layout using the mm_struct and VMAs:
mm_struct: entire memory layout of a process.- VMA (Virtual Memory Area): a contiguous region with the same permissions and attributes (e.g. code segment, heap, stack, mmap’ed file).
Each VMA has flags like:
VM_READ,VM_WRITE,VM_EXEC: permissionsVM_GROWSDOWN: stacks that can grow downwardVM_MAYSHARE/VM_SHARED: possibly shared mappings
Paging, TLBs, and Page Faults
Pages and Page Frames
- Page: unit of virtual memory (typically 4 KiB).
- Page frame: physical memory block backing a page.
- Linux manages physical page frames via
struct pagestructures.
There are also huge pages:
- Transparent Huge Pages (THP): kernel automatically uses larger pages (e.g. 2 MiB) when beneficial.
- Explicit huge pages: managed via
hugetlbfsand specific APIs (mmapwithMAP_HUGETLB).
TLB (Translation Lookaside Buffer)
Hardware caches address translations in a TLB:
- Without a TLB, each memory access would require walking page tables – too slow.
- TLB entries are per-core and flushed/updated when mappings change (e.g. context switch,
mprotect,munmap).
Page Faults
When a process accesses a virtual address that isn’t currently mapped, the CPU raises a page fault and hands control to the kernel:
- Minor (soft) page fault:
- The page is already in RAM but not mapped in this process’s page tables yet.
- Example: copy-on-write page shared with another process.
- Major (hard) page fault:
- The kernel must read the page from disk (file or swap).
- Much slower; excessive major faults indicate memory pressure or poor locality.
Userspace tools like top, ps, pidstat, perf can show minor/major page fault counts.
Physical Memory: Zones, Nodes, and `struct page`
Linux hardware abstraction introduces:
NUMA Nodes
On NUMA (Non-Uniform Memory Access) systems:
- Memory is split into nodes, each close to a CPU socket.
- Accessing “local” node memory is faster than remote node memory.
- The kernel tries to allocate memory from the local node of the CPU running the process.
NUMA-aware allocators exist in both kernel and userspace (e.g. numactl, libnuma).
Memory Zones
Within each node, memory is divided into zones (logical groupings by physical address range and hardware constraints):
Typical zones:
ZONE_DMA: legacy DMA devices (low addresses)ZONE_NORMAL: main, directly mapped memoryZONE_HIGHMEM(32-bit only): not permanently mapped in kernel space
Allocations use GFP flags (e.g. GFP_KERNEL, GFP_USER) that indicate which zones and constraints apply.
The `struct page` Abstraction
Each physical page frame is described by a struct page:
- Contains reference counts
- Flags (e.g. dirty, reserved, in LRU list)
- Links into LRU lists, slab caches, etc.
Kernel code never directly deals with “bare” physical addresses; it works via these struct page objects.
Allocators: From Bytes to Pages and Slabs
Linux uses different allocators layered on top of each other:
- Buddy allocator — manages pools of pages
- Slab subsystem — manages objects inside pages
- Per-CPU caches — reduce contention on global structures
Buddy Allocator
Manages memory in powers of two:
- If you ask for $2^k$ pages, and the smallest free block is $2^{k+1}$, the allocator splits it into two buddies.
- When freeing, it coalesces buddies back into larger blocks when both buddies are free.
Key properties:
- Allocates contiguous physical pages (necessary for many kernel structures, DMA, huge pages).
- Works at page granularity (
order = 0is one page,order = 1is 2 pages, etc.).
The GFP flags you see in kernel code influence which zone, reclaim behavior, and allowances (e.g. can it block, can it use emergency reserves).
Slab / SLUB / SLOB Allocators
Above the buddy allocator, Linux uses slab-based allocators for small, frequently allocated objects:
- Slab caches (
kmem_cache) manage objects of a given fixed size and type. - Examples: dentry cache, inode cache,
task_struct, etc.
Three implementations exist, but SLUB is the default on most modern distros:
- Takes pages from buddy allocator and carves them into objects.
- Uses per-CPU partial/full slab lists to minimize locking.
- Tracks internal fragmentation and object coloring to reduce cache conflicts.
slabtop lets you inspect kernel slab usage from userspace.
Process Memory Layout: Heap, Stack, and Mappings
Inside a process’s virtual address space, typical regions include:
- Text (code): executable instructions
- Data/BSS: statically allocated, global data
- Heap: dynamically allocated via
malloc,new; grows upward - Stack: local variables, function call frames; grows downward
- Memory-mapped regions: shared libraries,
mmapfiles, anonymous mappings
The kernel doesn’t care about C-level notions like “heap” or “stack”; it just knows:
- VMAs with specific flags (e.g. expanding down, no-exec, etc.)
- Mappings backed by files vs anonymous memory
Userspace allocators (glibc malloc, jemalloc, tcmalloc, etc.) decide how to:
- Use
brk/sbrkfor heap extension - Use
mmapfor large allocations or arena management
/proc/<pid>/maps shows VMAs; /proc/<pid>/smaps gives detailed per-VMA stats.
Copy-on-Write (CoW) and Fork
Linux optimizes memory usage during fork() using copy-on-write:
- When a process forks, parent and child initially share the same physical pages for code, data, heap, etc.
- Those shared pages are mapped read-only in both processes.
- On a write:
- A page fault occurs (protection fault).
- Kernel allocates a new page, copies data, updates the faulting process’s page tables.
- Only the writer gets the new private page; the other keeps the original.
This allows:
- Very cheap
forkof large processes (crucial for traditional “fork+exec” model). - Efficient
vfork/clonevariations.
CoW is also used by:
fork()without immediateexecMAP_PRIVATEfile mappings- Some file systems (Btrfs, ZFS) at the block level (different layer, but same concept).
Page Cache and Disk I/O
Linux treats unused RAM as cache for file data to speed up I/O:
Page Cache
- When reading from a file, data is stored in the page cache.
- Subsequent reads can come from RAM instead of disk.
- Write operations update the page cache and mark pages dirty; kernel writes them back later.
Properties:
- Global, system-wide cache for all file-backed pages.
- Uses LRU (approximate) lists to track frequently vs rarely used pages.
- Dropping the page cache (
echo 1 > /proc/sys/vm/drop_caches) frees cached data, but hurts I/O performance until cache warms up again.
Writeback
The kernel’s writeback subsystem:
- Flushes dirty pages to disk in the background.
- Tries to avoid bursts that would stall processes.
- Under memory pressure or sync operations (
fsync,sync), writeback accelerates.
Tunable parameters in /proc/sys/vm/ (e.g. dirty_ratio, dirty_background_ratio) influence when writeback kicks in.
Reclaim, Swapping, and Memory Pressure
Reclaimable Memory
When the system is low on free pages, the kernel tries to reclaim memory:
- File-backed clean pages: easiest to reclaim (just drop; they can be re-read from disk).
- File-backed dirty pages: must be written back before being reclaimed.
- Anonymous pages (e.g. heap, stack): cannot be dropped; can be swapped out if swap is enabled.
The reclaim logic uses:
- Per-zone LRU lists (active/inactive anonymous and file lists).
- Heuristics (age, reference bits) to decide what to reclaim.
Swapping
If reclaiming caches isn’t enough, Linux may use swap:
- Anonymous pages are written to swap space on a block device.
- Their page cache entry (for anonymous memory) becomes backed by swap instead of RAM.
- Accessing a swapped-out page causes a major page fault to bring it back.
Swap is controlled by:
- Presence/size of swap devices/files (
swapon -s,/proc/swaps). vm.swappiness: kernel preference for swapping vs dropping page cache.
Swapping is essential for:
- Handling unpredictable memory spikes gracefully.
- Allowing some degree of overcommit with less risk of OOM.
But excessive swapping causes thrashing and poor performance.
Overcommit and the OOM Killer
Linux allows memory overcommit by default:
- Processes can be promised more virtual memory than physically exists (plus swap).
- Many allocations are never fully used, so this usually works out.
The overcommit behavior is tunable via:
/proc/sys/vm/overcommit_memory0: heuristic overcommit (default)1: always overcommit (dangerous)2: don’t overcommit beyond a strict limit/proc/sys/vm/overcommit_ratio//proc/sys/vm/overcommit_kbytes: define the limit for mode 2.
OOM (Out-Of-Memory) Killer
When the kernel cannot satisfy an allocation even after reclaim and swap, it may trigger the OOM killer:
- Selects one or more processes to kill to free memory.
- Selection based on badness score:
- Memory usage
oom_score_adj(user-tunable per process)- Process priority, root vs user, etc.
Files:
/proc/<pid>/oom_score: current badness score./proc/<pid>/oom_score_adj: adjust vulnerability (-1000 to +1000).
Services like databases often set:
oom_score_adjlow (more protected).- Separate watchdogs to restart them if killed.
Kernel logs OOM events in dmesg / journalctl.
Kernel Memory vs User Memory
The kernel has its own memory needs independent of processes:
- Page tables
- Slab caches (inodes, dentries, network buffers, etc.)
- Kernel stacks
- Buffers for drivers, DMA, network
Unlike userspace:
- Kernel allocations are non-swappable (with rare exceptions like zswap/zram).
- Kernel must avoid deadlocks when allocating (e.g. cannot allocate with blocking in interrupt context).
Allocation APIs:
kmalloc,kzalloc,vmalloc,alloc_pages(and variants) with GFP flags.- For persistent object types, slab caches via
kmem_cache_create.
Kernel memory leaks or fragmentation can starve the system even if user processes look small.
NUMA-Aware Memory Management
On NUMA machines, memory locality matters:
- Each NUMA node has its own free lists, page cache, reclaim activity.
- Kernel tries to allocate memory from the local node of the current CPU.
Policies:
- Default: local-first, fallback to other nodes on pressure.
- Userspace can request specific policies via:
mbind,set_mempolicynumactl(run process with specific node bindings).
Key concepts:
- Interleaving: spread allocations round-robin across nodes.
- Binding: restrict allocations to specific nodes.
- Preferred: try one node first, then fallback.
Tools like numastat and numactl --hardware show NUMA-related memory statistics.
Huge Pages and THP
Memory management at 4 KiB page granularity has overhead:
- Large working sets ⇒ many page table entries ⇒ TLB pressure.
- Huge pages mitigate this.
Transparent Huge Pages (THP)
- Kernel automatically groups contiguous 4 KiB pages into larger (e.g. 2 MiB) huge pages when possible.
- Transparent: applications don’t need modification.
- Good for workloads with large, contiguous memory allocations (databases, JVMs).
Control:
/sys/kernel/mm/transparent_hugepage/enabled/sys/kernel/mm/transparent_hugepage/defrag
Explicit Huge Pages
Applications that want fine-grained control can:
- Use
hugetlbfsmounted at some path. - Allocate huge pages via
mmap(..., MAP_HUGETLB, ...).
Benefit: more predictable behavior, strict reservation of huge page pools.
Zswap, Zram, and Compressed Memory
To extend effective RAM without hitting disk as often, Linux can compress pages:
Zswap
- A compressed cache for swap pages in RAM.
- When pages would be swapped out, they’re compressed and stored in RAM.
- If zswap becomes full, old entries are written to real swap devices.
Enabled via kernel parameters and /sys/module/zswap/parameters/*.
Zram
- Creates a compressed block device in RAM (e.g.
/dev/zram0). - Can be used as swap space or for temporary filesystems.
- Particularly useful on memory-constrained systems (embedded, VMs).
Observing Memory Internals from Userspace
Linux exposes a lot of memory information in /proc and /sys:
/proc/meminfo: global memory statistics (cached, buffers, swap, etc.)./proc/<pid>/status: per-process RSS, virtual size, etc./proc/<pid>/smaps: detailed per-VMA usage (PSS, RSS, shared/anonymous breakdown).
Common tools:
free,vmstat,top,htop: high-level overview.slabtop: slab allocator statistics.numastat: per-NUMA node memory stats.perf,bcc/bpftracetools: deeper analysis of page faults, TLB misses, reclaim activity.
Interpreting these outputs effectively is a core skill for performance tuning and debugging.
Memory Control Groups (Overview Only)
Cgroups v2 (and v1 memory controller) provide resource control over memory:
- Limit memory for groups of processes.
- Apply policies for reclaim, swap, OOM within the group.
Key ideas:
- Memory limit and soft limit.
- Per-cgroup OOM events.
- Reclaim pressure metrics.
The detailed mechanics of cgroups is covered elsewhere; here the important point is that many memory management decisions now operate both globally and per-cgroup.
Summary
Linux memory management revolves around:
- Virtual memory and per-process address spaces.
- Paging, TLB, and handling page faults.
- A hierarchy of allocators: buddy, slab/SLUB, per-CPU caches.
- Copy-on-write to optimize duplication and sharing.
- Page cache, reclaim, and swap to balance performance and capacity.
- Overcommit and OOM killer to handle worst-case scenarios.
- NUMA, huge pages, and compressed memory (zswap, zram) for scalability and efficiency.
- Extensive /proc and /sys interfaces to observe and tune the system.
Understanding these mechanisms is crucial for advanced performance analysis, debugging subtle memory issues, and building efficient, reliable Linux systems at scale.