7.5.2 Memory management

Table of Contents

Big Picture: What Linux Memory Management Actually Does

Linux memory management is about deciding:

Which data lives in RAM and which is temporarily stored on disk
How each process sees its own memory
How to share memory safely
How to recover memory under pressure without crashing the system

You’ll see the same ideas repeated at different levels:

Virtual vs physical memory
Pages as the basic unit of memory
Overcommit and reclamation of memory
Caches (page cache, slab cache)
Swapping and OOM (Out-Of-Memory) behavior

This chapter looks at how Linux implements those ideas internally and how you can observe and influence them.

Virtual Memory and Address Spaces

Each process in Linux sees a virtual address space (e.g. on x86_64, up to 128 TB of virtual addresses, but much less physical RAM).

Key points:

The kernel uses paging: virtual memory is broken into fixed-size pages (typically 4 KiB).
Each process has its own page tables that map virtual pages to:

Physical frames in RAM
No frame yet (demand paging)
Swap
File-backed pages (from binaries, shared libraries, memory-mapped files)

User vs Kernel Address Space

On most 64-bit Linux systems:

User space: lower part of the virtual address space
Kernel space: upper part, same mapping for all processes

User processes cannot directly access kernel space; transitions happen via syscalls, interrupts, etc.

The kernel describes each process’s virtual address layout using the mm_struct and VMAs:

mm_struct: entire memory layout of a process.
VMA (Virtual Memory Area): a contiguous region with the same permissions and attributes (e.g. code segment, heap, stack, mmap’ed file).

Each VMA has flags like:

VM_READ, VM_WRITE, VM_EXEC: permissions
VM_GROWSDOWN: stacks that can grow downward
VM_MAYSHARE / VM_SHARED: possibly shared mappings

Paging, TLBs, and Page Faults

Pages and Page Frames

Page: unit of virtual memory (typically 4 KiB).
Page frame: physical memory block backing a page.
Linux manages physical page frames via struct page structures.

There are also huge pages:

Transparent Huge Pages (THP): kernel automatically uses larger pages (e.g. 2 MiB) when beneficial.
Explicit huge pages: managed via hugetlbfs and specific APIs (mmap with MAP_HUGETLB).

TLB (Translation Lookaside Buffer)

Hardware caches address translations in a TLB:

Without a TLB, each memory access would require walking page tables – too slow.
TLB entries are per-core and flushed/updated when mappings change (e.g. context switch, mprotect, munmap).

Page Faults

When a process accesses a virtual address that isn’t currently mapped, the CPU raises a page fault and hands control to the kernel:

Minor (soft) page fault:

The page is already in RAM but not mapped in this process’s page tables yet.
Example: copy-on-write page shared with another process.

Major (hard) page fault:

The kernel must read the page from disk (file or swap).
Much slower; excessive major faults indicate memory pressure or poor locality.

Userspace tools like top, ps, pidstat, perf can show minor/major page fault counts.

Physical Memory: Zones, Nodes, and `struct page`

Linux hardware abstraction introduces:

NUMA Nodes

On NUMA (Non-Uniform Memory Access) systems:

Memory is split into nodes, each close to a CPU socket.
Accessing “local” node memory is faster than remote node memory.
The kernel tries to allocate memory from the local node of the CPU running the process.

NUMA-aware allocators exist in both kernel and userspace (e.g. numactl, libnuma).

Memory Zones

Within each node, memory is divided into zones (logical groupings by physical address range and hardware constraints):

Typical zones:

ZONE_DMA: legacy DMA devices (low addresses)
ZONE_NORMAL: main, directly mapped memory
ZONE_HIGHMEM (32-bit only): not permanently mapped in kernel space

Allocations use GFP flags (e.g. GFP_KERNEL, GFP_USER) that indicate which zones and constraints apply.

The `struct page` Abstraction

Each physical page frame is described by a struct page:

Contains reference counts
Flags (e.g. dirty, reserved, in LRU list)
Links into LRU lists, slab caches, etc.

Kernel code never directly deals with “bare” physical addresses; it works via these struct page objects.

Allocators: From Bytes to Pages and Slabs

Linux uses different allocators layered on top of each other:

Buddy allocator — manages pools of pages
Slab subsystem — manages objects inside pages
Per-CPU caches — reduce contention on global structures

Buddy Allocator

Manages memory in powers of two:

If you ask for $2^k$ pages, and the smallest free block is $2^{k+1}$, the allocator splits it into two buddies.
When freeing, it coalesces buddies back into larger blocks when both buddies are free.

Key properties:

Allocates contiguous physical pages (necessary for many kernel structures, DMA, huge pages).
Works at page granularity (order = 0 is one page, order = 1 is 2 pages, etc.).

The GFP flags you see in kernel code influence which zone, reclaim behavior, and allowances (e.g. can it block, can it use emergency reserves).

Slab / SLUB / SLOB Allocators

Above the buddy allocator, Linux uses slab-based allocators for small, frequently allocated objects:

Slab caches (kmem_cache) manage objects of a given fixed size and type.
Examples: dentry cache, inode cache, task_struct, etc.

Three implementations exist, but SLUB is the default on most modern distros:

Takes pages from buddy allocator and carves them into objects.
Uses per-CPU partial/full slab lists to minimize locking.
Tracks internal fragmentation and object coloring to reduce cache conflicts.

slabtop lets you inspect kernel slab usage from userspace.

Process Memory Layout: Heap, Stack, and Mappings

Inside a process’s virtual address space, typical regions include:

Text (code): executable instructions
Data/BSS: statically allocated, global data
Heap: dynamically allocated via malloc, new; grows upward
Stack: local variables, function call frames; grows downward
Memory-mapped regions: shared libraries, mmap files, anonymous mappings

The kernel doesn’t care about C-level notions like “heap” or “stack”; it just knows:

VMAs with specific flags (e.g. expanding down, no-exec, etc.)
Mappings backed by files vs anonymous memory

Userspace allocators (glibc malloc, jemalloc, tcmalloc, etc.) decide how to:

Use brk/sbrk for heap extension
Use mmap for large allocations or arena management

/proc/<pid>/maps shows VMAs; /proc/<pid>/smaps gives detailed per-VMA stats.

Copy-on-Write (CoW) and Fork

Linux optimizes memory usage during fork() using copy-on-write:

When a process forks, parent and child initially share the same physical pages for code, data, heap, etc.
Those shared pages are mapped read-only in both processes.
On a write:

A page fault occurs (protection fault).
Kernel allocates a new page, copies data, updates the faulting process’s page tables.
Only the writer gets the new private page; the other keeps the original.

This allows:

Very cheap fork of large processes (crucial for traditional “fork+exec” model).
Efficient vfork/clone variations.

CoW is also used by:

fork() without immediate exec
MAP_PRIVATE file mappings
Some file systems (Btrfs, ZFS) at the block level (different layer, but same concept).

Page Cache and Disk I/O

Linux treats unused RAM as cache for file data to speed up I/O:

Page Cache

When reading from a file, data is stored in the page cache.
Subsequent reads can come from RAM instead of disk.
Write operations update the page cache and mark pages dirty; kernel writes them back later.

Properties:

Global, system-wide cache for all file-backed pages.
Uses LRU (approximate) lists to track frequently vs rarely used pages.
Dropping the page cache (echo 1 > /proc/sys/vm/drop_caches) frees cached data, but hurts I/O performance until cache warms up again.

Writeback

The kernel’s writeback subsystem:

Flushes dirty pages to disk in the background.
Tries to avoid bursts that would stall processes.
Under memory pressure or sync operations (fsync, sync), writeback accelerates.

Tunable parameters in /proc/sys/vm/ (e.g. dirty_ratio, dirty_background_ratio) influence when writeback kicks in.

Reclaim, Swapping, and Memory Pressure

Reclaimable Memory

When the system is low on free pages, the kernel tries to reclaim memory:

File-backed clean pages: easiest to reclaim (just drop; they can be re-read from disk).
File-backed dirty pages: must be written back before being reclaimed.
Anonymous pages (e.g. heap, stack): cannot be dropped; can be swapped out if swap is enabled.

The reclaim logic uses:

Per-zone LRU lists (active/inactive anonymous and file lists).
Heuristics (age, reference bits) to decide what to reclaim.

Swapping

If reclaiming caches isn’t enough, Linux may use swap:

Anonymous pages are written to swap space on a block device.
Their page cache entry (for anonymous memory) becomes backed by swap instead of RAM.
Accessing a swapped-out page causes a major page fault to bring it back.

Swap is controlled by:

Presence/size of swap devices/files (swapon -s, /proc/swaps).
vm.swappiness: kernel preference for swapping vs dropping page cache.

Swapping is essential for:

Handling unpredictable memory spikes gracefully.
Allowing some degree of overcommit with less risk of OOM.

But excessive swapping causes thrashing and poor performance.

Overcommit and the OOM Killer

Linux allows memory overcommit by default:

Processes can be promised more virtual memory than physically exists (plus swap).
Many allocations are never fully used, so this usually works out.

The overcommit behavior is tunable via:

/proc/sys/vm/overcommit_memory

0: heuristic overcommit (default)
1: always overcommit (dangerous)
2: don’t overcommit beyond a strict limit

/proc/sys/vm/overcommit_ratio / /proc/sys/vm/overcommit_kbytes: define the limit for mode 2.

OOM (Out-Of-Memory) Killer

When the kernel cannot satisfy an allocation even after reclaim and swap, it may trigger the OOM killer:

Selects one or more processes to kill to free memory.
Selection based on badness score:

Memory usage
oom_score_adj (user-tunable per process)
Process priority, root vs user, etc.

Files:

/proc/<pid>/oom_score: current badness score.
/proc/<pid>/oom_score_adj: adjust vulnerability (-1000 to +1000).

Services like databases often set:

oom_score_adj low (more protected).
Separate watchdogs to restart them if killed.

Kernel logs OOM events in dmesg / journalctl.

Kernel Memory vs User Memory

The kernel has its own memory needs independent of processes:

Page tables
Slab caches (inodes, dentries, network buffers, etc.)
Kernel stacks
Buffers for drivers, DMA, network

Unlike userspace:

Kernel allocations are non-swappable (with rare exceptions like zswap/zram).
Kernel must avoid deadlocks when allocating (e.g. cannot allocate with blocking in interrupt context).

Allocation APIs:

kmalloc, kzalloc, vmalloc, alloc_pages (and variants) with GFP flags.
For persistent object types, slab caches via kmem_cache_create.

Kernel memory leaks or fragmentation can starve the system even if user processes look small.

NUMA-Aware Memory Management

On NUMA machines, memory locality matters:

Each NUMA node has its own free lists, page cache, reclaim activity.
Kernel tries to allocate memory from the local node of the current CPU.

Policies:

Default: local-first, fallback to other nodes on pressure.
Userspace can request specific policies via:

mbind, set_mempolicy
numactl (run process with specific node bindings).

Key concepts:

Interleaving: spread allocations round-robin across nodes.
Binding: restrict allocations to specific nodes.
Preferred: try one node first, then fallback.

Tools like numastat and numactl --hardware show NUMA-related memory statistics.

Huge Pages and THP

Memory management at 4 KiB page granularity has overhead:

Large working sets ⇒ many page table entries ⇒ TLB pressure.
Huge pages mitigate this.

Transparent Huge Pages (THP)

Kernel automatically groups contiguous 4 KiB pages into larger (e.g. 2 MiB) huge pages when possible.
Transparent: applications don’t need modification.
Good for workloads with large, contiguous memory allocations (databases, JVMs).

Control:

/sys/kernel/mm/transparent_hugepage/enabled
/sys/kernel/mm/transparent_hugepage/defrag

Explicit Huge Pages

Applications that want fine-grained control can:

Use hugetlbfs mounted at some path.
Allocate huge pages via mmap(..., MAP_HUGETLB, ...).

Benefit: more predictable behavior, strict reservation of huge page pools.

Zswap, Zram, and Compressed Memory

To extend effective RAM without hitting disk as often, Linux can compress pages:

Zswap

A compressed cache for swap pages in RAM.
When pages would be swapped out, they’re compressed and stored in RAM.
If zswap becomes full, old entries are written to real swap devices.

Enabled via kernel parameters and /sys/module/zswap/parameters/*.

Zram

Creates a compressed block device in RAM (e.g. /dev/zram0).
Can be used as swap space or for temporary filesystems.
Particularly useful on memory-constrained systems (embedded, VMs).

Observing Memory Internals from Userspace

Linux exposes a lot of memory information in /proc and /sys:

/proc/meminfo: global memory statistics (cached, buffers, swap, etc.).
/proc/<pid>/status: per-process RSS, virtual size, etc.
/proc/<pid>/smaps: detailed per-VMA usage (PSS, RSS, shared/anonymous breakdown).

Common tools:

free, vmstat, top, htop: high-level overview.
slabtop: slab allocator statistics.
numastat: per-NUMA node memory stats.
perf, bcc/bpftrace tools: deeper analysis of page faults, TLB misses, reclaim activity.

Interpreting these outputs effectively is a core skill for performance tuning and debugging.

Memory Control Groups (Overview Only)

Cgroups v2 (and v1 memory controller) provide resource control over memory:

Limit memory for groups of processes.
Apply policies for reclaim, swap, OOM within the group.

Key ideas:

Memory limit and soft limit.
Per-cgroup OOM events.
Reclaim pressure metrics.

The detailed mechanics of cgroups is covered elsewhere; here the important point is that many memory management decisions now operate both globally and per-cgroup.

Summary

Linux memory management revolves around:

Virtual memory and per-process address spaces.
Paging, TLB, and handling page faults.
A hierarchy of allocators: buddy, slab/SLUB, per-CPU caches.
Copy-on-write to optimize duplication and sharing.
Page cache, reclaim, and swap to balance performance and capacity.
Overcommit and OOM killer to handle worst-case scenarios.
NUMA, huge pages, and compressed memory (zswap, zram) for scalability and efficiency.
Extensive /proc and /sys interfaces to observe and tune the system.

Understanding these mechanisms is crucial for advanced performance analysis, debugging subtle memory issues, and building efficient, reliable Linux systems at scale.

Comments

Please login to add a comment.

Don't have an account? Register now!