Table of Contents
Understanding How Linux Uses Memory
Linux memory management is highly dynamic. Before tuning, you need a mental model of how Linux actually uses RAM:
- Page cache & buffers: Linux aggressively caches file data and metadata in RAM to speed up I/O. This memory is “used but reclaimable.”
- Anonymous memory: Memory not backed by files (heaps, stacks) — usually the most important for applications.
- Kernel memory: Slab caches, kernel stacks, and other internal structures.
- Swap: Disk-based extension of memory; much slower than RAM.
For tuning decisions, think in terms of:
- “How fast is my workload vs how much memory does it allocate?”
- “Is slowness due to CPU, disk, or memory pressure?”
Key Tools for Memory Tuning
You should already know basic monitoring tools from earlier chapters. For memory tuning, focus on:
free -h— quick overview (used, free, buff/cache, swap)top/htop— per-process memory; look atRES(resident),SHR,%MEMvmstat 1— high-level view:si/so(swap in/out)us,sy,wa(CPU time in user, system, I/O wait)sar -r/sar -B(fromsysstat) — historical memory and paging dataps aux --sort=-%mem | head— top memory consumerspmap -x <pid>— detailed mapping for a processsmem(if installed) — more accurate proportional set size (PSS)
For kernel-level analysis:
/proc/meminfo— global memory state/proc/vmstat— detailed VM statistics (paging, reclaim, etc.)slabtop— kernel slab allocations
Focus on interpretability: tuning without understanding what the metrics actually reflect will usually worsen performance.
Swap Management and Swappiness
Swap can be critical for stability (avoiding OOM kills), but poorly tuned swap causes massive slowdown.
When swap is helpful vs harmful
Swap is useful to:
- Prevent crashes when memory briefly spikes.
- Move truly idle pages out of RAM to make room for active data.
Problematic when:
- You see continuous, high
si/soinvmstatand high I/O wait. - Applications feel sluggish, with disks working hard.
- Latency-sensitive workloads (databases, low-latency network services) are impacted.
Tuning swappiness
swappiness controls how aggressively Linux will swap out anonymous memory:
- Value range:
0–100 - Higher values ⇒ more eager swapping.
- Lower values ⇒ keep anonymous memory in RAM as long as possible.
Check current value:
cat /proc/sys/vm/swappinessTemporary change (until reboot):
sysctl vm.swappiness=10Permanent change (persist across reboot):
echo "vm.swappiness = 10" | sudo tee -a /etc/sysctl.conf
sudo sysctl -pTypical guidelines:
- Desktop with plenty of RAM:
10–20 - Latency-sensitive server (DB, cache):
1–10 - RAM-constrained systems where stability is more important than speed:
30–60
Swap size and type
Decision factors:
- Do you prioritize stability (avoid OOM) or speed (avoid swap usage)?
- Are you using hibernation? (swap ≥ RAM is commonly recommended for hibernation.)
- Do you frequently run memory-heavy batch jobs?
Options:
- Swap partition — fixed, simple, good for servers.
- Swap file — flexible; easier to resize later; good for most installations.
- No swap — only for very specialized systems with lots of RAM and fully controlled workloads.
Creating a swap file example:
sudo fallocate -l 8G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
sudo swapon -aMonitor with:
swapon --show
free -hKernel VM Parameters for Memory Tuning
Several /proc/sys/vm/* parameters impact how the kernel uses and frees memory. These should be changed carefully and usually only after observing concrete problems.
1. `vm.dirty_ratio` and `vm.dirty_background_ratio`
These control how much memory can be filled with “dirty” (modified but not yet written to disk) pages.
vm.dirty_background_ratio(% of total memory):- When dirty pages exceed this, background writeback begins.
vm.dirty_ratio(% of total memory):- When dirty pages exceed this, any process attempting to write may be forced to write data out.
On large-RAM systems, defaults can lead to:
- Huge bursts of writes (I/O spikes).
- Latency spikes when system suddenly flushes a large amount of data.
You can instead tune with absolute values (in KB):
vm.dirty_bytesvm.dirty_background_bytes
Example: limit dirty data to 1 GB and start background writeback at 256 MB:
sysctl -w vm.dirty_bytes=$((1024*1024*1024))
sysctl -w vm.dirty_background_bytes=$((256*1024*1024))
Persist by adding to /etc/sysctl.conf.
2. `vm.min_free_kbytes`
Controls how much memory the kernel tries to keep free for emergencies and for allocating contiguous pages.
- Too low ⇒ more risk of allocation stalls under pressure.
- Too high ⇒ effectively reduces usable RAM.
On modern systems with lots of RAM, increasing slightly from default can improve stability under heavy load, but drastic changes can cause issues. Only tune this when you observe allocation stalls or reclaim thrashing and understand kernel documentation.
3. `vm.overcommit_memory` and `vm.overcommit_ratio`
These control memory overcommit — promising more memory to processes than physically exists.
vm.overcommit_memory:0: heuristic overcommit (default).1: always allow overcommit (can promise more than RAM+swap).2: strict overcommit (limit to a percentage of RAM+swap).vm.overcommit_ratio:- Used when
overcommit_memory=2. - Specifies % of physical RAM that can be committed in addition to swap.
For memory-critical servers (databases, scientific workloads):
- Setting
overcommit_memory=2and a reasonableovercommit_ratiocan avoid the kernel promising more memory than you want it to. - Example (strict but generous):
sysctl -w vm.overcommit_memory=2
sysctl -w vm.overcommit_ratio=80Use this when you must bound memory usage and avoid overcommit-induced OOM kills at random.
4. `vm.vfs_cache_pressure`
Controls how aggressively the kernel reclaims inode/dentry caches (directory and file metadata).
- Higher values ⇒ reclaim metadata cache more aggressively to free memory.
- Lower values ⇒ keep filesystem caches longer.
If you have filesystem-heavy workloads and sufficient RAM, a moderately low value (e.g. 50) may improve performance by keeping more metadata cached:
sysctl -w vm.vfs_cache_pressure=50Avoid extreme values unless you have specific measurements.
5. Zone reclaim and NUMA (NUMA systems only)
On multi-socket NUMA systems, vm.zone_reclaim_mode controls how aggressively memory is reclaimed from local NUMA nodes before allocating from remote ones.
- Over-aggressive reclaim can cause unnecessary paging and latency.
- For many server workloads, disabling zone reclaim (
0) and letting the kernel use remote memory is better than reclaiming aggressively from local memory.
sysctl -w vm.zone_reclaim_mode=0
Only relevant on NUMA hardware. Use numactl --hardware to inspect layout.
Transparent Huge Pages (THP) and HugeTLB
Huge pages (e.g. 2 MB instead of 4 KB) reduce page table size and TLB misses, potentially improving performance for some memory-heavy workloads.
Transparent Huge Pages (THP)
Linux can automatically use huge pages via THP:
- Pros:
- Can significantly help workloads with large, contiguous anonymous memory (databases, in-memory caches).
- Cons:
- Kernel may spend CPU time coalescing/defragmenting memory.
- Can sometimes introduce latency spikes (especially if
alwaysis used).
Check current mode:
cat /sys/kernel/mm/transparent_hugepage/enabled
It typically shows something like: [always] madvise never (brackets indicate current mode).
Common options:
always— aggressively use THP when possible.madvise— only use THP when applications request it viamadvise().never— disable THP.
Tuning:
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
For latency-sensitive systems where THP causes issues, use never. For big DB hosts, check vendor recommendations; many recommend always disabled and explicit huge pages instead.
Explicit huge pages (HugeTLB)
For peak performance and predictability, you can allocate explicit huge pages:
- Requires reserving huge pages in advance:
sysctl -w vm.nr_hugepages=1024- Applications must be configured to use them (often via DB configs or
libhugetlbfs).
This is advanced tuning: follow your application’s documentation (databases like Oracle, PostgreSQL, MySQL may benefit).
Tuning for Different Workload Types
Memory tuning is always workload-specific. Some typical patterns:
1. Database / in-memory cache servers
Goals: low latency, predictable performance, avoid OOM during spikes.
Typical adjustments (after measuring behavior):
- Lower swappiness:
vm.swappiness = 1- Disable or relax THP depending on DB vendor guidance:
- Often:
transparent_hugepage=never(kernel boot parameter or runtime). - Consider strict overcommit:
vm.overcommit_memory = 2
vm.overcommit_ratio = 80- Adjust dirty limits to avoid large I/O bursts on write-heavy workloads:
vm.dirty_background_bytes = 268435456 # 256 MiB
vm.dirty_bytes = 1073741824 # 1 GiB- Pin DB memory limits (e.g. PostgreSQL
shared_buffers, MySQLinnodb_buffer_pool_size) so the DB does not consume more than a safe fraction of RAM.
Monitor:
- Page faults, I/O wait, swap usage, DB-specific metrics (buffer cache hit rates, etc.).
2. Web/application servers
Goals: handle many concurrent connections, frequent forks, moderate memory usage.
Patterns:
- Multiple worker processes (e.g. web server + app server).
- Frequent short-lived processes (CGI, some microservice architectures).
Tuning ideas:
- Moderate swappiness (e.g.
10) — avoid both OOM and constant swapping. - Ensure adequate
vm.min_free_kbytesto avoid stalls under sudden connection spikes. - Monitor per-process RSS; adjust service worker counts so aggregate memory stays within comfortable headroom.
Use systemd features (in other chapters) like MemoryMax= and MemoryHigh= per-service to prevent single services from taking over the machine.
3. Batch / HPC / scientific workloads
Goals: high throughput, large in-memory datasets, long-running jobs.
Patterns:
- Processes using tens of GB of RAM.
- Very expensive if OOM kills happen mid-run.
Tuning:
- Consider strict overcommit mode:
vm.overcommit_memory = 2
vm.overcommit_ratio = 90- Adequate swap as a safety net, but avoid running entirely from swap.
- Use per-job cgroups (discussed elsewhere) to cap memory per user or per job.
- For NUMA systems, use
numactlto control memory locality and checkzone_reclaim_mode.
4. Desktop / interactive systems
Goals: snappy feel, avoid stalls when RAM fills, allow heavy occasional workloads.
Practical changes:
- Lower swappiness than default, e.g.:
vm.swappiness = 10- For SSD-based systems, using moderate swap with a lower swappiness often provides best user experience.
- Desktop environments can be memory-hungry; monitor and trim unused startup services.
Avoiding and Handling OOM (Out of Memory)
When memory is exhausted, Linux’s OOM killer chooses processes to terminate.
Detecting OOM events
Symptoms:
- Sudden disappearance of processes.
- Messages in logs:
dmesg | grep -i "out of memory"
journalctl -k | grep -i "out of memory"Look for lines like “Out of memory: Kill process <pid> (name) score X or sacrifice child”.
Tuning OOM behavior with `oom_score_adj`
Each process has an oom_score and an adjustable oom_score_adj (-1000 to +1000).
- Lower value ⇒ less likely to be killed.
- Higher value ⇒ more likely.
Example: protect a critical service:
echo -1000 | sudo tee /proc/$(pidof my_critical_service)/oom_score_adj
You can configure this permanently through systemd unit files (e.g. OOMScoreAdjust=) instead of scripting manual changes.
Be careful: if you protect too many processes, the OOM killer may have no good candidates and the system may become unresponsive instead of recovering.
Proactive controls with cgroups (memory.max, memory.high)
Instead of waiting for OOM:
- Use cgroups (through systemd
MemoryMax=/MemoryHigh=) to: - Limit memory usage per service.
- Force reclaim behavior before system-wide OOM occurs.
- This keeps memory hogs contained and protects critical services.
Reducing Application Memory Footprint
Kernel tuning helps, but you often gain more by reducing what applications use.
Consider:
- Consolidating services:
- Fewer different languages, runtimes, and daemons ⇒ fewer duplicate libraries, less overhead.
- Configuring application memory limits:
- Java:
-Xmx/-Xms - Databases: buffer pool and cache sizes
- Web servers: worker counts and per-worker memory.
- Avoiding unnecessary background processes:
- GUI services, indexing services, notification daemons on servers.
Use:
smem -rorps_mem.py(if installed) to estimate true per-program memory use (PSS).pmap -x <pid>to inspect where memory goes (heap, stack, shared libs, anonymous segments).
NUMA-aware Memory Tuning (Advanced)
On multi-socket servers, memory is divided into NUMA nodes, each with local CPUs. Accessing local memory is faster than remote memory.
Key tools:
numactl --hardware— show NUMA nodes and memory distribution.numastat— per-node memory stats.numactl --membind/--cpunodebind— bind processes to specific nodes.
Issues:
- One node can be full while others still have free memory.
- Zone reclaim may aggressively reclaim local memory instead of using remote memory, causing paging.
Mitigations:
- Often disable zone reclaim:
sysctl -w vm.zone_reclaim_mode=0- Use
numactlor systemd CPU/memory affinity to keep critical processes local to a specific node. - Monitor
numastatto ensure that remote memory access is not excessive when performance matters.
Measuring, Testing, and Iterating Safely
Memory tuning should follow a deliberate cycle:
- Measure baseline:
- Collect metrics (RAM usage, swap, page faults, I/O wait, app latency).
- Use tools like
sar,vmstat,free,top, application metrics. - Change one thing at a time:
- Adjust a single parameter (e.g.
swappiness). - Apply via
sysctl -wor by editing/etc/sysctl.conffor persistence. - Load test or observe under real workload:
- Reproduce typical or worst-case load.
- Measure before/after differences.
- Document and version-control:
- Keep sysctl settings under configuration management.
- Note why each non-default value was chosen.
- Roll back if needed:
- If latency, OOMs, or instability increase, revert to last known good settings.
Never adopt “universal” sysctl recipes blindly. Effective memory tuning depends on:
- Hardware (RAM size, disk/SSD speed, NUMA topology).
- Workload (I/O patterns, latency requirements, memory allocation patterns).
- Stability vs performance trade-offs required for that system.