7.3.2 Memory tuning

Table of Contents

Understanding How Linux Uses Memory

Linux memory management is highly dynamic. Before tuning, you need a mental model of how Linux actually uses RAM:

Page cache & buffers: Linux aggressively caches file data and metadata in RAM to speed up I/O. This memory is “used but reclaimable.”
Anonymous memory: Memory not backed by files (heaps, stacks) — usually the most important for applications.
Kernel memory: Slab caches, kernel stacks, and other internal structures.
Swap: Disk-based extension of memory; much slower than RAM.

For tuning decisions, think in terms of:

“How fast is my workload vs how much memory does it allocate?”
“Is slowness due to CPU, disk, or memory pressure?”

Key Tools for Memory Tuning

You should already know basic monitoring tools from earlier chapters. For memory tuning, focus on:

free -h — quick overview (used, free, buff/cache, swap)
top / htop — per-process memory; look at RES (resident), SHR, %MEM
vmstat 1 — high-level view:

si/so (swap in/out)
us, sy, wa (CPU time in user, system, I/O wait)

sar -r / sar -B (from sysstat) — historical memory and paging data
ps aux --sort=-%mem | head — top memory consumers
pmap -x <pid> — detailed mapping for a process
smem (if installed) — more accurate proportional set size (PSS)

For kernel-level analysis:

/proc/meminfo — global memory state
/proc/vmstat — detailed VM statistics (paging, reclaim, etc.)
slabtop — kernel slab allocations

Focus on interpretability: tuning without understanding what the metrics actually reflect will usually worsen performance.

Swap Management and Swappiness

Swap can be critical for stability (avoiding OOM kills), but poorly tuned swap causes massive slowdown.

When swap is helpful vs harmful

Swap is useful to:

Prevent crashes when memory briefly spikes.
Move truly idle pages out of RAM to make room for active data.

Problematic when:

You see continuous, high si/so in vmstat and high I/O wait.
Applications feel sluggish, with disks working hard.
Latency-sensitive workloads (databases, low-latency network services) are impacted.

Tuning swappiness

swappiness controls how aggressively Linux will swap out anonymous memory:

Value range: 0–100

Higher values ⇒ more eager swapping.
Lower values ⇒ keep anonymous memory in RAM as long as possible.

Check current value:

cat /proc/sys/vm/swappiness

Temporary change (until reboot):

sysctl vm.swappiness=10

Permanent change (persist across reboot):

echo "vm.swappiness = 10" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

Typical guidelines:

Desktop with plenty of RAM: 10–20
Latency-sensitive server (DB, cache): 1–10
RAM-constrained systems where stability is more important than speed: 30–60

Swap size and type

Decision factors:

Do you prioritize stability (avoid OOM) or speed (avoid swap usage)?
Are you using hibernation? (swap ≥ RAM is commonly recommended for hibernation.)
Do you frequently run memory-heavy batch jobs?

Options:

Swap partition — fixed, simple, good for servers.
Swap file — flexible; easier to resize later; good for most installations.
No swap — only for very specialized systems with lots of RAM and fully controlled workloads.

Creating a swap file example:

sudo fallocate -l 8G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
sudo swapon -a

Monitor with:

swapon --show
free -h

Kernel VM Parameters for Memory Tuning

Several /proc/sys/vm/* parameters impact how the kernel uses and frees memory. These should be changed carefully and usually only after observing concrete problems.

1. `vm.dirty_ratio` and `vm.dirty_background_ratio`

These control how much memory can be filled with “dirty” (modified but not yet written to disk) pages.

vm.dirty_background_ratio (% of total memory):

When dirty pages exceed this, background writeback begins.

vm.dirty_ratio (% of total memory):

When dirty pages exceed this, any process attempting to write may be forced to write data out.

On large-RAM systems, defaults can lead to:

Huge bursts of writes (I/O spikes).
Latency spikes when system suddenly flushes a large amount of data.

You can instead tune with absolute values (in KB):

vm.dirty_bytes
vm.dirty_background_bytes

Example: limit dirty data to 1 GB and start background writeback at 256 MB:

sysctl -w vm.dirty_bytes=$((1024*1024*1024))
sysctl -w vm.dirty_background_bytes=$((256*1024*1024))

Persist by adding to /etc/sysctl.conf.

2. `vm.min_free_kbytes`

Controls how much memory the kernel tries to keep free for emergencies and for allocating contiguous pages.

Too low ⇒ more risk of allocation stalls under pressure.
Too high ⇒ effectively reduces usable RAM.

On modern systems with lots of RAM, increasing slightly from default can improve stability under heavy load, but drastic changes can cause issues. Only tune this when you observe allocation stalls or reclaim thrashing and understand kernel documentation.

3. `vm.overcommit_memory` and `vm.overcommit_ratio`

These control memory overcommit — promising more memory to processes than physically exists.

vm.overcommit_memory:

0: heuristic overcommit (default).
1: always allow overcommit (can promise more than RAM+swap).
2: strict overcommit (limit to a percentage of RAM+swap).

vm.overcommit_ratio:

Used when overcommit_memory=2.
Specifies % of physical RAM that can be committed in addition to swap.

For memory-critical servers (databases, scientific workloads):

Setting overcommit_memory=2 and a reasonable overcommit_ratio can avoid the kernel promising more memory than you want it to.
Example (strict but generous):

sysctl -w vm.overcommit_memory=2
sysctl -w vm.overcommit_ratio=80

Use this when you must bound memory usage and avoid overcommit-induced OOM kills at random.

4. `vm.vfs_cache_pressure`

Controls how aggressively the kernel reclaims inode/dentry caches (directory and file metadata).

Higher values ⇒ reclaim metadata cache more aggressively to free memory.
Lower values ⇒ keep filesystem caches longer.

If you have filesystem-heavy workloads and sufficient RAM, a moderately low value (e.g. 50) may improve performance by keeping more metadata cached:

sysctl -w vm.vfs_cache_pressure=50

Avoid extreme values unless you have specific measurements.

5. Zone reclaim and NUMA (NUMA systems only)

On multi-socket NUMA systems, vm.zone_reclaim_mode controls how aggressively memory is reclaimed from local NUMA nodes before allocating from remote ones.

Over-aggressive reclaim can cause unnecessary paging and latency.
For many server workloads, disabling zone reclaim (0) and letting the kernel use remote memory is better than reclaiming aggressively from local memory.

sysctl -w vm.zone_reclaim_mode=0

Only relevant on NUMA hardware. Use numactl --hardware to inspect layout.

Transparent Huge Pages (THP) and HugeTLB

Huge pages (e.g. 2 MB instead of 4 KB) reduce page table size and TLB misses, potentially improving performance for some memory-heavy workloads.

Transparent Huge Pages (THP)

Linux can automatically use huge pages via THP:

Pros:

Can significantly help workloads with large, contiguous anonymous memory (databases, in-memory caches).

Cons:

Kernel may spend CPU time coalescing/defragmenting memory.
Can sometimes introduce latency spikes (especially if always is used).

Check current mode:

cat /sys/kernel/mm/transparent_hugepage/enabled

It typically shows something like: [always] madvise never (brackets indicate current mode).

Common options:

always — aggressively use THP when possible.
madvise — only use THP when applications request it via madvise().
never — disable THP.

Tuning:

echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

For latency-sensitive systems where THP causes issues, use never. For big DB hosts, check vendor recommendations; many recommend always disabled and explicit huge pages instead.

Explicit huge pages (HugeTLB)

For peak performance and predictability, you can allocate explicit huge pages:

Requires reserving huge pages in advance:

  sysctl -w vm.nr_hugepages=1024

Applications must be configured to use them (often via DB configs or libhugetlbfs).

This is advanced tuning: follow your application’s documentation (databases like Oracle, PostgreSQL, MySQL may benefit).

Tuning for Different Workload Types

Memory tuning is always workload-specific. Some typical patterns:

1. Database / in-memory cache servers

Goals: low latency, predictable performance, avoid OOM during spikes.

Typical adjustments (after measuring behavior):

Lower swappiness:

  vm.swappiness = 1

Disable or relax THP depending on DB vendor guidance:

Often: transparent_hugepage=never (kernel boot parameter or runtime).

Consider strict overcommit:

  vm.overcommit_memory = 2
  vm.overcommit_ratio = 80

Adjust dirty limits to avoid large I/O bursts on write-heavy workloads:

  vm.dirty_background_bytes = 268435456   # 256 MiB
  vm.dirty_bytes = 1073741824            # 1 GiB

Pin DB memory limits (e.g. PostgreSQL shared_buffers, MySQL innodb_buffer_pool_size) so the DB does not consume more than a safe fraction of RAM.

Monitor:

Page faults, I/O wait, swap usage, DB-specific metrics (buffer cache hit rates, etc.).

2. Web/application servers

Goals: handle many concurrent connections, frequent forks, moderate memory usage.

Patterns:

Multiple worker processes (e.g. web server + app server).
Frequent short-lived processes (CGI, some microservice architectures).

Tuning ideas:

Moderate swappiness (e.g. 10) — avoid both OOM and constant swapping.
Ensure adequate vm.min_free_kbytes to avoid stalls under sudden connection spikes.
Monitor per-process RSS; adjust service worker counts so aggregate memory stays within comfortable headroom.

Use systemd features (in other chapters) like MemoryMax= and MemoryHigh= per-service to prevent single services from taking over the machine.

3. Batch / HPC / scientific workloads

Goals: high throughput, large in-memory datasets, long-running jobs.

Patterns:

Processes using tens of GB of RAM.
Very expensive if OOM kills happen mid-run.

Tuning:

Consider strict overcommit mode:

  vm.overcommit_memory = 2
  vm.overcommit_ratio = 90

Adequate swap as a safety net, but avoid running entirely from swap.
Use per-job cgroups (discussed elsewhere) to cap memory per user or per job.
For NUMA systems, use numactl to control memory locality and check zone_reclaim_mode.

4. Desktop / interactive systems

Goals: snappy feel, avoid stalls when RAM fills, allow heavy occasional workloads.

Practical changes:

Lower swappiness than default, e.g.:

  vm.swappiness = 10

For SSD-based systems, using moderate swap with a lower swappiness often provides best user experience.
Desktop environments can be memory-hungry; monitor and trim unused startup services.

Avoiding and Handling OOM (Out of Memory)

When memory is exhausted, Linux’s OOM killer chooses processes to terminate.

Detecting OOM events

Symptoms:

Sudden disappearance of processes.
Messages in logs:

  dmesg | grep -i "out of memory"
  journalctl -k | grep -i "out of memory"

Look for lines like “Out of memory: Kill process <pid> (name) score X or sacrifice child”.

Tuning OOM behavior with `oom_score_adj`

Each process has an oom_score and an adjustable oom_score_adj (-1000 to +1000).

Lower value ⇒ less likely to be killed.
Higher value ⇒ more likely.

Example: protect a critical service:

echo -1000 | sudo tee /proc/$(pidof my_critical_service)/oom_score_adj

You can configure this permanently through systemd unit files (e.g. OOMScoreAdjust=) instead of scripting manual changes.

Be careful: if you protect too many processes, the OOM killer may have no good candidates and the system may become unresponsive instead of recovering.

Proactive controls with cgroups (memory.max, memory.high)

Instead of waiting for OOM:

Use cgroups (through systemd MemoryMax=/MemoryHigh=) to:

Limit memory usage per service.
Force reclaim behavior before system-wide OOM occurs.

This keeps memory hogs contained and protects critical services.

Reducing Application Memory Footprint

Kernel tuning helps, but you often gain more by reducing what applications use.

Consider:

Consolidating services:

Fewer different languages, runtimes, and daemons ⇒ fewer duplicate libraries, less overhead.

Configuring application memory limits:

Java: -Xmx / -Xms
Databases: buffer pool and cache sizes
Web servers: worker counts and per-worker memory.

Avoiding unnecessary background processes:

GUI services, indexing services, notification daemons on servers.

Use:

smem -r or ps_mem.py (if installed) to estimate true per-program memory use (PSS).
pmap -x <pid> to inspect where memory goes (heap, stack, shared libs, anonymous segments).

NUMA-aware Memory Tuning (Advanced)

On multi-socket servers, memory is divided into NUMA nodes, each with local CPUs. Accessing local memory is faster than remote memory.

Key tools:

numactl --hardware — show NUMA nodes and memory distribution.
numastat — per-node memory stats.
numactl --membind / --cpunodebind — bind processes to specific nodes.

Issues:

One node can be full while others still have free memory.
Zone reclaim may aggressively reclaim local memory instead of using remote memory, causing paging.

Mitigations:

Often disable zone reclaim:

  sysctl -w vm.zone_reclaim_mode=0

Use numactl or systemd CPU/memory affinity to keep critical processes local to a specific node.
Monitor numastat to ensure that remote memory access is not excessive when performance matters.

Measuring, Testing, and Iterating Safely

Memory tuning should follow a deliberate cycle:

Measure baseline:

Collect metrics (RAM usage, swap, page faults, I/O wait, app latency).
Use tools like sar, vmstat, free, top, application metrics.

Change one thing at a time:

Adjust a single parameter (e.g. swappiness).
Apply via sysctl -w or by editing /etc/sysctl.conf for persistence.

Load test or observe under real workload:

Reproduce typical or worst-case load.
Measure before/after differences.

Document and version-control:

Keep sysctl settings under configuration management.
Note why each non-default value was chosen.

Roll back if needed:

If latency, OOMs, or instability increase, revert to last known good settings.

Never adopt “universal” sysctl recipes blindly. Effective memory tuning depends on: