Kahibaro
Discord Login Register

7.3.2 Memory tuning

Understanding How Linux Uses Memory

Linux memory management is highly dynamic. Before tuning, you need a mental model of how Linux actually uses RAM:

For tuning decisions, think in terms of:

Key Tools for Memory Tuning

You should already know basic monitoring tools from earlier chapters. For memory tuning, focus on:

For kernel-level analysis:

Focus on interpretability: tuning without understanding what the metrics actually reflect will usually worsen performance.

Swap Management and Swappiness

Swap can be critical for stability (avoiding OOM kills), but poorly tuned swap causes massive slowdown.

When swap is helpful vs harmful

Swap is useful to:

Problematic when:

Tuning swappiness

swappiness controls how aggressively Linux will swap out anonymous memory:

Check current value:

cat /proc/sys/vm/swappiness

Temporary change (until reboot):

sysctl vm.swappiness=10

Permanent change (persist across reboot):

echo "vm.swappiness = 10" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

Typical guidelines:

Swap size and type

Decision factors:

Options:

Creating a swap file example:

sudo fallocate -l 8G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
sudo swapon -a

Monitor with:

swapon --show
free -h

Kernel VM Parameters for Memory Tuning

Several /proc/sys/vm/* parameters impact how the kernel uses and frees memory. These should be changed carefully and usually only after observing concrete problems.

1. `vm.dirty_ratio` and `vm.dirty_background_ratio`

These control how much memory can be filled with “dirty” (modified but not yet written to disk) pages.

On large-RAM systems, defaults can lead to:

You can instead tune with absolute values (in KB):

Example: limit dirty data to 1 GB and start background writeback at 256 MB:

sysctl -w vm.dirty_bytes=$((1024*1024*1024))
sysctl -w vm.dirty_background_bytes=$((256*1024*1024))

Persist by adding to /etc/sysctl.conf.

2. `vm.min_free_kbytes`

Controls how much memory the kernel tries to keep free for emergencies and for allocating contiguous pages.

On modern systems with lots of RAM, increasing slightly from default can improve stability under heavy load, but drastic changes can cause issues. Only tune this when you observe allocation stalls or reclaim thrashing and understand kernel documentation.

3. `vm.overcommit_memory` and `vm.overcommit_ratio`

These control memory overcommit — promising more memory to processes than physically exists.

For memory-critical servers (databases, scientific workloads):

sysctl -w vm.overcommit_memory=2
sysctl -w vm.overcommit_ratio=80

Use this when you must bound memory usage and avoid overcommit-induced OOM kills at random.

4. `vm.vfs_cache_pressure`

Controls how aggressively the kernel reclaims inode/dentry caches (directory and file metadata).

If you have filesystem-heavy workloads and sufficient RAM, a moderately low value (e.g. 50) may improve performance by keeping more metadata cached:

sysctl -w vm.vfs_cache_pressure=50

Avoid extreme values unless you have specific measurements.

5. Zone reclaim and NUMA (NUMA systems only)

On multi-socket NUMA systems, vm.zone_reclaim_mode controls how aggressively memory is reclaimed from local NUMA nodes before allocating from remote ones.

sysctl -w vm.zone_reclaim_mode=0

Only relevant on NUMA hardware. Use numactl --hardware to inspect layout.

Transparent Huge Pages (THP) and HugeTLB

Huge pages (e.g. 2 MB instead of 4 KB) reduce page table size and TLB misses, potentially improving performance for some memory-heavy workloads.

Transparent Huge Pages (THP)

Linux can automatically use huge pages via THP:

Check current mode:

cat /sys/kernel/mm/transparent_hugepage/enabled

It typically shows something like: [always] madvise never (brackets indicate current mode).

Common options:

Tuning:

echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

For latency-sensitive systems where THP causes issues, use never. For big DB hosts, check vendor recommendations; many recommend always disabled and explicit huge pages instead.

Explicit huge pages (HugeTLB)

For peak performance and predictability, you can allocate explicit huge pages:

  sysctl -w vm.nr_hugepages=1024

This is advanced tuning: follow your application’s documentation (databases like Oracle, PostgreSQL, MySQL may benefit).

Tuning for Different Workload Types

Memory tuning is always workload-specific. Some typical patterns:

1. Database / in-memory cache servers

Goals: low latency, predictable performance, avoid OOM during spikes.

Typical adjustments (after measuring behavior):

  vm.swappiness = 1
  vm.overcommit_memory = 2
  vm.overcommit_ratio = 80
  vm.dirty_background_bytes = 268435456   # 256 MiB
  vm.dirty_bytes = 1073741824            # 1 GiB

Monitor:

2. Web/application servers

Goals: handle many concurrent connections, frequent forks, moderate memory usage.

Patterns:

Tuning ideas:

Use systemd features (in other chapters) like MemoryMax= and MemoryHigh= per-service to prevent single services from taking over the machine.

3. Batch / HPC / scientific workloads

Goals: high throughput, large in-memory datasets, long-running jobs.

Patterns:

Tuning:

  vm.overcommit_memory = 2
  vm.overcommit_ratio = 90

4. Desktop / interactive systems

Goals: snappy feel, avoid stalls when RAM fills, allow heavy occasional workloads.

Practical changes:

  vm.swappiness = 10

Avoiding and Handling OOM (Out of Memory)

When memory is exhausted, Linux’s OOM killer chooses processes to terminate.

Detecting OOM events

Symptoms:

  dmesg | grep -i "out of memory"
  journalctl -k | grep -i "out of memory"

Look for lines like “Out of memory: Kill process <pid> (name) score X or sacrifice child”.

Tuning OOM behavior with `oom_score_adj`

Each process has an oom_score and an adjustable oom_score_adj (-1000 to +1000).

Example: protect a critical service:

echo -1000 | sudo tee /proc/$(pidof my_critical_service)/oom_score_adj

You can configure this permanently through systemd unit files (e.g. OOMScoreAdjust=) instead of scripting manual changes.

Be careful: if you protect too many processes, the OOM killer may have no good candidates and the system may become unresponsive instead of recovering.

Proactive controls with cgroups (memory.max, memory.high)

Instead of waiting for OOM:

Reducing Application Memory Footprint

Kernel tuning helps, but you often gain more by reducing what applications use.

Consider:

Use:

NUMA-aware Memory Tuning (Advanced)

On multi-socket servers, memory is divided into NUMA nodes, each with local CPUs. Accessing local memory is faster than remote memory.

Key tools:

Issues:

Mitigations:

  sysctl -w vm.zone_reclaim_mode=0

Measuring, Testing, and Iterating Safely

Memory tuning should follow a deliberate cycle:

  1. Measure baseline:
    • Collect metrics (RAM usage, swap, page faults, I/O wait, app latency).
    • Use tools like sar, vmstat, free, top, application metrics.
  2. Change one thing at a time:
    • Adjust a single parameter (e.g. swappiness).
    • Apply via sysctl -w or by editing /etc/sysctl.conf for persistence.
  3. Load test or observe under real workload:
    • Reproduce typical or worst-case load.
    • Measure before/after differences.
  4. Document and version-control:
    • Keep sysctl settings under configuration management.
    • Note why each non-default value was chosen.
  5. Roll back if needed:
    • If latency, OOMs, or instability increase, revert to last known good settings.

Never adopt “universal” sysctl recipes blindly. Effective memory tuning depends on:

Views: 171

Comments

Please login to add a comment.

Don't have an account? Register now!