Performance Tuning

Understanding Linux Performance

Performance tuning is about making a system do more useful work with the same (or fewer) resources, while staying stable and predictable. It is not just “making benchmarks faster”.

At a high level, tuning always follows this loop:

Define the goal
What do you care about?

Maximum throughput (requests/second, jobs/hour, MB/s)
Minimum latency (response time, jitter)
Capacity (how many users / VMs / containers)
Efficiency (work per watt, work per dollar)

Measure the current state
Use monitoring and profiling tools (covered in other chapters) to see:

Where time is spent
Which resources are saturated
How performance changes with load

Form a hypothesis
Examples:

“The CPU is saturated on core 0 due to interrupts, moving NIC IRQs will help.”
“We’re I/O bound due to sync writes, enabling write‑back caching may help.”
“Context switches are high, reducing process count may help.”

Apply a change
Change one thing at a time:

A kernel parameter
A scheduler setting
A service configuration
Hardware / topology layout

Measure again
Compare before/after:

If it improved the target metric, keep it.
If it didn’t, revert and try a different hypothesis.

This loop is the core of performance tuning regardless of subsystem (CPU, memory, disk, network, etc.). Later chapters in this section look at CPU/memory/disk tuning specifics; here we stay at the strategy and system‑wide level.

Key Performance Concepts

Throughput vs Latency vs Utilization

These three concepts often pull in different directions:

Throughput: how much work per unit time
Examples: HTTP requests/second, database transactions/sec, GB/s read/write.
Latency: time to complete a single operation
Examples: p50, p95, p99 response times, job completion time.
Utilization: how “busy” a resource is

CPU utilization (% busy)
Disk utilization
Network link utilization

Typical relationships:

Pushing for maximum throughput tends to increase latency once a resource becomes saturated.
Operating near 100% utilization of any bottleneck resource usually causes unpredictable latency (queueing).
For interactive systems, a good rule of thumb is to keep critical resources under about 70–80% utilization under normal peak.

Bottlenecks and Amdahl’s Law

A bottleneck is the resource or code path that limits performance. Speeding up anything else won’t help.

Amdahl’s Law quantifies this:

If a fraction $p$ of your workload is sped up by a factor $s$, the overall speedup is:
$$
\text{Speedup} = \frac{1}{(1 - p) + \frac{p}{s}}
$$

Implications for tuning:

If only 20% of the time is in disk I/O, even infinite disk speed gives at most:
$$
\frac{1}{0.8 + 0} = 1.25\times
$$
You get the biggest wins by:

Finding the largest bottleneck (largest $p$).
Improving that part of the system.

This is why profiling and system‑wide tracing are more valuable than tweaking random kernel tunables.

Little’s Law and Queues

Whenever a resource is shared (CPU, disk, network, database), you essentially have:

$$
L = \lambda \times W
$$

Where:

$L$ = average number of requests in the system (queue + being served)
$\lambda$ = arrival rate (requests/sec)
$W$ = average time in the system (latency)

For performance tuning:

If arrival rate $\lambda$ increases and capacity doesn’t, either:

Latency $W$ goes up (queues grow), or
Requests are dropped/throttled.

Your job includes:

Avoid operating so close to capacity that $W$ explodes with small $\lambda$ increases.
Implement back‑pressure and limits so the system degrades gracefully.

A Systematic Tuning Workflow

1. Define the Workload

Performance depends heavily on what you’re running. Clarify:

Type of workload

CPU‑bound (compilation, encoding, crypto)
Memory‑intensive (in‑memory DB, caches)
I/O‑bound (databases, file servers, backup jobs)
Latency‑sensitive (web APIs, VoIP, trading)
Mixed / multi‑tenant

Load pattern

Steady vs bursty
Diurnal cycles (day/night patterns)
Rare but extreme spikes (sales events, batch windows)

SLA/SLOs

Maximum acceptable latency (e.g., p95 < 200 ms)
Expected throughput
Reliability targets (error rates, uptime)

Without this, you can’t meaningfully judge whether a change is an improvement.

2. Establish a Baseline

Before tuning, capture:

Current resource usage at representative load

CPU usage and run queue lengths
Memory usage and reclaim activity
Disk I/O latency and throughput
Network throughput and drops

Application metrics

Requests per second
Latencies (p50, p95, p99)
Error/timeout rates

Configuration snapshot

Kernel version
Key sysctl values (/proc/sys/*)
Filesystem types and mount options
Scheduler, governor, and power settings

Keep this baseline somewhere versioned (Git, docs) so you can:

Compare after changes.
Roll back if needed.
Know what “normal” looks like.

3. Quick Health Checks

Before deep tuning, verify:

No obvious hardware issues

SMART errors on disks
PCIe or NIC errors in logs
High correctable ECC memory error rates

No misconfigurations

Swap completely disabled on memory‑constrained systems without justification
Tiny or missing write‑back caches where safe
Frequent OOM kills

No broken expectations

Logs flooded with errors/warnings
Background tasks (backup, cron jobs) competing with production work at peak times

Often, fixing low‑hanging fruit yields large gains without complex tuning.

4. Identify the Primary Bottleneck

For each major resource, answer:

CPU

Are CPUs mostly idle, or pegged near 100%?
Are a few cores overloaded while others are idle (affinity / imbalance)?
Is there high system time (kernel), user time, or wait time?

Memory

Is the system swapping or reclaiming heavily?
Cache hit rates (application caches, DB caches)
Page faults and NUMA locality issues

Storage

High average I/O latency?
High queue depths?
Small random vs large sequential I/O?

Network

Bandwidth near line rate?
Packet drops, retransmits, or out‑of‑order packets?
Single flow limited by TCP windowing or latency/BW product?

Once you know which resource saturates first under load, it becomes your primary tuning target.

5. Plan and Prioritize Changes

Possible classes of changes:

Architectural / application‑level

Caching at the right layer
Reducing chattiness (fewer round trips)
Batching operations
Using more efficient algorithms

System‑level

Adjusting kernel scheduling and priorities
Tuning network buffers and TCP stack parameters
Choosing suitable I/O schedulers and mount options
Configuring IRQ affinity and NUMA settings

Capacity / topology

Scaling vertically (more CPU/RAM/IO)
Scaling horizontally (more nodes)
Re‑balancing services across hosts

System‑level tuning cannot fix fundamental design issues. If every request requires N synchronous disk writes, no amount of scheduler tuning will match a design that batches or avoids those writes.

6. Test, Validate, and Document

For each change:

Test under as realistic a load as possible:

Use representative datasets.
Exercise the same request patterns.

Measure:

Did your target metric (e.g., p95 latency) improve?
Did you introduce new problems (e.g., jitter, GC pauses, tail latencies)?

Document:

What was changed.
Why it was changed (hypothesis).
Before/after metrics and graphs.
Any side effects or caveats.

Treat tuning like code: change‑controlled, reviewed, and reversible.

System‑Level Tuning Themes

Schedulers and Priorities

Linux uses multiple schedulers that all influence performance:

CPU scheduler

Decides which task runs on which core
Affected by:

Process priorities (nice)
Scheduling policies (normal vs real‑time)
CPU affinity (which cores a process can run on)

I/O scheduler

Orders disk operations to balance throughput vs latency
Different schedulers favor different workloads (sequential vs random, HDD vs SSD).

Network queueing

TX/RX queue sizes
Traffic shaping / QoS disciplines

Tuning often involves:

Ensuring critical services are not starved by background jobs.
Pinning latency‑sensitive tasks to cores with less interrupt load.
Choosing I/O scheduling and queueing policies that match access patterns.

Details of how to do this are covered in the CPU and Disk performance chapters, but here the main principle is:

Match scheduler behavior to workload needs: throughput vs latency, fairness vs priority.

Caching and Locality

Most performance wins on modern systems come from better use of caches:

CPU caches (L1/L2/L3)
Page cache (filesystem cache in RAM)
Application‑level caches (e.g., Redis, memcached)
Database buffer pools

Core ideas:

Temporal locality: reused data should stay close to the CPU (hot working set).
Spatial locality: accessing data that’s contiguous (e.g., sequential file access) is cheaper.

From a tuning perspective:

Avoid unnecessary cache flushes (e.g. frequent sync or drop_caches in cron).
Ensure critical data can fit into available caches (tune memory distribution between OS and applications).
Consider access patterns when laying out data (e.g., sequential vs random I/O).

NUMA Considerations

On multi‑socket systems, memory is separated into NUMA nodes:

Access to memory on the local node is faster.
Access to remote node memory incurs additional latency.

Performance tuning on NUMA systems often includes:

Ensuring processes are pinned (CPU affinity) such that:

They primarily use memory from their local node.

Avoiding memory imbalance where:

One node is full and another has free memory, causing unnecessary cross‑node access.

If you treat a NUMA machine like a uniform SMP, you may leave significant performance on the table for memory‑sensitive workloads.

Power Management vs Performance

Modern systems use power‑saving features:

CPU frequency scaling (governors like powersave, ondemand, performance)
Core parking / deep sleep states (C‑states)
Device power‑saving (e.g., disks, NICs)

Trade‑offs:

Best latency often requires:

Higher minimum frequencies
Shallower sleep states (faster wakeup)

Best energy efficiency often:

Allows more aggressive throttling and deeper sleep

For performance‑critical systems:

Choose power settings consistent with goals:

Real‑time trading system: favor latency over power savings.
Batch compute cluster: maybe accept slightly higher latency for lower power use.

Measuring and Thinking About Performance

1. Time Scales and Granularity

Performance issues can appear at:

Microseconds (interrupts, lock contention)
Milliseconds (request latency)
Seconds to minutes (GC pauses, batch jobs)
Hours to days (memory leaks, fragmentation, slow log growth)

Tune your measurement tools to:

Capture short‑spike behavior (perf, tracing, microbenchmarks).
Observe long‑term trends (monitoring, dashboards, historical logs).

Over‑aggregated metrics (e.g., 1‑minute averages) can hide:

Short but frequent latency spikes.
“Noisy neighbor” effects on shared infrastructure.

2. Tail Latency

Real systems care about more than averages:

p50: median — typical experience
p95: worst experience for 5% of users
p99 / p99.9: critical for high‑SLA services

Tuning for tail latency often involves:

Reducing jitter sources:

GC, compaction, periodic cron/backup jobs
Log rotation stalls
Co‑tenanted heavy jobs

Adding headroom:

Keeping resource utilization below saturation so queues don’t explode.

Improving isolation:

cgroups, namespaces, CPU pinning
Dedicated hardware for critical services

3. Holistic vs Local Optimization

Danger of local optimization:

Making a microbenchmark 3× faster but:

Increasing memory use so much that the OS swaps
Or moving the bottleneck to a shared critical resource

Always ask:

Did this change improve the end‑to‑end metric we care about?
Did it cause regressions elsewhere (e.g., another service on the same box)?

Holistic tuning sometimes means intentionally slowing down one component to:

Prevent overload of a downstream service.
Flatten load peaks to improve overall throughput.

Practical Tuning Guidelines

General Principles

Change one thing at a time
Combining multiple tweaks makes it impossible to know which one helped or hurt.
Prefer simple over clever
Simple, well‑understood configurations are easier to debug and operate.
Don’t over‑tune for a synthetic benchmark
Benchmark‑friendly settings might not reflect real‑world usage.
Automate and codify tuning
Keep sysctl and service configs under version control and deploy them consistently.
Document intent
Every non‑default parameter should have a comment:

What it does
Why it was changed
When it should be revisited

When to Tune vs When to Scale

Tuning helps when:

You see clear inefficiencies (e.g., excessive context switches, cache misses).
You’re far from hardware limits but hitting software limits (lock contention, poor scheduling).

Scaling (more or better hardware) might be better when:

You’re already near the practical limits of well‑tuned software.
The cost of deep tuning exceeds hardware cost.
You need redundancy/high availability anyway.

Often, you do some tuning to use hardware efficiently, then scale out.

Safe vs Risky Changes

Relatively safe (when tested properly):

Adjusting priorities of non‑critical background jobs
Choosing appropriate I/O scheduler for SSD vs HDD
Increasing application‑specific connection limits within reason
Tuning logging verbosity to reduce I/O

Riskier (require careful testing and rollback plans):

Aggressive TCP and network buffer tuning
Changing VM overcommit/hugepages strategies
Disabling safety mechanisms (e.g., some integrity checks)
Editing low‑level kernel parameters without understanding side effects

Always maintain a rollback procedure and test in non‑production environments first.