Table of Contents
Understanding Linux Performance
Performance tuning is about making a system do more useful work with the same (or fewer) resources, while staying stable and predictable. It is not just “making benchmarks faster”.
At a high level, tuning always follows this loop:
- Define the goal
What do you care about? - Maximum throughput (requests/second, jobs/hour, MB/s)
- Minimum latency (response time, jitter)
- Capacity (how many users / VMs / containers)
- Efficiency (work per watt, work per dollar)
- Measure the current state
Use monitoring and profiling tools (covered in other chapters) to see: - Where time is spent
- Which resources are saturated
- How performance changes with load
- Form a hypothesis
Examples: - “The CPU is saturated on core 0 due to interrupts, moving NIC IRQs will help.”
- “We’re I/O bound due to sync writes, enabling write‑back caching may help.”
- “Context switches are high, reducing process count may help.”
- Apply a change
Change one thing at a time: - A kernel parameter
- A scheduler setting
- A service configuration
- Hardware / topology layout
- Measure again
Compare before/after: - If it improved the target metric, keep it.
- If it didn’t, revert and try a different hypothesis.
This loop is the core of performance tuning regardless of subsystem (CPU, memory, disk, network, etc.). Later chapters in this section look at CPU/memory/disk tuning specifics; here we stay at the strategy and system‑wide level.
Key Performance Concepts
Throughput vs Latency vs Utilization
These three concepts often pull in different directions:
- Throughput: how much work per unit time
Examples: HTTP requests/second, database transactions/sec, GB/s read/write. - Latency: time to complete a single operation
Examples: p50, p95, p99 response times, job completion time. - Utilization: how “busy” a resource is
- CPU utilization (% busy)
- Disk utilization
- Network link utilization
Typical relationships:
- Pushing for maximum throughput tends to increase latency once a resource becomes saturated.
- Operating near 100% utilization of any bottleneck resource usually causes unpredictable latency (queueing).
- For interactive systems, a good rule of thumb is to keep critical resources under about 70–80% utilization under normal peak.
Bottlenecks and Amdahl’s Law
A bottleneck is the resource or code path that limits performance. Speeding up anything else won’t help.
Amdahl’s Law quantifies this:
- If a fraction $p$ of your workload is sped up by a factor $s$, the overall speedup is:
$$
\text{Speedup} = \frac{1}{(1 - p) + \frac{p}{s}}
$$
Implications for tuning:
- If only 20% of the time is in disk I/O, even infinite disk speed gives at most:
$$
\frac{1}{0.8 + 0} = 1.25\times
$$ - You get the biggest wins by:
- Finding the largest bottleneck (largest $p$).
- Improving that part of the system.
This is why profiling and system‑wide tracing are more valuable than tweaking random kernel tunables.
Little’s Law and Queues
Whenever a resource is shared (CPU, disk, network, database), you essentially have:
$$
L = \lambda \times W
$$
Where:
- $L$ = average number of requests in the system (queue + being served)
- $\lambda$ = arrival rate (requests/sec)
- $W$ = average time in the system (latency)
For performance tuning:
- If arrival rate $\lambda$ increases and capacity doesn’t, either:
- Latency $W$ goes up (queues grow), or
- Requests are dropped/throttled.
Your job includes:
- Avoid operating so close to capacity that $W$ explodes with small $\lambda$ increases.
- Implement back‑pressure and limits so the system degrades gracefully.
A Systematic Tuning Workflow
1. Define the Workload
Performance depends heavily on what you’re running. Clarify:
- Type of workload
- CPU‑bound (compilation, encoding, crypto)
- Memory‑intensive (in‑memory DB, caches)
- I/O‑bound (databases, file servers, backup jobs)
- Latency‑sensitive (web APIs, VoIP, trading)
- Mixed / multi‑tenant
- Load pattern
- Steady vs bursty
- Diurnal cycles (day/night patterns)
- Rare but extreme spikes (sales events, batch windows)
- SLA/SLOs
- Maximum acceptable latency (e.g., p95 < 200 ms)
- Expected throughput
- Reliability targets (error rates, uptime)
Without this, you can’t meaningfully judge whether a change is an improvement.
2. Establish a Baseline
Before tuning, capture:
- Current resource usage at representative load
- CPU usage and run queue lengths
- Memory usage and reclaim activity
- Disk I/O latency and throughput
- Network throughput and drops
- Application metrics
- Requests per second
- Latencies (p50, p95, p99)
- Error/timeout rates
- Configuration snapshot
- Kernel version
- Key sysctl values (
/proc/sys/*) - Filesystem types and mount options
- Scheduler, governor, and power settings
Keep this baseline somewhere versioned (Git, docs) so you can:
- Compare after changes.
- Roll back if needed.
- Know what “normal” looks like.
3. Quick Health Checks
Before deep tuning, verify:
- No obvious hardware issues
- SMART errors on disks
- PCIe or NIC errors in logs
- High correctable ECC memory error rates
- No misconfigurations
- Swap completely disabled on memory‑constrained systems without justification
- Tiny or missing write‑back caches where safe
- Frequent OOM kills
- No broken expectations
- Logs flooded with errors/warnings
- Background tasks (backup, cron jobs) competing with production work at peak times
Often, fixing low‑hanging fruit yields large gains without complex tuning.
4. Identify the Primary Bottleneck
For each major resource, answer:
- CPU
- Are CPUs mostly idle, or pegged near 100%?
- Are a few cores overloaded while others are idle (affinity / imbalance)?
- Is there high system time (kernel), user time, or wait time?
- Memory
- Is the system swapping or reclaiming heavily?
- Cache hit rates (application caches, DB caches)
- Page faults and NUMA locality issues
- Storage
- High average I/O latency?
- High queue depths?
- Small random vs large sequential I/O?
- Network
- Bandwidth near line rate?
- Packet drops, retransmits, or out‑of‑order packets?
- Single flow limited by TCP windowing or latency/BW product?
Once you know which resource saturates first under load, it becomes your primary tuning target.
5. Plan and Prioritize Changes
Possible classes of changes:
- Architectural / application‑level
- Caching at the right layer
- Reducing chattiness (fewer round trips)
- Batching operations
- Using more efficient algorithms
- System‑level
- Adjusting kernel scheduling and priorities
- Tuning network buffers and TCP stack parameters
- Choosing suitable I/O schedulers and mount options
- Configuring IRQ affinity and NUMA settings
- Capacity / topology
- Scaling vertically (more CPU/RAM/IO)
- Scaling horizontally (more nodes)
- Re‑balancing services across hosts
System‑level tuning cannot fix fundamental design issues. If every request requires N synchronous disk writes, no amount of scheduler tuning will match a design that batches or avoids those writes.
6. Test, Validate, and Document
For each change:
- Test under as realistic a load as possible:
- Use representative datasets.
- Exercise the same request patterns.
- Measure:
- Did your target metric (e.g., p95 latency) improve?
- Did you introduce new problems (e.g., jitter, GC pauses, tail latencies)?
- Document:
- What was changed.
- Why it was changed (hypothesis).
- Before/after metrics and graphs.
- Any side effects or caveats.
Treat tuning like code: change‑controlled, reviewed, and reversible.
System‑Level Tuning Themes
Schedulers and Priorities
Linux uses multiple schedulers that all influence performance:
- CPU scheduler
- Decides which task runs on which core
- Affected by:
- Process priorities (
nice) - Scheduling policies (normal vs real‑time)
- CPU affinity (which cores a process can run on)
- I/O scheduler
- Orders disk operations to balance throughput vs latency
- Different schedulers favor different workloads (sequential vs random, HDD vs SSD).
- Network queueing
- TX/RX queue sizes
- Traffic shaping / QoS disciplines
Tuning often involves:
- Ensuring critical services are not starved by background jobs.
- Pinning latency‑sensitive tasks to cores with less interrupt load.
- Choosing I/O scheduling and queueing policies that match access patterns.
Details of how to do this are covered in the CPU and Disk performance chapters, but here the main principle is:
Match scheduler behavior to workload needs: throughput vs latency, fairness vs priority.
Caching and Locality
Most performance wins on modern systems come from better use of caches:
- CPU caches (L1/L2/L3)
- Page cache (filesystem cache in RAM)
- Application‑level caches (e.g., Redis, memcached)
- Database buffer pools
Core ideas:
- Temporal locality: reused data should stay close to the CPU (hot working set).
- Spatial locality: accessing data that’s contiguous (e.g., sequential file access) is cheaper.
From a tuning perspective:
- Avoid unnecessary cache flushes (e.g. frequent
syncordrop_cachesin cron). - Ensure critical data can fit into available caches (tune memory distribution between OS and applications).
- Consider access patterns when laying out data (e.g., sequential vs random I/O).
NUMA Considerations
On multi‑socket systems, memory is separated into NUMA nodes:
- Access to memory on the local node is faster.
- Access to remote node memory incurs additional latency.
Performance tuning on NUMA systems often includes:
- Ensuring processes are pinned (CPU affinity) such that:
- They primarily use memory from their local node.
- Avoiding memory imbalance where:
- One node is full and another has free memory, causing unnecessary cross‑node access.
If you treat a NUMA machine like a uniform SMP, you may leave significant performance on the table for memory‑sensitive workloads.
Power Management vs Performance
Modern systems use power‑saving features:
- CPU frequency scaling (governors like
powersave,ondemand,performance) - Core parking / deep sleep states (C‑states)
- Device power‑saving (e.g., disks, NICs)
Trade‑offs:
- Best latency often requires:
- Higher minimum frequencies
- Shallower sleep states (faster wakeup)
- Best energy efficiency often:
- Allows more aggressive throttling and deeper sleep
For performance‑critical systems:
- Choose power settings consistent with goals:
- Real‑time trading system: favor latency over power savings.
- Batch compute cluster: maybe accept slightly higher latency for lower power use.
Measuring and Thinking About Performance
1. Time Scales and Granularity
Performance issues can appear at:
- Microseconds (interrupts, lock contention)
- Milliseconds (request latency)
- Seconds to minutes (GC pauses, batch jobs)
- Hours to days (memory leaks, fragmentation, slow log growth)
Tune your measurement tools to:
- Capture short‑spike behavior (perf, tracing, microbenchmarks).
- Observe long‑term trends (monitoring, dashboards, historical logs).
Over‑aggregated metrics (e.g., 1‑minute averages) can hide:
- Short but frequent latency spikes.
- “Noisy neighbor” effects on shared infrastructure.
2. Tail Latency
Real systems care about more than averages:
- p50: median — typical experience
- p95: worst experience for 5% of users
- p99 / p99.9: critical for high‑SLA services
Tuning for tail latency often involves:
- Reducing jitter sources:
- GC, compaction, periodic cron/backup jobs
- Log rotation stalls
- Co‑tenanted heavy jobs
- Adding headroom:
- Keeping resource utilization below saturation so queues don’t explode.
- Improving isolation:
- cgroups, namespaces, CPU pinning
- Dedicated hardware for critical services
3. Holistic vs Local Optimization
Danger of local optimization:
- Making a microbenchmark 3× faster but:
- Increasing memory use so much that the OS swaps
- Or moving the bottleneck to a shared critical resource
Always ask:
- Did this change improve the end‑to‑end metric we care about?
- Did it cause regressions elsewhere (e.g., another service on the same box)?
Holistic tuning sometimes means intentionally slowing down one component to:
- Prevent overload of a downstream service.
- Flatten load peaks to improve overall throughput.
Practical Tuning Guidelines
General Principles
- Change one thing at a time
Combining multiple tweaks makes it impossible to know which one helped or hurt. - Prefer simple over clever
Simple, well‑understood configurations are easier to debug and operate. - Don’t over‑tune for a synthetic benchmark
Benchmark‑friendly settings might not reflect real‑world usage. - Automate and codify tuning
Keep sysctl and service configs under version control and deploy them consistently. - Document intent
Every non‑default parameter should have a comment: - What it does
- Why it was changed
- When it should be revisited
When to Tune vs When to Scale
Tuning helps when:
- You see clear inefficiencies (e.g., excessive context switches, cache misses).
- You’re far from hardware limits but hitting software limits (lock contention, poor scheduling).
Scaling (more or better hardware) might be better when:
- You’re already near the practical limits of well‑tuned software.
- The cost of deep tuning exceeds hardware cost.
- You need redundancy/high availability anyway.
Often, you do some tuning to use hardware efficiently, then scale out.
Safe vs Risky Changes
Relatively safe (when tested properly):
- Adjusting priorities of non‑critical background jobs
- Choosing appropriate I/O scheduler for SSD vs HDD
- Increasing application‑specific connection limits within reason
- Tuning logging verbosity to reduce I/O
Riskier (require careful testing and rollback plans):
- Aggressive TCP and network buffer tuning
- Changing VM overcommit/hugepages strategies
- Disabling safety mechanisms (e.g., some integrity checks)
- Editing low‑level kernel parameters without understanding side effects
Always maintain a rollback procedure and test in non‑production environments first.
Performance Tuning in the Bigger Picture
Performance tuning isn't a one‑time activity; it’s part of:
- Capacity planning
- Estimating future needs based on trends
- Justifying hardware and cloud budgets
- Release management
- Regression testing for performance on new versions
- Canary deployments and gradual rollouts
- Incident response
- Diagnosing sudden slowdowns or resource exhaustion
- Applying temporary mitigations (rate limits, shed load)
- DevOps culture
- Sharing findings with developers and operations
- Embedding performance metrics into CI/CD pipelines
By treating performance as a continuous, measured, and collaborative practice, you reduce firefighting and increase system reliability.
In the following chapters (CPU tuning, memory tuning, and disk/I/O optimization), these general principles will be applied to specific subsystems, with concrete tools and configuration examples.