Table of Contents
Understanding CPU Bottlenecks
Before tuning, you must confirm the CPU is actually the bottleneck and understand how it is being used.
Key CPU Utilization Metrics
Most tools ultimately report the same core metrics (from /proc/stat):
- User time: Time running user-space processes (non-kernel)
- System time: Time running kernel code
- Idle: CPU doing nothing (no runnable tasks)
- I/O wait: CPU idle while waiting for disk/network I/O
- Steal: Time taken by the hypervisor (common on VMs)
- Nice: Time running processes with adjusted priority
High CPU usage alone is not always bad; what matters is why and whether it’s affecting latency/throughput.
Typical patterns:
- High
user+ high load average + slow app: CPU-bound workload - High
system: heavy kernel activity, syscalls, context switches - High
iowait: actually an I/O bottleneck, not pure CPU - High
steal: congested hypervisor, not much to tune inside the guest - High
idlebut app slow: locking, serialization, or waiting on other resources
Identifying CPU Bottlenecks
Use monitoring tools (covered elsewhere) to answer:
- Are all cores busy, or just one/few?
- Is contention on kernel/system time or user-space?
- Is performance limited by single-threaded sections?
Look specifically for:
- One process pinning a single core at 100% (single-threaded hot path)
- Many processes at moderate CPU, but load average >> number of cores (run queue congestion)
- Frequent context switches and migrations across cores
CPU Scheduler and Priorities
Tuning starts with controlling which tasks get CPU time and when.
Nice Levels (Static Priority for Normal Tasks)
nice controls relative CPU share among regular (non-RT) processes.
- Range: $-20$ (highest priority) to $19$ (lowest)
- Default:
0
Run a command with lower priority:
nice -n 10 long_batch_job
Increase priority of an already running process (requires sudo for negative nice):
sudo renice -n -5 -p 12345Use cases:
- Decrease nice for latency-critical app servers (
nice -5or so) - Increase nice for background work (backups, indexing) to avoid starving interactive services
Avoid extreme negative nice on many processes; you can starve system daemons.
Real-Time Scheduling Classes
Real-time (RT) policies force the scheduler to favor certain tasks above all normal ones.
Main policies:
SCHED_FIFO: strict first-in, first-out; highest RT priority runs firstSCHED_RR: round-robin among RT tasksSCHED_OTHER/CFS: normal scheduling (default)SCHED_BATCH,SCHED_IDLE: deprioritized for background tasks
Set real-time scheduling:
sudo chrt -f -p 90 12345 # SCHED_FIFO with priority 90
sudo chrt -r -p 80 12345 # SCHED_RR 80Or start a command with RT:
sudo chrt -f 80 ./audio_engineWarnings:
- RT tasks can lock up the system if they never yield or block.
- Reserve RT for carefully written, latency-critical workloads (audio, trading systems, control systems).
- Keep some cores or lower RT priority free for kernel housekeeping and interrupts.
CFS Tuning via cgroups (CPU Shares and Quotas)
Using cgroups (v1/v2), you can partition CPU resources among groups of processes.
Typical knobs:
- CPU shares: relative weight when CPU is contended
Higher shares → more CPU time relative to others. - CPU quota / period: cap the maximum CPU a group can consume.
Example: 200% quota means up to two cores worth of CPU.
Example with systemd (per-service CPU weight):
Edit a unit override:
sudo systemctl edit myservice.serviceAdd:
[Service]
CPUWeight=1000 # default 100; 1000 gives higher shareOr to cap CPU:
[Service]
CPUQuota=50%This is useful for:
- Making sure background or noisy services don’t starve critical workloads
- Protecting the box from runaway tasks
Core and Thread Affinity
Controlling where processes run can reduce cache misses, migration overhead, and NUMA penalties.
Basic CPU Affinity
taskset binds a process or PID to specific cores:
# Run a program pinned to core 0
taskset -c 0 ./myapp
# Pin existing PID 12345 to cores 2–3
sudo taskset -cp 2-3 12345Use cases:
- Keep noisy batch jobs off cores used by latency-sensitive services
- Pin each worker process to a fixed core to reduce migrations and cache thrash
Hyper-Threading Awareness
On CPUs with SMT/Hyper-Threading, each physical core presents 2+ logical CPUs.
- Logical siblings share execution units and some resources.
- Running two heavy tasks on sibling threads can reduce per-task throughput.
Identify sibling threads:
lscpu | grep "Thread(s) per core"
lscpu -e # lists CPUs with their core and socket IDsTuning approaches:
- For throughput-oriented workloads: fully use all threads.
- For latency-sensitive workloads:
- Prefer using 1 sibling per core first.
- Use
tasksetor service-level affinity to place critical tasks on separate physical cores.
NUMA-Aware Placement
On multi-socket or NUMA systems, memory is closer to one CPU node than another.
- Accessing remote NUMA node memory increases latency.
- Migrating processes across NUMA nodes can hurt cache and memory locality.
Tools:
numactlto control CPU and memory placement:
# Run on NUMA node 0 with memory allocated from node 0
numactl --cpunodebind=0 --membind=0 ./db_server- Prefer keeping long-lived, memory-heavy processes on a single NUMA node.
- Avoid forcing one process to span multiple NUMA nodes unless it is designed to scale that way.
CPU Frequency and Power Management
CPU frequency scaling and power states affect both performance and latency.
CPU Frequency Governors
Linux exposes various governors that decide CPU frequency:
Common governors:
performance: run at max frequencypowersave: prefer low frequency to save powerondemand: scale based on load (older systems)schedutil: integrated with the scheduler (common on newer systems)
View current governor:
cpupower frequency-info
Set to performance (example for all cores):
sudo cpupower frequency-set -g performance
Or via /sys (per CPU):
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governorTuning guidance:
- For low-latency server workloads: prefer
performancegovernor. - For laptops and power-sensitive environments:
schedutilorpowersave.
Turbo Boost and Thermal Limits
Modern CPUs can boost above base frequency (Intel Turbo Boost, AMD Precision Boost).
- Turbo increases single-thread performance but raises power/heat.
- Sustained heavy load can trigger thermal throttling (effective frequency drops).
Check if turbo is enabled (Intel example):
cat /sys/devices/system/cpu/intel_pstate/no_turbo
# 0 = turbo enabled, 1 = disabledEnable/disable (Intel example):
# Disable turbo
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
# Enable turbo
echo 0 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turboUse cases:
- Latency-sensitive burst workloads: keep turbo on.
- Predictable performance and thermal stability: consider disabling turbo in dense environments.
C-States and Latency
C-states are CPU sleep states (C0 active, C1/C2… deeper sleep).
- Deeper C-states save more power but increase wake-up latency.
- For ultra-low-latency systems, you may want to limit deep C-states.
This often requires:
- Kernel parameters (e.g.,
processor.max_cstate,intel_idle.max_cstate) - BIOS/firmware settings (e.g., disabling deep C-states, package C-states)
This is advanced: disabling power saving increases heat/power and may not be appropriate on general-purpose servers.
Kernel Parameters for CPU Scheduling
The Linux scheduler has tunables that affect latency vs throughput trade-offs.
Preemption Model and CONFIG Options
Chosen at kernel build time (not runtime tunable):
CONFIG_PREEMPT_NONE: lower overhead, higher throughput, higher latencyCONFIG_PREEMPT_VOLUNTARY: middle groundCONFIG_PREEMPT: lower latency, slightly higher overheadCONFIG_PREEMPT_RT: real-time kernels; minimal latency
On distributions that ship RT or low-latency kernels (often for audio or trading), selecting them can dramatically reduce scheduling latency at some cost to raw throughput.
Scheduler Runtime Tunables
Runtime knobs are mostly under /proc/sys/kernel/ and /proc/sys/sched/. Examples (names vary by kernel version/dist):
sched_min_granularity_ns: smallest timeslice for taskssched_latency_ns: target latency for scheduling decisionssched_migration_cost_ns: cost of migrating tasks between CPUs
You can experiment:
cat /proc/sys/kernel/sched_min_granularity_ns
echo 3000000 | sudo tee /proc/sys/kernel/sched_min_granularity_nsBut:
- These are global changes.
- Wrong values can degrade interactive performance or throughput.
- Prefer using upstream-recommended values or distro defaults unless you have clear benchmark data and understand the trade-offs.
In practice, for most admins:
- Choose an appropriate kernel flavor (generic vs low-latency vs RT).
- Use per-application tuning (nice, cgroups, affinity) before global scheduler tweaks.
Application-Level CPU Tuning
Many performance gains come from how applications use CPU:
Reducing Context Switching and Overheads
Avoid spawning excessive threads or processes:
- Too many active threads → overhead from context switching > useful work.
- Aim for a thread pool roughly aligned with core count (taking I/O wait into account).
Tune application-level settings:
- Thread pool sizes
- Max worker processes
- Concurrency limits
Benchmark with different settings rather than assuming more threads = more performance.
Optimizing Workload Characteristics
Where you have control over the software:
- Use more efficient algorithms and data structures.
- Reduce unnecessary polling loops; use event-driven I/O.
- Batch small tasks to amortize overheads (e.g., bulk DB operations).
- Avoid excessive logging or debug output in hot paths.
These changes can drastically cut CPU time and are often more impactful than OS-level tweaks.
Benchmarking and Validation
Any CPU tuning must be validated to avoid “cargo cult” optimizations.
Establish a Baseline
Before changes:
- Capture CPU usage, latency, and throughput under a representative load.
- Use repeatable workloads (synthetic benchmarks or realistic load tests).
Record:
- Per-core utilization
- Load averages
- App response times / QPS / throughput
- Context switches, migrations (e.g., via
vmstat,pidstat,perf)
Apply One Change at a Time
To understand impact:
- Change a single variable (e.g., governor from
schedutiltoperformance). - Re-run the same benchmark.
- Compare metrics quantitatively.
Avoid:
- Bundling many changes together.
- Relying on subjective impressions (“feels faster”).
Watch for Regressions
After tuning, monitor:
- Latency spikes or jitter
- Starvation of background tasks or system services
- Increased thermal throttling or power draw
- Noisy neighbors impact on multi-tenant hosts
Be ready to revert:
- Keep track of every knob you change.
- Store configuration in version-controlled files or documented playbooks.
Putting It Together: Common CPU Tuning Scenarios
Scenario 1: Latency-Sensitive Web API on Bare Metal
Typical steps:
- Set CPU governor to
performance. - Reserve some cores for the web workers and pin them via systemd:
[Service]
CPUAffinity=0-7- Increase service CPU share:
[Service]
CPUWeight=1000- Lower
nicefor the web service (Nice=-5in the unit). - Raise
nice(or CPUQuota) for backup/indexing jobs. - Optionally enable low-latency or RT kernel if extreme latency requirements exist, with careful testing.
Scenario 2: Multi-Tenant VM Host
Goals: fairness, avoid noisy neighbors.
- Use cgroups to set reasonable CPUQuota and CPUWeight per tenant.
- Avoid real-time scheduling for guest processes.
- Monitor host
stealtime to detect oversubscription. - Optionally disable turbo for predictability and thermal headroom.
Scenario 3: High-Throughput Batch Processing
Goals: maximize throughput, tolerate higher latency.
- Use all logical CPUs (including hyper-threads).
- Use
niceto deprioritize jobs relative to interactive services. - Tune app thread pools to match or slightly exceed hardware concurrency.
- Consider
schedutilorondemandwith appropriate scaling parameters, orperformanceif power is not a concern.
Safety and Operational Considerations
- Test tuning changes in a staging environment before production.
- Document every kernel parameter and sysfs change.
- Be conservative with RT scheduling and global scheduler knobs.
- Combine CPU tuning with instrumentation: always measure effects.
- Remember that CPU is often just one piece; bottlenecks can move to memory, I/O, or network as you tune.