CPU tuning

Table of Contents

Understanding CPU Bottlenecks

Before tuning, you must confirm the CPU is actually the bottleneck and understand how it is being used.

Key CPU Utilization Metrics

Most tools ultimately report the same core metrics (from /proc/stat):

User time: Time running user-space processes (non-kernel)
System time: Time running kernel code
Idle: CPU doing nothing (no runnable tasks)
I/O wait: CPU idle while waiting for disk/network I/O
Steal: Time taken by the hypervisor (common on VMs)
Nice: Time running processes with adjusted priority

High CPU usage alone is not always bad; what matters is why and whether it’s affecting latency/throughput.

Typical patterns:

High user + high load average + slow app: CPU-bound workload
High system: heavy kernel activity, syscalls, context switches
High iowait: actually an I/O bottleneck, not pure CPU
High steal: congested hypervisor, not much to tune inside the guest
High idle but app slow: locking, serialization, or waiting on other resources

Identifying CPU Bottlenecks

Use monitoring tools (covered elsewhere) to answer:

Are all cores busy, or just one/few?
Is contention on kernel/system time or user-space?
Is performance limited by single-threaded sections?

Look specifically for:

One process pinning a single core at 100% (single-threaded hot path)
Many processes at moderate CPU, but load average >> number of cores (run queue congestion)
Frequent context switches and migrations across cores

CPU Scheduler and Priorities

Tuning starts with controlling which tasks get CPU time and when.

Nice Levels (Static Priority for Normal Tasks)

nice controls relative CPU share among regular (non-RT) processes.

Range: $-20$ (highest priority) to $19$ (lowest)
Default: 0

Run a command with lower priority:

nice -n 10 long_batch_job

Increase priority of an already running process (requires sudo for negative nice):

sudo renice -n -5 -p 12345

Use cases:

Decrease nice for latency-critical app servers (nice -5 or so)
Increase nice for background work (backups, indexing) to avoid starving interactive services

Avoid extreme negative nice on many processes; you can starve system daemons.

Real-Time Scheduling Classes

Real-time (RT) policies force the scheduler to favor certain tasks above all normal ones.

Main policies:

SCHED_FIFO: strict first-in, first-out; highest RT priority runs first
SCHED_RR: round-robin among RT tasks
SCHED_OTHER / CFS: normal scheduling (default)
SCHED_BATCH, SCHED_IDLE: deprioritized for background tasks

Set real-time scheduling:

sudo chrt -f -p 90 12345      # SCHED_FIFO with priority 90
sudo chrt -r -p 80 12345      # SCHED_RR 80

Or start a command with RT:

sudo chrt -f 80 ./audio_engine

Warnings:

RT tasks can lock up the system if they never yield or block.
Reserve RT for carefully written, latency-critical workloads (audio, trading systems, control systems).
Keep some cores or lower RT priority free for kernel housekeeping and interrupts.

CFS Tuning via cgroups (CPU Shares and Quotas)

Using cgroups (v1/v2), you can partition CPU resources among groups of processes.

Typical knobs:

CPU shares: relative weight when CPU is contended
Higher shares → more CPU time relative to others.
CPU quota / period: cap the maximum CPU a group can consume.
Example: 200% quota means up to two cores worth of CPU.

Example with systemd (per-service CPU weight):

Edit a unit override:

sudo systemctl edit myservice.service

Add:

[Service]
CPUWeight=1000   # default 100; 1000 gives higher share

Or to cap CPU:

[Service]
CPUQuota=50%

This is useful for:

Making sure background or noisy services don’t starve critical workloads
Protecting the box from runaway tasks

Core and Thread Affinity

Controlling where processes run can reduce cache misses, migration overhead, and NUMA penalties.

Basic CPU Affinity

taskset binds a process or PID to specific cores:

# Run a program pinned to core 0
taskset -c 0 ./myapp
# Pin existing PID 12345 to cores 2–3
sudo taskset -cp 2-3 12345

Use cases:

Keep noisy batch jobs off cores used by latency-sensitive services
Pin each worker process to a fixed core to reduce migrations and cache thrash

Hyper-Threading Awareness

On CPUs with SMT/Hyper-Threading, each physical core presents 2+ logical CPUs.

Logical siblings share execution units and some resources.
Running two heavy tasks on sibling threads can reduce per-task throughput.

Identify sibling threads:

lscpu | grep "Thread(s) per core"
lscpu -e   # lists CPUs with their core and socket IDs

Tuning approaches:

For throughput-oriented workloads: fully use all threads.
For latency-sensitive workloads:

Prefer using 1 sibling per core first.
Use taskset or service-level affinity to place critical tasks on separate physical cores.

NUMA-Aware Placement

On multi-socket or NUMA systems, memory is closer to one CPU node than another.

Accessing remote NUMA node memory increases latency.
Migrating processes across NUMA nodes can hurt cache and memory locality.

Tools:

numactl to control CPU and memory placement:

# Run on NUMA node 0 with memory allocated from node 0
numactl --cpunodebind=0 --membind=0 ./db_server

Prefer keeping long-lived, memory-heavy processes on a single NUMA node.
Avoid forcing one process to span multiple NUMA nodes unless it is designed to scale that way.

CPU Frequency and Power Management

CPU frequency scaling and power states affect both performance and latency.

CPU Frequency Governors

Linux exposes various governors that decide CPU frequency:

Common governors:

performance: run at max frequency
powersave: prefer low frequency to save power
ondemand: scale based on load (older systems)
schedutil: integrated with the scheduler (common on newer systems)

View current governor:

cpupower frequency-info

Set to performance (example for all cores):

sudo cpupower frequency-set -g performance

Or via /sys (per CPU):

echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

Tuning guidance:

For low-latency server workloads: prefer performance governor.
For laptops and power-sensitive environments: schedutil or powersave.

Turbo Boost and Thermal Limits

Modern CPUs can boost above base frequency (Intel Turbo Boost, AMD Precision Boost).

Turbo increases single-thread performance but raises power/heat.
Sustained heavy load can trigger thermal throttling (effective frequency drops).

Check if turbo is enabled (Intel example):

cat /sys/devices/system/cpu/intel_pstate/no_turbo
# 0 = turbo enabled, 1 = disabled

Enable/disable (Intel example):

# Disable turbo
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
# Enable turbo
echo 0 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo

Use cases:

Latency-sensitive burst workloads: keep turbo on.
Predictable performance and thermal stability: consider disabling turbo in dense environments.

C-States and Latency

C-states are CPU sleep states (C0 active, C1/C2… deeper sleep).

Deeper C-states save more power but increase wake-up latency.
For ultra-low-latency systems, you may want to limit deep C-states.

This often requires:

Kernel parameters (e.g., processor.max_cstate, intel_idle.max_cstate)
BIOS/firmware settings (e.g., disabling deep C-states, package C-states)

This is advanced: disabling power saving increases heat/power and may not be appropriate on general-purpose servers.

Kernel Parameters for CPU Scheduling

The Linux scheduler has tunables that affect latency vs throughput trade-offs.

Preemption Model and CONFIG Options

Chosen at kernel build time (not runtime tunable):

CONFIG_PREEMPT_NONE: lower overhead, higher throughput, higher latency
CONFIG_PREEMPT_VOLUNTARY: middle ground
CONFIG_PREEMPT: lower latency, slightly higher overhead
CONFIG_PREEMPT_RT: real-time kernels; minimal latency

On distributions that ship RT or low-latency kernels (often for audio or trading), selecting them can dramatically reduce scheduling latency at some cost to raw throughput.

Scheduler Runtime Tunables

Runtime knobs are mostly under /proc/sys/kernel/ and /proc/sys/sched/. Examples (names vary by kernel version/dist):

sched_min_granularity_ns: smallest timeslice for tasks
sched_latency_ns: target latency for scheduling decisions
sched_migration_cost_ns: cost of migrating tasks between CPUs

You can experiment:

cat /proc/sys/kernel/sched_min_granularity_ns
echo 3000000 | sudo tee /proc/sys/kernel/sched_min_granularity_ns

But:

These are global changes.
Wrong values can degrade interactive performance or throughput.
Prefer using upstream-recommended values or distro defaults unless you have clear benchmark data and understand the trade-offs.

In practice, for most admins:

Choose an appropriate kernel flavor (generic vs low-latency vs RT).
Use per-application tuning (nice, cgroups, affinity) before global scheduler tweaks.

Application-Level CPU Tuning

Many performance gains come from how applications use CPU:

Reducing Context Switching and Overheads

Avoid spawning excessive threads or processes:

Too many active threads → overhead from context switching > useful work.
Aim for a thread pool roughly aligned with core count (taking I/O wait into account).

Tune application-level settings:

Thread pool sizes
Max worker processes
Concurrency limits

Benchmark with different settings rather than assuming more threads = more performance.

Optimizing Workload Characteristics

Where you have control over the software:

Use more efficient algorithms and data structures.
Reduce unnecessary polling loops; use event-driven I/O.
Batch small tasks to amortize overheads (e.g., bulk DB operations).
Avoid excessive logging or debug output in hot paths.

These changes can drastically cut CPU time and are often more impactful than OS-level tweaks.

Benchmarking and Validation

Any CPU tuning must be validated to avoid “cargo cult” optimizations.

Establish a Baseline

Before changes:

Capture CPU usage, latency, and throughput under a representative load.
Use repeatable workloads (synthetic benchmarks or realistic load tests).

Record:

Per-core utilization
Load averages
App response times / QPS / throughput
Context switches, migrations (e.g., via vmstat, pidstat, perf)

Apply One Change at a Time

To understand impact:

Change a single variable (e.g., governor from schedutil to performance).
Re-run the same benchmark.
Compare metrics quantitatively.

Avoid:

Bundling many changes together.
Relying on subjective impressions (“feels faster”).

Watch for Regressions

After tuning, monitor:

Latency spikes or jitter
Starvation of background tasks or system services
Increased thermal throttling or power draw
Noisy neighbors impact on multi-tenant hosts

Be ready to revert:

Keep track of every knob you change.
Store configuration in version-controlled files or documented playbooks.

Putting It Together: Common CPU Tuning Scenarios

Scenario 1: Latency-Sensitive Web API on Bare Metal

Typical steps:

Set CPU governor to performance.
Reserve some cores for the web workers and pin them via systemd:

  [Service]
  CPUAffinity=0-7

Increase service CPU share:

  [Service]
  CPUWeight=1000

Lower nice for the web service (Nice=-5 in the unit).
Raise nice (or CPUQuota) for backup/indexing jobs.
Optionally enable low-latency or RT kernel if extreme latency requirements exist, with careful testing.

Scenario 2: Multi-Tenant VM Host

Goals: fairness, avoid noisy neighbors.

Use cgroups to set reasonable CPUQuota and CPUWeight per tenant.
Avoid real-time scheduling for guest processes.
Monitor host steal time to detect oversubscription.
Optionally disable turbo for predictability and thermal headroom.

Scenario 3: High-Throughput Batch Processing

Goals: maximize throughput, tolerate higher latency.

Use all logical CPUs (including hyper-threads).
Use nice to deprioritize jobs relative to interactive services.
Tune app thread pools to match or slightly exceed hardware concurrency.
Consider schedutil or ondemand with appropriate scaling parameters, or performance if power is not a concern.

Safety and Operational Considerations

Test tuning changes in a staging environment before production.
Document every kernel parameter and sysfs change.
Be conservative with RT scheduling and global scheduler knobs.
Combine CPU tuning with instrumentation: always measure effects.
Remember that CPU is often just one piece; bottlenecks can move to memory, I/O, or network as you tune.

Comments

Please login to add a comment.

Don't have an account? Register now!

CPU tuning

Understanding CPU Bottlenecks

Key CPU Utilization Metrics

Identifying CPU Bottlenecks

CPU Scheduler and Priorities

Nice Levels (Static Priority for Normal Tasks)

Real-Time Scheduling Classes

CFS Tuning via cgroups (CPU Shares and Quotas)

Core and Thread Affinity

Basic CPU Affinity

Hyper-Threading Awareness

NUMA-Aware Placement

CPU Frequency and Power Management

CPU Frequency Governors

Turbo Boost and Thermal Limits

C-States and Latency

Kernel Parameters for CPU Scheduling

Preemption Model and CONFIG Options

Scheduler Runtime Tunables

Application-Level CPU Tuning

Reducing Context Switching and Overheads

Optimizing Workload Characteristics

Benchmarking and Validation

Establish a Baseline

Apply One Change at a Time

Watch for Regressions

Putting It Together: Common CPU Tuning Scenarios

Scenario 1: Latency-Sensitive Web API on Bare Metal

Scenario 2: Multi-Tenant VM Host

Scenario 3: High-Throughput Batch Processing

Safety and Operational Considerations

Comments

Where to Move