7.3.1 CPU tuning

Table of Contents

Understanding CPU Tuning Goals

CPU tuning is about matching how the kernel, scheduler, and applications use the processor to what your system actually does. You try to reduce wasted CPU cycles, shorten response time for important tasks, and keep throughput high without causing instability.

Before changing anything, you need a clear performance goal. Typical goals include lowering latency for interactive or real time workloads, increasing throughput for batch processing or serving more web requests per second, or reducing CPU power usage and heat while keeping performance acceptable. All tuning steps should be guided by measurement. You observe current behavior, change one parameter at a time, then measure again.

Always change one tuning parameter at a time and measure the effect. Never tune blindly in production.

Measuring CPU Utilization and Bottlenecks

CPU tuning starts with knowing whether the CPU is actually the limiting factor. If the system is I/O bound or memory bound, CPU tuning will not solve the real problem.

You typically begin with tools like top or htop to get a quick overview. These show total CPU usage percentage, split across user, system, and idle time, and they list processes sorted by CPU consumption. For deeper insight you might use tools that show per core statistics, longer term averages, or how much time the CPU spends waiting for I/O. When you see high iowait, the CPU is idle while storage or network are slow, which is not primarily a CPU scheduling problem.

Another useful metric is run queue length. This is the number of runnable tasks waiting for CPU time. If each core has many tasks waiting most of the time and overall CPU usage is very high, then CPU is likely the bottleneck. If you have high load but low CPU utilization, then some other resource is slow and tasks are blocked.

You should also look at CPU usage patterns of individual processes. Some workloads spike for short periods, others keep one core at 100 percent constantly. The pattern tells you whether pinning processes to particular cores, isolating cores, or changing scheduling policy might help. Long running, CPU intensive batch jobs benefit more from affinity and core isolation, while short, bursty interactive jobs benefit from low latency scheduling.

Process Priority and Scheduling Policies

Linux assigns each process a scheduling policy and a priority inside that policy. You can influence how the scheduler treats a process by adjusting these values. This does not change how fast a single core is, but it changes which tasks get to run when there is contention.

The normal scheduling class for most user processes is the Completely Fair Scheduler, often referred to as CFS. These processes share CPU time more or less fairly and use a nice value to express relative importance. Lower nice values mean higher priority within CFS, and higher nice values mean lower priority. Nice values are integers in the range from $-20$ to $+19$. A practical rule is that a process with nice $-10$ has significantly more weight than a process at nice $0$, and one at nice $10$ runs less often.

You can set the nice value when starting a process with the nice command, or change the priority of an existing process with renice. Raising the nice number reduces its CPU share, which can keep background tasks from disturbing latency sensitive applications. Conversely, slightly negative nice values can help give CPU bound services a larger share when they compete with many other processes.

Besides CFS, Linux has real time scheduling policies such as SCHED_FIFO and SCHED_RR. These policies are strictly priority based. Real time tasks can preempt normal tasks and, if misused, can starve them of CPU. Because of this, you should only use real time scheduling for applications that are designed to handle it, such as audio processing or certain control systems, and only after careful testing.

Misusing real time priorities can freeze a system. Never give arbitrary processes real time priority without a recovery plan.

CPU Affinity and Core Isolation

CPU affinity determines which cores a process is allowed to run on. By default the scheduler can move tasks across all cores to balance load. However, for some workloads you can gain performance or more predictable latency by restricting cores.

You can start a program with a given affinity using taskset, or change the affinity of an existing process. For example, you can confine a batch job to a subset of cores, leaving others free for interactive tasks. This can reduce cache pollution, improve predictability, and avoid competition between tasks that are sensitive to each other.

Core isolation is a stronger form of affinity, applied at the kernel level through boot parameters. Isolated cores are largely kept free of regular kernel work such as interrupts and background tasks. Then you run critical workloads on those isolated cores to get very stable timing. This is a common pattern for low latency trading systems or real time audio processing, where jitter in scheduling is more important than raw throughput.

You must be careful that you do not isolate too many cores. The remaining non isolated cores must still handle all regular tasks and kernel work. If they are overloaded, the entire system becomes sluggish, even if isolated cores are idle.

Balancing Power Saving and Performance

Modern CPUs support dynamic frequency scaling and power saving features. The kernel chooses a frequency governor that decides how aggressively the CPU clock speed changes in response to load. The main governors are usually performance, which keeps the CPU at a high frequency, powersave, which prefers lower frequency, and other adaptive modes that scale up and down according to demand.

If you care about maximum throughput or minimum latency, you often select a more aggressive governor for relevant cores. This can reduce the time the CPU spends ramping up frequency, which shortens response time. For systems that prioritize battery life or heat reduction, you choose a power saving governor and accept some performance reduction.

Hyper threading or symmetric multithreading creates logical cores that share parts of the physical core. In some workloads, using both logical siblings improves throughput. In others, such as certain numeric or low latency tasks, it can hurt performance because the siblings compete for shared resources. Tuning sometimes involves disabling hyper threading in firmware or avoiding scheduling critical threads on sibling logical cores.

You can also adjust CPU idle states. Deeper idle states save more power but take longer to wake up. High frequency interruption or low latency workloads may benefit from using shallower idle states, at the cost of higher power draw. Choosing appropriate idle state behavior depends on how steady or bursty your load is.

Concurrency, Parallelism, and Application Behavior

CPU tuning is not limited to kernel settings. The way applications use threads and processes has a large impact on CPU performance. If a program creates more runnable threads than you have cores, then contention increases and context switching overhead grows. In extreme cases, having many more threads than cores can reduce throughput and increase latency due to constant switching and cache eviction.

For CPU bound workloads, a reasonable starting point is to match the number of worker threads to the number of physical or logical cores that you intend to dedicate to that workload. Some workloads benefit from a few extra threads to hide latency, others perform best with a one to one mapping between cores and workers. You determine the correct setting experimentally.

Contention for shared locks can also waste CPU. If many threads wait on a single lock, they may spin or wake and sleep frequently. This shows up as high CPU usage without corresponding progress. In such cases, tuning sometimes involves reconfiguring the application to reduce contention, or splitting workloads into more independent units.

From a tuning perspective, it can be useful to separate different classes of work into different processes so that you can adjust priority and affinity separately. For example, you might run background maintenance tasks with reduced priority and possibly on a subset of cores, while keeping frontend request handlers on the most responsive cores.

Scheduling for Latency vs Throughput

CPU scheduling always balances two competing goals. Throughput is the total amount of work done per unit time. Latency is how long an individual task waits to get CPU time. Tuning for one can hurt the other, so you must decide which matters more for a given system.

Batch processing systems, such as those running scientific computations or long data processing jobs, usually care about throughput. On these systems you might accept higher latency for individual tasks if it allows the scheduler to keep cores busy and maintain good cache locality. Adjustments like longer time slices, core pinning, and disabling certain interrupt distributions can help.

Interactive systems and low latency servers care more about how quickly they can respond to a request. Here you might give interactive processes a small nice boost, prefer aggressive frequency scaling, and avoid letting heavy background jobs run without limits. You may also confine background tasks to specific cores or schedule them outside peak times.

You should not assume that the default scheduler settings are bad. For many mixed workloads the default configuration is well balanced. CPU tuning beyond nice and affinity only makes sense when you have a clear, measured latency or throughput problem that cannot be solved more easily at the application or architectural level.

Methodical Tuning Workflow

Effective CPU tuning is an iterative process. First you identify a concrete symptom such as high request latency, poor batch throughput, or unexpectedly high CPU usage. Then you collect baseline metrics for CPU utilization, run queue length, per process CPU share, and possibly hardware performance counters if you have the tools for them.

Next, you form a hypothesis. For example, you might suspect that a background job competes with an interactive application, that too many threads are contending for the same cores, or that the CPU governor is too conservative. Based on this, you adjust a small number of parameters. Common first steps are setting appropriate nice values, changing CPU affinity for certain services, or selecting a different frequency governor.

After making a change, you repeat the same measurements under comparable load. Only if the results show clear improvement do you keep the new setting. If not, you revert and try a different hypothesis. All changes should be documented, ideally in configuration management, so you can reproduce a working configuration or roll back if needed.

Never apply aggressive CPU tuning directly on critical production systems. Test on a similar non production environment and always keep a path to revert changes.

By following a structured approach and focusing on measurable effects, CPU tuning becomes a controlled tool to squeeze more performance or predictability from your hardware, instead of a risky collection of random tweaks.

Comments

Please login to add a comment.

Don't have an account? Register now!