3.5.1 CPU and memory monitoring

Table of Contents

Understanding CPU and Memory Monitoring

CPU and memory monitoring is about observing how your system uses its processing power and RAM over time, so you can detect problems early and understand performance bottlenecks. In this chapter you will focus on what you measure, how to read it, and which common tools help you watch CPU and memory usage on a Linux system.

Key CPU Metrics

When you monitor the CPU, you want to know how busy it is, what is keeping it busy, and whether tasks are waiting. Most tools present CPU usage as percentages that always add up to 100 percent for each CPU or core.

User time is the percentage of CPU running user space code. This includes regular applications and most programs you start yourself. System time is the percentage of CPU running kernel code. This covers system calls, drivers, and other kernel work. High system time can indicate heavy disk or network IO, or drivers and kernel features doing lots of work.

Idle time is the time the CPU is doing no work and is available for more tasks. If idle time is high, the CPU is not the bottleneck. If idle time is very low for a long period, the CPU may be saturated. IO wait is the percentage of time the CPU is idle but has tasks waiting for IO, usually disk or network. A high IO wait value means processes are stuck waiting for data to be read or written, not that the CPU is slow.

You may also see other categories such as nice time, which is CPU time used by processes with adjusted priority, and steal time, seen in virtual machines when the hypervisor has taken CPU time away from the guest. For performance work, user, system, idle, and IO wait are usually the most important.

Important rule: A constantly high user + system percentage together with very low idle time indicates CPU saturation. A high IO wait percentage with some idle time remaining usually means IO is the main bottleneck, not CPU speed.

Load Average

Load average is a classic Linux metric that shows the average number of active and waiting tasks over time. On most systems you will see three values, for example 0.35 0.72 1.10, which correspond to the average over 1, 5, and 15 minutes.

A task is counted in the load average if it is running on a CPU or waiting to run on a CPU or waiting on uninterruptible IO, typically disk IO. Load average is not the same as CPU usage. It is a count of how many tasks are demanding CPU or blocked on IO.

To interpret load average, you must know the number of logical CPUs. If a system has $N$ logical CPUs, then a useful rule of thumb is:

$$
\text{Optimal load range} \approx 0.7 \times N \text{ to } 1.0 \times N
$$

If the load average is much lower than $N$, the CPUs are underutilized. If it is consistently higher than $N$ over the 15 minute value, the system is likely overloaded or blocked on IO. For example, on a 4 core, 8 thread CPU, $N = 8$. A 1 minute load average of 12 with a 15 minute value of 11 suggests too many tasks are competing for CPU or disk.

Key guideline: Compare load average to CPU count. A load average much higher than the number of logical CPUs for a long time means the system is overloaded or blocked, even if short spikes are acceptable.

Memory Metrics and How to Read Them

Linux uses RAM aggressively for caching data to keep the system fast. This means a simple reading of "used" memory is usually misleading. You need to distinguish between memory used by applications and memory used as a cache that can be reclaimed.

Physical memory is the total amount of RAM in the system. Used memory includes everything currently allocated, such as processes, kernel data structures, and page cache. Free memory is RAM not currently used at all. Buffers and cache represent memory used for caching file system data, metadata, and other frequently accessed data.

In modern tools, you will often see "available" memory. This is an estimate of how much memory could be given to new applications without causing swapping. It considers free memory plus cache that can be dropped and reused.

The most important value is usually available memory. As long as available memory stays comfortably above zero, the system can still serve new allocations without immediately swapping. If available memory gets very low, the kernel will start to swap and may eventually trigger its out of memory killer.

Important rule: On Linux, "used" memory almost always looks high because of caching. Focus on "available" memory and swap usage to decide if you are actually under memory pressure.

Swap and Memory Pressure

Swap is disk space that the kernel can use to store memory pages from RAM. It allows the system to overcommit memory beyond physical RAM, but accessing swapped memory is much slower than RAM.

Swap total is the total configured swap space. Swap used is the portion currently in use. A small amount of swap usage is not necessarily a problem. Continuous growth in swap used, together with a low value for available memory and high IO wait, often shows that the system does not have enough RAM for its workload.

When memory pressure is high, the kernel will spend more time reclaiming memory. This can result in high system CPU time and frequent disk IO for swap. In the worst case, the kernel may kill processes to free memory.

A rough conceptual relationship is that total virtual memory is the sum of physical RAM and swap:

$$
\text{Total virtual memory} = \text{RAM} + \text{Swap}
$$

In practice, you want most active data to stay in RAM. Swap should act as a safety buffer, not as a place for heavily used pages.

Using top for CPU and Memory

The top command provides a continuous, text based view of running processes and system load. It shows CPU usage, memory usage, load averages, and details for each process.

When you run top, the first lines summarize system state. You will see load averages, the number of tasks, and a breakdown of CPU usage. A typical CPU line might include us for user, sy for system, id for idle, and wa for IO wait. This aligns with the descriptions you saw previously.

The memory summary lines show total, used, free, and buffer or cache values, plus swap total and free. Modern top versions also show Mem available. Interpret these values as described before, paying attention to available memory, not just used.

Below the summary, top lists processes with columns like PID, user, CPU percentage, memory percentage, and command name. You can quickly see which processes are consuming the most CPU and memory.

You can interact with top using single key commands. For example, you can change the sort order, filter by user, or toggle fields, but the exact keys and deeper features belong in process focused topics. For CPU and memory monitoring, use it mainly to observe overall usage and identify heavy processes.

Using htop for a Friendlier View

htop is an enhanced alternative to top. It displays CPU usage per core with colored bars, memory usage, and swap usage in a more graphical way. The process list is easier to navigate, and you can scroll through it with arrow keys.

Each CPU core has its own bar that shows user, system, and other usages using different colors. This makes it simple to see if one core is heavily loaded while others are idle. Memory and swap bars display total usage and can help you see trends at a glance.

Because htop is interactive, you can select a process and send signals such as terminate or kill, or change its priority. For pure monitoring, the main advantage is clarity and visibility per core and of memory patterns.

Using vmstat for Overview Metrics

The vmstat command provides a compact, periodic snapshot of CPU, memory, and IO activity. It is very useful to observe trends over several seconds or minutes.

When you run vmstat with an interval and a count, for example:

vmstat 2 5

you get output every 2 seconds, 5 times. The columns are grouped. Procs columns show processes that are runnable and blocked. Memory columns show how much memory is free and how much is used for buffers and cache. Swap columns show swap in and out activity. IO columns show blocks read and written. System columns include interrupts and context switches. CPU columns show user, system, idle, and wait percentages.

For CPU and memory monitoring, focus on the free memory, any significant swap in or swap out activity, and the CPU columns. A rising si (swap in) and so (swap out) together with low idle and low free memory usually indicate memory pressure. High wa in the CPU columns suggests IO wait.

vmstat is most informative when you capture several lines and look at how the values change, instead of just a single instant.

Using free for Memory Snapshots

The free command gives you a concise snapshot of memory usage. It prints total, used, free, shared, buff or cache, and available memory, plus swap totals.

If you run free -h, the -h option shows sizes in a human friendly format, such as MiB or GiB. The available column is the best indicator of how much memory is really left for new applications. If available is large relative to total, you have plenty of room. If it is very low and swap used is high, the system is likely under memory pressure.

You can combine watch with free for continuous monitoring. For example, watch -n 2 free -h updates the output every 2 seconds, so you can see memory changes as you start or stop applications or run heavy workloads.

Visualizing Metrics Over Time

Command line tools such as top, htop, vmstat, and free are ideal for quick checks and interactive troubleshooting. For longer term monitoring, you often connect these metrics to logging and graphing systems, which is addressed elsewhere.

For now, it is important to understand that CPU and memory metrics are more valuable when viewed as time series than as static snapshots. A spike in CPU usage for a few seconds is usually acceptable. A sustained high CPU load or a slow but steady increase in swap usage deserves more attention.

When you monitor, ask whether the metric is a short lived burst or a stable pattern. This distinction will guide your response, such as whether you need to tune an application, add more hardware resources, or adjust system level settings.

Connecting CPU and Memory Symptoms

Many real issues involve both CPU and memory. High CPU can result from many short lived processes, heavy computation, or frequent memory allocations. Low available memory can increase CPU usage because the kernel has to spend more time managing memory and reclaiming pages.

You can often link symptoms using the tools from this chapter. For example, if top shows high system CPU percentage and vmstat shows frequent swap in and swap out operations, you probably have memory pressure that is affecting CPU. If top shows high user CPU on one process and free shows plenty of available memory and no swap used, the bottleneck is more likely pure computation in that process.

By combining these observations, you can build a mental model of how CPU and memory interact on your system and use that model to decide what to investigate next.

Comments

Please login to add a comment.

Don't have an account? Register now!