3.5 System Monitoring

Table of Contents

Understanding System Monitoring

System monitoring is the ongoing observation and measurement of how a Linux system behaves while it runs real workloads. In this chapter you focus on the overall idea and practice of monitoring, while the following child chapters look closely at CPU and memory, disk and I/O, log files, boot performance, and running services.

Monitoring gives you visibility into what your system is doing, how resources are being used, and where problems appear before they become outages. Without monitoring you are effectively blind. With it you can respond quickly, plan capacity, and troubleshoot issues systematically.

Why System Monitoring Matters

The main goal of monitoring is to ensure that systems remain usable, responsive, and reliable. In practice this means you want to answer questions like whether the system is under heavy load, if performance is degrading over time, and whether something has failed or is very close to failing.

From the perspective of a system administrator, monitoring has three broad uses. First, it supports troubleshooting. When users report that an application is slow, monitoring data helps you see if the CPU is saturated, memory is exhausted, disks are overloaded, or the network is congested. Second, it supports capacity planning. Historical monitoring data reveals trends that guide hardware upgrades or architecture changes. Third, it enables alerting and automation. Monitoring tools can notify you or trigger automatic actions when metrics cross defined thresholds.

System monitoring is not optional for serious systems. A server that is not monitored will eventually fail in ways that are hard or impossible to diagnose.

Core Concepts and Terminology

System monitoring involves several common ideas that apply to different tools and subsystems. It is useful to understand them before you dive into the details for CPU, memory, disks, logs, and services.

A metric is a numeric measurement of some aspect of the system at a given time. Examples include CPU usage as a percentage, available memory in megabytes, disk read throughput in megabytes per second, and the number of running processes. Many monitoring tools work by collecting and storing time series of metrics, which means you have values indexed by timestamp and can plot them over time.

Sampling is the act of collecting metrics at regular intervals. For example, a tool might record CPU usage every 5 seconds. Shorter intervals give more detailed data but also increase storage and processing requirements. There is always a trade off between resolution and overhead.

The monitoring frequency should be frequent enough to capture important events, but not so frequent that it harms system performance. A process that samples once every millisecond, for instance, could itself contribute to system load.

Another key concept is overhead, the extra work the system must do to monitor itself. Lightweight tools read kernel counters from special files like /proc and /sys with minimal impact. Heavyweight tools collect and store many metrics, perform complex analysis, or transmit data across the network. Good monitoring design tries to minimize overhead while still gathering sufficient detail.

Finally, there is a difference between real time monitoring and historical monitoring. Real time monitoring shows you what is happening right now, second by second. Historical monitoring shows the past and lets you analyze trends, such as a slow increase in memory use or a daily pattern of CPU spikes.

Types of Monitoring

When you monitor a Linux system you usually combine several different views of the system. These views correspond to the main resources and subsystems.

Resource monitoring focuses on low level resources like CPU, memory, disks, and network. It answers questions about how busy each resource is, how much capacity remains, and whether anything is being saturated.

Service and process monitoring focuses on the programs that deliver functionality to users. This looks at whether specific services are running, whether they restart unexpectedly, and how much resource they consume over time.

Log monitoring focuses on the messages that the system and applications write to log files. Logs contain detailed information about errors, warnings, configuration problems, and unusual events. Purely numeric metrics rarely tell the full story without log context.

From another angle you can divide monitoring into system level monitoring and application level monitoring. System level monitoring looks at the operating system who is using the CPU, how much memory is available, and similar questions. Application level monitoring looks inside a particular service at request rates, error rates, and internal operation timings.

In practice, an effective monitoring setup uses all of these types together. For example, when you investigate slow web responses you might check CPU metrics, memory usage, disk I/O load, the web server process status, and the web server error logs.

Tools and Approaches

Linux provides a rich collection of built in tools that expose monitoring information. Some of these tools are interactive and show real time updates in a terminal, while others print a snapshot or record values over time. External monitoring systems can then collect and centralize this data for many machines at once.

Interactive tools like top and htop show you process lists, CPU usage, and memory utilization that update every few seconds. Disk related tools can show read and write activity. Other commands print one time summaries, such as overall memory allocation or basic disk usage.

There are also specialized commands for particular aspects of the system. Later chapters will explore tools focused on CPU and RAM, on storage and I/O, and on services. Some tools look specifically at network behavior, while others focus on logs and boot timing.

For larger environments, administrators deploy dedicated monitoring systems that run agents on each server or query standard interfaces. These systems collect metrics, store them in a database, graph them, and provide dashboards and alerts. They can support many machines from a central place, which is critical when manual inspection is no longer feasible.

A typical approach on a single server is to start with the tools that are already available in your distribution, learn what they show, and then move to more advanced systems if needed. Because all these tools read from the same underlying sources in the kernel and in /proc or /sys, understanding the basics translates well from one monitoring setup to another.

Establishing Baselines

A baseline is a description of what normal looks like for a system. Without a baseline, a high CPU usage value or a specific disk I/O rate may be hard to interpret. For one application 70 percent CPU usage might be normal during business hours, while for another application the same value might indicate a problem.

To build a baseline, you monitor the system over a reasonable period of typical activity and record key metrics such as average CPU usage, typical memory consumption, normal disk read and write rates, and usual patterns in logs. You can then recognize changes that deviate from the baseline. These changes can be gradual, such as a steady increase in memory usage over weeks, or sudden, such as a spike in disk writes after a configuration change.

Baselines are specific to each system and workload. A development server with small test data sets will have a very different baseline from a production database server with heavy transactional activity. Monitoring tools help you visualize these differences and notice when they shift.

In practice, you often define baselines informally at first by observing graphs and usage patterns. Over time you can refine these into more formal thresholds and rules, which can then be used for alerting.

Thresholds, Alerts, and Incidents

Monitoring is not only about looking at graphs. It is also about being informed automatically when something is wrong. This is where thresholds and alerts come in.

A threshold is a defined limit for a metric. For example, you might say that CPU usage over 90 percent for more than 5 minutes is a concern. When such a condition is met, the monitoring system can generate an alert by email, by a message in a chat system, or by integrating with an incident management platform.

A good alert should indicate a real problem that requires human attention. Excessive or poorly chosen alerts create noise, which leads to alert fatigue and ignored warnings.

Alerts should be based on both current values and context. A single brief spike in CPU may not matter if it coincides with a scheduled batch job, while a long period of high usage outside expected times might indicate a runaway process or a misconfiguration.

You can also define complex alerts that involve multiple metrics. For example, memory usage may be high, but if swap usage is low and the system remains responsive, it might not be critical. If memory is high, swap is heavily used, and processes are being killed, then an urgent alert is justified.

Incidents are events where something in the system fails or significantly degrades. Monitoring data is crucial during incident response. It shows when the issue began, how it evolved, and which components were involved. It also helps in verifying that the problem is resolved and that performance has returned to normal after you apply a fix.

Monitoring Strategy and Scope

Designing a monitoring approach involves decisions about what to monitor, at what detail, and with what tools. It is not necessary or wise to collect every possible metric on every system. Instead, you should choose metrics that relate closely to the roles and risks of each machine.

On a simple personal Linux desktop, you may only need basic monitoring to troubleshoot occasional slowdowns. For example, watching CPU and memory usage when you run heavy applications, and checking logs when something crashes, might be enough.

On a server, you typically monitor more systematically. You would always include CPU, memory, disk, and network usage, and then add service status checks, application specific metrics, and log scanning for errors. For critical services, you might also monitor redundancy and failover mechanisms, resource usage over long periods, and external checks such as whether a web page responds correctly from outside the server.

Monitoring should also consider the lifecycle of the system. Before deploying a new application, you can monitor a testing environment to understand its resource needs. After deployment, you adjust thresholds and alerts, then watch for unexpected behavior. When decommissioning a system, you can use monitoring to ensure that workloads have migrated correctly and that the old system is no longer in active use.

Using Monitoring for Troubleshooting

When a system behaves poorly, monitoring gives you a structured approach to troubleshooting. Rather than guessing, you observe metrics and logs and work from the most fundamental resources upward.

Often you start by asking if the machine is resource constrained. You look at CPU usage to see if processes are consuming all available compute resources. You check memory to see if it is close to full or if swap is being used heavily. You examine disk I/O to see if reads or writes are saturating the device and causing delays. You also review network usage if the problem appears to affect communication with other systems.

Once you know whether a particular resource is under pressure, you can identify which process or service is responsible and look at its logs. If no single resource stands out, you might look at more detailed metrics, such as context switches, load averages, or application specific indicators.

Because logs capture discrete events and metrics capture continuous behavior over time, combining both yields the clearest picture. For example, you might notice in metrics that performance dropped at a certain time, then open the logs around that time to see error messages or configuration changes.

Monitoring also helps verify that your changes work. After you adjust a configuration setting or restart a service, you can observe how metrics change. If CPU usage falls to a normal level and response times improve, you have evidence that the issue was resolved.

From Local Observation to Centralized Monitoring

On a single computer you can work interactively with local tools and log files, since you have direct access to the system. However, in any environment with multiple servers, manual monitoring quickly becomes impractical.

Centralized monitoring systems address this problem by gathering metrics and logs from multiple hosts into one place. They run small agents that send data at regular intervals, or they use network protocols to query services and devices. The collected data can then be visualized with dashboards, stored for long term analysis, and used to trigger alerts according to defined rules.

While this course focuses on Linux itself rather than specific monitoring products, the principles remain the same. The metrics you first learn to inspect with local commands are the same metrics that larger tools collect and analyze automatically.

As you gain experience, you can move from ad hoc checks toward more systematic monitoring. You begin to define which metrics should be collected by default for new systems, which dashboards are needed for common roles, and which alerts indicate real problems. System monitoring becomes a normal, continuous practice instead of an occasional reaction to trouble.

Connecting to the Next Chapters

In the following chapters you will look more closely at specific aspects of monitoring. CPU and memory monitoring explores how to see detailed usage and identify bottlenecks. Disk and I/O monitoring covers storage activity and performance. Log monitoring in /var/log shows how textual messages reveal system behavior. Boot performance examines startup timing and delays. Monitoring running services focuses on the state and health of key daemons.

Together, these topics build a complete and practical understanding of how to keep a Linux system observable and manageable over its entire lifetime.

3.5.1 CPU and memory monitoring

3.5.2 Disk and I/O monitoring

3.5.3 Log files in /var/log

3.5.4 Boot performance

3.5.5 Monitoring running services