3.5.2 Disk and I/O monitoring

Table of Contents

Understanding Disk and I/O Monitoring

Disk and I/O monitoring focuses on how storage devices read and write data and how quickly they respond. While CPU and memory monitoring show how fast the system can process data, disk and I/O metrics reveal how efficiently data moves to and from storage. For many servers, especially databases and file servers, storage performance is the main bottleneck, so learning to observe and interpret disk behavior is a key administrative skill.

I/O in this context means input and output to block devices such as hard drives, SSDs, virtual disks, and storage arrays. It includes sequential and random reads and writes and the queue of pending operations that wait for the disk to service them. Monitoring tools do not change performance by themselves, but they tell you when and where storage is slow so that you can act.

Important: Disk and I/O monitoring is not only about high usage. Watch for long response times and large queues, since they are the main signs of disk contention and user visible slowness.

Key Disk and I/O Concepts

Before looking at tools, it is important to understand the common concepts they report. Most disk monitoring output is just different views of the same underlying metrics.

The simplest quantity is the number of input and output operations per second, usually written as IOPS. If $R$ is the number of read operations during an interval and $W$ is the number of write operations, and the interval length is $T$ seconds, then:

$$
\text{IOPS} = \frac{R + W}{T}
$$

Tools also report throughput, which is the amount of data transferred per second. If $B$ is the total number of bytes read or written in time $T$, then:

$$
\text{Throughput} = \frac{B}{T}
$$

In practice, throughput is shown in KB/s, MB/s, or similar units. IOPS describes how many operations the disk is handling, while throughput describes how much data it is moving. A server with many small operations can have high IOPS and modest throughput, while a backup job that streams large files can have high throughput and fairly low IOPS.

Latency, often shown as await or as an average service time, measures how long an individual I/O request takes. When latency rises, applications that wait on disk spend more time blocked and the system feels slow. Finally, queue depth shows how many requests are waiting to be processed at any moment. A small but consistently high queue often means the disk is busy but coping, while a rapidly growing queue with rising latency usually indicates a severe bottleneck.

Using `iostat` for Basic Disk Statistics

One of the most widely available tools for disk I/O is iostat, which is part of the sysstat package on many distributions. It reads kernel statistics and prints summaries of CPU and device usage. As with many monitoring commands, it can display a snapshot since boot or repeated samples over time.

A simple use is:

iostat

This often shows a combined disk line and overall statistics since startup, which is not very useful for real time analysis. To get periodic samples, you can pass an interval in seconds, and optionally the number of reports:

iostat 2
iostat 5 10

The first example repeats every 2 seconds until you stop it, and the second shows 10 reports at 5 second intervals. To see more detailed statistics per device, you usually add the -x option:

iostat -x 2

Extended output typically shows per device fields such as utilization percentage, average queue size, average wait time, and service time. Each column has a specific meaning that you should learn carefully, but the general interpretation is similar across systems. Look for devices with consistently high utilization and long average wait times compared with others.

One limitation of iostat is that it gives a periodic summary instead of a timeline for each individual I/O, so it is best used to detect general load patterns and to identify which disks are busy. It is also a good starting point on a new system, because its output is concise and it is usually present or easy to install.

Real Time I/O Monitoring with `iotop`

While iostat focuses mainly on devices, iotop adds a process oriented view. It acts a bit like top but for I/O, and it can show which processes generate the most disk traffic at the moment.

On many distributions, iotop must be installed from the package manager. After installation, you can run:

sudo iotop

Administrator privileges are typically required, because the tool reads detailed task accounting data from the kernel. The display refreshes at intervals and shows columns such as current read and write rates per process, and often accumulated values.

By default, iotop may show only processes actually doing I/O at each moment. If you want to see all tasks and their induced I/O, you can enable that mode using a command line option. The key benefit is rapid identification of which process is responsible when disks are busy. If you see a large backup process, a database instance, or a misbehaving application at the top, you have an immediate starting point for further investigation.

In interactive mode, you can usually change the sort order and pause the display using keyboard commands, similar to top. Since iotop reports rates rather than long term averages, it is especially useful while a performance issue is ongoing. When the issue is intermittent, you may leave iotop running until the problem appears, then note the top entries at that moment.

Block Level Activity with `vmstat` and `dstat`

Some tools provide disk I/O as part of a broader resource view. vmstat is one example that prints information about processes, memory, paging, and block device I/O. You can run:

vmstat 2

to see periodic updates. Among the columns, you will find ones that show the number of blocks read from and written to disk per second. While these are not detailed per device, they help you correlate disk activity with other indicators such as page faults or swap usage.

A more flexible tool on some systems is dstat. It combines multiple statistics that you might otherwise gather from separate commands. If installed, you can try:

dstat -d -D sda,sdb 2

to watch selected disks only, or use other options to include CPU, network, and memory at the same time. dstat can be useful when you want to see whether heavy disk traffic coincides with other resource pressure without running many different commands.

Both vmstat and dstat are more suited to quick diagnostics than to long term historical analysis, but they give a convenient overview when something feels slow and you want to find which subsystem may be involved.

Advanced Per Device and Per Process Detail with `pidstat` and `iostat` Options

If you need per process disk statistics over time instead of a single real time snapshot, pidstat can help. It belongs to the same sysstat family as iostat. To look at I/O related statistics, you use its storage options. A simple example is:

pidstat -d 2

This prints read and write rates for each process at the chosen interval. It is less interactive than iotop, but it can be redirected to a file and reviewed later. On systems where you want to monitor specific services over a longer period, you can run pidstat with a longer interval and keep its output for analysis.

iostat also has options to focus on specific devices or to combine statistics in different ways. For instance, you can limit the output to one device by passing its name. This can be useful on servers with many disks where you suspect that only one is misbehaving. By carefully selecting which tools and options you use, you can zoom in from a global view to the particular process and device that cause trouble.

Everyday I/O Indicators with `iotop` and System Load

Even without specialized metrics, there are signals that suggest an I/O problem. A high system load value with moderate CPU usage often points to tasks blocked on disk. The load average includes processes that are runnable and those waiting on I/O, so a sudden spike can be disk related even if CPU graphs look normal.

In such cases, a quick call to iotop during the high load episode tells you if any process is doing extraordinary I/O. Then iostat confirms if a particular disk is saturated. If both show nothing unusual, the bottleneck may be elsewhere, for example in the network or in contention on locks inside an application. Disk monitoring is most useful when interpreted alongside other system metrics rather than in isolation.

Because I/O patterns vary over time, you should avoid drawing conclusions from a single short measurement. It is often better to observe for a few minutes and see if the pattern is stable. Short spikes are normal for many workloads, while sustained high latency and queues are more concerning. When possible, try to compare with a baseline collected during normal operation to distinguish normal high activity from abnormal contention.

Collecting Historical Disk Metrics

Real time tools are very helpful while you are watching the system, but they do not provide history. For long term trends, capacity planning, or intermittent issues, you need some form of periodic collection and storage of disk metrics.

One basic approach is to run commands like iostat or pidstat from scripts at regular intervals and log their output to files. This method is simple but can be hard to visualize, and large logs become awkward to handle. More advanced setups use monitoring systems that read kernel statistics at intervals and send them to time series databases. Those systems are outside the scope of this chapter, but their disk panels usually display the same fundamental metrics such as IOPS, throughput, queue depth, and latency.

When configuring such monitoring, it is useful to choose a sampling interval that balances detail and overhead. Very short intervals give more precise views of short spikes but generate more data. For many servers, intervals in the range of 10 to 60 seconds are a practical compromise. No matter which tools you choose, the goal remains the same to recognize when disk usage is approaching the limits of the hardware and to act before that leads to user visible slowdowns.

Using `/proc` and `/sys` for Raw Disk Statistics

All of the tools described depend on kernel counters that are also visible in virtual files. For block devices, there is usually a file under /sys/block that contains statistics, and on many systems there is also a summary in /proc/diskstats. You can inspect them directly with commands like:

cat /proc/diskstats

The raw numbers there are cumulative counts and often include fields for sectors read and written and time spent doing I/O. Normally, you let higher level tools interpret them for you, but direct access is helpful when you want to check whether a metric exists on a particular system or when developing custom scripts.

Reading from /proc and /sys does not affect performance in any significant way, because these files are generated by the kernel on demand, not stored on disk. This means that disk and I/O monitoring tools do not themselves cause noticeable disk traffic, which makes them safe to run even while the system is under load.

Practical Interpretation of Disk Metrics

Monitoring tools provide numbers, but the real skill lies in interpretation. A disk that shows high IOPS and throughput is not necessarily a problem if latency remains low and the queue is stable. High utilization close to 100 percent with small wait times can simply mean that the hardware is working at capacity and the workload is well matched.

Signs of trouble include long average wait times, rapidly increasing queue lengths, and processes that regularly appear at the top of I/O lists even during light usage. Another warning sign is a sudden drop in throughput and IOPS without a decrease in demand, often visible as rising load and user complaints.

When you find that disks are a bottleneck, the solutions may include moving data to faster storage, improving application access patterns, tuning file systems, or changing scheduling priorities. Monitoring does not solve these issues directly, but it gives the evidence you need to decide which path to follow and to verify later that the changes have helped.

By practicing with the main disk and I/O monitoring tools on non critical systems, you will become familiar with what normal patterns look like and you will be better prepared to recognize abnormal behavior when it matters.

Comments

Please login to add a comment.

Don't have an account? Register now!