3.5.2 Disk and I/O monitoring

Why Disk and I/O Monitoring Matters

Disk and I/O (input/output) performance often becomes a bottleneck before CPU or RAM. Slow disks can cause:

High load averages
“Laggy” terminals and applications
Slow database queries and web responses
Increased latency for virtual machines and containers

Monitoring disk and I/O helps you answer:

Is the disk saturated or idle?
Are we limited by throughput, latency, or queue length?
Which processes are causing heavy I/O?
Is the problem on a specific disk, partition, filesystem, or mount?

This chapter focuses on the main tools and metrics used to monitor disk and I/O on Linux.

Key Disk and I/O Metrics

You’ll see these metrics across most tools:

Throughput

Read/write rate in KB/s, MB/s, or sectors/s.
Answers: “How much data per second is moving?”

IOPS (I/O operations per second)

Number of read/write operations per second.
Important for workloads with many small I/O requests (databases, mail servers).

Latency / Response time

How long a single I/O request takes to complete.
Often seen as:

await, r_await, w_await (average time per request, in ms)
svctm (service time, in ms, on some tools)

High latency is often what users “feel” as slowness.

Utilization

Percentage of time the device is busy (%util).
Close to 100% suggests the disk is saturated.

Queue depth

Number of I/O requests waiting to be serviced (avgqu-sz or similar).
High queue depth means the disk cannot keep up with incoming requests.

Read vs Write balance

How much of the load is reads vs writes; different devices handle them differently.

Filesystem-level metrics

Free/used space
Inode usage (how many files you can still create)
Per-file or per-directory I/O (access frequency, “hot” files)

Tools Overview

You’ll use a mix of:

Instant/top-like tools (live view):

iostat, dstat, iotop, atop, vmstat, pidstat

Periodic/report tools:

sar (from sysstat), logs from monitoring systems

Space & filesystem view:

df, du, lsblk, findmnt

Low-level / advanced:

/proc/diskstats, blktrace, perf, bpftrace (advanced usage)

We’ll focus on practical usage of the most common tools.

Device and Mount Layout Basics for Monitoring

To interpret output correctly, you need to recognize:

Block devices: sda, sdb, nvme0n1, vda
Partitions: sda1, sda2, nvme0n1p1
Logical volumes: vg0-root, etc. (from LVM)
Mount points: /, /home, /var, /data, etc.

Useful quick commands:

lsblk      # Block devices, partitions, mount points
lsblk -f   # + filesystem type, labels, UUIDs
findmnt    # Show mounts in a tree form

These help map “which disk” to “which filesystem” when you see them in monitoring tools.

Monitoring with iostat (sysstat)

iostat is one of the core disk monitoring tools. It comes from the sysstat package (install it if missing).

Basic usage:

iostat

This prints CPU stats and basic device stats since boot. For real monitoring you usually:

Set an interval
Use “extended” stats
Optionally restrict to disks only

Common patterns:

# Extended stats for all devices every 2 seconds
iostat -x 2
# Extended stats for one device
iostat -x 2 /dev/sda
# Human-readable bytes (on some distros)
iostat -h -x 2

Key columns in iostat -x:

r/s, w/s
Read/write IOPS.
rkB/s, wkB/s (or rMB/s, wMB/s on some systems)
Read/write throughput.
await
Average time per I/O request (ms). Includes waiting in queue + service time.

< 5 ms: usually very good (especially for SSD).
10–20 ms: may be OK for HDD under load but can feel slow.
> 50 ms: often indicates serious disk pressure.

svctm (if present)
Average service time, without queueing. If await is much larger than svctm, queueing is the problem.
%util
Percentage of time the disk was busy.

Consistently > 80–90% indicates saturation.
On rotational disks, 100% means only one I/O at a time is being served (per spindle).

avgqu-sz
Average request queue size. A large value with high await means over-commitment.

When troubleshooting:

Run iostat -x 2 during the slowdown.
Look for disks with:

High %util
High await
High r/s or w/s (IOPS) and/or high throughput

Map the busy device to a filesystem with lsblk/findmnt.

Monitoring with vmstat

vmstat focuses on memory and virtual memory but has an I/O section too. It’s often used as a lightweight first look.

Basic:

vmstat 2

Key I/O columns:

bi – Blocks received from a block device (blocks in).
bo – Blocks sent to a block device (blocks out).

These are coarse but useful for seeing whether disk is moving data at all, and how changes (e.g. starting a backup job) affect I/O rates over time.

Using dstat and atop for Combined Views

dstat

dstat (if installed) gives a more customizable live view.

dstat -d -D sda,sdb 1      # Disk stats only, for sda and sdb
dstat -cdngy 1            # CPU, disk, net, paging, system

Helpful flags:

-d – disk
-D dev1,dev2 – specific devices
-r – I/O request stats
-g – page cache stats

It shows per-second rates, which makes trends easier to see than cumulative counters.

atop

atop is an advanced monitoring tool that can show per-process disk usage and can also log to a file over time.

atop

Look for the DSK section for device-level load, and (in some versions) per-process I/O statistics. You can:

Navigate with arrow keys
Press d (depending on version) for disk-specific views

It’s especially useful on servers for long-term performance analysis (when used with its logging mode).

Finding I/O-Heavy Processes with iotop and pidstat

When you know disks are busy, the next question is “who is causing this?”

iotop

iotop shows I/O usage by process/thread. You usually need root (or sudo) to see full details.

Install from your distro repo, then:

sudo iotop

or, for a more typical mode:

sudo iotop -o

Key options:

-o – show only processes/threads currently doing I/O
-b – batch mode (for logging to a file)
-d 2 – update interval in seconds

Important columns (names may vary slightly):

DISK READ, DISK WRITE – instantaneous KB/s per process
SWAPIN – percentage of time the process is waiting on swap I/O
IO – percentage of time the process is waiting on I/O (general)

Use it to identify:

Backup tools, tar, rsync jobs
Databases doing heavy writes
Misbehaving applications endlessly reading/writing logs

pidstat (from sysstat)

pidstat can show per-process I/O over time. For example:

pidstat -d 2

Columns (may vary):

kB_rd/s, kB_wr/s – Kilobytes read/written per second
kB_ccwr/s – Kilobytes of “cancelled” write-back (rarely needed at beginner level)

You can also monitor a single process:

pidstat -d -p <PID> 1

This is handy when you already suspect one application and want to quantify its disk usage.

Checking Space Usage: df and du

Performance and capacity are linked: a nearly full disk can slow down and cause failures.

df: filesystem-level usage

df -h

Shows:

Size, used, available space
Use% for each filesystem
Mount points

Points to watch:

Keep critical filesystems (like /, /var, databases, and log partitions) well below 90–95% full.
Some filesystems degrade in performance near full capacity.

You can filter specific filesystems:

df -h /var
df -h /home

du: directory-level usage

To find which directories use the most space:

du -sh *        # in current directory
du -sh /var/*   # biggest users in /var

Options:

-s – summarize each argument
-h – human-readable
--max-depth=1 – one level of subdirectories

Example:

sudo du -h --max-depth=1 /var | sort -h

This helps track down:

Huge log directories (e.g. /var/log)
Cache directories
Data folders growing unexpectedly

Monitoring Inodes

On some filesystems, you can run out of inodes (maximum number of files) even if there is free space.

Check inode usage:

df -i

Look at the IUse% column. A filesystem that is 100% full on inodes cannot create more files, even if Use% (space) is lower.

This is common when many small files are created (mail spools, caches, temporary files).

Device-Level Stats from /proc and /sys

Many monitoring tools read from /proc and /sys. You can inspect them directly for custom scripts.

/proc/diskstats

cat /proc/diskstats

Each line corresponds to a device or partition, with fields like:

Reads completed
Sectors read
Time spent doing I/Os (ms)
Similar stats for writes

These are cumulative counters since boot. Tools like iostat simply sample this repeatedly and compute per-second differences.

/sys/block

ls /sys/block

Per-device directories (e.g. /sys/block/sda) contain:

queue/ – queue-related settings (scheduler, depth)
stat – basic device stats

This is more advanced but useful to know it exists for deeper investigations or scripting.

Historical I/O Data with sar

sar (also from sysstat) can collect and display historical disk and I/O metrics, if the sysstat service/cron is enabled.

To view historical device activity:

sar -d 1 3          # live, like iostat
sar -d -f /var/log/sysstat/sa10   # historical, file name varies by distro/date

Key columns:

tps – transfers per second
rkB/s, wkB/s – read/write KB/s
await, %util – similar to iostat

This is particularly useful when:

The slowdown happened in the past and is now gone.
You want to correlate disk load with other metrics (CPU, network).

Simple Workflows for Common Scenarios

Scenario 1: System feels slow, high load average

Check if load is I/O-related:

top or uptime for load
vmstat 2 for bi/bo

If I/O is active, use:

iostat -x 2 to see which device is saturated
df -h to ensure the filesystem is not full

Use iotop -o or pidstat -d 2 to see which processes are causing heavy I/O.

Scenario 2: Database or application latency spikes

Run iostat -x 2 and watch await and %util on the disk(s) where data is stored.
If these are high, use iotop/pidstat to see whether:

The database itself is doing heavy I/O
Another process (backup, log rotation, find job) is competing for the same disk

Consider whether the data is on HDD vs SSD, and if the workload pattern (random vs sequential) is stressing the device.

Scenario 3: Disk unexpectedly full, causing failures

df -h to find full or near-full filesystems.
du -sh /* and then drill down:

du -sh /var/*

Identify growth in:

Log directories (/var/log)
Cache directories
Application-specific data directories

Remove/rotate/compress data as appropriate, or move it to another volume.

Intro to I/O Performance Characteristics (HDD vs SSD)

Understanding rough differences helps interpret numbers:

HDD (rotating disks):

Low IOPS (hundreds, maybe low thousands)
High seek time; random I/O is expensive
await of 10–20 ms under load may be normal
High queue depth quickly increases latency

SSD / NVMe:

Very high IOPS
Very low latency (typically < 1 ms)
If await regularly exceeds a few ms and %util is high, the device may be saturated or throttled

This means:

The same await value has different implications depending on device type.
HDDs saturate with far fewer concurrent operations than SSDs.

Basic Monitoring Tips and Practices

Use interval-based tools (iostat -x 2, vmstat 2, dstat 1) when the issue is happening; one-off snapshots hide spikes.
Combine device-level (iostat) and process-level (iotop, pidstat) views.
Record a baseline on a “normal” day:

What are typical await, %util, and IOPS under normal load?

Watch out for:

Cron jobs (backups, indexing) running at peak times
Log files or temp directories growing without limits
Database or VM images stored on slow or overloaded disks

Integrate disk/I/O metrics into your system-wide monitoring (using whatever monitoring stack your environment uses).

By combining these tools and metrics, you can quickly determine whether disk and I/O are your bottleneck, and identify the processes and filesystems involved.

Comments

Please login to add a comment.

Don't have an account? Register now!