Table of Contents
Why Disk and I/O Monitoring Matters
Disk and I/O (input/output) performance often becomes a bottleneck before CPU or RAM. Slow disks can cause:
- High load averages
- “Laggy” terminals and applications
- Slow database queries and web responses
- Increased latency for virtual machines and containers
Monitoring disk and I/O helps you answer:
- Is the disk saturated or idle?
- Are we limited by throughput, latency, or queue length?
- Which processes are causing heavy I/O?
- Is the problem on a specific disk, partition, filesystem, or mount?
This chapter focuses on the main tools and metrics used to monitor disk and I/O on Linux.
Key Disk and I/O Metrics
You’ll see these metrics across most tools:
- Throughput
- Read/write rate in KB/s, MB/s, or sectors/s.
- Answers: “How much data per second is moving?”
- IOPS (I/O operations per second)
- Number of read/write operations per second.
- Important for workloads with many small I/O requests (databases, mail servers).
- Latency / Response time
- How long a single I/O request takes to complete.
- Often seen as:
await,r_await,w_await(average time per request, in ms)svctm(service time, in ms, on some tools)- High latency is often what users “feel” as slowness.
- Utilization
- Percentage of time the device is busy (
%util). - Close to 100% suggests the disk is saturated.
- Queue depth
- Number of I/O requests waiting to be serviced (
avgqu-szor similar). - High queue depth means the disk cannot keep up with incoming requests.
- Read vs Write balance
- How much of the load is reads vs writes; different devices handle them differently.
- Filesystem-level metrics
- Free/used space
- Inode usage (how many files you can still create)
- Per-file or per-directory I/O (access frequency, “hot” files)
Tools Overview
You’ll use a mix of:
- Instant/top-like tools (live view):
iostat,dstat,iotop,atop,vmstat,pidstat- Periodic/report tools:
sar(fromsysstat), logs from monitoring systems- Space & filesystem view:
df,du,lsblk,findmnt- Low-level / advanced:
/proc/diskstats,blktrace,perf,bpftrace(advanced usage)
We’ll focus on practical usage of the most common tools.
Device and Mount Layout Basics for Monitoring
To interpret output correctly, you need to recognize:
- Block devices:
sda,sdb,nvme0n1,vda - Partitions:
sda1,sda2,nvme0n1p1 - Logical volumes:
vg0-root, etc. (from LVM) - Mount points:
/,/home,/var,/data, etc.
Useful quick commands:
lsblk # Block devices, partitions, mount points
lsblk -f # + filesystem type, labels, UUIDs
findmnt # Show mounts in a tree formThese help map “which disk” to “which filesystem” when you see them in monitoring tools.
Monitoring with iostat (sysstat)
iostat is one of the core disk monitoring tools. It comes from the sysstat package (install it if missing).
Basic usage:
iostatThis prints CPU stats and basic device stats since boot. For real monitoring you usually:
- Set an interval
- Use “extended” stats
- Optionally restrict to disks only
Common patterns:
# Extended stats for all devices every 2 seconds
iostat -x 2
# Extended stats for one device
iostat -x 2 /dev/sda
# Human-readable bytes (on some distros)
iostat -h -x 2
Key columns in iostat -x:
r/s,w/s
Read/write IOPS.rkB/s,wkB/s(orrMB/s,wMB/son some systems)
Read/write throughput.await
Average time per I/O request (ms). Includes waiting in queue + service time.- < 5 ms: usually very good (especially for SSD).
- 10–20 ms: may be OK for HDD under load but can feel slow.
- > 50 ms: often indicates serious disk pressure.
svctm(if present)
Average service time, without queueing. Ifawaitis much larger thansvctm, queueing is the problem.%util
Percentage of time the disk was busy.- Consistently > 80–90% indicates saturation.
- On rotational disks, 100% means only one I/O at a time is being served (per spindle).
avgqu-sz
Average request queue size. A large value with highawaitmeans over-commitment.
When troubleshooting:
- Run
iostat -x 2during the slowdown. - Look for disks with:
- High
%util - High
await - High
r/sorw/s(IOPS) and/or high throughput - Map the busy device to a filesystem with
lsblk/findmnt.
Monitoring with vmstat
vmstat focuses on memory and virtual memory but has an I/O section too. It’s often used as a lightweight first look.
Basic:
vmstat 2Key I/O columns:
bi– Blocks received from a block device (blocks in).bo– Blocks sent to a block device (blocks out).
These are coarse but useful for seeing whether disk is moving data at all, and how changes (e.g. starting a backup job) affect I/O rates over time.
Using dstat and atop for Combined Views
dstat
dstat (if installed) gives a more customizable live view.
dstat -d -D sda,sdb 1 # Disk stats only, for sda and sdb
dstat -cdngy 1 # CPU, disk, net, paging, systemHelpful flags:
-d– disk-D dev1,dev2– specific devices-r– I/O request stats-g– page cache stats
It shows per-second rates, which makes trends easier to see than cumulative counters.
atop
atop is an advanced monitoring tool that can show per-process disk usage and can also log to a file over time.
atopLook for the DSK section for device-level load, and (in some versions) per-process I/O statistics. You can:
- Navigate with arrow keys
- Press
d(depending on version) for disk-specific views
It’s especially useful on servers for long-term performance analysis (when used with its logging mode).
Finding I/O-Heavy Processes with iotop and pidstat
When you know disks are busy, the next question is “who is causing this?”
iotop
iotop shows I/O usage by process/thread. You usually need root (or sudo) to see full details.
Install from your distro repo, then:
sudo iotopor, for a more typical mode:
sudo iotop -oKey options:
-o– show only processes/threads currently doing I/O-b– batch mode (for logging to a file)-d 2– update interval in seconds
Important columns (names may vary slightly):
DISK READ,DISK WRITE– instantaneous KB/s per processSWAPIN– percentage of time the process is waiting on swap I/OIO– percentage of time the process is waiting on I/O (general)
Use it to identify:
- Backup tools,
tar,rsyncjobs - Databases doing heavy writes
- Misbehaving applications endlessly reading/writing logs
pidstat (from sysstat)
pidstat can show per-process I/O over time. For example:
pidstat -d 2Columns (may vary):
kB_rd/s,kB_wr/s– Kilobytes read/written per secondkB_ccwr/s– Kilobytes of “cancelled” write-back (rarely needed at beginner level)
You can also monitor a single process:
pidstat -d -p <PID> 1This is handy when you already suspect one application and want to quantify its disk usage.
Checking Space Usage: df and du
Performance and capacity are linked: a nearly full disk can slow down and cause failures.
df: filesystem-level usage
df -hShows:
- Size, used, available space
- Use% for each filesystem
- Mount points
Points to watch:
- Keep critical filesystems (like
/,/var, databases, and log partitions) well below 90–95% full. - Some filesystems degrade in performance near full capacity.
You can filter specific filesystems:
df -h /var
df -h /homedu: directory-level usage
To find which directories use the most space:
du -sh * # in current directory
du -sh /var/* # biggest users in /varOptions:
-s– summarize each argument-h– human-readable--max-depth=1– one level of subdirectories
Example:
sudo du -h --max-depth=1 /var | sort -hThis helps track down:
- Huge log directories (e.g.
/var/log) - Cache directories
- Data folders growing unexpectedly
Monitoring Inodes
On some filesystems, you can run out of inodes (maximum number of files) even if there is free space.
Check inode usage:
df -i
Look at the IUse% column. A filesystem that is 100% full on inodes cannot create more files, even if Use% (space) is lower.
This is common when many small files are created (mail spools, caches, temporary files).
Device-Level Stats from /proc and /sys
Many monitoring tools read from /proc and /sys. You can inspect them directly for custom scripts.
/proc/diskstats
cat /proc/diskstatsEach line corresponds to a device or partition, with fields like:
- Reads completed
- Sectors read
- Time spent doing I/Os (ms)
- Similar stats for writes
These are cumulative counters since boot. Tools like iostat simply sample this repeatedly and compute per-second differences.
/sys/block
ls /sys/block
Per-device directories (e.g. /sys/block/sda) contain:
queue/– queue-related settings (scheduler, depth)stat– basic device stats
This is more advanced but useful to know it exists for deeper investigations or scripting.
Historical I/O Data with sar
sar (also from sysstat) can collect and display historical disk and I/O metrics, if the sysstat service/cron is enabled.
To view historical device activity:
sar -d 1 3 # live, like iostat
sar -d -f /var/log/sysstat/sa10 # historical, file name varies by distro/dateKey columns:
tps– transfers per secondrkB/s,wkB/s– read/write KB/sawait,%util– similar toiostat
This is particularly useful when:
- The slowdown happened in the past and is now gone.
- You want to correlate disk load with other metrics (CPU, network).
Simple Workflows for Common Scenarios
Scenario 1: System feels slow, high load average
- Check if load is I/O-related:
toporuptimefor loadvmstat 2forbi/bo- If I/O is active, use:
iostat -x 2to see which device is saturateddf -hto ensure the filesystem is not full- Use
iotop -oorpidstat -d 2to see which processes are causing heavy I/O.
Scenario 2: Database or application latency spikes
- Run
iostat -x 2and watchawaitand%utilon the disk(s) where data is stored. - If these are high, use
iotop/pidstatto see whether: - The database itself is doing heavy I/O
- Another process (backup, log rotation,
findjob) is competing for the same disk - Consider whether the data is on HDD vs SSD, and if the workload pattern (random vs sequential) is stressing the device.
Scenario 3: Disk unexpectedly full, causing failures
df -hto find full or near-full filesystems.du -sh /*and then drill down:du -sh /var/*- Identify growth in:
- Log directories (
/var/log) - Cache directories
- Application-specific data directories
- Remove/rotate/compress data as appropriate, or move it to another volume.
Intro to I/O Performance Characteristics (HDD vs SSD)
Understanding rough differences helps interpret numbers:
- HDD (rotating disks):
- Low IOPS (hundreds, maybe low thousands)
- High seek time; random I/O is expensive
awaitof 10–20 ms under load may be normal- High queue depth quickly increases latency
- SSD / NVMe:
- Very high IOPS
- Very low latency (typically < 1 ms)
- If
awaitregularly exceeds a few ms and%utilis high, the device may be saturated or throttled
This means:
- The same
awaitvalue has different implications depending on device type. - HDDs saturate with far fewer concurrent operations than SSDs.
Basic Monitoring Tips and Practices
- Use interval-based tools (
iostat -x 2,vmstat 2,dstat 1) when the issue is happening; one-off snapshots hide spikes. - Combine device-level (
iostat) and process-level (iotop,pidstat) views. - Record a baseline on a “normal” day:
- What are typical
await,%util, and IOPS under normal load? - Watch out for:
- Cron jobs (backups, indexing) running at peak times
- Log files or temp directories growing without limits
- Database or VM images stored on slow or overloaded disks
- Integrate disk/I/O metrics into your system-wide monitoring (using whatever monitoring stack your environment uses).
By combining these tools and metrics, you can quickly determine whether disk and I/O are your bottleneck, and identify the processes and filesystems involved.