Table of Contents
Understanding Disk and I/O Bottlenecks
Disk and I/O optimization starts with identifying where time is being spent. At a high level, you care about:
- How busy the disks are (utilization)
- How long operations take (latency)
- How much data is moved (throughput)
- How many operations are happening (IOPS)
- Whether the workload is sequential or random, read-heavy or write-heavy
These will inform whether you have a disk bottleneck and what type.
Key Metrics and Concepts
- Latency: Time an I/O takes to complete. Measured as:
- Average completion time (e.g.
r_await,w_awaitin ms) - Tail latency (e.g. 95th, 99th percentile in some tools)
- Throughput: Data per second, typically MB/s.
- IOPS: I/O operations per second, often more relevant than MB/s for small random I/O.
- Queue depth: How many I/O requests are in-flight or waiting. High queue depth + high latency = potential bottleneck.
- Utilization: How much of the time the device is busy (e.g.
%utilorutil).
General goals:
- Keep latency acceptable for your workload.
- Avoid sustained 100% utilization.
- Match your storage layout and kernel settings to the access pattern.
Measuring Disk and I/O Performance
Using `iostat`
The iostat tool (from sysstat) gives a core view of device-level I/O:
iostat -xz 1Pay attention to:
r/s,w/s— read/write IOPSrkB/s,wkB/s— read/write throughputr_await,w_await— average latency (ms) per read/writesvctmorr_await/w_await— service time (understanding varies by version)%util— estimated percentage of time the device is busy
Typical patterns:
- High
%util(near 100%) and highr_await/w_await: Disk is saturated. - Moderate
%utilbut very highawait: Latency may be dominated by something else (e.g. RAID, networked storage, contention in upper layers).
Using `pidstat` and `iotop` for Per-Process I/O
To know who is causing I/O:
pidstat -d 1kB_rd/s,kB_wr/s,kB_ccwr/s(cancelled write, e.g. rewrites)iodelay— rough delay in ticks due to I/O for each process
iotop (requires root and a kernel with I/O accounting):
iotop -oPa-o— only show processes doing I/O-P— show per-process, not per-thread-a— accumulated I/O
Use these to identify top offenders (databases, loggers, backup jobs, etc.).
Using `blktrace`/`btt` and `blocktop` (Deep Dive)
For low-level insight into block I/O behavior:
blktraceprovides per-request tracing at the block layer.btt(Block Trace Tools) summarizes patterns: queue times, merge rates, etc.blocktop(similar totopfor block devices) shows active I/O.
These are more advanced, useful when other metrics already show a bottleneck but the cause is unclear.
`perf` for I/O-Related Stalls
perf can show where threads spend time:
perf record -g -p <pid>
perf reportLook for:
- High time in
sys_read,sys_write,vfs_read, etc. - Time spent in filesystem/driver functions.
- High time in
schedule,io_schedule(indicating waits).
Filesystem-Level Considerations
Assume basic filesystem concepts and types are already covered elsewhere; focus here is on performance-relevant choices.
Choosing a Filesystem for the Workload
Some general tendencies:
- EXT4: Default on many distros; good general-purpose, low overhead.
- XFS: Often better for large files, parallel I/O, and large systems.
- Btrfs: Copy-on-write with advanced features; good for snapshots/checksums, but workloads with heavy random writes can pay a performance price if misconfigured.
Use the filesystem that best matches:
- Many small files, metadata-heavy: tune directory indexing, inode density, journaling.
- Large sequential files (e.g. media, backups): align block sizes, consider XFS/EXT4 with big allocation units.
Mount Options that Affect Performance
Check current options:
mount | grep ' / 'Some common performance-related options (effect depends heavily on FS):
noatime,nodiratime:- Disable updating access times on every read.
- On modern kernels,
relatimeis default and often good enough, but on extremely read-heavy workloadsnoatimecan still help. barrier=0ornobarrier(EXT4/XFS):- Disables write barriers. Never do this without reliable power-loss protection (battery-backed RAID/UPS).
data=ordered,data=writeback,data=journal(EXT3/EXT4):orderedis default: good safety/performance tradeoff.writebackhigher performance, potentially less safety.discard:- Online TRIM for SSDs. Convenient but can add latency; alternative is periodic
fstrim. commit=<seconds>(EXT4):- Maximum interval between journal commits. Larger values reduce write amplification at cost of more data lost on crash.
Use /etc/fstab to set these consistently.
Journaling and Metadata Tuning
- Journaling adds write overhead but improves integrity and recovery.
- For write-heavy benchmarks or where data integrity is handled at another layer (e.g. database with own WAL and full backups), you may experiment with:
- Less frequent commits.
- Tuning journal size where supported (e.g. XFS log size).
For metadata-heavy workloads:
- Ensure directory indexing (e.g.
dir_indexon EXT4) is enabled. - Consider layout that avoids overly large directories when possible.
Block Layer and I/O Scheduler Tuning
Understanding the I/O Scheduler
Modern kernels often use:
mq-deadline: Fairness and latency guarantee, good general default.none(noop-like): Minimal scheduling. Often used for some NVMe devices where the hardware/driver does its own scheduling.bfq: Budget Fair Queuing, good for interactive desktops or mixed workloads.
Check and change scheduler for a device (legacy path, concept applies):
cat /sys/block/sda/queue/scheduler
echo mq-deadline | sudo tee /sys/block/sda/queue/scheduler
On systems with blk-mq, you might see no per-device scheduler file; configuration can be via kernel parameters or udev rules.
Guidance:
- HDDs:
mq-deadlineorbfq. - SSDs/NVMe:
noneormq-deadline, depending on benchmarks.
Read-Ahead Tuning
Read-ahead prefetches data ahead of the current request, benefitting sequential workloads and wasting I/O on highly random workloads.
See and set read-ahead (in 512-byte sectors):
blockdev --getra /dev/sda
sudo blockdev --setra 4096 /dev/sdaRules of thumb:
- Sequential-heavy workloads: increase read-ahead (e.g. 4–16 MB).
- Random I/O (e.g. small DB queries): keep it small; large read-ahead mostly adds overhead.
Queue Depth
For some devices, you can tune depth:
cat /sys/block/sda/queue/nr_requests
echo 1024 | sudo tee /sys/block/sda/queue/nr_requests- Higher queue depth can improve throughput under heavy load.
- Too high can increase latency and worsen behavior under contention.
For NVMe, there are separate parameters (often configured via module options or at driver level).
Application-Level Strategies
Optimizing disk I/O is often more about changing what the application does than tweaking the kernel.
Aligning Access Patterns
- Batch small writes: combine many small writes into fewer larger ones.
- Use append-only patterns where possible (log-like) rather than random overwrites.
- Align I/O to filesystem/block sizes: misaligned writes can cause read-modify-write cycles, especially on RAID and SSDs.
Caching and Buffers
- Increase application-level caches (DB buffer pools, web server caches) to reduce I/O.
- Consider OS page cache behavior:
- Large sequential reads can evict useful cache; use
posix_fadvise(from apps) or flags likeO_DIRECTwhere appropriate and well understood.
Do not blindly set O_DIRECT at the application level; it bypasses page cache and can reduce performance if not designed for it.
Asynchronous and Parallel I/O
- Use async I/O when supported (e.g.
libaio,io_uring, or internal DB async I/O). - Multiple I/O threads or worker processes can increase throughput, but:
- Too many can raise contention and queue depth excessively.
- Balance concurrency against latency requirements.
Sync Behavior: `fsync`, `fdatasync`
Many applications issue explicit sync calls:
- Excessive
fsync()after every small write can destroy performance. - Where allowed by correctness and durability requirements, batching transactions to reduce sync frequency can help dramatically.
For databases and critical services, always understand the consistency model before changing sync behavior.
RAID, LVM, and Multi-Disk Layout
Assume the basics of RAID/LVM are covered elsewhere; here we focus on performance-related aspects.
Choosing RAID Levels for Performance
- RAID 0: High performance, no redundancy. Use only for scratch or where you accept loss.
- RAID 1: Mirroring. Good read performance (reads can be served from either disk), write similar to a single disk.
- RAID 5/6: Adds parity; often poor for small random writes due to read-modify-write cycles.
- RAID 10: Good compromise for mixed workloads; high read and write performance and redundancy.
Workload-driven choices:
- Small random writes (DBs): prefer RAID 10 over RAID 5/6.
- Large sequential reads/writes (backups, media): RAID 5/6 may be acceptable.
Stripe Sizes, Alignment, and LVM
Ensure:
- Partition alignment to underlying physical/RAID stripe sizes.
- LVM extents aligned with RAID stripes where using both.
Misalignment leads to extra I/O operations per logical I/O, killing performance, especially with spinning disks and RAID.
Use tools like parted -a optimal and lsblk -t to verify alignment and topology.
Splitting Workloads Across Devices
When possible:
- Put random I/O-heavy workloads (e.g. DB data files) on the fastest storage.
- Put sequential-heavy workloads (logs, backups, media) on separate devices or arrays.
- Separate read-heavy and write-heavy workloads when you can.
Even with a single physical pool, using separate logical volumes and filesystems can help you tune mount options and configuration per workload.
Optimizing SSD and NVMe Performance
Solid-state devices behave differently than HDDs; performance tuning must account for that.
TRIM and Garbage Collection
- TRIM informs SSDs which blocks are free, improving write performance and wear-leveling.
- Use periodic
fstrim:
sudo fstrim -v /
Most distributions have a systemd timer for this (e.g. fstrim.timer).
Online TRIM via mount discard may harm latency-sensitive workloads; benchmark and compare with periodic TRIM instead.
Over-Provisioning
Leaving some capacity unpartitioned gives the controller more space for wear-leveling and garbage collection.
- Many enterprise SSDs already have internal over-provisioning.
- On cheaper consumer drives, not filling the disk to 100% (e.g. staying below 80–90%) helps maintain performance.
Avoiding Write Amplification
Write amplification is extra physical writing done by the SSD compared to logical writes.
To reduce it:
- Avoid unnecessary rewrites of the same data.
- Favor append or log-structured approaches.
- Use filesystem options that reduce metadata churn when safely possible.
- Maintain free space; a nearly full SSD suffers more from write amplification.
Caching, Swap, and the Page Cache
Disk I/O is tightly coupled with memory behavior. While deep memory tuning is covered elsewhere, this section focuses on its impact on I/O.
`vm.swappiness` and Swap Usage
vm.swappiness controls the kernel’s tendency to swap out anonymous memory in favor of page cache use. Range: $0$–$100$.
- Lower values (e.g.
10) make the kernel reluctant to swap. - Higher values (e.g.
60, often the default) allow more swapping.
Modify temporarily:
sysctl vm.swappiness=10
Persistent change via /etc/sysctl.conf or a drop-in.
Too aggressive swapping can cause swap thrashing, leading to heavy I/O and bad performance. Too little swapping can reduce page cache and hurt I/O performance for cacheable data.
Dirty Page Writeback Tuning
Dirty pages (modified cache) are eventually written to disk. The timing and amount affect latency and throughput.
Key parameters (values are percentages of total memory, unless otherwise documented):
vm.dirty_ratiovm.dirty_background_ratiovm.dirty_bytesvm.dirty_background_bytes
You can choose either ratio-based or byte-based settings (bytes override ratios). For example:
sysctl vm.dirty_ratio=20
sysctl vm.dirty_background_ratio=5- Larger dirty thresholds buffer more writes, improving throughput at the cost of larger write bursts and longer flush times.
- Smaller thresholds write back more smoothly but more frequently.
Tuning is workload-specific; test under realistic load.
Dropping Caches (Testing Only)
To test performance of I/O without cached effects, you can instruct the kernel to drop caches:
sync
echo 3 | sudo tee /proc/sys/vm/drop_cachesModes (write a number):
1— pagecache2— dentries and inodes3— both
Use only for testing/benchmarking; not as a “performance fix” in production.
Network and Remote Storage I/O
For NFS, iSCSI, and other networked storage, latency and throughput depend also on the network path.
Key focus areas:
- MTU size, offload settings, and NIC queues.
- NFS mount options like
rsize,wsize,async,noatime,nfsvers. - iSCSI tuning (queue depth, multipathing).
The goal is the same: match I/O patterns and protocol settings to the workload, minimize unnecessary round trips, and ensure the network is not the bottleneck.
Benchmarking and Validation
Always measure before and after changes.
Synthetic Benchmarks
Tools like fio let you model realistic workloads:
fio --name=randread --filename=/mnt/testfile \
--rw=randread --bs=4k --size=1G --iodepth=32 --numjobs=4 --direct=1Key parameters:
--rw— type (read,write,randread,randwrite,randrw,readwrite, etc.)--bs— block size--iodepth— queue depth--numjobs— parallel jobs--direct=1— bypass page cache (for device-focused tests)
Use variations to approximate your real workload.
Application-Level Benchmarks
Synthetic benchmarks are not enough. Validate with:
- Database benchmarks (pgbench, sysbench).
- File server benchmarks (e.g.
bonnie++, custom scripts). - Realistic load tests (web traffic, batch jobs).
Record:
- Latency distributions (p95, p99).
- Throughput and IOPS.
- CPU usage and
%utilon disks. - Any regressions in tail latency after “optimizations”.
Practical Optimization Workflow
A disciplined approach minimizes risk:
- Observe:
- Use
iostat,pidstat,iotop,vmstat,sar,perfto confirm disk I/O is the bottleneck. - Characterize:
- Random vs sequential, read vs write, sync vs async, hot files vs cold, cache behavior.
- Quick wins:
- Fix obvious issues:
noatimewhere safe,fstrimfor SSDs, move noisy logs, avoid cron jobs colliding with peak load. - Filesystem and mount tuning:
- Adjust mount options, read-ahead, scheduler, dirty ratios, etc.
- Application and data layout changes:
- Split workloads across devices, adjust caching and sync behavior, redesign access patterns.
- Validate:
- Benchmark before/after; track latency, throughput, CPU, and error/recovery behavior.
- Iterate:
- Change one thing at a time; keep documented profiles of “good” configurations.
By combining careful measurement, targeted tuning at each layer, and workload-aware design, you can significantly reduce disk and I/O bottlenecks and improve overall system performance.