7.3.3 Disk and I/O optimization

Table of Contents

Understanding Disk and I/O Bottlenecks

Disk and I/O optimization starts with identifying where time is being spent. At a high level, you care about:

How busy the disks are (utilization)
How long operations take (latency)
How much data is moved (throughput)
How many operations are happening (IOPS)
Whether the workload is sequential or random, read-heavy or write-heavy

These will inform whether you have a disk bottleneck and what type.

Key Metrics and Concepts

Latency: Time an I/O takes to complete. Measured as:

Average completion time (e.g. r_await, w_await in ms)
Tail latency (e.g. 95th, 99th percentile in some tools)

Throughput: Data per second, typically MB/s.
IOPS: I/O operations per second, often more relevant than MB/s for small random I/O.
Queue depth: How many I/O requests are in-flight or waiting. High queue depth + high latency = potential bottleneck.
Utilization: How much of the time the device is busy (e.g. %util or util).

General goals:

Keep latency acceptable for your workload.
Avoid sustained 100% utilization.
Match your storage layout and kernel settings to the access pattern.

Measuring Disk and I/O Performance

Using `iostat`

The iostat tool (from sysstat) gives a core view of device-level I/O:

iostat -xz 1

Pay attention to:

r/s, w/s — read/write IOPS
rkB/s, wkB/s — read/write throughput
r_await, w_await — average latency (ms) per read/write
svctm or r_await/w_await — service time (understanding varies by version)
%util — estimated percentage of time the device is busy

Typical patterns:

High %util (near 100%) and high r_await/w_await: Disk is saturated.
Moderate %util but very high await: Latency may be dominated by something else (e.g. RAID, networked storage, contention in upper layers).

Using `pidstat` and `iotop` for Per-Process I/O

To know who is causing I/O:

pidstat -d 1

kB_rd/s, kB_wr/s, kB_ccwr/s (cancelled write, e.g. rewrites)
iodelay — rough delay in ticks due to I/O for each process

iotop (requires root and a kernel with I/O accounting):

iotop -oPa

-o — only show processes doing I/O
-P — show per-process, not per-thread
-a — accumulated I/O

Use these to identify top offenders (databases, loggers, backup jobs, etc.).

Using `blktrace`/`btt` and `blocktop` (Deep Dive)

For low-level insight into block I/O behavior:

blktrace provides per-request tracing at the block layer.
btt (Block Trace Tools) summarizes patterns: queue times, merge rates, etc.
blocktop (similar to top for block devices) shows active I/O.

These are more advanced, useful when other metrics already show a bottleneck but the cause is unclear.

`perf` for I/O-Related Stalls

perf can show where threads spend time:

perf record -g -p <pid>
perf report

Look for:

High time in sys_read, sys_write, vfs_read, etc.
Time spent in filesystem/driver functions.
High time in schedule, io_schedule (indicating waits).

Filesystem-Level Considerations

Assume basic filesystem concepts and types are already covered elsewhere; focus here is on performance-relevant choices.

Choosing a Filesystem for the Workload

Some general tendencies:

EXT4: Default on many distros; good general-purpose, low overhead.
XFS: Often better for large files, parallel I/O, and large systems.
Btrfs: Copy-on-write with advanced features; good for snapshots/checksums, but workloads with heavy random writes can pay a performance price if misconfigured.

Use the filesystem that best matches:

Many small files, metadata-heavy: tune directory indexing, inode density, journaling.
Large sequential files (e.g. media, backups): align block sizes, consider XFS/EXT4 with big allocation units.

Mount Options that Affect Performance

Check current options:

mount | grep ' / '

Some common performance-related options (effect depends heavily on FS):

noatime, nodiratime:

Disable updating access times on every read.
On modern kernels, relatime is default and often good enough, but on extremely read-heavy workloads noatime can still help.

barrier=0 or nobarrier (EXT4/XFS):

Disables write barriers. Never do this without reliable power-loss protection (battery-backed RAID/UPS).

data=ordered, data=writeback, data=journal (EXT3/EXT4):

ordered is default: good safety/performance tradeoff.
writeback higher performance, potentially less safety.

discard:

Online TRIM for SSDs. Convenient but can add latency; alternative is periodic fstrim.

commit=<seconds> (EXT4):

Maximum interval between journal commits. Larger values reduce write amplification at cost of more data lost on crash.

Use /etc/fstab to set these consistently.

Journaling and Metadata Tuning

Journaling adds write overhead but improves integrity and recovery.
For write-heavy benchmarks or where data integrity is handled at another layer (e.g. database with own WAL and full backups), you may experiment with:

Less frequent commits.
Tuning journal size where supported (e.g. XFS log size).

For metadata-heavy workloads:

Ensure directory indexing (e.g. dir_index on EXT4) is enabled.
Consider layout that avoids overly large directories when possible.

Block Layer and I/O Scheduler Tuning

Understanding the I/O Scheduler

Modern kernels often use:

mq-deadline: Fairness and latency guarantee, good general default.
none (noop-like): Minimal scheduling. Often used for some NVMe devices where the hardware/driver does its own scheduling.
bfq: Budget Fair Queuing, good for interactive desktops or mixed workloads.

Check and change scheduler for a device (legacy path, concept applies):

cat /sys/block/sda/queue/scheduler
echo mq-deadline | sudo tee /sys/block/sda/queue/scheduler

On systems with blk-mq, you might see no per-device scheduler file; configuration can be via kernel parameters or udev rules.

Guidance:

HDDs: mq-deadline or bfq.
SSDs/NVMe: none or mq-deadline, depending on benchmarks.

Read-Ahead Tuning

Read-ahead prefetches data ahead of the current request, benefitting sequential workloads and wasting I/O on highly random workloads.

See and set read-ahead (in 512-byte sectors):

blockdev --getra /dev/sda
sudo blockdev --setra 4096 /dev/sda

Rules of thumb:

Sequential-heavy workloads: increase read-ahead (e.g. 4–16 MB).
Random I/O (e.g. small DB queries): keep it small; large read-ahead mostly adds overhead.

Queue Depth

For some devices, you can tune depth:

cat /sys/block/sda/queue/nr_requests
echo 1024 | sudo tee /sys/block/sda/queue/nr_requests

Higher queue depth can improve throughput under heavy load.
Too high can increase latency and worsen behavior under contention.

For NVMe, there are separate parameters (often configured via module options or at driver level).

Application-Level Strategies

Optimizing disk I/O is often more about changing what the application does than tweaking the kernel.

Aligning Access Patterns

Batch small writes: combine many small writes into fewer larger ones.
Use append-only patterns where possible (log-like) rather than random overwrites.
Align I/O to filesystem/block sizes: misaligned writes can cause read-modify-write cycles, especially on RAID and SSDs.

Caching and Buffers

Increase application-level caches (DB buffer pools, web server caches) to reduce I/O.
Consider OS page cache behavior:

Large sequential reads can evict useful cache; use posix_fadvise (from apps) or flags like O_DIRECT where appropriate and well understood.

Do not blindly set O_DIRECT at the application level; it bypasses page cache and can reduce performance if not designed for it.

Asynchronous and Parallel I/O

Use async I/O when supported (e.g. libaio, io_uring, or internal DB async I/O).
Multiple I/O threads or worker processes can increase throughput, but:

Too many can raise contention and queue depth excessively.
Balance concurrency against latency requirements.

Sync Behavior: `fsync`, `fdatasync`

Many applications issue explicit sync calls:

Excessive fsync() after every small write can destroy performance.
Where allowed by correctness and durability requirements, batching transactions to reduce sync frequency can help dramatically.

For databases and critical services, always understand the consistency model before changing sync behavior.

RAID, LVM, and Multi-Disk Layout

Assume the basics of RAID/LVM are covered elsewhere; here we focus on performance-related aspects.

Choosing RAID Levels for Performance

RAID 0: High performance, no redundancy. Use only for scratch or where you accept loss.
RAID 1: Mirroring. Good read performance (reads can be served from either disk), write similar to a single disk.
RAID 5/6: Adds parity; often poor for small random writes due to read-modify-write cycles.
RAID 10: Good compromise for mixed workloads; high read and write performance and redundancy.

Workload-driven choices:

Small random writes (DBs): prefer RAID 10 over RAID 5/6.
Large sequential reads/writes (backups, media): RAID 5/6 may be acceptable.

Stripe Sizes, Alignment, and LVM

Ensure:

Partition alignment to underlying physical/RAID stripe sizes.
LVM extents aligned with RAID stripes where using both.

Misalignment leads to extra I/O operations per logical I/O, killing performance, especially with spinning disks and RAID.

Use tools like parted -a optimal and lsblk -t to verify alignment and topology.

Splitting Workloads Across Devices

When possible:

Put random I/O-heavy workloads (e.g. DB data files) on the fastest storage.
Put sequential-heavy workloads (logs, backups, media) on separate devices or arrays.
Separate read-heavy and write-heavy workloads when you can.

Even with a single physical pool, using separate logical volumes and filesystems can help you tune mount options and configuration per workload.

Optimizing SSD and NVMe Performance

Solid-state devices behave differently than HDDs; performance tuning must account for that.

TRIM and Garbage Collection

TRIM informs SSDs which blocks are free, improving write performance and wear-leveling.
Use periodic fstrim:

sudo fstrim -v /

Most distributions have a systemd timer for this (e.g. fstrim.timer).

Online TRIM via mount discard may harm latency-sensitive workloads; benchmark and compare with periodic TRIM instead.

Over-Provisioning

Leaving some capacity unpartitioned gives the controller more space for wear-leveling and garbage collection.

Many enterprise SSDs already have internal over-provisioning.
On cheaper consumer drives, not filling the disk to 100% (e.g. staying below 80–90%) helps maintain performance.

Avoiding Write Amplification

Write amplification is extra physical writing done by the SSD compared to logical writes.

To reduce it:

Avoid unnecessary rewrites of the same data.
Favor append or log-structured approaches.
Use filesystem options that reduce metadata churn when safely possible.
Maintain free space; a nearly full SSD suffers more from write amplification.

Caching, Swap, and the Page Cache

Disk I/O is tightly coupled with memory behavior. While deep memory tuning is covered elsewhere, this section focuses on its impact on I/O.

`vm.swappiness` and Swap Usage

vm.swappiness controls the kernel’s tendency to swap out anonymous memory in favor of page cache use. Range: $0$–$100$.

Lower values (e.g. 10) make the kernel reluctant to swap.
Higher values (e.g. 60, often the default) allow more swapping.

Modify temporarily:

sysctl vm.swappiness=10

Persistent change via /etc/sysctl.conf or a drop-in.

Too aggressive swapping can cause swap thrashing, leading to heavy I/O and bad performance. Too little swapping can reduce page cache and hurt I/O performance for cacheable data.

Dirty Page Writeback Tuning

Dirty pages (modified cache) are eventually written to disk. The timing and amount affect latency and throughput.

Key parameters (values are percentages of total memory, unless otherwise documented):

vm.dirty_ratio
vm.dirty_background_ratio
vm.dirty_bytes
vm.dirty_background_bytes

You can choose either ratio-based or byte-based settings (bytes override ratios). For example:

sysctl vm.dirty_ratio=20
sysctl vm.dirty_background_ratio=5

Larger dirty thresholds buffer more writes, improving throughput at the cost of larger write bursts and longer flush times.
Smaller thresholds write back more smoothly but more frequently.

Tuning is workload-specific; test under realistic load.

Dropping Caches (Testing Only)

To test performance of I/O without cached effects, you can instruct the kernel to drop caches:

sync
echo 3 | sudo tee /proc/sys/vm/drop_caches

Modes (write a number):

1 — pagecache
2 — dentries and inodes
3 — both

Use only for testing/benchmarking; not as a “performance fix” in production.

Network and Remote Storage I/O

For NFS, iSCSI, and other networked storage, latency and throughput depend also on the network path.

Key focus areas:

MTU size, offload settings, and NIC queues.
NFS mount options like rsize, wsize, async, noatime, nfsvers.
iSCSI tuning (queue depth, multipathing).

The goal is the same: match I/O patterns and protocol settings to the workload, minimize unnecessary round trips, and ensure the network is not the bottleneck.

Benchmarking and Validation

Always measure before and after changes.

Synthetic Benchmarks

Tools like fio let you model realistic workloads:

fio --name=randread --filename=/mnt/testfile \
    --rw=randread --bs=4k --size=1G --iodepth=32 --numjobs=4 --direct=1

Key parameters:

--rw — type (read, write, randread, randwrite, randrw, readwrite, etc.)
--bs — block size
--iodepth — queue depth
--numjobs — parallel jobs
--direct=1 — bypass page cache (for device-focused tests)

Use variations to approximate your real workload.

Application-Level Benchmarks

Synthetic benchmarks are not enough. Validate with:

Database benchmarks (pgbench, sysbench).
File server benchmarks (e.g. bonnie++, custom scripts).
Realistic load tests (web traffic, batch jobs).

Record:

Latency distributions (p95, p99).
Throughput and IOPS.
CPU usage and %util on disks.
Any regressions in tail latency after “optimizations”.

Practical Optimization Workflow

A disciplined approach minimizes risk:

Observe:

Use iostat, pidstat, iotop, vmstat, sar, perf to confirm disk I/O is the bottleneck.

Characterize:

Random vs sequential, read vs write, sync vs async, hot files vs cold, cache behavior.

Quick wins:

Fix obvious issues: noatime where safe, fstrim for SSDs, move noisy logs, avoid cron jobs colliding with peak load.

Filesystem and mount tuning:

Adjust mount options, read-ahead, scheduler, dirty ratios, etc.

Application and data layout changes:

Split workloads across devices, adjust caching and sync behavior, redesign access patterns.

Validate:

Benchmark before/after; track latency, throughput, CPU, and error/recovery behavior.

Iterate:

Change one thing at a time; keep documented profiles of “good” configurations.

By combining careful measurement, targeted tuning at each layer, and workload-aware design, you can significantly reduce disk and I/O bottlenecks and improve overall system performance.

Comments

Please login to add a comment.

Don't have an account? Register now!