Kahibaro
Discord Login Register

Disk and I/O optimization

Understanding Disk and I/O Bottlenecks

Disk and I/O optimization starts with identifying where time is being spent. At a high level, you care about:

These will inform whether you have a disk bottleneck and what type.

Key Metrics and Concepts

General goals:

Measuring Disk and I/O Performance

Using `iostat`

The iostat tool (from sysstat) gives a core view of device-level I/O:

iostat -xz 1

Pay attention to:

Typical patterns:

Using `pidstat` and `iotop` for Per-Process I/O

To know who is causing I/O:

pidstat -d 1

iotop (requires root and a kernel with I/O accounting):

iotop -oPa

Use these to identify top offenders (databases, loggers, backup jobs, etc.).

Using `blktrace`/`btt` and `blocktop` (Deep Dive)

For low-level insight into block I/O behavior:

These are more advanced, useful when other metrics already show a bottleneck but the cause is unclear.

`perf` for I/O-Related Stalls

perf can show where threads spend time:

perf record -g -p <pid>
perf report

Look for:

Filesystem-Level Considerations

Assume basic filesystem concepts and types are already covered elsewhere; focus here is on performance-relevant choices.

Choosing a Filesystem for the Workload

Some general tendencies:

Use the filesystem that best matches:

Mount Options that Affect Performance

Check current options:

mount | grep ' / '

Some common performance-related options (effect depends heavily on FS):

Use /etc/fstab to set these consistently.

Journaling and Metadata Tuning

For metadata-heavy workloads:

Block Layer and I/O Scheduler Tuning

Understanding the I/O Scheduler

Modern kernels often use:

Check and change scheduler for a device (legacy path, concept applies):

cat /sys/block/sda/queue/scheduler
echo mq-deadline | sudo tee /sys/block/sda/queue/scheduler

On systems with blk-mq, you might see no per-device scheduler file; configuration can be via kernel parameters or udev rules.

Guidance:

Read-Ahead Tuning

Read-ahead prefetches data ahead of the current request, benefitting sequential workloads and wasting I/O on highly random workloads.

See and set read-ahead (in 512-byte sectors):

blockdev --getra /dev/sda
sudo blockdev --setra 4096 /dev/sda

Rules of thumb:

Queue Depth

For some devices, you can tune depth:

cat /sys/block/sda/queue/nr_requests
echo 1024 | sudo tee /sys/block/sda/queue/nr_requests

For NVMe, there are separate parameters (often configured via module options or at driver level).

Application-Level Strategies

Optimizing disk I/O is often more about changing what the application does than tweaking the kernel.

Aligning Access Patterns

Caching and Buffers

Do not blindly set O_DIRECT at the application level; it bypasses page cache and can reduce performance if not designed for it.

Asynchronous and Parallel I/O

Sync Behavior: `fsync`, `fdatasync`

Many applications issue explicit sync calls:

For databases and critical services, always understand the consistency model before changing sync behavior.

RAID, LVM, and Multi-Disk Layout

Assume the basics of RAID/LVM are covered elsewhere; here we focus on performance-related aspects.

Choosing RAID Levels for Performance

Workload-driven choices:

Stripe Sizes, Alignment, and LVM

Ensure:

Misalignment leads to extra I/O operations per logical I/O, killing performance, especially with spinning disks and RAID.

Use tools like parted -a optimal and lsblk -t to verify alignment and topology.

Splitting Workloads Across Devices

When possible:

Even with a single physical pool, using separate logical volumes and filesystems can help you tune mount options and configuration per workload.

Optimizing SSD and NVMe Performance

Solid-state devices behave differently than HDDs; performance tuning must account for that.

TRIM and Garbage Collection

sudo fstrim -v /

Most distributions have a systemd timer for this (e.g. fstrim.timer).

Online TRIM via mount discard may harm latency-sensitive workloads; benchmark and compare with periodic TRIM instead.

Over-Provisioning

Leaving some capacity unpartitioned gives the controller more space for wear-leveling and garbage collection.

Avoiding Write Amplification

Write amplification is extra physical writing done by the SSD compared to logical writes.

To reduce it:

Caching, Swap, and the Page Cache

Disk I/O is tightly coupled with memory behavior. While deep memory tuning is covered elsewhere, this section focuses on its impact on I/O.

`vm.swappiness` and Swap Usage

vm.swappiness controls the kernel’s tendency to swap out anonymous memory in favor of page cache use. Range: $0$–$100$.

Modify temporarily:

sysctl vm.swappiness=10

Persistent change via /etc/sysctl.conf or a drop-in.

Too aggressive swapping can cause swap thrashing, leading to heavy I/O and bad performance. Too little swapping can reduce page cache and hurt I/O performance for cacheable data.

Dirty Page Writeback Tuning

Dirty pages (modified cache) are eventually written to disk. The timing and amount affect latency and throughput.

Key parameters (values are percentages of total memory, unless otherwise documented):

You can choose either ratio-based or byte-based settings (bytes override ratios). For example:

sysctl vm.dirty_ratio=20
sysctl vm.dirty_background_ratio=5

Tuning is workload-specific; test under realistic load.

Dropping Caches (Testing Only)

To test performance of I/O without cached effects, you can instruct the kernel to drop caches:

sync
echo 3 | sudo tee /proc/sys/vm/drop_caches

Modes (write a number):

Use only for testing/benchmarking; not as a “performance fix” in production.

Network and Remote Storage I/O

For NFS, iSCSI, and other networked storage, latency and throughput depend also on the network path.

Key focus areas:

The goal is the same: match I/O patterns and protocol settings to the workload, minimize unnecessary round trips, and ensure the network is not the bottleneck.

Benchmarking and Validation

Always measure before and after changes.

Synthetic Benchmarks

Tools like fio let you model realistic workloads:

fio --name=randread --filename=/mnt/testfile \
    --rw=randread --bs=4k --size=1G --iodepth=32 --numjobs=4 --direct=1

Key parameters:

Use variations to approximate your real workload.

Application-Level Benchmarks

Synthetic benchmarks are not enough. Validate with:

Record:

Practical Optimization Workflow

A disciplined approach minimizes risk:

  1. Observe:
    • Use iostat, pidstat, iotop, vmstat, sar, perf to confirm disk I/O is the bottleneck.
  2. Characterize:
    • Random vs sequential, read vs write, sync vs async, hot files vs cold, cache behavior.
  3. Quick wins:
    • Fix obvious issues: noatime where safe, fstrim for SSDs, move noisy logs, avoid cron jobs colliding with peak load.
  4. Filesystem and mount tuning:
    • Adjust mount options, read-ahead, scheduler, dirty ratios, etc.
  5. Application and data layout changes:
    • Split workloads across devices, adjust caching and sync behavior, redesign access patterns.
  6. Validate:
    • Benchmark before/after; track latency, throughput, CPU, and error/recovery behavior.
  7. Iterate:
    • Change one thing at a time; keep documented profiles of “good” configurations.

By combining careful measurement, targeted tuning at each layer, and workload-aware design, you can significantly reduce disk and I/O bottlenecks and improve overall system performance.

Views: 21

Comments

Please login to add a comment.

Don't have an account? Register now!