Table of Contents
Role of Storage in HPC Architectures
Storage systems in high performance computing connect the fast, volatile parts of a machine to long term, persistent data. In the memory hierarchy, registers, caches, and RAM hold data that the CPU is actively working on, while storage keeps data across reboots, user sessions, and job runs. For HPC, storage is not only about capacity. It must deliver sufficient throughput, low enough latency for the workload, and reliability at very large scales.
In this chapter, the focus is on storage devices and how they behave from a performance and reliability point of view. Higher level concepts such as parallel filesystems and cluster wide storage layers are treated in later chapters. Here, you will learn the basic building blocks that those systems are made from.
Types of Storage Media
At the hardware level, most HPC systems use a mix of two main storage technologies: spinning magnetic disks and solid state storage. Tape also plays a role, usually for long term archival storage.
Magnetic hard disk drives use spinning platters and mechanical read and write heads. They provide high capacity at relatively low cost, which makes them attractive for large storage pools. However, their seek time and rotational latency are high compared to memory and flash, and performance depends strongly on access patterns.
Solid state drives use flash memory with no moving parts. They have much lower access latency and much higher random I/O performance compared to disks. In HPC systems, SSDs are used both as fast local scratch space and as a component of high performance shared storage layers. They cost more per byte than disks, and have limited write endurance, so they are rarely used alone for very large capacity storage.
Magnetic tape is used in automated tape libraries as a very low cost, very high capacity storage medium. Tape has very high sequential throughput but very high access latency. It is not used directly by HPC applications, but cluster wide backup or archival systems may stage rarely accessed data out to tape.
From the point of view of HPC performance, the key distinction is that disk and tape exhibit strong penalties for random access and seek heavy workloads, whereas SSDs perform much better for random I/O. This difference influences how you should structure data access in your applications.
Latency, Bandwidth, and IOPS
Storage performance is described with several related metrics. Latency is the time between issuing an I/O request and receiving the first byte of data. Bandwidth or throughput is the sustained rate at which data can be transferred once the transfer is underway. Input or output operations per second, abbreviated IOPS, counts how many I/O requests a device can handle per unit time.
For a single I/O request of size $S$ bytes, with device latency $L$ seconds and streaming bandwidth $B$ bytes per second, an idealized access time $T$ is
$$T = L + \frac{S}{B}.$$
For small I/O sizes, latency dominates. For large I/O sizes, bandwidth dominates. Effective storage use in HPC tries to make individual I/O operations large enough that the $\frac{S}{B}$ term is much larger than $L$.
Hard disks have relatively high $L$ and good $B$ for large sequential transfers, but poor IOPS for many small random requests. SSDs have much lower $L$ and higher IOPS, and their $B$ is often limited by the interface, such as SATA or PCIe, rather than the flash chips. Tape has very high $B$ when streaming, but $L$ can be many seconds while a robot mounts a tape and the tape seeks to the correct position.
For HPC applications that read or write large arrays or fields, maximizing sustained bandwidth is often more important than minimizing the latency of individual requests. The storage system must also support many jobs at once, so aggregate bandwidth across many devices and links is critical.
Local vs Networked Storage
Each compute node in a cluster can have storage attached directly, for example a local SSD or disk. At the same time, most clusters also have networked storage that is accessible from all nodes. Local storage is usually faster and less contended but is not shared, while networked storage is shared and persistent across nodes and jobs.
Locally attached storage, such as NVMe SSDs on a PCIe bus or SATA disks inside the node chassis, offers high bandwidth and low latency since data does not travel across the cluster network. However, data stored locally is tied to that specific node. If your job runs on a different node, it will not see the previous job’s local files. This makes local storage ideal for temporary scratch data that can be regenerated, but not for long term shared datasets.
Networked storage uses protocols such as NFS, parallel filesystem clients, or block storage layers over InfiniBand or Ethernet to expose a remote storage server or storage appliance as if it were local. The performance of networked storage depends on the storage hardware and on the interconnect. It is usually slower and more variable than local SSDs on a single node, but it supports sharing and centralized management, and can aggregate many devices for very high total capacity and bandwidth.
HPC systems often combine both types. A typical pattern is to stage input data from networked storage to local storage at the beginning of a job, perform the main computation using local I/O, then write final results back to shared storage at the end.
Block Devices and Filesystems
Storage devices present themselves to the operating system as block devices. A block device provides access to fixed size blocks, such as 512 bytes or 4 kilobytes, which can be read and written independently. On top of block devices, the operating system builds filesystems that provide files, directories, permissions, and other abstractions.
From an HPC programmer’s perspective, the main implication of block based storage is that physical I/O actually happens in units of blocks, even if your code requests smaller reads or writes. Small logical I/O calls may cause the system to read or write full blocks anyway, which adds overhead. Therefore, many HPC I/O libraries and file formats are designed to work with large, aligned blocks of data that map efficiently to the underlying device.
More advanced storage systems may layer features such as software RAID, encryption, compression, or logical volume management between the raw device and the filesystem. These layers can improve performance, reliability, or flexibility, but they also introduce additional behavior such as caching and write buffering that can influence application level I/O patterns.
RAID and Reliability
At the scale of HPC storage, device failures are common events rather than rare exceptions. To cope with this, storage systems use redundancy techniques. The most common building blocks are RAID configurations, which group multiple devices together to provide fault tolerance and sometimes better performance.
Simple striping across devices, often labeled RAID 0, distributes successive blocks across multiple disks. This increases bandwidth but provides no redundancy. Mirroring, such as RAID 1, stores identical copies on two or more disks so that a single disk failure does not lose data, at the cost of extra capacity.
Parity based schemes, such as RAID 5 or RAID 6, store parity blocks that can be used to reconstruct data from a failed device. These configurations can tolerate one or more disk failures while using less extra capacity than mirroring. However, parity calculations and reconstruction add write overhead and rebuild time.
At larger scales, advanced erasure coding schemes generalize parity concepts and are used inside large storage appliances and parallel filesystems. Although end users rarely configure this directly, it affects performance behavior. Rebuilding lost data can consume bandwidth and reduce responsiveness for active jobs.
From your perspective as an HPC user, the key point is that storage systems are designed to survive component failures, but heavy redundancy and rebuild activity can slow down I/O. It is also important to remember that redundancy does not replace backups or good data management practices.
Caching, Buffers, and Writeback
Storage subsystems make extensive use of caching to reduce perceived latency and to smooth bursts of I/O. The operating system maintains a page cache in main memory that holds recently accessed file data and buffers pending writes. Storage controllers and devices also have their own internal caches.
When your application writes data, the write often completes from its point of view once the data reaches the operating system buffer cache, not necessarily when it has been committed to physical media. The OS then flushes buffered data to disk or SSD in the background. This writeback behavior improves performance by allowing multiple small writes to be combined into larger, more efficient I/O operations.
Data in caches or buffers that has not been flushed to stable storage can be lost in a crash or power loss. For critical results, explicitly flush data and close files to force metadata and data to reach persistent storage.
Similarly, reads can be served out of cache if the data is already in memory, either because your process accessed it recently or because another process did. This means that repeated tests that read the same file may see much higher apparent performance than first time reads from cold storage.
For performance evaluation and scaling studies, it is important to distinguish cached performance from true storage performance. Many benchmarking tools offer options to bypass caches or to drop page cache contents between runs, in order to measure the capabilities of the underlying device or filesystem.
Local Scratch, Home, and Project Storage
On most HPC clusters, storage is logically divided into different areas that are backed by different hardware and tuned for different tradeoffs among capacity, speed, and reliability. Even if the specifics vary between centers, there are common patterns.
User home directories are typically stored on networked storage with strong redundancy and backup policies. They have limited quotas and are intended for configuration files, code, and small input data. Performance is adequate but not optimized for large scale bulk I/O.
Scratch or temporary storage areas are designed for high throughput and large short term data. They may reside on fast parallel storage or on local SSDs attached to compute nodes. These areas often have automatic purging policies and are not backed up. Applications that generate large intermediate files should use scratch storage instead of home directories.
Shared project or work spaces provide larger quotas and may be tuned for either capacity or performance, depending on the institution. These spaces are intended for shared datasets used by multiple users or teams.
Although this logical layout is a policy decision rather than a hardware feature, it directly reflects the underlying storage systems. Understanding the differences helps you choose where to place your data for the best mix of performance and safety.
I/O Patterns and Their Impact on Storage
How an application reads and writes data is as important as the raw speed of the storage devices. Storage systems respond very differently to sequential and random access patterns, to large or small operations, and to concurrent access from many processes.
Sequential I/O reads or writes large contiguous regions of a file. This pattern enables underlying devices to issue long, efficient transfers and minimizes seeks, which is especially important for disks and tapes. Random I/O, in contrast, jumps around within a file or across many files, which leads to many small I/O operations and poor performance on seek sensitive devices.
Small I/O operations, for example a few kilobytes at a time, amplify the effects of latency and reduce effective bandwidth, as suggested by the equation for $T$. Large, contiguous I/O requests amortize the latency cost and can saturate the available bandwidth.
In parallel programs, many ranks or threads may access storage concurrently. If each process performs tiny, unaligned reads and writes, the storage system must handle huge IOPS loads and may become a bottleneck. Collective I/O strategies, where processes cooperate to bundle data into fewer, larger requests, align well with how storage systems are built and usually perform better.
Designing data structures and output formats to favor large, sequential, and well organized I/O is one of the simplest and most effective ways to improve application performance without touching numerical algorithms.
Storage Hierarchies and Burst Buffers
Just as CPU memory is organized in a hierarchy from registers to caches to RAM, modern HPC systems also build hierarchies of storage. At the fastest level, some systems provide node local nonvolatile memory or SSD tiers that sit between RAM and the shared filesystem. At the slower level, massive capacity disk based or tape based archives hold long term data.
Burst buffers are specialized high speed storage layers that absorb short periods of very high I/O demand from applications, then drain the data to slower, larger capacity backing storage. They are often implemented with arrays of SSDs connected through high bandwidth networks and managed by software that coordinates with the job scheduler.
This hierarchical organization allows the system to match different parts of application lifecycles to appropriate storage tiers. Simulation phases that produce checkpoints and diagnostic outputs can write quickly to the fast tier, while the storage system moves data down to slower tiers afterward, often without user intervention.
From a user point of view, burst buffers and similar layers may appear as special directories or filesystems, or may be integrated transparently into parallel filesystems. Later chapters that address parallel I/O and parallel filesystems will discuss how to use these layers effectively in multi node runs.
Summary
Storage systems in HPC connect persistent data to compute resources and are built from a mix of disks, SSDs, and sometimes tape, organized through block devices, filesystems, redundancy schemes, and hierarchies. Their performance is governed by latency, bandwidth, and IOPS, and is strongly influenced by application I/O patterns. Understanding the properties of local and networked storage, and the behaviors of caching and redundancy, is essential for choosing where to place data and how to structure access in order to make efficient use of large scale HPC systems.