14.1 Parallel I/O concepts

Table of Contents

Understanding Parallel I/O in HPC

Parallel I/O is about reading and writing data efficiently when many processes or threads run at the same time on an HPC system. In large simulations and data analysis workloads, I/O can become the main bottleneck, even if the computation is highly optimized. This chapter focuses on the core ideas behind parallel I/O, how it differs from serial I/O, and the common models and patterns you will encounter on HPC systems.

Why Serial I/O Breaks Down

On a laptop or a small server, a typical program opens a file, reads or writes it through a single process, and closes it. In a parallel job with hundreds or thousands of processes, this simple approach no longer works well.

If every process tries to do its own independent serial I/O, several problems appear. All processes compete for the same underlying hardware, file systems become overloaded by many small and random accesses, and the volume of metadata operations, such as opening and closing files, grows quickly. At the same time, if only one process performs all I/O on behalf of the others, that process becomes a bottleneck and everyone else waits.

Parallel I/O techniques aim to avoid both extremes. They let many processes participate in reading and writing, but in a coordinated way that matches how parallel file systems and storage hardware are designed.

Parallel I/O is efficient only when many processes access data in a coordinated and structured way that matches the layout and capabilities of the underlying parallel file system.

Process-Level View of Parallel I/O

In a typical distributed memory program, such as one based on MPI, you can think about I/O at the level of processes. Each process has some local memory and may be responsible for part of a larger data set. The conceptual question is always the same: how does a globally defined data structure, for example a large 3D array, map to many processes and to one or more files on disk.

Two basic models help to understand this mapping.

In the file per process model, each process writes to its own file. For example, rank 0 writes output_0.dat, rank 1 writes output_1.dat, and so on. This is simple to implement and avoids conflicts between processes, because every process has a private file. However, hundreds or thousands of small files are hard to manage, transfer, and post-process. File systems are usually not optimized for huge numbers of tiny independent files.

In the shared file model, all processes cooperate to read or write a single logical file, or a small number of files. Each process accesses different regions of that file. This approach fits better with parallel file systems that distribute a single file across many storage servers. It also leads to datasets that are easier to manage. The challenge is that accesses must be correctly ordered and structured to avoid data corruption and performance collapse.

Parallel I/O libraries and interfaces provide mechanisms to implement one or both models, while hiding some of the complexity of the underlying file system.

Parallel File Systems and Data Striping

Parallel file systems store and serve files across multiple storage servers. A key concept for parallel I/O is striping. Instead of placing a file on a single disk, the file is split into stripes, contiguous chunks of bytes, and those stripes are distributed across several storage targets.

If each storage target can sustain a bandwidth of $B$ and a file is striped across $k$ such targets, then in ideal conditions the maximum bandwidth for a large sequential access is approximately
$$
B_{\text{max}} \approx k \cdot B.
$$

This simplified model ignores overheads, but it captures why striping matters. When several processes read or write different parts of a large file, the requests are spread over many disks and servers, and aggregate bandwidth can scale.

In practice, the stripe count, that is the number of targets, and the stripe size, that is the bytes per stripe, have a strong impact on performance. Parallel I/O concepts and libraries help you shape access patterns so that stripes are used efficiently, for example by having processes request large aligned chunks instead of many small scattered pieces.

Good parallel I/O tries to align large, contiguous accesses with the stripe boundaries of the parallel file system to achieve near-ideal aggregate bandwidth.

I/O Models: Independent and Collective

Parallel I/O operations are often divided into independent and collective models. This classification does not refer to parallel computation itself, but to how processes coordinate their I/O calls.

In independent I/O, each process issues I/O requests on its own, with no explicit synchronization with other processes. For example, an MPI process can call a parallel I/O routine to write its local portion of an array at an offset computed from its rank, without informing the other ranks. This model is flexible and sometimes easier to reason about, because each process acts locally. However, from the perspective of the file system, the accesses may appear as many separate, small, and irregular requests, which reduces efficiency.

In collective I/O, processes cooperate to perform I/O. They call I/O routines collectively, meaning all processes in a communicator participate, and the library is free to combine, reorder, and optimize the underlying file system operations. Collective I/O can, for example, gather many small chunks of data from several processes into larger contiguous buffers, perform fewer and larger writes, and then distribute the results logically back to the processes.

Collective I/O is particularly important for obtaining good performance from parallel file systems, because it allows the I/O library to shape the actual access pattern at the server side.

Use collective parallel I/O when possible. It allows libraries and file systems to transform many small accesses into a few large, efficient operations.

Logical Data Views and File Offsets

Parallel files are just sequences of bytes. To access them correctly from a parallel program, each process must know which portion of that sequence corresponds to its logical part of the data.

A key parallel I/O concept is the logical data view. Each process defines a mapping between its local memory representation and portions of the global file. For example, consider a two dimensional array of size $N \times M$, logically stored row by row in a single file. Suppose there are $P$ processes, each responsible for $N / P$ consecutive rows. If each row occupies $M \cdot s$ bytes, where $s$ is the size of one element, then process $p$ should start at a file offset of
$$
\text{offset}_p = p \cdot \frac{N}{P} \cdot M \cdot s.
$$

Basic parallel I/O can be implemented by computing such offsets manually and using low level write operations with explicit offsets. More advanced parallel I/O interfaces allow you to describe complex, possibly noncontiguous patterns, such as writing every k-th column or a subarray in a higher dimensional grid, without manual offset calculations.

Logical views decouple how data is laid out in memory from how it is stored on disk. Parallel I/O libraries exploit this to perform optimizations, for example reading or writing in large contiguous chunks on disk while scattering or gathering data to or from noncontiguous regions in memory.

Two-Level I/O and Aggregation

A central idea in efficient parallel I/O is aggregation. Instead of letting all processes access the file system directly and randomly, a subset of processes act as aggregators. Other processes send their data to these aggregators, which then perform fewer and larger I/O operations.

You can view this as a two level I/O scheme. At the first level, data moves between compute processes and aggregators over the communication network. At the second level, aggregators perform disk I/O. If chosen carefully, both levels can operate efficiently. Network messages can be sized to match network characteristics, and I/O requests can be sized and aligned with the stripes and caches of the parallel file system.

Aggregation is often used behind the scenes by collective parallel I/O routines. From the user side, you might see only a single collective write call, but internally the library elects aggregators, performs redistributions, and issues optimized file operations.

Aggregation reduces the number of small I/O operations by combining them into large, well aligned accesses, often through a subset of processes called aggregators.

Access Patterns and Their Impact

How processes access data in parallel is just as important as how much data they access. Several typical patterns have distinct performance characteristics.

In a regular block pattern, each process owns a contiguous block of the file, often corresponding to a block of the global domain. This pattern is friendly to parallel file systems, because each process can perform large contiguous reads or writes. Collective I/O can improve this further, but even independent I/O with careful offsets can perform reasonably well.

In a block cyclic or scattered pattern, each process owns many small, disjoint parts of the global data, interleaved with data owned by other processes. This can occur with certain domain decompositions or linear algebra distributions. A naive implementation where every small piece triggers an independent I/O call leads to severe inefficiency. Collective I/O and datatype descriptions can help, by gathering scattered pieces into larger I/O buffers.

In a halo or ghost cell pattern, processes need to read or write additional layers of data around their own subdomain, for example for finite difference stencils. When this halo data is stored contiguously in the file, access is easy. When it is interleaved, careful mapping or a dedicated file format becomes helpful to avoid many small random accesses.

Another common pattern is checkpoint I/O, where the entire simulation state is periodically written. Here, your main concern is writing large volumes of data as fast as possible, often with a simple, regular layout. Collective and aggregated writes are particularly beneficial, and the choice between shared files and file per process can have a strong impact.

Recognizing these patterns and choosing appropriate I/O strategies is a core part of parallel I/O design.

Synchronization and Consistency

When multiple processes write to the same file, questions arise about when data becomes visible and in what order. In parallel I/O, synchronization and consistency rules define the relationship between concurrent operations.

There is a distinction between the view that a process has of the file and the actual state of data on disk. Buffers, caches, and reordering inside the library, operating system, and file system can delay or reorder physical writes. For correctness, parallel I/O interfaces typically provide synchronization operations, such as flushes or barriers around I/O calls, to ensure that certain conditions hold before computation proceeds.

A simple mental model is to separate the logical definition of what should be in the file after a sequence of operations from the physical timing of writes. Collective and coordinated I/O make it easier for libraries to ensure a predictable logical result while still optimizing physical access. Independent, overlapping writes from many processes to the same region are risky and should be avoided unless the semantics are well defined.

Avoid overlapping writes to the same file regions from different processes. Use explicit synchronization or collective I/O to maintain a clear and correct file state.

Caching, Buffering, and Direct I/O

Another important aspect of parallel I/O is the role of intermediate storage layers. Data may flow from application memory to system buffers, to file system caches, and finally to physical storage. These layers exist to hide latency and improve effective throughput, but they also introduce subtle behavior.

Write operations often complete as soon as data is copied into kernel or library buffers, not when it reaches disk. For many scientific applications, this is acceptable. For crash resilience, however, you sometimes need guarantees that data is durable on persistent storage. Parallel I/O interfaces usually provide flush operations or hints to control this, but using them too often can reduce performance.

Local caching on compute nodes can also interact with parallel I/O. Multiple nodes may have cached copies of parts of a file. Without proper cache coherence, writing and then reading the same region from another node can yield surprising results. Parallel file systems implement their own coherence models, and parallel I/O libraries are designed to work within those models, but application code should not assume that all reads always see the latest data unless the correct synchronization operations have been used.

To reduce the overhead of caches and buffers, some systems and libraries offer direct I/O modes, which bypass certain levels of caching. This can be beneficial for large, streaming I/O where the data is not reused, but it also removes some benefits of buffering.

Parallel I/O APIs and Their Roles

Several standardized and de facto standardized interfaces exist specifically to support parallel I/O on HPC systems. While other chapters focus on specific libraries and formats, it is useful here to understand how they relate conceptually.

Low level parallel I/O interfaces, such as MPI I/O, provide calls for reading and writing in parallel, describing file views, and performing independent or collective I/O. They map relatively closely to the underlying parallel file system operations and expose details like file offsets, datatypes, and access modes.

Higher level I/O libraries, such as HDF5 and NetCDF parallel variants, build on these low level interfaces. They let you describe data in terms of multidimensional arrays, groups, attributes, and variables. They internally manage file layouts, chunking, and metadata, and they implement parallel access concepts like collective I/O, aggregation, and logical views.

Scientific applications often use these higher level formats to avoid manual offset management and to ensure that data is portable, self describing, and suitable for analysis tools. Understanding basic parallel I/O concepts is still important, because performance tuning often involves selecting appropriate chunk sizes, choosing between collective and independent I/O, and designing access patterns that align with the underlying parallel file system.

Low level parallel I/O APIs expose file views and offsets, while high level libraries expose arrays and variables, but both rely on the same underlying concepts of coordinated parallel access.

Designing for Parallel I/O from the Start

Finally, parallel I/O concepts influence how you design data structures and output strategies in an HPC application. Trying to retrofit parallel I/O onto an application that was structured for serial I/O usually leads to compromises and poor performance.

When designing a parallel code, consider how your domain decomposition maps to file regions, which data must be written for checkpoints, diagnostics, or visualization, and how often you need to perform these operations. Favor regular decompositions and layouts that can be represented as contiguous or structured regions. Minimize the number of files when possible, especially when you expect very large process counts. Think about the life cycle of data, including post processing and long term storage, and choose formats that support efficient parallel reading as well as writing.

Parallel I/O concepts ultimately connect the computational parallelism of your code with the parallelism of the storage system. When these two kinds of parallelism are aligned, large scale simulations and analyses become feasible within realistic time windows.

Comments

Please login to add a comment.

Don't have an account? Register now!