Table of Contents
Why Parallel I/O Matters
As applications scale to thousands of processes and huge datasets, input and output can become the dominant cost in a run. Parallel I/O is about letting multiple processes or threads read and write data concurrently while preserving correctness and achieving higher aggregate bandwidth.
At a high level, parallel I/O aims to:
- Avoid bottlenecks caused by a single process doing all I/O.
- Match the parallelism of your computation with parallelism in storage.
- Organize data on disk to be read and written efficiently by many processes.
This chapter focuses on the concepts you need to understand parallel I/O patterns and the APIs/building blocks you’ll see on HPC systems.
Parallel I/O vs. Serial I/O
In a serial (or “master-only”) I/O pattern:
- One process (often rank 0) gathers data from all other processes.
- That process writes a single file using standard POSIX I/O (
read,write, etc.).
Issues:
- Communication overhead: everyone sends data to rank 0.
- Single process bottleneck: only one process uses the filesystem.
- Poor scalability: works for tens of processes, struggles at thousands.
In parallel I/O:
- Multiple processes access the filesystem simultaneously.
- Data is partitioned across processes and often across storage devices.
- I/O libraries coordinate access to avoid corruption and conflicts.
Parallel I/O does not necessarily mean “many files”; you can have:
- Single shared file, many processes
- Many files, many processes (e.g., one file per rank)
Both are forms of parallel I/O; which is better depends on the filesystem, tools, and workflow.
Layers of Parallel I/O
Conceptually, parallel I/O is usually organized in layers:
- Application-level I/O
- You call high-level I/O routines (e.g., write an array, read a variable).
- Examples: HDF5, NetCDF, ADIOS.
- Parallel I/O library
- Implements collective and non-collective operations across processes.
- Often built on top of MPI-IO.
- MPI-IO
- Standard MPI extension for parallel file I/O.
- Provides APIs like
MPI_File_read_at_all,MPI_File_write_all, etc. - Knows about communicators, datatypes, and views.
- POSIX / filesystem interface
- Basic
open,read,write,lseek,close. - Interacts with the parallel filesystem (e.g., Lustre, GPFS).
- Parallel filesystem / storage hardware
- Distributes data across multiple devices and servers.
- Provides bandwidth and capacity.
When you design parallel I/O, you’re choosing how you use these layers: direct POSIX from multiple ranks, MPI-IO, or a higher-level library.
Models of Parallel File Access
Shared-File vs. File-Per-Process
Two common patterns are:
1. Shared-file model
- All processes access the same file concurrently.
- Layout in the file is partitioned logically among processes.
- Pros:
- Single file is easier to manage and post-process.
- Many tools expect a single dataset.
- Cons:
- More complex coordination.
- Requires careful design to avoid contention.
2. File-per-process model
- Each process writes its own file (e.g.,
data_rank0000.dat). - Pros:
- Conceptually simple.
- Can perform well for moderate process counts.
- Cons:
- File explosion (tens of thousands of files).
- Filesystem metadata stress; harder data management.
- Post-processing often needs a merging step.
Many production codes use hybrid patterns:
- One file per node (or per I/O group).
- Or a limited number of shared files (e.g., 16 files for 1024 ranks).
Independent vs. Collective I/O
Parallel I/O operations can be:
Independent I/O
- Each process issues I/O calls on its own timeline.
- No coordination required between processes.
- More flexible but often suboptimal performance.
- Example: Each rank calls
MPI_File_write_atwith its own offset.
Collective I/O
- A group of processes in a communicator call the same I/O routine together.
- The library can:
- Merge (“coalesce”) small requests into big ones.
- Reorder accesses for better alignment with filesystem stripes.
- Use a subset of processes as I/O aggregators.
- Often yields much better performance on parallel filesystems.
- Examples:
MPI_File_write_at_all, collective HDF5 writes.
Choosing collective I/O is mainly about performance and sometimes robustness; the semantics (what ends up in the file) can be the same.
Data Partitioning and File Views
To do parallel I/O correctly and efficiently, each process must know:
- What part of the dataset it owns in memory.
- Where that part lives in the file.
This mapping is at the core of parallel I/O design.
Decomposition in Memory
In parallel programs, data structures are typically decomposed across processes, for example:
- 1D array of size $N$ split into equal blocks:
- Rank $r$ owns indices $[r \cdot n_{\text{local}}, (r+1)n_{\text{local}})$.
- 2D or 3D domain decompositions:
- Each process owns a sub-block or sub-domain.
This partitioning is usually already decided in the computation phase. Parallel I/O needs a file layout that reflects or at least is compatible with that decomposition.
Logical File Layout
The file typically stores a global view of the data (e.g., a full 3D grid). Processes should be able to:
- Independently compute their file offsets and lengths, or
- Use the I/O library’s abstraction of a file view.
A file view (in MPI-IO terminology):
- Defines a process’s visible region of the file.
- Often specified using MPI datatypes (including non-contiguous patterns).
- Allows each rank to “see” only its portion without manually computing offsets.
Conceptually, a file view lets you say: “For rank r, this is the slice of the file representing my sub-domain,” even if that slice is strided or irregular.
Access Patterns and Their Impact
The pattern with which processes access the file strongly influences performance.
Contiguous vs. Non-Contiguous Access
- Contiguous access
- Each process reads/writes a single long chunk.
- Simplest and usually fastest (fewer, larger I/O operations).
- Non-contiguous (strided or scattered) access
- Data is interleaved or stored in patterns (e.g., all X-lines, planes).
- Naively, many small I/O calls, which is slow.
- Libraries can optimize this using derived datatypes and collective I/O.
When possible, design file layouts where each process can perform few, large, contiguous operations.
Regular vs. Irregular Patterns
- Regular patterns
- Same layout rules for each process (e.g., block or block-cyclic).
- Easy for MPI-IO or HDF5 to optimize.
- Irregular patterns
- Each rank owns scattered or unpredictable portions of data.
- Harder to optimize; may require extra staging or reordering.
If your computation uses irregular decomposition, consider:
- Reordering data into a regular buffer before I/O, or
- Using libraries that help handle this via metadata and high-level APIs.
Shared-File Coordination and Consistency
With multiple processes writing to one file, you must avoid:
- Overwriting each other’s data.
- Inconsistent views of the file.
Concepts relevant to coordination:
File Offsets and Atomicity
Each process can:
- Maintain its own individual offset (like a file pointer).
- Use explicit offsets (e.g.,
*_atroutines in MPI-IO). - Rely on collective I/O to manage offsets coherently.
Atomicity refers to the guarantee that:
- Combined writes from different processes do not become interleaved at the byte level.
- On some systems, ensuring strict POSIX-like atomicity can reduce performance.
- Many parallel I/O APIs let you trade strict atomicity for higher performance if your access patterns are non-overlapping and well-structured.
Consistency and Visibility
Because of caching and buffering:
- A write by one process might not be immediately visible to others.
- Many parallel I/O APIs provide:
- Synchronization calls (e.g.,
sync,flush,barrier-like operations). - Modes that control when data is forced to storage.
The typical pattern:
- Perform all parallel writes.
- Call a collective sync/close to ensure data is safely stored and globally visible.
- Then allow reading by other processes or post-processing tools.
The Role of MPI-IO (Conceptually)
Without going into full API details, MPI-IO introduces several key ideas:
- Communicator-based I/O
- Operations are defined over a group of processes (like MPI message passing).
- File views
- Custom, potentially non-contiguous views per process.
- Collective calls
- Allow the library to reorganize requests for better performance.
- I/O aggregation
- A subset of processes can act as I/O “aggregators”, collecting data from others and issuing fewer large I/O requests.
From a conceptual standpoint, MPI-IO provides the primitives to:
- Describe which parts of the file each process owns.
- Issue I/O collectively across those processes.
- Let the library exploit filesystem characteristics.
High-Level Parallel File Formats
Many scientific codes do not use MPI-IO directly but rely on parallel-aware libraries and formats. At a conceptual level, these provide:
- Self-describing data
- Metadata embedded in the file describes variables, dimensions, types.
- Parallel operations on datasets
- Write/read a subset (hyperslab) of a dataset in parallel.
- Portability
- Files can be moved across systems and read by different languages/tools.
- Abstractions over file layout
- You think in terms of arrays and variables, not low-level offsets and stripes.
These libraries typically:
- Use MPI-IO under the hood.
- Implement collective I/O where beneficial.
- Allow you to define parallel decompositions that mirror your computation.
Understanding the concepts of decomposition, file views, and collective I/O helps you use such libraries more effectively.
Typical Parallel I/O Patterns
A few common usage patterns in HPC codes:
- Checkpoint/Restart I/O
- All processes periodically write their state to disk.
- Usually a parallel write to one or a small number of files.
- Emphasis on robustness and performance.
- Time-stepped output
- At each time step, processes write out fields or snapshots.
- May use one file per time step, or a multi-time-step file.
- Parallel analysis / post-processing
- Separate parallel job reads simulation outputs in parallel.
- Can use a different decomposition than the original code.
In each case, your choices about:
- Shared vs. file-per-process.
- Independent vs. collective.
- Data layout and decomposition.
strongly affect performance and usability.
Performance Considerations (Conceptual)
While detailed tuning belongs elsewhere, conceptually:
- Few large operations > many small operations.
- Aligned access that matches filesystem striping is beneficial.
- Collective I/O can hide complexity and improve throughput.
- Over-creating files (e.g., one per core at large scale) can overwhelm metadata servers.
- Mixing random small I/O with large sequential I/O can hurt everyone’s performance on a shared system.
Parallel I/O design is about balancing:
- The natural decomposition of your computation.
- The capabilities and preferences of the parallel filesystem.
- The needs of downstream tools and workflows.
Summary of Core Concepts
Key ideas to keep in mind for parallel I/O:
- Parallel I/O lets multiple processes access data concurrently to match the scale of modern HPC computations.
- You can parallelize I/O via shared files, file-per-process, or hybrids.
- Collective I/O and file views are central concepts in scalable parallel I/O.
- Data decomposition in memory must be mapped carefully to file layouts.
- High-level libraries build on MPI-IO concepts to provide convenient, portable, parallel file formats.
These concepts form the foundation for using specific parallel filesystems, libraries, and optimizations in practice.