14.1 Parallel I/O concepts

Why Parallel I/O Matters

As applications scale to thousands of processes and huge datasets, input and output can become the dominant cost in a run. Parallel I/O is about letting multiple processes or threads read and write data concurrently while preserving correctness and achieving higher aggregate bandwidth.

At a high level, parallel I/O aims to:

Avoid bottlenecks caused by a single process doing all I/O.
Match the parallelism of your computation with parallelism in storage.
Organize data on disk to be read and written efficiently by many processes.

This chapter focuses on the concepts you need to understand parallel I/O patterns and the APIs/building blocks you’ll see on HPC systems.

Parallel I/O vs. Serial I/O

In a serial (or “master-only”) I/O pattern:

One process (often rank 0) gathers data from all other processes.
That process writes a single file using standard POSIX I/O (read, write, etc.).

Issues:

Communication overhead: everyone sends data to rank 0.
Single process bottleneck: only one process uses the filesystem.
Poor scalability: works for tens of processes, struggles at thousands.

In parallel I/O:

Multiple processes access the filesystem simultaneously.
Data is partitioned across processes and often across storage devices.
I/O libraries coordinate access to avoid corruption and conflicts.

Parallel I/O does not necessarily mean “many files”; you can have:

Single shared file, many processes
Many files, many processes (e.g., one file per rank)

Both are forms of parallel I/O; which is better depends on the filesystem, tools, and workflow.

Layers of Parallel I/O

Conceptually, parallel I/O is usually organized in layers:

Application-level I/O

You call high-level I/O routines (e.g., write an array, read a variable).
Examples: HDF5, NetCDF, ADIOS.

Parallel I/O library

Implements collective and non-collective operations across processes.
Often built on top of MPI-IO.

MPI-IO

Standard MPI extension for parallel file I/O.
Provides APIs like MPI_File_read_at_all, MPI_File_write_all, etc.
Knows about communicators, datatypes, and views.

POSIX / filesystem interface

Basic open, read, write, lseek, close.
Interacts with the parallel filesystem (e.g., Lustre, GPFS).

Parallel filesystem / storage hardware

Distributes data across multiple devices and servers.
Provides bandwidth and capacity.

When you design parallel I/O, you’re choosing how you use these layers: direct POSIX from multiple ranks, MPI-IO, or a higher-level library.

Models of Parallel File Access

Shared-File vs. File-Per-Process

Two common patterns are:

1. Shared-file model

All processes access the same file concurrently.
Layout in the file is partitioned logically among processes.
Pros:

Single file is easier to manage and post-process.
Many tools expect a single dataset.

Cons:

More complex coordination.
Requires careful design to avoid contention.

2. File-per-process model

Each process writes its own file (e.g., data_rank0000.dat).
Pros:

Conceptually simple.
Can perform well for moderate process counts.

Cons:

File explosion (tens of thousands of files).
Filesystem metadata stress; harder data management.
Post-processing often needs a merging step.

Many production codes use hybrid patterns:

One file per node (or per I/O group).
Or a limited number of shared files (e.g., 16 files for 1024 ranks).

Independent vs. Collective I/O

Parallel I/O operations can be:

Independent I/O

Each process issues I/O calls on its own timeline.
No coordination required between processes.
More flexible but often suboptimal performance.
Example: Each rank calls MPI_File_write_at with its own offset.

Collective I/O

A group of processes in a communicator call the same I/O routine together.
The library can:

Merge (“coalesce”) small requests into big ones.
Reorder accesses for better alignment with filesystem stripes.
Use a subset of processes as I/O aggregators.

Often yields much better performance on parallel filesystems.
Examples: MPI_File_write_at_all, collective HDF5 writes.

Choosing collective I/O is mainly about performance and sometimes robustness; the semantics (what ends up in the file) can be the same.

Data Partitioning and File Views

To do parallel I/O correctly and efficiently, each process must know:

What part of the dataset it owns in memory.
Where that part lives in the file.

This mapping is at the core of parallel I/O design.

Decomposition in Memory

In parallel programs, data structures are typically decomposed across processes, for example:

1D array of size $N$ split into equal blocks:

Rank $r$ owns indices $[r \cdot n_{\text{local}}, (r+1)n_{\text{local}})$.

2D or 3D domain decompositions:

Each process owns a sub-block or sub-domain.

This partitioning is usually already decided in the computation phase. Parallel I/O needs a file layout that reflects or at least is compatible with that decomposition.

Logical File Layout

The file typically stores a global view of the data (e.g., a full 3D grid). Processes should be able to:

Independently compute their file offsets and lengths, or
Use the I/O library’s abstraction of a file view.

A file view (in MPI-IO terminology):

Defines a process’s visible region of the file.
Often specified using MPI datatypes (including non-contiguous patterns).
Allows each rank to “see” only its portion without manually computing offsets.

Conceptually, a file view lets you say: “For rank r, this is the slice of the file representing my sub-domain,” even if that slice is strided or irregular.

Access Patterns and Their Impact

The pattern with which processes access the file strongly influences performance.

Contiguous vs. Non-Contiguous Access

Contiguous access

Each process reads/writes a single long chunk.
Simplest and usually fastest (fewer, larger I/O operations).

Non-contiguous (strided or scattered) access

Data is interleaved or stored in patterns (e.g., all X-lines, planes).
Naively, many small I/O calls, which is slow.
Libraries can optimize this using derived datatypes and collective I/O.

When possible, design file layouts where each process can perform few, large, contiguous operations.

Regular vs. Irregular Patterns

Regular patterns

Same layout rules for each process (e.g., block or block-cyclic).
Easy for MPI-IO or HDF5 to optimize.

Irregular patterns

Each rank owns scattered or unpredictable portions of data.
Harder to optimize; may require extra staging or reordering.

If your computation uses irregular decomposition, consider:

Reordering data into a regular buffer before I/O, or
Using libraries that help handle this via metadata and high-level APIs.

Shared-File Coordination and Consistency

With multiple processes writing to one file, you must avoid:

Overwriting each other’s data.
Inconsistent views of the file.

Concepts relevant to coordination:

File Offsets and Atomicity

Each process can:

Maintain its own individual offset (like a file pointer).
Use explicit offsets (e.g., *_at routines in MPI-IO).
Rely on collective I/O to manage offsets coherently.

Atomicity refers to the guarantee that:

Combined writes from different processes do not become interleaved at the byte level.
On some systems, ensuring strict POSIX-like atomicity can reduce performance.
Many parallel I/O APIs let you trade strict atomicity for higher performance if your access patterns are non-overlapping and well-structured.

Consistency and Visibility

Because of caching and buffering:

A write by one process might not be immediately visible to others.
Many parallel I/O APIs provide:

Synchronization calls (e.g., sync, flush, barrier-like operations).
Modes that control when data is forced to storage.

The typical pattern:

Perform all parallel writes.
Call a collective sync/close to ensure data is safely stored and globally visible.
Then allow reading by other processes or post-processing tools.

The Role of MPI-IO (Conceptually)

Without going into full API details, MPI-IO introduces several key ideas:

Communicator-based I/O

Operations are defined over a group of processes (like MPI message passing).

File views

Custom, potentially non-contiguous views per process.

Collective calls

Allow the library to reorganize requests for better performance.

I/O aggregation

A subset of processes can act as I/O “aggregators”, collecting data from others and issuing fewer large I/O requests.

From a conceptual standpoint, MPI-IO provides the primitives to:

Describe which parts of the file each process owns.
Issue I/O collectively across those processes.
Let the library exploit filesystem characteristics.

High-Level Parallel File Formats

Many scientific codes do not use MPI-IO directly but rely on parallel-aware libraries and formats. At a conceptual level, these provide:

Self-describing data

Metadata embedded in the file describes variables, dimensions, types.

Parallel operations on datasets

Write/read a subset (hyperslab) of a dataset in parallel.

Portability

Files can be moved across systems and read by different languages/tools.

Abstractions over file layout

You think in terms of arrays and variables, not low-level offsets and stripes.

These libraries typically:

Use MPI-IO under the hood.
Implement collective I/O where beneficial.
Allow you to define parallel decompositions that mirror your computation.

Understanding the concepts of decomposition, file views, and collective I/O helps you use such libraries more effectively.

Typical Parallel I/O Patterns

A few common usage patterns in HPC codes:

Checkpoint/Restart I/O

All processes periodically write their state to disk.
Usually a parallel write to one or a small number of files.
Emphasis on robustness and performance.

Time-stepped output

At each time step, processes write out fields or snapshots.
May use one file per time step, or a multi-time-step file.

Parallel analysis / post-processing

Separate parallel job reads simulation outputs in parallel.
Can use a different decomposition than the original code.

In each case, your choices about:

Shared vs. file-per-process.
Independent vs. collective.
Data layout and decomposition.
strongly affect performance and usability.

Performance Considerations (Conceptual)

While detailed tuning belongs elsewhere, conceptually:

Few large operations > many small operations.
Aligned access that matches filesystem striping is beneficial.
Collective I/O can hide complexity and improve throughput.
Over-creating files (e.g., one per core at large scale) can overwhelm metadata servers.
Mixing random small I/O with large sequential I/O can hurt everyone’s performance on a shared system.

Parallel I/O design is about balancing:

The natural decomposition of your computation.
The capabilities and preferences of the parallel filesystem.
The needs of downstream tools and workflows.

Summary of Core Concepts

Key ideas to keep in mind for parallel I/O:

Parallel I/O lets multiple processes access data concurrently to match the scale of modern HPC computations.
You can parallelize I/O via shared files, file-per-process, or hybrids.
Collective I/O and file views are central concepts in scalable parallel I/O.
Data decomposition in memory must be mapped carefully to file layouts.
High-level libraries build on MPI-IO concepts to provide convenient, portable, parallel file formats.

These concepts form the foundation for using specific parallel filesystems, libraries, and optimizations in practice.

Comments

Please login to add a comment.

Don't have an account? Register now!