4.8 Parallel filesystems

Table of Contents

What Makes a Filesystem “Parallel” in HPC

A parallel filesystem is designed so that many clients (compute nodes) can read and write many files concurrently, with the data striped across multiple storage servers and disks. The main goals are:

High aggregate bandwidth for many nodes at once
Scalable metadata management (file names, directories, permissions)
Consistent access semantics suitable for batch and parallel jobs

In contrast to a simple network filesystem on a single server, a parallel filesystem distributes both:

Data: file contents are split into chunks and stored across multiple storage targets
Metadata: information about files and directories is often managed by dedicated metadata services

From the user's point of view, you usually see one mount point (e.g. /home, /project, /scratch), but behind that path there may be:

Dozens or hundreds of disks
Multiple storage servers
Separate networks for client–storage traffic

Key Concepts and Architecture

Data and Metadata Separation

Parallel filesystems typically distinguish between:

Metadata servers (MDS): manage

File and directory names
Ownership and permissions
Timestamps
Directory hierarchy

Object or storage servers (OSS/OSD): store

Actual file data in “objects” or “extents”
Often backed by RAID arrays or other redundancy

Why this matters in HPC:

Metadata operations (creating many files, listing directories) can be bottlenecks if not scaled properly
Data throughput (reading/writing large arrays, checkpoints) is handled separately and can be scaled by adding more storage servers

As a user, heavy use of operations like ls -R, creating millions of tiny files, or compiling large codebases on shared storage can stress the metadata system.

Striping

Striping is the core data layout mechanism in parallel filesystems:

A large file is divided into fixed-size chunks, e.g. 1 MB, 4 MB, 16 MB, etc.
These chunks are spread across multiple storage targets (disks/servers)
Multiple clients can access different stripes simultaneously

Conceptually, for a file:

$$
\text{file\_data} = \text{stripe}_0 + \text{stripe}_1 + \dots + \text{stripe}_n
$$

Each stripe may live on a different storage server. When many processes read or write different parts of the file, you can get near-aggregate bandwidth of all those servers.

Important characteristics:

Stripe count: how many storage targets a file uses
Stripe size: how big each chunk is before moving to the next target

Choosing stripe parameters is an important tuning lever for large parallel I/O patterns (often configured per-directory or per-file by the user or admin).

Shared Namespace and Global View

A parallel filesystem usually presents a single, shared namespace across the entire cluster:

Every node sees the same path names (e.g. /scratch/projectX/run42)
Processes on different nodes can operate on the same files concurrently
No need to copy results between nodes just to share them

This global view is crucial for:

MPI or hybrid applications doing parallel I/O
Post-processing and visualization on login or analysis nodes
Workflow tools that run different stages on different nodes

Consistency and Concurrency Semantics

Because many processes may access the same file:

The filesystem must define when writes by one process become visible to others
Locking and caching strategies are needed to prevent corruption

In HPC parallel filesystems:

POSIX semantics are often approximated or implemented with optimizations
Some systems offer relaxed semantics or require specific I/O patterns for best performance (e.g. collective or coordinated I/O)

In practice, you’ll see recommendations like:

Avoid many processes appending to a single file simultaneously
Prefer collective writes or rank-local files combined later
Use parallel I/O libraries (MPI-IO, HDF5, NetCDF) that understand the filesystem

Why Parallel Filesystems Matter in HPC Workflows

Parallel filesystems address several typical HPC needs:

Checkpoint/restart:
Many ranks write large state files at once. A parallel filesystem can complete this quickly so jobs spend less time in I/O.
Large data sets:
Simulations and data analysis may produce and consume terabytes or petabytes. Parallel filesystems scale well beyond what a single server can provide.
Shared software and input data:
Many jobs may read from the same input datasets or shared software stack concurrently.
Multi-user, multi-job concurrency:
The same filesystem must serve many independent jobs with acceptable performance.

Common Usage Patterns and Best Practices (Conceptual)

Implementation-specific commands and options vary (and are covered in the NFS, Lustre, and GPFS sections), but some high-level patterns are common to most parallel filesystems.

Directories for Different Purposes

Clusters often provide multiple parallel filesystems or multiple top-level directories with different characteristics:

Home (e.g. /home):

Backed up
Size and I/O limits
Intended for scripts, configuration, small files

Project or work (e.g. /project, /work):

Larger quotas
Good for shared code, medium data

Scratch (e.g. /scratch, /tmp/project):

Very large capacity
Optimized for high throughput
Often not backed up, with purge policies

These may all be instances of a parallel filesystem but tuned for different workloads.

File Size and Count Considerations

Parallel filesystems are optimized for large, streaming I/O:

Large files (MB–GB–TB) accessed sequentially leverage striping well
Millions of tiny files stress metadata and reduce throughput

Typical good practices conceptually:

Prefer fewer, larger files over many tiny ones
Use parallel-aware file formats (HDF5, NetCDF, ADIOS, etc.)
Avoid using shared parallel filesystems for temporary compiler outputs or per-step logs where possible

Access Patterns

How your application reads and writes:

Sequential, contiguous I/O: aligns well with striping
Random, small I/O: leads to many small requests, poor bandwidth
Many processes writing to unique, non-overlapping regions: works well when combined with collective or well-structured I/O
Many processes writing to the same offset or appending: creates contention and locking overhead

Parallel I/O libraries and runtime systems can transform application-level patterns into more filesystem-friendly ones.

Performance and Tuning Considerations (High Level)

Specific tools and commands will be introduced along with each concrete filesystem (NFS, Lustre, GPFS). Here we focus on concepts that apply across many parallel filesystems.

Aggregate Bandwidth vs Single-Stream Performance

A key design goal is aggregate bandwidth:

One single-threaded cp from a login node may not look spectacular
But hundreds or thousands of processes reading/writing in parallel can collectively reach tens or hundreds of GB/s

As a user, this means:

Evaluating performance with a full parallel job is more meaningful than a quick single-node test
Oversubmitting I/O (too many processes per file, random patterns) can still saturate or fragment performance

Stripe Settings

Where supported, admins or advanced users may control stripe settings per directory or per file:

Larger stripe count:

Engages more storage targets
Potentially higher throughput for large files
But more overhead for tiny files

Larger stripe size:

Good for large, streaming access patterns
Less suitable for very small random I/O

On many systems, sensible defaults are configured. Users typically only adjust striping for:

Very large, regularly accessed files
Performance-sensitive check pointing files
Shared input datasets accessed by many jobs

Contention and Fair Use

Parallel filesystems are shared resources, so:

One user’s massive, uncoordinated I/O can slow others
Job schedulers may have policies to limit I/O-heavy jobs or direct them to specific filesystems

Typical guidelines (policy-specific details come from your center):

Avoid heavy I/O in tight loops unless necessary
Buffer data in memory and write in larger chunks
Use scratch storage for large temporary data, not home directories

Reliability, Redundancy, and Failure Handling

Parallel filesystems must balance performance with reliability:

Redundant storage: RAID, erasure coding, and replication are common to survive disk or server failures
Metadata replication: multiple metadata servers or failover mechanisms
Monitoring and repair tools: background scrubbing, rebalancing, and integrity checks

From a user perspective, this usually shows up as:

Occasional slowdowns during rebuilds or maintenance
Recommendations to use checkpointing so jobs can survive transient issues
Quotas and policies that help maintain health and manage capacity growth

Integration with Parallel I/O Libraries and Tools

Parallel filesystems are often used together with:

MPI-IO: defines collective and independent parallel I/O on top of MPI communicators
High-level I/O libraries:

HDF5, parallel NetCDF, ADIOS, etc.
Provide structured data formats, metadata, and parallel access patterns

Domain-specific frameworks that hide filesystem details from end users

The filesystem itself does not interpret your application’s data format, but it must support:

Many concurrent opens/closes
Efficient large-block reads/writes
Locking and consistency models compatible with these libraries

Using these higher-level tools often yields better performance and portability than manual fopen, read, and write loops scattered through application code.

How Parallel Filesystems Differ from NFS, Lustre, and GPFS Sections

Within this course:

This chapter focuses on general principles of parallel filesystems common across many technologies.
The NFS, Lustre, and GPFS subsections discuss:

Their specific architectures and design choices
Their concrete command-line tools and tuning knobs
Their particular semantics and center-specific usage guidelines

Understanding the general concepts here—data/metadata separation, striping, aggregate bandwidth, and access patterns—will help you interpret and apply the more detailed, system-specific information in those subsequent chapters.