Table of Contents
What Makes a Filesystem “Parallel” in HPC
A parallel filesystem is designed so that many clients (compute nodes) can read and write many files concurrently, with the data striped across multiple storage servers and disks. The main goals are:
- High aggregate bandwidth for many nodes at once
- Scalable metadata management (file names, directories, permissions)
- Consistent access semantics suitable for batch and parallel jobs
In contrast to a simple network filesystem on a single server, a parallel filesystem distributes both:
- Data: file contents are split into chunks and stored across multiple storage targets
- Metadata: information about files and directories is often managed by dedicated metadata services
From the user's point of view, you usually see one mount point (e.g. /home, /project, /scratch), but behind that path there may be:
- Dozens or hundreds of disks
- Multiple storage servers
- Separate networks for client–storage traffic
Key Concepts and Architecture
Data and Metadata Separation
Parallel filesystems typically distinguish between:
- Metadata servers (MDS): manage
- File and directory names
- Ownership and permissions
- Timestamps
- Directory hierarchy
- Object or storage servers (OSS/OSD): store
- Actual file data in “objects” or “extents”
- Often backed by RAID arrays or other redundancy
Why this matters in HPC:
- Metadata operations (creating many files, listing directories) can be bottlenecks if not scaled properly
- Data throughput (reading/writing large arrays, checkpoints) is handled separately and can be scaled by adding more storage servers
As a user, heavy use of operations like ls -R, creating millions of tiny files, or compiling large codebases on shared storage can stress the metadata system.
Striping
Striping is the core data layout mechanism in parallel filesystems:
- A large file is divided into fixed-size chunks, e.g. 1 MB, 4 MB, 16 MB, etc.
- These chunks are spread across multiple storage targets (disks/servers)
- Multiple clients can access different stripes simultaneously
Conceptually, for a file:
$$
\text{file\_data} = \text{stripe}_0 + \text{stripe}_1 + \dots + \text{stripe}_n
$$
Each stripe may live on a different storage server. When many processes read or write different parts of the file, you can get near-aggregate bandwidth of all those servers.
Important characteristics:
- Stripe count: how many storage targets a file uses
- Stripe size: how big each chunk is before moving to the next target
Choosing stripe parameters is an important tuning lever for large parallel I/O patterns (often configured per-directory or per-file by the user or admin).
Shared Namespace and Global View
A parallel filesystem usually presents a single, shared namespace across the entire cluster:
- Every node sees the same path names (e.g.
/scratch/projectX/run42) - Processes on different nodes can operate on the same files concurrently
- No need to copy results between nodes just to share them
This global view is crucial for:
- MPI or hybrid applications doing parallel I/O
- Post-processing and visualization on login or analysis nodes
- Workflow tools that run different stages on different nodes
Consistency and Concurrency Semantics
Because many processes may access the same file:
- The filesystem must define when writes by one process become visible to others
- Locking and caching strategies are needed to prevent corruption
In HPC parallel filesystems:
- POSIX semantics are often approximated or implemented with optimizations
- Some systems offer relaxed semantics or require specific I/O patterns for best performance (e.g. collective or coordinated I/O)
In practice, you’ll see recommendations like:
- Avoid many processes appending to a single file simultaneously
- Prefer collective writes or rank-local files combined later
- Use parallel I/O libraries (MPI-IO, HDF5, NetCDF) that understand the filesystem
Why Parallel Filesystems Matter in HPC Workflows
Parallel filesystems address several typical HPC needs:
- Checkpoint/restart:
Many ranks write large state files at once. A parallel filesystem can complete this quickly so jobs spend less time in I/O. - Large data sets:
Simulations and data analysis may produce and consume terabytes or petabytes. Parallel filesystems scale well beyond what a single server can provide. - Shared software and input data:
Many jobs may read from the same input datasets or shared software stack concurrently. - Multi-user, multi-job concurrency:
The same filesystem must serve many independent jobs with acceptable performance.
Common Usage Patterns and Best Practices (Conceptual)
Implementation-specific commands and options vary (and are covered in the NFS, Lustre, and GPFS sections), but some high-level patterns are common to most parallel filesystems.
Directories for Different Purposes
Clusters often provide multiple parallel filesystems or multiple top-level directories with different characteristics:
- Home (e.g.
/home): - Backed up
- Size and I/O limits
- Intended for scripts, configuration, small files
- Project or work (e.g.
/project,/work): - Larger quotas
- Good for shared code, medium data
- Scratch (e.g.
/scratch,/tmp/project): - Very large capacity
- Optimized for high throughput
- Often not backed up, with purge policies
These may all be instances of a parallel filesystem but tuned for different workloads.
File Size and Count Considerations
Parallel filesystems are optimized for large, streaming I/O:
- Large files (MB–GB–TB) accessed sequentially leverage striping well
- Millions of tiny files stress metadata and reduce throughput
Typical good practices conceptually:
- Prefer fewer, larger files over many tiny ones
- Use parallel-aware file formats (HDF5, NetCDF, ADIOS, etc.)
- Avoid using shared parallel filesystems for temporary compiler outputs or per-step logs where possible
Access Patterns
How your application reads and writes:
- Sequential, contiguous I/O: aligns well with striping
- Random, small I/O: leads to many small requests, poor bandwidth
- Many processes writing to unique, non-overlapping regions: works well when combined with collective or well-structured I/O
- Many processes writing to the same offset or appending: creates contention and locking overhead
Parallel I/O libraries and runtime systems can transform application-level patterns into more filesystem-friendly ones.
Performance and Tuning Considerations (High Level)
Specific tools and commands will be introduced along with each concrete filesystem (NFS, Lustre, GPFS). Here we focus on concepts that apply across many parallel filesystems.
Aggregate Bandwidth vs Single-Stream Performance
A key design goal is aggregate bandwidth:
- One single-threaded
cpfrom a login node may not look spectacular - But hundreds or thousands of processes reading/writing in parallel can collectively reach tens or hundreds of GB/s
As a user, this means:
- Evaluating performance with a full parallel job is more meaningful than a quick single-node test
- Oversubmitting I/O (too many processes per file, random patterns) can still saturate or fragment performance
Stripe Settings
Where supported, admins or advanced users may control stripe settings per directory or per file:
- Larger stripe count:
- Engages more storage targets
- Potentially higher throughput for large files
- But more overhead for tiny files
- Larger stripe size:
- Good for large, streaming access patterns
- Less suitable for very small random I/O
On many systems, sensible defaults are configured. Users typically only adjust striping for:
- Very large, regularly accessed files
- Performance-sensitive check pointing files
- Shared input datasets accessed by many jobs
Contention and Fair Use
Parallel filesystems are shared resources, so:
- One user’s massive, uncoordinated I/O can slow others
- Job schedulers may have policies to limit I/O-heavy jobs or direct them to specific filesystems
Typical guidelines (policy-specific details come from your center):
- Avoid heavy I/O in tight loops unless necessary
- Buffer data in memory and write in larger chunks
- Use scratch storage for large temporary data, not home directories
Reliability, Redundancy, and Failure Handling
Parallel filesystems must balance performance with reliability:
- Redundant storage: RAID, erasure coding, and replication are common to survive disk or server failures
- Metadata replication: multiple metadata servers or failover mechanisms
- Monitoring and repair tools: background scrubbing, rebalancing, and integrity checks
From a user perspective, this usually shows up as:
- Occasional slowdowns during rebuilds or maintenance
- Recommendations to use checkpointing so jobs can survive transient issues
- Quotas and policies that help maintain health and manage capacity growth
Integration with Parallel I/O Libraries and Tools
Parallel filesystems are often used together with:
- MPI-IO: defines collective and independent parallel I/O on top of MPI communicators
- High-level I/O libraries:
- HDF5, parallel NetCDF, ADIOS, etc.
- Provide structured data formats, metadata, and parallel access patterns
- Domain-specific frameworks that hide filesystem details from end users
The filesystem itself does not interpret your application’s data format, but it must support:
- Many concurrent opens/closes
- Efficient large-block reads/writes
- Locking and consistency models compatible with these libraries
Using these higher-level tools often yields better performance and portability than manual fopen, read, and write loops scattered through application code.
How Parallel Filesystems Differ from NFS, Lustre, and GPFS Sections
Within this course:
- This chapter focuses on general principles of parallel filesystems common across many technologies.
- The NFS, Lustre, and GPFS subsections discuss:
- Their specific architectures and design choices
- Their concrete command-line tools and tuning knobs
- Their particular semantics and center-specific usage guidelines
Understanding the general concepts here—data/metadata separation, striping, aggregate bandwidth, and access patterns—will help you interpret and apply the more detailed, system-specific information in those subsequent chapters.