4.8 Parallel filesystems

Table of Contents

Overview

Parallel filesystems are designed to provide high throughput and scalable access to data for many users and many compute nodes at the same time. They are a core part of modern HPC clusters, because they allow hundreds or thousands of processes to read and write data concurrently without becoming a single bottleneck.

A traditional filesystem is usually tied to a single server, which stores both metadata and file contents on its local disks. Parallel filesystems split these responsibilities across multiple servers and multiple disks, and they coordinate how client nodes access them. The goal is to allow I/O operations to scale as you add more hardware, in the same way you want computation to scale when you add more nodes.

In this chapter the focus is on concepts that are specific to parallel filesystems as used in HPC, not on general storage or filesystem basics, which are discussed elsewhere.

Basic idea of parallel file access

The key idea behind a parallel filesystem is that many nodes can access the same files at the same time, and the system can serve these requests from many disks and servers in parallel. Instead of one central server handling all read and write operations, the filesystem distributes data across several storage targets, and client nodes talk to all of them.

In a parallel filesystem, a single large file is usually split into chunks, often called stripes. These stripes are placed on multiple storage devices, for example disks or SSDs. When a program reads a large file, different parts of the file can be read from different devices in parallel. When many processes read or write different parts of the same file, the system can serve all of these operations concurrently.

The client software on each node understands how files are split across storage. When an application makes a standard POSIX file call, such as open, read, or write, the client translates these calls into parallel I/O operations to the appropriate servers. From the application perspective, the filesystem still appears as a single directory tree mounted at some path like /scratch or /fsx, even though the underlying storage is spread out.

A parallel filesystem increases performance by distributing data across multiple storage servers and allowing many concurrent clients to access different parts of the data at the same time.

Data striping and throughput

Striping is the core mechanism that lets parallel filesystems deliver high bandwidth. A file is divided into equal sized chunks, and these chunks are placed in a round robin pattern across a set of storage targets. A storage target might be a disk, an SSD, or a RAID group.

If the stripe size is $S$ bytes, and the file is stored across $k$ storage targets, then the file layout alternates between these targets in units of $S$. Roughly, for a large sequential read of size $N$ bytes, up to $k$ targets can participate concurrently. In an idealized case, if each target can deliver $B$ GB/s, the theoretical peak throughput is

$$
B_{\text{total}} \approx k \cdot B.
$$

In practice, you never reach the full theoretical sum, but increasing the number of targets often increases achievable bandwidth.

Stripe size and stripe count matter for performance. If many processes read or write large contiguous regions of a file, a larger stripe size and more storage targets can provide higher throughput. If many processes access small pieces of the file at random offsets, overly large stripes or too many stripes can lead to unnecessary contention.

As a user you may have limited control over striping. On some systems an administrator sets a default stripe pattern. On others, user tools allow you to set striping on a per file or per directory basis. It is common for performance tuning to involve experimenting with stripe counts and stripe sizes for large simulation output or checkpoint files.

When possible, align I/O sizes and access patterns with the stripe size, and avoid many small unaligned I/O operations, because these patterns degrade performance on parallel filesystems.

Metadata and data separation

Parallel filesystems separate metadata operations from data operations. Metadata includes information such as directory hierarchies, filenames, file sizes, ownership, permissions, and timestamps. Data refers to the actual file contents.

In a typical design, a set of metadata servers holds the directory tree and the mapping from filename and offset to specific storage targets. A set of object or data servers stores the file contents. When a client opens a file, it contacts the metadata server to resolve the path and to obtain the layout of the file, including which storage targets hold which stripes. Once the layout is known, most read and write operations go directly between the client and the data servers, so that the metadata server does not become the main bottleneck.

Operations like creating many small files, listing very large directories, or changing permissions stress the metadata portion of the filesystem. Operations like reading huge datasets or writing checkpoint files stress the data layer. HPC workloads often do a mix of both.

From a practical standpoint, this separation explains why one kind of operation might be slow while another seems fast. For example, ls on a directory with millions of files might appear to hang, even though reading a single huge file in that directory is fast. Users often confuse metadata performance with data performance, and understanding this distinction helps you interpret filesystem behavior.

Metadata intensive operations, such as creating many small files or scanning huge directories, are often the performance weak spot of parallel filesystems, even when large sequential I/O is very fast.

Client view and POSIX semantics

From an application point of view, parallel filesystems usually provide a POSIX like interface. Programs use the same system calls that they would use on a local filesystem. This compatibility is helpful in HPC, because many scientific codes assume POSIX semantics.

However, strict POSIX guarantees, such as immediately consistent file views across all nodes, are expensive to provide at large scale. Some parallel filesystems relax certain guarantees or offer tunable modes that trade strict consistency for higher performance. For example, client side caching and write back buffering can improve throughput but may delay visibility of changes to other nodes.

As a user you typically do not manipulate these low level details directly. What you do need to know is that different patterns of file sharing can behave differently. Many parallel applications avoid having multiple processes write to the same offset in a file at the same time, because even if the filesystem technically supports this, performance and reproducibility can suffer.

When you design your I/O, you should think about how your parallel program uses files. For example, you can choose between each process writing its own file, all processes writing to a single shared file, or some hybrid. Parallel I/O libraries and MPI I/O exist to help organize these patterns more safely and efficiently, but the underlying filesystem semantics still matter.

Common usage patterns in HPC

Parallel filesystems on HPC systems are often exposed through specific mount points that are intended for particular use cases. A cluster might have:

A high performance scratch filesystem, often named /scratch or similar, which lives on a parallel filesystem and is intended for temporary data, large outputs, and checkpoints.

A home filesystem, maybe exported from a different storage system, intended for long term storage of code, configuration, and small datasets, with stronger backup policies but lower performance.

A project or work filesystem, used for shared datasets and collaborative projects, usually also backed by scalable storage.

The parallel filesystem is most relevant for performance sensitive workloads that read or write large amounts of data or that need many nodes to access the same data concurrently. Typical uses include reading simulation input data, writing time step outputs, storing checkpoints, and reading or writing training data for machine learning jobs.

Users usually access the filesystem directly with standard tools such as cp, mv, ls, rm, and with standard file APIs in their programs. However, some centers provide specific tools to inspect stripe settings or to manage quotas, and they publish policies about what can be stored where, how long files can remain on scratch, and how much capacity each user or project can consume.

Typical performance characteristics

Parallel filesystems are designed to deliver extremely high aggregate bandwidth to large I/O operations. They are usually optimized for a workload with many clients, few very large files, and mostly sequential access patterns. Random access patterns involving many small files and tiny I/O operations are harder for them to serve efficiently.

As a rough rule of thumb, parallel filesystems like to handle I/O in large chunks. For example, reading or writing data in blocks of at least several megabytes usually performs better than a pattern that uses many small kilobyte sized operations. Buffered I/O in higher level languages can sometimes hide inefficient patterns, but eventually the filesystem sees the true access pattern.

Latency for small operations is often higher than on a local SSD, because requests must travel across the network to reach the storage servers and metadata servers. Therefore a single process doing small random reads from a parallel filesystem may not see impressive speed. The design focuses on providing high throughput when many processes access data concurrently.

Another common characteristic is variable performance under load. Since multiple users and projects share the same storage hardware, performance can change depending on what others are doing. If multiple jobs simultaneously perform large writes or heavy metadata operations, each job might see reduced throughput. This is a shared resource, and you must plan for some variability.

Parallel filesystems are optimized for large, sequential, and concurrent I/O, and they perform poorly with many tiny, random operations or huge numbers of small files.

Reliability, quotas, and policies

Because parallel filesystems store critical research data and serve many users, they are built with redundancy and monitored carefully. Storage targets are often configured with RAID or similar mechanisms to protect against disk failures. Metadata servers may be replicated, and failover strategies are in place to keep the filesystem available.

Despite this, parallel filesystems are not always intended for permanent archival storage. High performance scratch filesystems frequently have no backups, and they may have policies that automatically delete files after some period. Project or home filesystems may have backup policies, but they often use different storage technologies, and they may not provide the same performance.

Most HPC centers enforce quotas on parallel filesystems. A quota might limit total capacity in gigabytes or terabytes, and it might also limit the number of files, sometimes called inodes. Since metadata handling is expensive at scale, controlling file counts is important. Users who generate millions of tiny files can cause serious strain on the system.

Understanding the policies for each filesystem is essential. Keeping large numbers of small intermediate files on a high performance parallel filesystem is often discouraged. Instead, centers may ask users to aggregate outputs into fewer large files, or to move data off the parallel filesystem when it is no longer needed for active computation.

User level best practices

As an HPC beginner, you cannot change how the parallel filesystem is configured, but you can choose how to use it. There are a few simple habits that make a significant difference:

Prefer fewer large files over many small files when possible. For example, if your program writes one file per time step per process, consider aggregating outputs or using libraries that support collective I/O.

Use the parallel filesystem only for active data. Move long term archives to dedicated archival or object storage if your center provides such services.

Avoid running unnecessary metadata intensive commands during heavy usage. For example, repeatedly calling ls -R on entire directory trees or using tools that scan millions of files in tight loops can stress metadata servers.

Be explicit about where your job stores temporary data. Direct large I/O to the high performance scratch filesystem instead of your home directory, unless the documentation of your system indicates otherwise.

When using many processes, coordinate I/O where possible so that you are not creating a massive number of small independent files. Parallel I/O libraries, MPI I/O, and higher level data formats such as HDF5 or NetCDF can help, but even without these, you can write simple strategies to reduce file counts.

For beginners, the most important rule is: use the parallel scratch filesystem for large, active data, and avoid creating huge numbers of small files or scanning large directory trees repeatedly.

Interaction with job scheduling and compute nodes

Parallel filesystems are shared among login nodes and compute nodes. The job scheduler allocates CPU and memory resources, but I/O to the parallel filesystem remains a shared service. This means that your running jobs might be affected by I/O from other users, and your I/O may affect them.

Compute nodes usually mount the parallel filesystem over the high performance interconnect. When your job runs, each process can access the same directories and files, and the filesystem takes care of coordinating access. This is what allows parallel codes to read shared input datasets and to produce shared output.

Since I/O is shared, some centers provide guidelines about when to perform very heavy I/O operations. For example, you might be asked to avoid doing extremely metadata intensive tasks on login nodes, or to schedule large data reorganizations as batch jobs that run off peak.

From a practical perspective, you should test I/O behavior on a smaller scale before launching extremely large jobs. This helps you detect issues such as slow directory scans, unexpected permission problems, or suboptimal file layouts, and it reduces the risk of overloading the filesystem with a poorly behaved pattern at full scale.

Relation to specific parallel filesystems

HPC centers deploy specific parallel filesystem technologies such as NFS in certain configurations, Lustre, or GPFS and similar systems. Each of these has particular tools and commands for inspecting striping, quotas, and performance, and each has its own internal architecture.

From a beginner perspective, you do not need to learn all implementation details. You should focus on the general behaviors that all parallel filesystems share: separation of metadata and data, striping across multiple storage targets, sensitivity to access patterns, and the need to respect center policies on usage.

Later chapters will look at some concrete parallel filesystem technologies in more detail. There you will see how the general ideas presented here are realized in specific systems, and you will learn how to inspect and tune filesystem related settings for your jobs on real clusters.