4.8.3 GPFS

Table of Contents

Overview of GPFS

The IBM General Parallel File System, now called IBM Spectrum Scale, is a parallel filesystem designed for high performance, scalability, and reliability on clusters and supercomputers. On many HPC systems you will find it as the shared filesystem that provides a single large namespace, for example /gpfs, /scratch, or /projects, which all nodes can access simultaneously.

From a user point of view, GPFS looks like a normal POSIX filesystem. You use the usual Linux commands, such as ls, cp, mv, and rm, and you open files in your code with standard I/O calls. The distinctive properties of GPFS come from how it distributes data and metadata across many disks and servers to provide parallel access, high throughput, and fault tolerance.

This chapter focuses on behavior and usage patterns that are specific to GPFS in an HPC context, not on general parallel filesystem concepts, which are covered elsewhere.

Basic Architecture Concepts

GPFS is a shared disk parallel filesystem. Storage is organized into so called Network Shared Disks, or NSDs. An NSD is a logical abstraction of one or more physical disks or LUNs that are made available to the GPFS cluster. NSD servers export these disks to the rest of the cluster. Client nodes, which include compute nodes and login nodes, run GPFS client daemons and see the combined storage as a standard filesystem tree.

Internally, GPFS organizes disks into storage pools. A storage pool is a collection of NSDs with similar characteristics, for example a pool of fast SSDs used as a metadata pool or a pool of slower high capacity HDDs for bulk data. Files can be placed in specific pools according to policies, which allows the system to balance cost, capacity, and performance.

Data in GPFS is stored in blocks. A GPFS block is the basic unit of allocation on disks. Block sizes are configured by the administrators and can be much larger than the typical 4 KB filesystem block on a local disk. Typical GPFS block sizes in HPC range from hundreds of kilobytes to several megabytes. Larger blocks reduce metadata overhead and can significantly improve streaming I/O throughput for large files.

Metadata, such as directory contents, file attributes, and allocation maps, is distributed across metadata disks. On many clusters metadata is stored on faster devices than user data, which improves directory operations and file creation and deletion. From the user view, metadata distribution is invisible, but it influences performance, especially when applications create or remove many small files.

Striping and Parallel Data Access

GPFS achieves parallel I/O through striping across multiple disks. When a file is striped, its data blocks are spread across many NSDs. A read or write from multiple processes can then be served concurrently by many disks, which yields higher bandwidth than a single disk could provide.

Striping is governed by two main parameters. The first is block size, which is the size of each contiguous extent stored on a single NSD. The second is the number of data replicas or failure groups, which controls redundancy and placement for fault tolerance. The exact striping layout is often specified by the filesystem configuration and by policy rules, not by individual users.

Unlike some other parallel filesystems, GPFS does not usually expose an explicit per file stripe count parameter to end users on typical HPC clusters. Instead, administrators configure defaults that are suitable for the workload. However, for advanced users, GPFS file placement and layout can be influenced by file system policies, storage pools, and possibly explicit commands such as mmchattr, depending on site configuration. If your site allows user level tuning, your documentation will specify which attributes you can change.

Striping is particularly beneficial when many MPI processes access different parts of a large file concurrently. GPFS can coordinate access through its distributed lock manager and can serve requests from different nodes in parallel. This is one of the reasons why GPFS is widely used on large supercomputers that run highly parallel applications.

For high throughput on GPFS, design your applications to perform large, contiguous I/O operations that align reasonably with the filesystem block layout, and avoid many tiny, scattered reads or writes.

Consistency, Locking, and Concurrency

When multiple processes access the same file concurrently, the filesystem must maintain a consistent view. GPFS uses a distributed token based locking mechanism. Tokens grant rights to read, write, or modify metadata for file regions or directories. These tokens are managed by a set of GPFS daemons running on cluster nodes.

From the user perspective, GPFS provides POSIX compatible semantics for file I/O. This means that when a process closes a file after writing, subsequent opens and reads from other processes will see the data as committed. However, the interaction of buffering in user space and operating system caches can complicate apparent behavior if applications rely on seeing unflushed data in real time. To maintain portability and consistency, HPC applications that perform concurrent writes often use MPI I/O collectives, filesystem barriers, or application level synchronization rather than relying on implicit consistency.

Coarse grained locks over large file regions can cause serialization when many processes write small pieces to the same file concurrently. This is not unique to GPFS, but the details of GPFS token management make certain access patterns more expensive. For instance, thousands of ranks appending small records to a shared log file can generate a lot of lock traffic and reduce throughput dramatically.

On the other hand, independent I/O operations to disjoint regions of a large file typically scale well. GPFS can issue locks at a block or range granularity and can grant tokens that allow many readers or independent writers to proceed in parallel, as long as they do not conflict on the same regions of the file.

Avoid fine grained concurrent writes by many processes to overlapping file regions. Instead, use collective I/O or partition the file so that each process writes to a distinct, non overlapping range.

Fault Tolerance and Reliability

A key feature of GPFS in production HPC environments is its support for fault tolerance. The filesystem is designed to survive disk failures and, with appropriate configuration, node failures without losing data or becoming unavailable.

GPFS groups NSDs into failure groups. A failure group represents a set of disks that might fail together, for example disks within the same RAID array or enclosure. The filesystem can replicate data and metadata across different failure groups. If an entire failure group becomes unavailable, GPFS can still serve data from replicas stored in other groups.

From a user point of view, replication is invisible. You see a single logical file, even if the filesystem keeps multiple copies of its blocks in the background. Replication is usually configured for metadata, which is critical for filesystem integrity, and optionally for data, depending on reliability and capacity tradeoffs.

In addition to replication, GPFS supports background file system checking, so called mmfsck, and automated repair. Many of these operations are performed online, without unmounting the filesystem. For HPC users, the main practical consequence is that administrators can maintain and grow the filesystem while it remains available to jobs, although performance may temporarily vary during maintenance operations.

Performance Characteristics in HPC

On HPC clusters, GPFS is often deployed on large pools of HDDs for capacity, sometimes combined with SSDs or NVMe devices as metadata or cache tiers. The performance you see as a user depends on the configuration, the network between nodes and storage servers, and your access pattern.

For large sequential reads and writes to a single file from a moderate number of processes, GPFS can deliver very high bandwidth when files are striped across many disks. This is ideal for many simulation codes that periodically write large checkpoint files and then read them back at restart.

For workloads with many small files, such as millions of per process output files or temporary data fragments, metadata performance becomes dominant. GPFS scales metadata operations better than simple NFS, but can still become a bottleneck if applications create and remove small files excessively. This behavior is especially visible in project directories that hold deep directory trees and many tiny log files.

To give a simple mental model, suppose a large GPFS filesystem can sustain an aggregate bandwidth $B$ when serving sequential I/O to large stripes across $N$ disks. In ideal conditions, the total throughput can scale roughly as

$$ B \propto N $$

until you reach saturation of either the network or GPFS server CPU. However, for small random I/O, effective throughput falls far below this bound because each request carries significant overhead and often cannot be satisfied sequentially.

On GPFS, you will get the best performance if you aggregate I/O into fewer, larger files, reduce metadata intensive operations such as stat on millions of files, and avoid per time step creation and deletion of small files.

Typical Usage Patterns on HPC Clusters

As a user of an HPC system with GPFS, you do not normally administer the filesystem. Instead, you interact with it by using the directories your site provides, such as home, scratch, and project spaces. Each of these may be separate GPFS filesystems or separate directories within one large GPFS instance.

Home directories on GPFS are typically configured for reliability and may have quotas for both capacity and number of files. Scratch spaces, also often on GPFS, are optimized for high throughput and may have higher quotas but weaker durability guarantees, such as purge policies for old or unused files. Project filesystems may provide a balance between the two.

Most sites document the expected use of each area. For instance, you might be asked to keep only source code and small configuration files in your home directory, store large simulation input and output in a project or scratch directory, and clean up temporary files regularly. This guidance is closely tied to how GPFS storage pools have been tuned under the hood.

GPFS supports advanced features such as snapshots and quota management. Snapshots allow administrators or, in some configurations, users to capture a read only view of a filesystem or directory at a point in time. This can help with accidental deletion recovery. Quotas enforce per user or per group limits on storage usage and sometimes number of inodes. If you encounter errors about quota limits on a GPFS filesystem, you will typically need to delete files or request quota changes, not attempt to bypass the error locally.

Application I/O Strategies on GPFS

Although general parallel I/O techniques are discussed in other chapters, it is helpful to connect a few of them specifically to GPFS behavior.

Many MPI applications use MPI I/O to write checkpoint and output data. MPI I/O can combine many small, non contiguous requests into larger, aligned operations through collective I/O optimizations. On GPFS, this can significantly reduce lock contention and increase effective bandwidth. In simplified terms, instead of $P$ processes each writing a small chunk, the MPI library arranges for a smaller number of larger requests, which fit better with GPFS block and stripe alignment.

If your application writes multiple output files, you can choose between one file per process and a smaller number of shared files. On GPFS, one file per process can cause metadata pressure and strain directory operations when the number of ranks is very large. Many sites therefore recommend patterns such as one file per node or one file per group of ranks. This reduces file count while still allowing parallelism.

Another useful pattern on GPFS is to separate frequent small I/O from large periodic I/O. For example, write fine grained diagnostic logs to local node storage, such as /tmp, and periodically aggregate and flush important data to GPFS in larger chunks. This avoids the high overhead of many small synchronous writes to the parallel filesystem.

On GPFS, prefer MPI I/O collective operations and file aggregation strategies over naive one file per process patterns when scaling to many thousands of ranks.

Site Specific Tools and Commands

While end users mostly interact with GPFS through standard POSIX interfaces, there are a few GPFS specific commands that users at some sites may be allowed to use.

The mmlsfs command shows information about a GPFS filesystem, such as block size, maximum file size, and mount options. It is usually restricted to administrators, but some sites may permit read only usage so that users can inspect configuration and tune their I/O behavior accordingly.

The mmlsquota command reports quota usage and limits for users and groups on GPFS filesystems. If you hit quota errors on GPFS, running mmlsquota or site specific wrappers around it can help you understand which filesystem and which type of quota is exhausted.

The mmapplypolicy and policy files control data placement, migration, and file lifecycle within GPFS. These are typically only configured by administrators. However, the policies they define may expose behavior such as automatic movement of old files from fast to slow pools or automatic purging of scratch areas.

Because access to raw GPFS commands varies by site, your best source of truth is your system documentation. Nonetheless, recognizing the mm* command prefix helps you identify GPFS related utilities if you encounter them in job scripts or support messages.

Comparing GPFS to Other Parallel Filesystems

Although this chapter does not aim to fully compare filesystems, it is useful to highlight a few practical points that are distinctive for GPFS from a user point of view.

GPFS is fully POSIX compatible and presents itself as a single namespace mounted on all participating nodes. You do not usually specify stripe counts or layouts in your user commands. Instead, layout is governed by global policies and storage pools. This can simplify usage because you cannot accidentally choose harmful stripe configurations, but it also means performance tuning is more policy driven and handled by administrators.

Metadata operations in GPFS are strongly integrated with its token based locking and distributed architecture. This can make directory operations scale better than on traditional NFS for many workloads. At the same time, extreme metadata workloads, such as creating or deleting billions of files, can still produce bottlenecks, and HPC centers often provide guidelines on how to structure and manage large datasets within GPFS.

In practice, what matters for you is that many best practices for other parallel filesystems also apply to GPFS: aggregate I/O, reduce file counts, and respect filesystem specific policies such as purge rules or preferred directories for large jobs. However, the details of layout control and administration are different, and in GPFS environments, you will often find a tighter integration with IBM or vendor specific management tools.

Practical Recommendations for Using GPFS

To conclude, it is helpful to summarize several concrete practices that align well with GPFS characteristics on HPC clusters.

Place large, performance critical data such as checkpoints and bulk simulation output on the GPFS scratch or project directories that are advertised as high performance. Avoid putting such loads in GPFS backed home directories if these are tuned more for reliability than for throughput.

Reduce the number of files your jobs create on GPFS. Where possible, aggregate data from many ranks into fewer shared files using MPI I/O or post processing tools. Try to avoid deeply nested directory hierarchies with huge numbers of small files.

Design your applications to perform fewer, larger I/O operations rather than many small ones. Buffer data in memory as appropriate for your workflow, then write it out in blocks that are at least hundreds of kilobytes in size, and preferably megabytes, especially for checkpoint data.

If you encounter performance issues that appear to be related to GPFS, such as slow directory listings or slow writes during checkpoints, collect basic information such as filesystem path, approximate file sizes, and timing, and share this with your support team. Because GPFS is policy driven, administrators can often adjust storage pools or policies if they understand your workload pattern.

Finally, become familiar with your site specific documentation about GPFS. Many HPC centers publish details about their GPFS configuration, purging policies, and recommended I/O practices. Aligning your workflows with these guidelines will help you obtain consistent performance while coexisting fairly with other users on the shared parallel filesystem.

Comments

Please login to add a comment.

Don't have an account? Register now!