4.8.2 Lustre

Table of Contents

Overview of Lustre in HPC

Lustre is a parallel filesystem designed for large scale HPC systems. It allows many clients to read and write to a single filesystem at very high aggregate bandwidth. In cluster environments Lustre is used for shared project spaces, scratch areas, and large datasets that must be accessed in parallel by many compute nodes.

This chapter focuses on how Lustre is structured, how it behaves from a user’s perspective, and what practical habits help you get good performance and avoid common pitfalls. General ideas of parallel filesystems are covered elsewhere, so here we only refine what is specific to Lustre.

Basic Lustre Architecture

From a user’s point of view, Lustre appears as a normal mounted directory, such as /lustre, /scratch, or /work. Internally, it is split into separate logical components that work together.

A Lustre filesystem consists of:

A Metadata Server, or MDS, that handles filesystem namespace operations such as creating files, removing files, and changing permissions.
Metadata Targets, or MDTs, that store the actual metadata on disk, for example file names, directory structure, ownership, timestamps, and for small files sometimes the content.
Object Storage Servers, or OSSes, that provide data access services to clients.
Object Storage Targets, or OSTs, that store file data as objects on disk devices.

The MDS manages one or more MDTs, and each OSS manages one or more OSTs. Compute nodes and login nodes act as Lustre clients. When you open or create a file, the client talks to the MDS for metadata and to one or more OSSes for the data itself.

The separation of metadata and data is fundamental. It allows Lustre to scale metadata operations separately from data throughput, which is critical on large HPC systems with many clients.

File Striping in Lustre

A key feature of Lustre is file striping. A file can be broken into stripes that are distributed across multiple OSTs. This allows parallel access to different parts of the file and enables high bandwidth when many clients read or write.

Each Lustre file has a stripe count and a stripe size.

The stripe count is the number of OSTs used for the file. For example, a stripe count of 4 means that Lustre will place file stripes across 4 OSTs in a round robin pattern.

The stripe size is the amount of data on one OST before moving to the next OST. For example, a stripe size of 1 MiB means that the first 1 MiB of the file goes to the first OST, the next 1 MiB to the second OST, and so on.

Lustre rotates through the OSTs until all data is written. If the stripe count is $N$ and stripe size is $S$, then the layout repeats every $N \times S$ bytes.

In Lustre, file striping determines how your file is distributed across OSTs and can significantly affect performance. You can and often should choose stripe parameters for large parallel I/O.

Inspecting and Setting Striping

Users can inspect and control striping through Lustre specific tools. The most common command is lfs, which operates on files and directories on a Lustre filesystem.

To see the stripe settings of a file or directory you can use:

lfs getstripe /lustre/path/to/file

The output lists the stripe size, stripe count, and the OST indices that hold the stripes. On some systems, the stripe size may be reported in bytes, on others in KiB or MiB, depending on configuration and lfs version.

To set default striping for new files in a directory you set the striping on the directory itself:

lfs setstripe -c 4 -S 1m /lustre/path/to/dir

This command requests a stripe count of 4 and stripe size of 1 MiB for any new file created in that directory. Many sites restrict extreme stripe counts, so values that are too large may fail or be overridden by policy.

It is possible to set striping directly on a file, but the common pattern is to set striping on a directory before the file is created. Changing stripe settings on an existing file usually requires creating a new file with the desired layout and copying data.

Many clusters provide site specific recommendations for typical stripe settings for scratch, output, and checkpoint data. These policies are usually tuned to match the number of OSTs and expected workloads.

Metadata Behavior and Layout

Because Lustre separates metadata from data, metadata performance can limit overall performance in workloads with a large number of small files or frequent file creation and deletion.

The MDS and MDT handle:

File name lookups.

Creation and removal of files and directories.

Changing permissions and ownership.

Opening and closing files.

Each metadata operation is a separate request to the MDS. While individual operations are fast, many small operations from a large number of nodes can overwhelm the metadata server.

Some Lustre installations use multiple MDTs for load balancing of metadata. In such configurations files and directories may be spread across MDTs. From the user’s perspective, this is mostly transparent, but performance behavior can change depending on how directories are distributed.

In some newer Lustre versions small files can be stored directly on the MDT as part of metadata, which can improve performance for workloads that process many tiny files. However, this is controlled by the filesystem configuration and not typically by individual users.

Typical Mount Points and Usage Patterns

On most HPC systems using Lustre, the filesystem is mounted under a path that represents a project, work, or scratch area. Examples include:

/lustre/project/$USER

/scratch/$USER

/work/${PROJECT_ID}

The exact mount point and naming conventions are site specific. Documentation provided by the cluster usually states which mount points are backed by Lustre.

Lustre is generally used for temporary or project related data that benefits from high performance parallel access. It is often not used for home directories, user configuration files, or very small long lived files, because of the overheads and policies specific to the high performance filesystem.

Performance Characteristics of Lustre

Lustre is designed for high aggregate bandwidth and good scalability across many clients. It can provide many gigabytes per second of throughput when accessed properly. However, individual file latency is not its main focus.

Several characteristics are especially important for performance.

First, sequential parallel I/O scales much better than many small random accesses. Large, contiguous reads and writes, aligned to the stripe size, are ideal.

Second, using many OSTs through an appropriate stripe count can increase aggregate bandwidth. If a file is striped across more OSTs, different processes can access different stripes in parallel, which can multiply throughput.

Third, metadata operations do not scale as well as pure data I/O. Workloads that create, delete, and stat millions of tiny files can be much slower than workloads that write a small number of large files with the same total amount of data.

On Lustre, large, sequential, parallel I/O to a small number of files usually performs much better than creating many small files. Design I/O patterns to minimize small random I/O and excessive metadata operations.

Typical Usage Scenarios in HPC

In practice, Lustre is used for several common types of HPC workloads.

For checkpoint and restart data, applications write their state to Lustre at intervals so that jobs can be restarted after failures. These files are often large and written in parallel. Good striping and alignment help reduce checkpoint time.

For simulation outputs, many scientific codes write large multidimensional arrays or field data. These files can be written either per process, per node, or as a single shared file. Lustre’s striping is especially helpful when writing one shared file from many processes using MPI I/O or a high level I/O library.

For shared input datasets, such as grids, meshes, or observational data, many jobs and users may read the same large files. Lustre’s ability to provide high read bandwidth from multiple OSSes is essential here.

For analysis and post processing, users may run tools that scan, aggregate, or visualize results stored on Lustre. Efficient access patterns and avoiding unnecessary data copying can save significant time.

Good Practices for Users on Lustre

Although administrators configure and tune Lustre, user behavior has a strong impact on performance and on how pleasant the filesystem is for everyone. Some practices are particularly relevant.

It is advisable to group output into fewer, larger files when possible. For example, instead of each process writing its own small file, use MPI I/O or high level libraries so that a smaller set of larger files is created.

One should avoid running metadata heavy commands over huge directory trees, for example unfiltered find operations over millions of files, during peak hours. These operations stress the MDS more than the OSTs.

Users should pay attention to site guidelines for striping and directory locations. Some directories may be intended for high bandwidth scratch with recommended stripe counts and limited lifetimes, while others are for longer term storage with different policies.

It is helpful to precreate directories hierarchies before starting large job campaigns. Creating directories on the fly from thousands of MPI ranks can overload metadata services.

Often, parallel I/O libraries such as HDF5 or NetCDF are configured to work efficiently on Lustre. These libraries know how to tune stripe sizes, alignment, and access patterns. Using them appropriately can automatically lead to good Lustre performance.

Monitoring and Diagnostics from the User Side

While detailed Lustre monitoring belongs to administrators, users can perform simple diagnostics.

Commands such as lfs df show free space and usage on the Lustre filesystem:

lfs df -h /lustre

This output tells you how much capacity is left on the OSTs. Low free space can degrade performance and may trigger site policies, such as quotas or cleanup.

To see which OSTs are used by a file and how it is laid out, lfs getstripe is again useful. From the pattern of OST indices and stripe counts, you can infer whether a file is effectively striped.

If I/O performance suddenly drops for a specific job, checking whether the job has produced an extremely large number of small files, or is operating in a directory with inadequate striping, can be a first step before contacting support.

Common Pitfalls and Limitations

There are several pitfalls that new users of Lustre often encounter.

A common mistake is placing extremely large files on a single OST, for example by leaving the stripe count at 1, then expecting multi gigabyte per second performance from many nodes. In this case, the single OST becomes a bottleneck.

Another issue is creating millions of tiny files during a run, often one per rank or per time step. Not only does this overload metadata, but cleanup and later analysis become problematic. Some sites enforce quotas on file counts to discourage such patterns.

Running interactive tools that repeatedly scan whole project trees or scratch trees, such as frequent du -sh or deep recursive searches, can also overload the metadata service and affect other users.

It is important to remember that Lustre is not a backup system. Filesystem failures, accidental deletions, or automatic purge policies on scratch areas can lead to data loss. Critical data should be copied to backed up storage locations specified by the site.

Finally, site specific policies may automatically purge files that have not been accessed for a certain period, especially in scratch directories backed by Lustre. Ignoring these policies can lead to unexpected data loss.

Relation to Applications and I/O Libraries

Applications that are explicitly aware of Lustre can exploit its features more effectively. However, even unmodified applications can benefit if run-time I/O libraries are configured with Lustre in mind.

Many MPI I/O implementations include optimizations for Lustre, such as collective I/O and alignment with stripe boundaries. Setting environment variables that control MPI I/O buffering, aggregator counts, or alignment can improve performance without changing application code.

High level libraries like HDF5 and NetCDF often have tuning parameters related to chunk size, alignment, and collective I/O. On Lustre, these parameters can be adjusted so that chunk sizes are multiples of the stripe size, and collective writes line up with stripes.

Site documentation frequently provides recommended environment settings for these libraries on their specific Lustre deployment. Following these recommendations is a practical way to achieve good performance without deep filesystem expertise.

Summary

Lustre is a parallel filesystem tailored to the demands of HPC workflows that require high bandwidth and scalable access from many compute nodes. Its separation of metadata and data, use of MDS/MDT and OSS/OST components, and file striping mechanism are central to its design.

For users, the most important aspects of Lustre are understanding how striping influences performance, recognizing the cost of metadata operations, and adopting good I/O patterns with fewer large files and sequential parallel access.

By aligning application I/O with Lustre’s strengths, using tools such as lfs getstripe and lfs setstripe, and adhering to site policies, you can make effective and efficient use of Lustre based storage in HPC environments.

Comments

Please login to add a comment.

Don't have an account? Register now!