Table of Contents
Overview of Lustre in HPC
Lustre is a high-performance, parallel filesystem designed for large HPC clusters. It provides a single, shared namespace that can span thousands of nodes and petabytes of storage, with very high aggregate bandwidth. It is optimized for:
- Many clients accessing data in parallel
- Large files and streaming I/O
- Striping data across many storage devices
In most HPC systems, Lustre is mounted on all compute and login nodes at a common mount point (e.g. /lustre, /scratch, /fsx), giving a unified view of data regardless of which node you’re running on.
Lustre Architecture: Key Components
A Lustre filesystem is built from several specialized server types. Understanding these is essential to interpreting performance and behavior on clusters.
Management Server (MGS)
- Holds global configuration information for one or more Lustre filesystems.
- Knows which metadata and object storage targets belong to each filesystem.
- Usually combined with an MDS on the same physical node in production systems.
As a user, you rarely interact with the MGS directly; it works in the background to keep the filesystem configuration consistent.
Metadata Server (MDS) and Metadata Targets (MDTs)
- MDS: Server process that handles filesystem metadata operations:
- Creating, deleting, renaming files and directories
- Managing permissions
- Path resolution (
/path/to/file→ object identifiers) - MDT: The underlying block storage where metadata is stored.
Characteristics important to users:
- Metadata operations (e.g.
ls -R,find, creating millions of small files) are serviced by the MDS/MDT. - Metadata can become a bottleneck if you perform many small operations, even if the data servers are underused.
- Some advanced deployments use multiple MDTs (DNE – Distributed Namespace Environment) to scale metadata.
Object Storage Servers (OSS) and Object Storage Targets (OSTs)
- OSS: Server processes that handle bulk data I/O for file contents.
- OST: The actual storage volumes managed by OSS (e.g. RAID groups, disks, SSDs).
Files on Lustre are stored as objects on one or more OSTs:
- The MDS keeps track of which objects (and which OSTs) belong to a file.
- When an application reads or writes file data, client nodes communicate directly with the relevant OSSes.
This is where Lustre gets its parallel bandwidth: many OSS/OSTs can serve different parts of a file simultaneously.
How Lustre Presents Storage to Users
From a user’s perspective, a Lustre filesystem behaves much like a POSIX filesystem:
- Standard commands (
ls,cp,mv,rm,mkdir,chmod, etc.) work as usual. - You can compile and run programs without special I/O calls.
- Most standard libraries (
fopen,read,write, MPI-IO, HDF5, NetCDF) work normally.
However, the parallel structure underneath introduces features and behaviors you should be aware of.
Striping: Distributing File Data Across OSTs
Lustre can stripe a file across multiple OSTs:
- A file is split into fixed-size chunks (stripes).
- Each stripe is stored on a different OST (or set of OSTs).
- Multiple clients can read/write different stripes in parallel, multiplying effective bandwidth.
Key parameters:
- Stripe count (
stripe_count): Number of OSTs used for a file. - Stripe size (
stripe_size): Size of each stripe chunk (e.g. 1 MB, 4 MB).
Conceptually, for a file with stripe count 4 and stripe size 1 MB:
- Bytes 0–1 MB → OST 0
- Bytes 1–2 MB → OST 1
- Bytes 2–3 MB → OST 2
- Bytes 3–4 MB → OST 3
- Bytes 4–5 MB → back to OST 0, and so on.
Choosing appropriate striping is one of the main performance tuning knobs available to users.
Querying and Modifying Lustre Striping
Viewing Striping Information
Use lfs (Lustre filesystem utility) to inspect a file’s striping:
lfs getstripe myfileTypical output includes:
stripe_countstripe_size- The OST indices assigned
You can also inspect directory defaults:
lfs getstripe mydirIf a directory has a default stripe, new files created there inherit that layout.
Setting Striping for New Files
You normally set striping before creating the file. The most common approach is:
- Set a directory default:
lfs setstripe -c 4 -S 1M mydir-c 4: use 4 OSTs-S 1M: stripe size 1 MB (syntax varies slightly by version)
- Create files in that directory:
cd mydir
my_application > output.dat
Each new file in mydir will have the specified striping, unless overridden.
You can also set striping for a single file before it exists:
lfs setstripe -c 8 -S 4M bigfile.dat
my_application > bigfile.datAttempting to change striping of an existing, non-empty file is generally not supported directly; you’d typically:
- Create a new file with desired striping.
- Copy or re-generate data into it.
Workload Patterns and Striping Strategies
Striping choices depend on your access pattern. General (but not universal) guidelines:
- Large, sequential I/O from many processes:
- Moderate to high stripe count (e.g. 4–16, or more depending on system rules).
- Stripe size in MBs (1–8 MB typical).
- Good for single large shared files written/read collectively (e.g. large checkpoint files).
- Many independent files, one per process:
- Often fine to use the filesystem’s default striping (stripe_count 1 or a small number).
- Parallelism comes from many files spread across OSTs via the allocator.
- Small files or metadata-heavy workloads:
- Avoid large stripe counts; stripe_count=1 is often better.
- Fewer OSTs per file reduces overhead and metadata operations.
- Uncoordinated random access:
- Harder to optimize; moderate stripe count and careful MPI-IO or library usage help.
On many HPC systems, site documentation provides recommended striping settings for typical workloads; following those is usually the best starting point.
Typical User Experience and Mount Points
On a cluster, you might see Lustre filesystems mounted like:
/lustre/project/scratch/fs/lustre1
To check whether a path is on Lustre:
df -Th /scratch
Look for lustre in the filesystem type column.
Some clusters expose multiple Lustre filesystems (e.g. scratch vs project), often with different:
- Capacity
- Performance characteristics
- Purge policies (temporary vs long-term storage)
Knowing which filesystem you’re using matters for large runs and data retention.
Performance Considerations Specific to Lustre
Aggregating Bandwidth Across OSTs
Total achievable bandwidth is roughly:
$$
\text{effective bandwidth} \approx \min \left(
\sum_{\text{active OSTs}} \text{OST bandwidth},
\sum_{\text{client nodes}} \text{network bandwidth per node}
\right)
$$
Using more OSTs (higher stripe_count) can increase bandwidth, but only up to limits set by:
- The OSS hardware
- Network fabric
- Application I/O pattern
Over-striping can also introduce overhead; “more” is not always “better.”
Hotspots and OST Balancing
If many users or many files all hit the same OSTs, those OSTs become hotspots:
- Higher latency
- Reduced throughput for all files on those OSTs
Lustre attempts to distribute new files across OSTs, but manual striping that always targets a specific subset can negate this.
Admins often monitor OST usage and may recommend:
- Using defaults unless you have a reason to change striping.
- Avoiding custom striping that pins many large files to the same small OST subset.
Metadata vs Data Bottlenecks
Since MDS/MDTs handle metadata:
- Workloads creating/deleting millions of files, or heavy
ls -R/findoperations may saturate metadata servers, while OSTs are idle. - Conversely, streaming reads/writes of large files are limited mainly by OSS/OST bandwidth.
Performance problems blamed on “Lustre being slow” may actually be:
- Metadata saturation
- Network contention
- Poor access pattern (e.g. many tiny writes scattered across large striped files)
Basic Lustre Tools Useful to Users
While system tuning is done by administrators, several tools help users understand behavior:
lfs df– view free space and usage on Lustre:
lfs df -h /scratchlfs find– Lustre-aware find with extra options (e.g. by OST, size, layout).lfs getstripe/lfs setstripe– inspect and set striping, as shown earlier.
Not all options may be available on every system; site documentation usually describes supported commands and policies.
Good Practices When Using Lustre
In most clusters, Lustre is a shared resource. Efficient use is both a performance and a fairness issue.
Some common guidelines:
- Prefer fewer, larger files over many tiny ones when possible.
- Avoid excessive per-process files (e.g. tens of thousands of tiny checkpoints).
- Use parallel I/O libraries (MPI-IO, HDF5, NetCDF-4) rather than naive per-rank file patterns for large runs.
- Use site-recommended striping defaults unless you have a clear reason to override them.
- Avoid
ls -Randfindover huge directory trees during peak usage if possible. - Stage data in and out efficiently (e.g. tar/compress small files before moving them).
Understanding that Lustre is optimized for large, parallel, streaming I/O will help you design workflows that take advantage of its strengths and avoid common pitfalls.