Kahibaro
Discord Login Register

Lustre

Overview of Lustre in HPC

Lustre is a high-performance, parallel filesystem designed for large HPC clusters. It provides a single, shared namespace that can span thousands of nodes and petabytes of storage, with very high aggregate bandwidth. It is optimized for:

In most HPC systems, Lustre is mounted on all compute and login nodes at a common mount point (e.g. /lustre, /scratch, /fsx), giving a unified view of data regardless of which node you’re running on.

Lustre Architecture: Key Components

A Lustre filesystem is built from several specialized server types. Understanding these is essential to interpreting performance and behavior on clusters.

Management Server (MGS)

As a user, you rarely interact with the MGS directly; it works in the background to keep the filesystem configuration consistent.

Metadata Server (MDS) and Metadata Targets (MDTs)

Characteristics important to users:

Object Storage Servers (OSS) and Object Storage Targets (OSTs)

Files on Lustre are stored as objects on one or more OSTs:

This is where Lustre gets its parallel bandwidth: many OSS/OSTs can serve different parts of a file simultaneously.

How Lustre Presents Storage to Users

From a user’s perspective, a Lustre filesystem behaves much like a POSIX filesystem:

However, the parallel structure underneath introduces features and behaviors you should be aware of.

Striping: Distributing File Data Across OSTs

Lustre can stripe a file across multiple OSTs:

Key parameters:

Conceptually, for a file with stripe count 4 and stripe size 1 MB:

Choosing appropriate striping is one of the main performance tuning knobs available to users.

Querying and Modifying Lustre Striping

Viewing Striping Information

Use lfs (Lustre filesystem utility) to inspect a file’s striping:

lfs getstripe myfile

Typical output includes:

You can also inspect directory defaults:

lfs getstripe mydir

If a directory has a default stripe, new files created there inherit that layout.

Setting Striping for New Files

You normally set striping before creating the file. The most common approach is:

  1. Set a directory default:
   lfs setstripe -c 4 -S 1M mydir
  1. Create files in that directory:
   cd mydir
   my_application > output.dat

Each new file in mydir will have the specified striping, unless overridden.

You can also set striping for a single file before it exists:

lfs setstripe -c 8 -S 4M bigfile.dat
my_application > bigfile.dat

Attempting to change striping of an existing, non-empty file is generally not supported directly; you’d typically:

  1. Create a new file with desired striping.
  2. Copy or re-generate data into it.

Workload Patterns and Striping Strategies

Striping choices depend on your access pattern. General (but not universal) guidelines:

On many HPC systems, site documentation provides recommended striping settings for typical workloads; following those is usually the best starting point.

Typical User Experience and Mount Points

On a cluster, you might see Lustre filesystems mounted like:

To check whether a path is on Lustre:

df -Th /scratch

Look for lustre in the filesystem type column.

Some clusters expose multiple Lustre filesystems (e.g. scratch vs project), often with different:

Knowing which filesystem you’re using matters for large runs and data retention.

Performance Considerations Specific to Lustre

Aggregating Bandwidth Across OSTs

Total achievable bandwidth is roughly:

$$
\text{effective bandwidth} \approx \min \left(
\sum_{\text{active OSTs}} \text{OST bandwidth},
\sum_{\text{client nodes}} \text{network bandwidth per node}
\right)
$$

Using more OSTs (higher stripe_count) can increase bandwidth, but only up to limits set by:

Over-striping can also introduce overhead; “more” is not always “better.”

Hotspots and OST Balancing

If many users or many files all hit the same OSTs, those OSTs become hotspots:

Lustre attempts to distribute new files across OSTs, but manual striping that always targets a specific subset can negate this.

Admins often monitor OST usage and may recommend:

Metadata vs Data Bottlenecks

Since MDS/MDTs handle metadata:

Performance problems blamed on “Lustre being slow” may actually be:

Basic Lustre Tools Useful to Users

While system tuning is done by administrators, several tools help users understand behavior:

  lfs df -h /scratch

Not all options may be available on every system; site documentation usually describes supported commands and policies.

Good Practices When Using Lustre

In most clusters, Lustre is a shared resource. Efficient use is both a performance and a fairness issue.

Some common guidelines:

Understanding that Lustre is optimized for large, parallel, streaming I/O will help you design workflows that take advantage of its strengths and avoid common pitfalls.

Views: 11

Comments

Please login to add a comment.

Don't have an account? Register now!