Table of Contents
Overview of NFS in HPC
Network File System (NFS) is one of the oldest and most widely used networked filesystems in Unix/Linux environments. In HPC, NFS is typically used for shared but low‑to‑moderate performance storage, not for the highest‑throughput parallel I/O.
In a cluster, NFS usually serves:
- Home directories
- Small project spaces
- Configuration files and shared tools
- License servers/support files
More performance‑critical workloads (large parallel I/O, scratch) are usually placed on dedicated parallel filesystems like Lustre or GPFS, but NFS remains important infrastructure.
Basic NFS Architecture in a Cluster
NFS follows a classic client–server model:
- NFS server
A storage server exports one or more directories over the network. - NFS clients
Other systems (login nodes, compute nodes, management nodes) mount these exports, accessing them as if they were local directories. - Network transport
Traditionally TCP over Ethernet or InfiniBand (or IP over InfiniBand). Performance and reliability depend heavily on network quality.
In an HPC cluster, common patterns are:
- A single NFS server exporting
/hometo all nodes - One or more NFS servers exporting shared software trees, e.g.
/opt,/apps, or/cm/shared - Management servers exporting configuration directories (e.g.
/var/yp,/tftpboot, provisioning/management data)
NFS Versions and Their Impact in HPC
You’ll commonly encounter:
- NFSv3
- Stateless protocol
- Widely supported and simple
- Separate locking protocol (NLM)
- No built‑in strong security (relies mostly on host‑based trust, firewalls)
- Often still used for compatibility
- NFSv4 / NFSv4.1 / NFSv4.2
- Stateful, with integrated locking and better semantics
- Supports ACLs and more advanced security (Kerberos)
- NFSv4.1 introduces sessions and better parallelism (pNFS support)
- Standard in modern Linux distributions
In HPC, administrators balance:
- Compatibility with older OS images and tools
- Performance characteristics for many concurrent clients
- Security requirements (especially on multi‑user or multi‑tenant clusters)
As a user, you typically don’t choose the NFS version directly; the cluster is configured so mounts use the site‑preferred version.
What NFS Is Typically Used For in HPC
NFS is well suited for:
- Home directories
- Your shell startup files (
.bashrc,.bash_profile) - Source code, small datasets, scripts, configuration
- Moderate I/O, usually from a single user at a time
- Shared software trees
- Compilers, libraries, and tools that don’t need extreme performance
- Small executables and libraries (e.g. admin-maintained tools in
/opt) - Configuration and management data
- Centralized configurations for cluster management
- Shared license files or small data for engineering tools
These usages emphasize simplicity and centralization over maximum throughput.
NFS is not usually used for:
- High‑bandwidth parallel I/O from thousands of processes
- Very large scratch spaces with heavy write loads
- Latency‑sensitive metadata operations at massive scale
Those are handled by parallel filesystems introduced in other chapters.
Basic NFS Usage from a User Perspective
On most clusters, NFS mounts are already configured by the administrators. As a user:
- Your home directory might be something like
/home/usernameor/users/username. - Shared software could live at paths like
/opt,/apps, or/software.
You generally interact with files over NFS no differently than with local files:
ls,cp,mv,rm,mkdir, etc.- Editing files via
vim,nano,emacs, or IDEs connected over SSH
However, because NFS is network‑based, performance and behavior differ from local disk:
- File operations may feel slower under load
- Many simultaneous jobs accessing the same NFS area can cause contention
Recognizing NFS Mounts
To see which directories are on NFS, you can run:
mount | grep nfsor
df -hT | grep nfsThis helps you understand where your data lives and what performance to expect.
Performance Characteristics and Limitations
NFS is not a parallel filesystem in the HPC sense; it has several characteristics that matter for cluster workloads.
Single Server Bottlenecks
- Traditional NFS setups have one server (or a small number) handling all requests.
- A large number of clients (hundreds or thousands of nodes) may overload a single NFS server if jobs hammer it with I/O.
- This is why admins often discourage heavy I/O in home directories and provide scratch or parallel storage for data‑intensive runs.
Metadata and Small-File Overheads
- Operations on many small files (e.g. millions of small output files) cause:
- High metadata load (lookups, create/delete, attribute checks)
- Many RPC calls across the network
- This can significantly slow jobs and affect other users.
For HPC workloads, this motivates:
- Using fewer, larger files where possible
- Avoiding massive small‑file creation on NFS
- Using parallel filesystems for large, structured outputs
Caching Effects
NFS clients cache file data and attributes to reduce network round trips. Consequences:
- Performance improves for repeated reads of the same data.
- There may be delays in seeing changes when multiple jobs/users access the same files concurrently.
- Some operations may not be immediately visible across nodes (e.g. very small timing windows in rapidly changing files).
For most basic workflows this is invisible, but for tightly coupled concurrent access/busy directories, it can matter.
NFS Mount Options Relevant to HPC
Administrators tune mount options to balance performance, robustness, and consistency. Some common options you might see (for information only):
rw/ro– read‑write vs read‑onlyhard– NFS operations retry indefinitely on server failures (common in HPC to avoid silent corruption)intr/soft(less common in HPC) – allow interrupts/timeout of operations, but can risk data consistencynoatime– avoids updating access times, reducing metadata writesrsize/wsize– I/O block sizes for performance tuningvers=n– specify NFS protocol version
On many clusters, these options are set centrally in /etc/fstab or automounter maps and are not user-modifiable.
Typical Best Practices for Users on NFS
While configuration is mostly out of your hands, you can use NFS effectively by following simple practices:
1. Avoid Heavy I/O in Home Directories
- Use NFS-mounted home for:
- Source code
- Submission scripts
- Small configuration files
- Small to moderate result sets for post-processing and archiving
- Use dedicated scratch / parallel filesystem for:
- Large simulation outputs
- Checkpoints of large runs
- High-rate logging or writing from many processes
A common pattern:
# On login node
cd $HOME
cp -r myproject $SCRATCH/myproject_run
# Edit and submit from SCRATCH location
cd $SCRATCH/myproject_run
sbatch run.slurm
# After job finishes, copy key results back to HOME (NFS) for archiving
cp important_results.tar.gz $HOME/results/2. Limit Small-File Explosion
- Group related data into fewer, larger files (e.g. HDF5, NetCDF) when possible.
- Avoid creating one file per process per timestep in NFS directories.
- Periodically clean up obsolete files and directories.
3. Be Careful with Concurrent Writing
- Many processes writing to the same NFS file simultaneously can cause poor performance or undefined behavior, depending on how the application is written.
- Prefer:
- Each process writing its own output in a controlled pattern, or
- Using libraries designed for concurrent I/O, and placing data on storage that supports it well.
4. Consider Job Startup Storms
- Launching thousands of processes that all read the same executable or input from NFS at once can overload the server.
- Admins may mitigate this via caching or local copies, but as a user:
- Avoid huge simultaneous compilations on login/home if not necessary.
- Prefer staging large shared input datasets to scratch.
Common NFS-Related Issues Users May See
You may encounter:
- “Stale NFS file handle” errors
- Happen when files or directories are changed/removed while mounted
- Often transient; reloading directory views or re-running commands sometimes helps
- Persistent problems should be reported to support
- Freezes or slow responses in directories
- When the NFS server is heavily loaded or unresponsive
- Commands like
lsmay hang if they access a busy or unavailable NFS mount - Permission mismatches
- NFS often relies on consistent user and group IDs across nodes (e.g. via LDAP)
- If mapping is misconfigured, you may see unexpected permission denied errors
On a managed HPC system, you typically report these behaviors to the support team rather than modify NFS settings yourself.
Security Considerations (High-Level)
NFS security in HPC environments is typically based on:
- Network isolation (internal cluster networks, firewalls)
- Consistent user authentication (e.g. centralized directory services)
- NFSv4 features like Kerberos in more security-conscious or multi-tenant environments
For most users, the main implications are:
- Treat NFS areas as shared infrastructure with other users on the cluster.
- Do not store highly sensitive personal data there unless explicitly allowed and protected.
- Respect file permissions, quotas, and site policy.
Summary: NFS’s Role in an HPC Filesystem Mix
Within an HPC cluster’s overall storage architecture, NFS usually:
- Provides convenient, centralized, moderate-performance shared storage.
- Hosts home directories, small shared software trees, and configuration.
- Is not the right place for heavy, large-scale parallel I/O.
Effective HPC workflows typically:
- Use NFS for scripts, configuration, small data, and long-term personal storage.
- Use high-performance parallel or scratch filesystems for data-intensive runs and large datasets.
Understanding where NFS fits lets you design workflows that are both efficient and kind to shared infrastructure.