Table of Contents
Role of Storage in HPC Systems
In HPC, storage systems are responsible for holding data beyond the lifetime of a single program run or system boot. Unlike registers, cache, and RAM, storage devices are:
- Non-volatile (data persists across reboots and power cycles).
- Larger in capacity but slower in access time.
- Shared between many users and jobs in cluster environments.
For HPC workloads, storage design must balance:
- Capacity (how much data can be stored).
- Performance (how fast data can be read/written).
- Concurrency (how many processes can access data simultaneously).
- Reliability (how resilient data is to failures).
- Cost (both hardware and operational).
This chapter focuses on local node-level storage and global/shared storage concepts, not the parallel filesystems themselves, which are covered later in the course.
Types of Storage Devices
Hard Disk Drives (HDDs)
HDDs use spinning platters and moving read/write heads.
Characteristics relevant to HPC:
- High capacity at relatively low cost.
- High sequential throughput for large, contiguous reads/writes.
- High access latency and poor random I/O performance compared to solid-state drives.
- Sensitive to concurrent small random I/O from many processes.
Typical use in HPC:
- Bulk storage for large datasets.
- Long-term project data, archives, and backup tiers.
- Underlying media for some parallel filesystems.
Key performance metrics:
- Sequential bandwidth (MB/s or GB/s).
- Average seek time (ms).
- IOPS (I/O operations per second), especially for random access.
Solid-State Drives (SSDs)
SSDs store data in flash memory, with no moving parts.
Characteristics:
- Much lower latency than HDDs.
- Higher IOPS and better random access performance.
- Sequential throughput comparable to or higher than HDDs.
- Limited write endurance (number of program/erase cycles), but usually adequate for HPC workloads when properly managed.
Common forms:
- SATA SSDs: use traditional disk interfaces; limited by SATA bandwidth.
- NVMe SSDs: connect directly via PCIe; significantly higher bandwidth and IOPS.
Typical use in HPC:
- Local scratch space on compute nodes for temporary data.
- High-performance tiers in multi-layer storage systems.
- Metadata storage for complex filesystem structures.
Emerging Non-Volatile Memory (NVM)
Newer technologies (e.g., NVDIMM, persistent memory modules, next-gen NVM) blur the line between memory and storage.
Characteristics:
- Latency closer to RAM than to SSDs.
- Byte-addressable or block-addressable depending on mode.
- Can be used as very fast storage or as extended memory.
In HPC, such devices are used for:
- Burst buffers (high-speed intermediate storage).
- Checkpoint/restart data.
- I/O-intensive workflows that need very low latency.
Local vs Shared Storage
Local Storage on Compute Nodes
Each compute node typically has its own internal storage (HDDs, SSDs, or both).
Usage patterns:
- Temporary “scratch” data created during a job and deleted afterwards.
- Local caching of input data to reduce network and shared filesystem load.
- Staging intermediate results that do not need to be shared between nodes.
Characteristics:
- High bandwidth and low latency from the perspective of processes on that node.
- Not visible to other nodes by default.
- Not persistent across node reinstallation or hardware replacement; often not backed up.
- Disk space limits can be much smaller than global storage.
Implications for users:
- Use local storage for short-lived files that are large or heavily accessed.
- Do not treat local disks as long-term storage unless explicitly documented.
- Expect files on local scratch to be cleaned periodically by system policies.
Shared / Global Storage
Shared or global storage is accessible from many or all nodes in the cluster, usually over a high-speed network.
Characteristics:
- Central location for user home directories, project data, and shared software.
- Concurrent access by many jobs and users.
- Backups and quotas are typically managed by system administrators.
- Implemented using networked storage technologies and often parallel filesystems (discussed in a later chapter).
Important distinctions often found on clusters:
- Home directories:
- Smaller capacity.
- Intended for source code, scripts, configuration files, and small datasets.
- Frequently backed up.
- Project or work directories:
- Larger capacity.
- Optimized for higher I/O throughput.
- May have more relaxed backup policies.
- Scratch or temporary global filesystems:
- Very large and high-performance.
- Not backed up; files can be purged after an inactivity period.
Users must understand which area to use for which kind of data to avoid performance issues and data loss.
Storage Performance Concepts
Latency vs Bandwidth
Two fundamental performance aspects:
- Latency: time to start an I/O operation.
- Includes command processing, seek time (for HDDs), and protocol overhead.
- Critical for small files and many random accesses.
- Bandwidth (throughput): amount of data that can be transferred per unit time.
- Important for large, contiguous reads/writes.
HPC codes may be:
- Latency-sensitive: many small, random I/O operations.
- Bandwidth-sensitive: few large sequential reads/writes.
Understanding which category your workflow falls into guides the choice of storage tier and access patterns.
IOPS and Access Patterns
IOPS measures how many I/O operations per second a device can handle.
- HDDs have low IOPS, especially for random access.
- SSDs and NVMe provide much higher IOPS.
Access patterns:
- Sequential I/O:
- Data read/written in large contiguous blocks.
- Maximizes effective bandwidth; ideal for both HDDs and SSDs.
- Random I/O:
- Small blocks scattered across disk.
- Incurs frequent seeks on HDDs; much slower.
- Better suited to SSDs, but still less efficient than sequential I/O.
Best practice in HPC:
- Design applications (when possible) to perform fewer, larger I/O operations.
- Avoid frequent opening/closing of files and tiny, unbuffered writes.
Caching and Buffering
Operating systems use RAM to cache disk data:
- Read cache: recently read data can be served from memory instead of disk.
- Write cache (buffering): writes are collected in memory and flushed to disk later.
Implications:
- Short-term performance can be much higher than the physical disk alone.
- Data may not be immediately on stable storage after a write;
fsyncor proper file closing forces flushing. - Under memory pressure, caches are reduced, and I/O can slow down.
On shared systems:
- Caching benefits are more pronounced for repeated reads of the same data.
- They are limited for workloads that stream very large data sets once.
Reliability and Redundancy
Storage failures are inevitable over time, especially with large numbers of disks. HPC systems employ techniques to improve reliability and availability.
RAID Basics
RAID (Redundant Array of Independent Disks) combines multiple physical disks into logical units to improve performance and/or redundancy.
Some common RAID levels:
- RAID 0 (striping):
- Splits data across disks to increase bandwidth.
- No redundancy: failure of one disk loses all data.
- RAID 1 (mirroring):
- Duplicates data on two or more disks.
- Improved read performance; can survive disk failure.
- RAID 5/6 (striping with parity):
- Distributes parity information across disks.
- Can survive failure of one (RAID 5) or two (RAID 6) disks.
- Provides a balance of performance, capacity, and redundancy.
In HPC:
- RAID is often used at the building-block level for larger storage systems.
- Users typically see only the resulting logical volumes, not the individual disks.
Data Integrity and Backups
Even with RAID, data can be lost due to:
- Multiple simultaneous hardware failures.
- Human error (accidental deletion).
- Software bugs or file corruption.
To mitigate this, HPC centers typically deploy:
- Snapshots: point-in-time copies of filesystems for quick rollback.
- Backups to separate storage tiers, often slower and cheaper.
- Archival systems (e.g., tape libraries) for long-term retention.
User responsibilities:
- Understand what is backed up and what is not (e.g., scratch space usually is not).
- Use version control for code.
- Design workflows to regularly save important intermediate results to persistent storage.
Storage Hierarchies in HPC
HPC environments commonly combine several storage layers with different performance and cost characteristics, forming a storage hierarchy similar in spirit to the memory hierarchy.
A typical hierarchy might include:
- Node-local ephemeral storage
- Example: NVMe SSDs or HDDs inside compute nodes.
- Fastest storage accessible to that node.
- Used for temporary files and job-specific scratch data.
- High-performance shared storage
- Implemented using parallel filesystems.
- Provides large capacity and high aggregate bandwidth across many nodes.
- Suitable for large shared datasets and job outputs.
- Persistent project/home storage
- More conservative performance characteristics.
- Stronger focus on reliability, quotas, and backups.
- Stores user environments, source code, and important datasets.
- Archive / cold storage
- Tape or low-cost disk systems.
- Very high capacity, low cost per TB.
- High latency to access; best for infrequently accessed data.
Users must choose the appropriate level for each data type:
- Hot data (frequently accessed, performance-critical): fastest tiers.
- Warm data (regularly used but less performance-critical): project storage.
- Cold data (rarely accessed, long-term): archival systems.
User-Level Best Practices for Storage Use
While system-level details vary, some general guidelines apply to most HPC centers:
- Read the documentation for your site’s storage layout:
- Know the purpose, performance expectations, and policies for each filesystem or directory (home, project, scratch, archive).
- Use local storage for:
- Temporary job data.
- Large intermediate files that do not need to be shared.
- Use shared high-performance storage for:
- Input datasets used by multiple jobs or users.
- Output that needs to be analyzed or shared after jobs finish.
- Avoid:
- Storing large datasets in home directories if home is not designed for high-performance I/O.
- Treating scratch spaces as permanent storage.
- Creating huge numbers of very small files when possible; group data into fewer, larger files.
- Plan for data movement:
- Stage input data to the appropriate storage tier before heavy computation.
- Compress and archive old results to free up space.
Understanding the storage systems available and their characteristics allows you to design workflows that are not only faster but also more robust and easier to manage at scale.