Kahibaro
Discord Login Register

Storage systems

Role of Storage in HPC Systems

In HPC, storage systems are responsible for holding data beyond the lifetime of a single program run or system boot. Unlike registers, cache, and RAM, storage devices are:

For HPC workloads, storage design must balance:

This chapter focuses on local node-level storage and global/shared storage concepts, not the parallel filesystems themselves, which are covered later in the course.

Types of Storage Devices

Hard Disk Drives (HDDs)

HDDs use spinning platters and moving read/write heads.

Characteristics relevant to HPC:

Typical use in HPC:

Key performance metrics:

Solid-State Drives (SSDs)

SSDs store data in flash memory, with no moving parts.

Characteristics:

Common forms:

Typical use in HPC:

Emerging Non-Volatile Memory (NVM)

Newer technologies (e.g., NVDIMM, persistent memory modules, next-gen NVM) blur the line between memory and storage.

Characteristics:

In HPC, such devices are used for:

Local vs Shared Storage

Local Storage on Compute Nodes

Each compute node typically has its own internal storage (HDDs, SSDs, or both).

Usage patterns:

Characteristics:

Implications for users:

Shared / Global Storage

Shared or global storage is accessible from many or all nodes in the cluster, usually over a high-speed network.

Characteristics:

Important distinctions often found on clusters:

Users must understand which area to use for which kind of data to avoid performance issues and data loss.

Storage Performance Concepts

Latency vs Bandwidth

Two fundamental performance aspects:

HPC codes may be:

Understanding which category your workflow falls into guides the choice of storage tier and access patterns.

IOPS and Access Patterns

IOPS measures how many I/O operations per second a device can handle.

Access patterns:

Best practice in HPC:

Caching and Buffering

Operating systems use RAM to cache disk data:

Implications:

On shared systems:

Reliability and Redundancy

Storage failures are inevitable over time, especially with large numbers of disks. HPC systems employ techniques to improve reliability and availability.

RAID Basics

RAID (Redundant Array of Independent Disks) combines multiple physical disks into logical units to improve performance and/or redundancy.

Some common RAID levels:

In HPC:

Data Integrity and Backups

Even with RAID, data can be lost due to:

To mitigate this, HPC centers typically deploy:

User responsibilities:

Storage Hierarchies in HPC

HPC environments commonly combine several storage layers with different performance and cost characteristics, forming a storage hierarchy similar in spirit to the memory hierarchy.

A typical hierarchy might include:

  1. Node-local ephemeral storage
    • Example: NVMe SSDs or HDDs inside compute nodes.
    • Fastest storage accessible to that node.
    • Used for temporary files and job-specific scratch data.
  2. High-performance shared storage
    • Implemented using parallel filesystems.
    • Provides large capacity and high aggregate bandwidth across many nodes.
    • Suitable for large shared datasets and job outputs.
  3. Persistent project/home storage
    • More conservative performance characteristics.
    • Stronger focus on reliability, quotas, and backups.
    • Stores user environments, source code, and important datasets.
  4. Archive / cold storage
    • Tape or low-cost disk systems.
    • Very high capacity, low cost per TB.
    • High latency to access; best for infrequently accessed data.

Users must choose the appropriate level for each data type:

User-Level Best Practices for Storage Use

While system-level details vary, some general guidelines apply to most HPC centers:

Understanding the storage systems available and their characteristics allows you to design workflows that are not only faster but also more robust and easier to manage at scale.

Views: 16

Comments

Please login to add a comment.

Don't have an account? Register now!