Kahibaro
Discord Login Register

File formats used in HPC

Why File Formats Matter in HPC

In HPC, file formats are not just about how data is stored; they strongly influence:

Most HPC data formats are designed to support:

This chapter focuses on commonly used file formats and container formats in HPC, and how they relate to parallel I/O and large-scale workflows.

Broad Categories of File Formats

HPC data can be organized in several broad categories:

Text-Based Formats in HPC

Simple text and CSV

Plain text (.txt, .dat) and comma-separated values (.csv) files are occasionally used in HPC:

Characteristics:

Best kept for:

JSON, YAML, XML

More structured text formats (JSON, YAML, XML) are sometimes used:

They are rarely used to store large numerical arrays in HPC due to size and parsing costs.

Binary Formats and Raw Data

Raw binary dumps

A simple approach is to dump arrays directly in binary:

Characteristics:

Common uses:

When using raw binary, teams often maintain:

Self-Describing, Structured Formats

Self-describing formats store both data and metadata in a structured, portable way, usually via a library. These are core to HPC data workflows.

HDF5 (Hierarchical Data Format version 5)

HDF5 is one of the most widely used general-purpose data container formats in HPC.

Key concepts

The logical structure resembles:

Why HDF5 is popular in HPC

Typical use cases

Practical considerations

NetCDF (Network Common Data Form)

NetCDF was originally designed for array-oriented scientific data, especially in Earth system sciences (climate, weather, oceanography). Modern NetCDF (NetCDF-4) is often built on top of HDF5.

Key characteristics

Versions and formats

Use in HPC

Parallel I/O support in NetCDF-4 is typically realized via HDF5 and MPI-IO.

ADIOS and ADIOS2

ADIOS (Adaptable IO System) and its successor ADIOS2 are designed specifically for HPC I/O scalability.

Goals and design

Features

Use cases

Domain-Specific and Community Formats

Many scientific domains have adopted standardized formats or conventions, often built on top of the general-purpose containers described above.

Climate and Earth system science formats

Visualization and mesh formats (VTK, XDMF, etc.)

Numerical simulations often output meshes and fields for visualization:

VTK and ParaView-related formats

Advantages:

Limitations:

XDMF (eXtensible Data Model and Format)

This separation keeps:

Molecular dynamics (MD) formats

Common MD codes (e.g., GROMACS, LAMMPS, NAMD) use their own set of formats:

Characteristics:

From an HPC standpoint:

Other domain examples

Choosing File Formats in HPC Projects

Selecting a file format in an HPC project involves multiple trade-offs.

Key criteria

  1. Performance and parallel I/O
    • Does the format support MPI-IO or other parallel mechanisms?
    • Are there known scalability limits on your target filesystem?
  2. Interoperability
    • Do your analysis/visualization tools support the format?
    • Is it standard in your research community?
  3. Metadata richness
    • Can you store necessary metadata (units, coordinate systems, provenance)?
    • Are there community conventions (e.g., CF) you should follow?
  4. Portability and longevity
    • Is the format self-describing and robust across platforms?
    • Is it likely to be readable in 5–10 years?
  5. Complexity and development effort
    • Does using the format require a large library or complex APIs?
    • Do you have the expertise and time to integrate it correctly?

Common patterns

Practical Tips for Working with HPC File Formats

Summary

In HPC, file formats are closely tied to:

Self-describing, structured formats like HDF5, NetCDF-4, and ADIOS2 form the backbone of many HPC workflows, while domain-specific conventions ensure compatibility within scientific communities. Choosing and using the right formats is a core design decision for any serious HPC application.

Views: 13

Comments

Please login to add a comment.

Don't have an account? Register now!