14.2 File formats used in HPC

Why File Formats Matter in HPC

In HPC, file formats are not just about how data is stored; they strongly influence:

I/O performance (throughput, metadata overhead, parallel access)
Portability between systems and architectures
Long-term usability and reproducibility
Interoperability between tools, libraries, and workflows

Most HPC data formats are designed to support:

Very large datasets (gigabytes to petabytes)
Parallel access (many processes reading/writing simultaneously)
Rich metadata (units, dimensions, provenance)
Portability across CPU architectures and operating systems

This chapter focuses on commonly used file formats and container formats in HPC, and how they relate to parallel I/O and large-scale workflows.

Broad Categories of File Formats

HPC data can be organized in several broad categories:

Text-based formats

Human-readable, easy to inspect and debug
Generally poor space efficiency and I/O performance

Binary formats

Compact and fast for reading/writing
Not directly human-readable; need tools or libraries

Self-describing, structured formats

Store both data and rich metadata (dimensions, types, attributes)
Support complex structures like multi-dimensional arrays, groups, or tables

Domain-specific formats

Tailored to particular scientific domains (e.g., climate, molecular dynamics)
Often built on top of general-purpose containers like HDF5 or NetCDF

Text-Based Formats in HPC

Simple text and CSV

Plain text (.txt, .dat) and comma-separated values (.csv) files are occasionally used in HPC:

For configuration files, small tables, or logs
For intermediate diagnostics and debugging
For small-scale or early-stage development

Characteristics:

Pros:

Easy to view with less, cat, head, tail, or a text editor
Simple to generate from almost any language
Easy to version-control and diff

Cons:

Large overhead (each number stored as characters)
Parsing overhead is high (conversion text → numeric types)
No inherent metadata structure
Parallel writing is difficult to coordinate efficiently

Best kept for:

Small datasets
Human-in-the-loop inspection
Non-performance-critical paths of a workflow

JSON, YAML, XML

More structured text formats (JSON, YAML, XML) are sometimes used:

For metadata, configuration, or workflow descriptions
To describe simulation setups or parameter files

They are rarely used to store large numerical arrays in HPC due to size and parsing costs.

Binary Formats and Raw Data

Raw binary dumps

A simple approach is to dump arrays directly in binary:

Example: writing a double array A of length N with C or Fortran I/O
Often no header, or a small custom header

Characteristics:

Pros:

Very space-efficient and fast
Simple to implement

Cons:

Not self-describing (you must know dimension, type, endianness)
Endianness and alignment can cause portability issues
No standard way to store metadata alongside the data
Hard to evolve the format without versioning headaches

Common uses:

Intermediate checkpoint files within a tightly controlled codebase
Short-term, single-code, single-platform workflows

When using raw binary, teams often maintain:

A separate specification for layout (documented in code or docs)
Small helper tools to convert binary dumps to more portable formats

Self-Describing, Structured Formats

Self-describing formats store both data and metadata in a structured, portable way, usually via a library. These are core to HPC data workflows.

HDF5 (Hierarchical Data Format version 5)

HDF5 is one of the most widely used general-purpose data container formats in HPC.

Key concepts

Data is stored in a hierarchical structure:

Groups (like directories)
Datasets (multi-dimensional arrays, like files)
Attributes (metadata attached to groups or datasets)

Supports many data types: integers, floating point, strings, compound types
Supports chunking, compression, and partial I/O (read/write subsets)

The logical structure resembles:

/ (root group)

/simulation

/simulation/parameters
/simulation/fields/temperature
/simulation/fields/velocity

Why HDF5 is popular in HPC

Parallel I/O support:

Parallel HDF5 built on top of MPI-IO
Enables many MPI processes to read and write a single shared file (with constraints)

Portability:

Transparent handling of endianness and platform differences

Rich metadata:

Attributes allow documenting units, axes, provenance, solver settings

Ecosystem:

Bindings in C, C++, Fortran, Python, and others
Visualization and analysis tools (e.g. h5py, HDFView, Paraview-compatible workflows)

Typical use cases

Large simulation outputs (multi-dimensional fields, time series)
Checkpoints that need to be portable and analyzable
Data interchange between codes written in different languages

Practical considerations

File layout (dataset chunking, alignment) can significantly affect parallel I/O performance
Parallel HDF5 often requires collective I/O patterns and careful access planning
Some HPC centers provide optimized HDF5 builds tuned for their parallel filesystems

NetCDF (Network Common Data Form)

NetCDF was originally designed for array-oriented scientific data, especially in Earth system sciences (climate, weather, oceanography). Modern NetCDF (NetCDF-4) is often built on top of HDF5.

Key characteristics

Focus on multi-dimensional variables with named dimensions and attributes
Designed with self-describing data:

Dimensions (e.g., time, latitude, longitude, depth)
Variables (e.g., temperature, pressure)
Attributes (units, missing values, metadata)

Provides a simpler conceptual model than full HDF5 for many scientific use cases

Versions and formats

NetCDF classic / 64-bit offset:

Older formats, non-HDF5, limited to specific features and file sizes

NetCDF-4:

Uses HDF5 as the underlying storage
Inherits many HDF5 benefits (compression, chunking, parallel I/O)

Use in HPC

Widely adopted in:

Climate models
Weather prediction systems
Ocean and atmospheric sciences

Supported by many analysis and visualization tools:

nco, cdo, Panoply, Python’s netCDF4 and xarray

Parallel I/O support in NetCDF-4 is typically realized via HDF5 and MPI-IO.

ADIOS and ADIOS2

ADIOS (Adaptable IO System) and its successor ADIOS2 are designed specifically for HPC I/O scalability.

Goals and design

High performance and scalability on large systems
Emphasis on streaming, in-situ, and staging I/O
Abstract the low-level details of parallel I/O from applications

Features

Multiple engines (backends) for different I/O modes:

File-based engines (e.g., BP format)
In-memory streaming between applications or nodes
Engines targeting different parallel filesystems or transports

Complex data types:

Multi-dimensional arrays
Attributes

APIs in C, C++, Fortran, Python

Use cases

Exascale or near-exascale simulations where traditional file-based I/O becomes a bottleneck
Workflows with in-situ analysis or visualization to reduce filesystem pressure
Coupled multiphysics simulations exchanging data via ADIOS engines

Domain-Specific and Community Formats

Many scientific domains have adopted standardized formats or conventions, often built on top of the general-purpose containers described above.

Climate and Earth system science formats

Typically use NetCDF or HDF5 as the underlying container
Common conventions:

CF (Climate and Forecast) conventions for metadata

Benefits:

Tools understand dimension names like time, lat, lon
Strong interoperability across models and post-processing tools

Files often have extensions like .nc or .nc4

Visualization and mesh formats (VTK, XDMF, etc.)

Numerical simulations often output meshes and fields for visualization:

VTK and ParaView-related formats

Legacy VTK (.vtk):

Both ASCII and binary variants
Simpler but older, limited scalability

VTU (VTK XML, unstructured grid) / VTP, etc.:

XML-based, often with separate binary data sections

Used with tools like ParaView and VisIt

Advantages:

Rich representation of meshes, topology, and fields
Widely supported in visualization tools

Limitations:

XML overhead
Not primarily designed for massive parallel I/O, though parallel collections (e.g., .pvtu) exist

XDMF (eXtensible Data Model and Format)

Uses XML for metadata + HDF5 for heavy numeric data
The XML file describes:

Mesh topology and geometry
Associations between HDF5 datasets and fields

HDF5 stores the actual large arrays

This separation keeps:

Human-readable structure and metadata
High-performance, parallel-friendly storage in HDF5

Molecular dynamics (MD) formats

Common MD codes (e.g., GROMACS, LAMMPS, NAMD) use their own set of formats:

Trajectory formats:

E.g. .dcd, .xtc, .trr, .lammpstrj

Topology and parameter files

Characteristics:

Optimized for time-dependent particle data (positions, velocities, forces)
Often have binary layouts tailored to specific MD packages
Analysis tools and libraries for these formats are widespread in MD communities

From an HPC standpoint:

Parallel I/O strategies and capabilities differ among MD codes
Sometimes, trajectories are later converted to more generic containers (e.g., HDF5-based formats) for long-term storage or cross-tool use

Other domain examples

Astronomy / astrophysics:

FITS (Flexible Image Transport System) for images and cubes
HDF5-based formats for simulation data

Computational fluid dynamics (CFD):

Proprietary formats of commercial solvers
Open-source formats (CGNS, often based on HDF5)

Choosing File Formats in HPC Projects

Selecting a file format in an HPC project involves multiple trade-offs.

Key criteria

Performance and parallel I/O

Does the format support MPI-IO or other parallel mechanisms?
Are there known scalability limits on your target filesystem?

Interoperability

Do your analysis/visualization tools support the format?
Is it standard in your research community?

Metadata richness

Can you store necessary metadata (units, coordinate systems, provenance)?
Are there community conventions (e.g., CF) you should follow?

Portability and longevity

Is the format self-describing and robust across platforms?
Is it likely to be readable in 5–10 years?

Complexity and development effort

Does using the format require a large library or complex APIs?
Do you have the expertise and time to integrate it correctly?

Common patterns

HDF5 / NetCDF-4:

Default choice for many large-scale structured datasets
Good blend of performance, portability, and tooling

ADIOS2:

For extreme-scale or in-situ workflows where traditional file I/O is too slow

Domain-specific formats:

When aligning with community standards is critical (e.g., climate, MD)

Simple text or binary:

For prototyping, small-scale outputs, or quick inspection needs

Practical Tips for Working with HPC File Formats

Use existing libraries and bindings:

For example, h5py for HDF5 in Python, netCDF4 for NetCDF, adios2 bindings

Start with simple, well-documented layouts:

Avoid overcomplicated hierarchies unless justified

Include clear metadata:

Always store units, coordinate descriptions, and code version information

Provide conversion tools:

Small utilities to convert from your primary format to commonly used ones (e.g., HDF5 → VTK) help collaboration

Test I/O patterns at scale:

Small test runs may not expose performance issues that appear when thousands of processes access a file simultaneously

Summary

In HPC, file formats are closely tied to:

Performance and scalability of I/O
Portability and interoperability across tools and platforms
Long-term reproducibility and data reuse

Self-describing, structured formats like HDF5, NetCDF-4, and ADIOS2 form the backbone of many HPC workflows, while domain-specific conventions ensure compatibility within scientific communities. Choosing and using the right formats is a core design decision for any serious HPC application.

Comments

Please login to add a comment.

Don't have an account? Register now!

14.2 File formats used in HPC

Why File Formats Matter in HPC

Broad Categories of File Formats

Text-Based Formats in HPC

Simple text and CSV

JSON, YAML, XML

Binary Formats and Raw Data

Raw binary dumps

Self-Describing, Structured Formats

HDF5 (Hierarchical Data Format version 5)

Key concepts

Why HDF5 is popular in HPC

Typical use cases

Practical considerations

NetCDF (Network Common Data Form)

Key characteristics

Versions and formats

Use in HPC

ADIOS and ADIOS2

Goals and design

Features

Use cases

Domain-Specific and Community Formats

Climate and Earth system science formats

Visualization and mesh formats (VTK, XDMF, etc.)

VTK and ParaView-related formats

XDMF (eXtensible Data Model and Format)

Molecular dynamics (MD) formats

Other domain examples

Choosing File Formats in HPC Projects

Key criteria

Common patterns

Practical Tips for Working with HPC File Formats

Summary

Comments

Where to Move