Table of Contents
Why File Formats Matter in HPC
In HPC, file formats are not just about how data is stored; they strongly influence:
- I/O performance (throughput, metadata overhead, parallel access)
- Portability between systems and architectures
- Long-term usability and reproducibility
- Interoperability between tools, libraries, and workflows
Most HPC data formats are designed to support:
- Very large datasets (gigabytes to petabytes)
- Parallel access (many processes reading/writing simultaneously)
- Rich metadata (units, dimensions, provenance)
- Portability across CPU architectures and operating systems
This chapter focuses on commonly used file formats and container formats in HPC, and how they relate to parallel I/O and large-scale workflows.
Broad Categories of File Formats
HPC data can be organized in several broad categories:
- Text-based formats
- Human-readable, easy to inspect and debug
- Generally poor space efficiency and I/O performance
- Binary formats
- Compact and fast for reading/writing
- Not directly human-readable; need tools or libraries
- Self-describing, structured formats
- Store both data and rich metadata (dimensions, types, attributes)
- Support complex structures like multi-dimensional arrays, groups, or tables
- Domain-specific formats
- Tailored to particular scientific domains (e.g., climate, molecular dynamics)
- Often built on top of general-purpose containers like HDF5 or NetCDF
Text-Based Formats in HPC
Simple text and CSV
Plain text (.txt, .dat) and comma-separated values (.csv) files are occasionally used in HPC:
- For configuration files, small tables, or logs
- For intermediate diagnostics and debugging
- For small-scale or early-stage development
Characteristics:
- Pros:
- Easy to view with
less,cat,head,tail, or a text editor - Simple to generate from almost any language
- Easy to version-control and diff
- Cons:
- Large overhead (each number stored as characters)
- Parsing overhead is high (conversion text → numeric types)
- No inherent metadata structure
- Parallel writing is difficult to coordinate efficiently
Best kept for:
- Small datasets
- Human-in-the-loop inspection
- Non-performance-critical paths of a workflow
JSON, YAML, XML
More structured text formats (JSON, YAML, XML) are sometimes used:
- For metadata, configuration, or workflow descriptions
- To describe simulation setups or parameter files
They are rarely used to store large numerical arrays in HPC due to size and parsing costs.
Binary Formats and Raw Data
Raw binary dumps
A simple approach is to dump arrays directly in binary:
- Example: writing a
doublearrayAof lengthNwith C or Fortran I/O - Often no header, or a small custom header
Characteristics:
- Pros:
- Very space-efficient and fast
- Simple to implement
- Cons:
- Not self-describing (you must know dimension, type, endianness)
- Endianness and alignment can cause portability issues
- No standard way to store metadata alongside the data
- Hard to evolve the format without versioning headaches
Common uses:
- Intermediate checkpoint files within a tightly controlled codebase
- Short-term, single-code, single-platform workflows
When using raw binary, teams often maintain:
- A separate specification for layout (documented in code or docs)
- Small helper tools to convert binary dumps to more portable formats
Self-Describing, Structured Formats
Self-describing formats store both data and metadata in a structured, portable way, usually via a library. These are core to HPC data workflows.
HDF5 (Hierarchical Data Format version 5)
HDF5 is one of the most widely used general-purpose data container formats in HPC.
Key concepts
- Data is stored in a hierarchical structure:
- Groups (like directories)
- Datasets (multi-dimensional arrays, like files)
- Attributes (metadata attached to groups or datasets)
- Supports many data types: integers, floating point, strings, compound types
- Supports chunking, compression, and partial I/O (read/write subsets)
The logical structure resembles:
/(root group)/simulation/simulation/parameters/simulation/fields/temperature/simulation/fields/velocity
Why HDF5 is popular in HPC
- Parallel I/O support:
- Parallel HDF5 built on top of MPI-IO
- Enables many MPI processes to read and write a single shared file (with constraints)
- Portability:
- Transparent handling of endianness and platform differences
- Rich metadata:
- Attributes allow documenting units, axes, provenance, solver settings
- Ecosystem:
- Bindings in C, C++, Fortran, Python, and others
- Visualization and analysis tools (e.g. h5py, HDFView, Paraview-compatible workflows)
Typical use cases
- Large simulation outputs (multi-dimensional fields, time series)
- Checkpoints that need to be portable and analyzable
- Data interchange between codes written in different languages
Practical considerations
- File layout (dataset chunking, alignment) can significantly affect parallel I/O performance
- Parallel HDF5 often requires collective I/O patterns and careful access planning
- Some HPC centers provide optimized HDF5 builds tuned for their parallel filesystems
NetCDF (Network Common Data Form)
NetCDF was originally designed for array-oriented scientific data, especially in Earth system sciences (climate, weather, oceanography). Modern NetCDF (NetCDF-4) is often built on top of HDF5.
Key characteristics
- Focus on multi-dimensional variables with named dimensions and attributes
- Designed with self-describing data:
- Dimensions (e.g., time, latitude, longitude, depth)
- Variables (e.g., temperature, pressure)
- Attributes (units, missing values, metadata)
- Provides a simpler conceptual model than full HDF5 for many scientific use cases
Versions and formats
- NetCDF classic / 64-bit offset:
- Older formats, non-HDF5, limited to specific features and file sizes
- NetCDF-4:
- Uses HDF5 as the underlying storage
- Inherits many HDF5 benefits (compression, chunking, parallel I/O)
Use in HPC
- Widely adopted in:
- Climate models
- Weather prediction systems
- Ocean and atmospheric sciences
- Supported by many analysis and visualization tools:
nco,cdo,Panoply, Python’snetCDF4andxarray
Parallel I/O support in NetCDF-4 is typically realized via HDF5 and MPI-IO.
ADIOS and ADIOS2
ADIOS (Adaptable IO System) and its successor ADIOS2 are designed specifically for HPC I/O scalability.
Goals and design
- High performance and scalability on large systems
- Emphasis on streaming, in-situ, and staging I/O
- Abstract the low-level details of parallel I/O from applications
Features
- Multiple engines (backends) for different I/O modes:
- File-based engines (e.g., BP format)
- In-memory streaming between applications or nodes
- Engines targeting different parallel filesystems or transports
- Complex data types:
- Multi-dimensional arrays
- Attributes
- APIs in C, C++, Fortran, Python
Use cases
- Exascale or near-exascale simulations where traditional file-based I/O becomes a bottleneck
- Workflows with in-situ analysis or visualization to reduce filesystem pressure
- Coupled multiphysics simulations exchanging data via ADIOS engines
Domain-Specific and Community Formats
Many scientific domains have adopted standardized formats or conventions, often built on top of the general-purpose containers described above.
Climate and Earth system science formats
- Typically use NetCDF or HDF5 as the underlying container
- Common conventions:
- CF (Climate and Forecast) conventions for metadata
- Benefits:
- Tools understand dimension names like
time,lat,lon - Strong interoperability across models and post-processing tools
- Files often have extensions like
.ncor.nc4
Visualization and mesh formats (VTK, XDMF, etc.)
Numerical simulations often output meshes and fields for visualization:
VTK and ParaView-related formats
- Legacy VTK (
.vtk): - Both ASCII and binary variants
- Simpler but older, limited scalability
- VTU (VTK XML, unstructured grid) / VTP, etc.:
- XML-based, often with separate binary data sections
- Used with tools like ParaView and VisIt
Advantages:
- Rich representation of meshes, topology, and fields
- Widely supported in visualization tools
Limitations:
- XML overhead
- Not primarily designed for massive parallel I/O, though parallel collections (e.g.,
.pvtu) exist
XDMF (eXtensible Data Model and Format)
- Uses XML for metadata + HDF5 for heavy numeric data
- The XML file describes:
- Mesh topology and geometry
- Associations between HDF5 datasets and fields
- HDF5 stores the actual large arrays
This separation keeps:
- Human-readable structure and metadata
- High-performance, parallel-friendly storage in HDF5
Molecular dynamics (MD) formats
Common MD codes (e.g., GROMACS, LAMMPS, NAMD) use their own set of formats:
- Trajectory formats:
- E.g.
.dcd,.xtc,.trr,.lammpstrj - Topology and parameter files
Characteristics:
- Optimized for time-dependent particle data (positions, velocities, forces)
- Often have binary layouts tailored to specific MD packages
- Analysis tools and libraries for these formats are widespread in MD communities
From an HPC standpoint:
- Parallel I/O strategies and capabilities differ among MD codes
- Sometimes, trajectories are later converted to more generic containers (e.g., HDF5-based formats) for long-term storage or cross-tool use
Other domain examples
- Astronomy / astrophysics:
- FITS (Flexible Image Transport System) for images and cubes
- HDF5-based formats for simulation data
- Computational fluid dynamics (CFD):
- Proprietary formats of commercial solvers
- Open-source formats (CGNS, often based on HDF5)
Choosing File Formats in HPC Projects
Selecting a file format in an HPC project involves multiple trade-offs.
Key criteria
- Performance and parallel I/O
- Does the format support MPI-IO or other parallel mechanisms?
- Are there known scalability limits on your target filesystem?
- Interoperability
- Do your analysis/visualization tools support the format?
- Is it standard in your research community?
- Metadata richness
- Can you store necessary metadata (units, coordinate systems, provenance)?
- Are there community conventions (e.g., CF) you should follow?
- Portability and longevity
- Is the format self-describing and robust across platforms?
- Is it likely to be readable in 5–10 years?
- Complexity and development effort
- Does using the format require a large library or complex APIs?
- Do you have the expertise and time to integrate it correctly?
Common patterns
- HDF5 / NetCDF-4:
- Default choice for many large-scale structured datasets
- Good blend of performance, portability, and tooling
- ADIOS2:
- For extreme-scale or in-situ workflows where traditional file I/O is too slow
- Domain-specific formats:
- When aligning with community standards is critical (e.g., climate, MD)
- Simple text or binary:
- For prototyping, small-scale outputs, or quick inspection needs
Practical Tips for Working with HPC File Formats
- Use existing libraries and bindings:
- For example,
h5pyfor HDF5 in Python,netCDF4for NetCDF,adios2bindings - Start with simple, well-documented layouts:
- Avoid overcomplicated hierarchies unless justified
- Include clear metadata:
- Always store units, coordinate descriptions, and code version information
- Provide conversion tools:
- Small utilities to convert from your primary format to commonly used ones (e.g., HDF5 → VTK) help collaboration
- Test I/O patterns at scale:
- Small test runs may not expose performance issues that appear when thousands of processes access a file simultaneously
Summary
In HPC, file formats are closely tied to:
- Performance and scalability of I/O
- Portability and interoperability across tools and platforms
- Long-term reproducibility and data reuse
Self-describing, structured formats like HDF5, NetCDF-4, and ADIOS2 form the backbone of many HPC workflows, while domain-specific conventions ensure compatibility within scientific communities. Choosing and using the right formats is a core design decision for any serious HPC application.