14.2 File formats used in HPC

Table of Contents

Overview of File Formats in HPC

High performance computing workflows produce and consume large, structured datasets. The choice of file format affects performance, portability, ease of analysis, and long term preservation. In HPC the focus is not only on what the data represents but also on how efficiently multiple processes can read, write, and share that data on parallel filesystems. This chapter introduces the most commonly used file formats in HPC, how they are structured, and what makes them suitable for large scale computing.

HPC file formats are typically self describing and portable across architectures, support large multidimensional arrays, and integrate well with parallel I/O libraries. Many formats are not specific to one scientific field, but some domains have converged on de facto standards that wrap or build on the same underlying technologies.

Text Versus Binary Formats in HPC

Human readable text formats such as CSV or plain text logs are convenient for small data and debugging. In HPC workloads their limitations quickly become evident. Text encodings are verbose and require expensive parsing for numbers, which increases CPU time and I/O volume. They also lack explicit metadata for units, coordinate systems, and mesh topology, unless additional conventions are imposed.

Binary formats store data in the in memory representation or in a portable layout defined by the format specification. This reduces file size and read and write costs, and makes random access to subsets of data more efficient. However, binary formats are not self evident to a human reader, so metadata, tools, and libraries are essential.

In large scale HPC workflows, binary, self describing, portable formats are preferred over ad hoc text formats, especially for multidimensional arrays and time series.

Many commonly used HPC file formats follow this pattern. They encode a description of the data types, dimensions, and attributes in a header or metadata structures, and store the actual numerical values in binary sections that can be accessed efficiently by parallel I/O libraries.

HDF5: Hierarchical Data Format

HDF5 is one of the most widely used general purpose file formats in HPC. It is designed for very large datasets and complex data models. HDF5 is not just a file structure, it also defines a rich C based library and higher level bindings in languages such as Fortran, C++, Python, and others.

An HDF5 file stores data in a hierarchical structure similar to a filesystem. Groups act like directories and datasets act like files. Each dataset holds a multidimensional array with an associated datatype and can carry attributes as key value metadata. Groups and datasets themselves can also have arbitrary attributes, which allows users to attach units, provenance information, and semantic annotations.

One important feature of HDF5 in HPC is chunking. Instead of storing arrays as a single contiguous block, datasets can be divided into fixed size blocks called chunks. Each chunk can be compressed and accessed independently. This allows efficient reading and writing of subarrays, such as hyperslabs defined by ranges in each dimension, without touching the entire dataset.

HDF5 is designed to be portable. It defines its own datatype descriptions and uses a conversion layer so that the same file can be read on systems with different endianess or word sizes. This is important for long term archiving and data sharing across diverse architectures.

For parallel I/O, HDF5 provides an MPI I/O based interface often called Parallel HDF5. Multiple MPI processes can participate in collective read and write operations on the same file. Internally, HDF5 maps selections in datasets to MPI I/O requests on the underlying parallel filesystem. This integration makes HDF5 a key building block for scalable analysis and simulation codes.

When using HDF5 in parallel jobs, always build and link against the MPI enabled HDF5 library and use the corresponding parallel I/O routines, not the serial API.

HDF5 has become the foundation for many domain specific conventions, for example in climate modeling, neutron scattering, and microscopy. In those communities, the rules for dataset naming, group layout, and required attributes are defined on top of the generic HDF5 container.

netCDF: Array Oriented Scientific Data

NetCDF, or Network Common Data Form, was created for earth system sciences and has become a standard for gridded scientific data such as climate and weather model outputs, reanalyses, and observational products. The conceptual model of netCDF focuses on named variables defined over one or more named dimensions, with associated attributes that describe units, coordinate reference systems, and other metadata.

Modern netCDF files, often referred to as netCDF 4, are built on top of HDF5. This means that they inherit the scalability, portability, and parallel I/O capabilities of HDF5, but impose an additional data model: variables, dimensions, and attributes. While HDF5 allows arbitrary hierarchies, netCDF usually presents a flatter structure intended for array oriented data.

A typical netCDF file will define dimensions such as time, lat, and lon, coordinate variables that give the values along those dimensions, and data variables that use those dimensions, such as temperature(time, lat, lon). Global and variable specific attributes record units, long names, conventions, and provenance.

NetCDF offers a high level API that abstracts away the HDF5 implementation details. Many analysis tools in geosciences, including popular pre and postprocessing libraries, work directly with netCDF. Parallel netCDF interfaces allow MPI programs to perform collective operations on netCDF variables. This can be implemented either via the netCDF 4 HDF5 based API or via a separate library called PnetCDF that targets the classic netCDF 3 file format.

For grid based scientific data, prefer netCDF with the CF (Climate and Forecast) metadata conventions to maximize interoperability and tool support.

Although strongly associated with atmospheric and ocean models, the netCDF data model is general and can be used in other domains that require labeled multidimensional arrays.

ADIOS and ADIOS2: Adaptable I/O System

ADIOS and its successor ADIOS2 are I/O frameworks and file formats designed specifically for high performance simulations. Their goal is to decouple the description of data from the underlying I/O transport. Instead of directly calling a particular file format API throughout a simulation code, users define variables and groups in ADIOS and then choose one or more transports, such as file based formats, MPI based transports, or streaming to other components.

The ADIOS2 file based engines write data to BP (Binary Packed) files, which can be considered a family of formats optimized for high throughput and scalability. These formats record metadata in a way that is efficient to update incrementally as a simulation progresses. This supports workflows where data is produced in time steps and needs to be accessed by analysis tools while the simulation is still running.

A distinctive feature of ADIOS is support for in situ and in transit analysis. The same ADIOS description can be used to send data directly to an analysis application, a visualization application, or a storage engine, without changing the simulation code. BP files can later be read by tools that understand the ADIOS metadata and variable layout.

ADIOS2 also supports compression and data reduction methods such as wavelet based compression or lossy compressors that are tailored to floating point arrays. Data transformations are expressed at the variable level and applied in the I/O pipeline.

In practice, ADIOS based workflows require both producer side integration in the simulation code and consumer side tools such as Python bindings or C++ readers that know how to interpret BP files. For large simulations that produce many time steps and variables, the BP formats can significantly reduce both I/O overhead and downstream analysis time compared to more generic formats.

Domain Specific Formats in HPC

Beyond general purpose containers, many scientific communities have developed domain specific formats optimized for their data models and tools. These formats often aim to standardize how simulations and experiments are described, including geometry, meshes, boundary conditions, and derived quantities. They typically supply libraries and viewers that integrate with HPC workflows.

In computational fluid dynamics and other mesh based simulations, formats such as CGNS, XDMF, and Exodus are common. CGNS, the CFD General Notation System, defines a standardized representation of grids, flow solutions, and associated metadata. It builds on HDF5 for storage and offers APIs for C and Fortran codes. By adopting CGNS, codes and tools can share meshes and results without custom converters.

XDMF, the eXtensible Data Model and Format, separates metadata written in XML from heavy numerical data usually stored in HDF5. The XML describes the topology, geometry, and attributes of the mesh, while the binary chunks hold the arrays. This separation makes XDMF particularly suitable for visualization tools, which parse the XML quickly and then fetch data from the HDF5 backend as needed.

Exodus, or Exodus II, is used for finite element and finite volume meshes and results, particularly in structural mechanics and multi physics simulations. It is built on netCDF or HDF5, and defines structures for blocks of elements, node sets, side sets, and fields defined on these entities. Many mesh generation and finite element analysis tools support Exodus as an interchange format.

In particle based simulations such as molecular dynamics, formats like DCD, TRR, and XTC are widely used. These trajectory formats focus on time ordered coordinates and sometimes velocities, with varying levels of compression and precision. Although they may not be self describing in the same way as HDF5, the surrounding ecosystems of MD codes and analysis tools understand their layout and can read frames in parallel.

In high energy physics and data intensive experiments, ROOT files are a common choice. The ROOT framework provides a custom container format optimized for event based data and columnar storage, where branches can be read independently. ROOT files support compression and random access to subsets of events, which is critical when analyzing petabyte scale datasets on large clusters.

Domain specific formats often incorporate application semantics that general containers do not capture. This brings advantages for specialized tools but can make cross domain interoperability more challenging. When choosing such a format, it is important to consider the available ecosystem of readers, writers, and analysis frameworks on the target HPC systems.

Parallel I/O and File Formats

In a parallel HPC environment, file formats must be compatible with parallel I/O patterns. Not all formats can be efficiently written by many processes at once. Some domain specific formats were originally designed for serial workflows and only later extended to support parallelism, sometimes with constraints.

File formats that are layered on top of MPI I/O, such as HDF5, netCDF 4, and some variants of ADIOS BP, can coordinate collective operations across MPI processes. They map high level selections in datasets or variables to low level noncontiguous file views used by MPI I/O. This enables each process to write its portion of an array directly to the correct location in the file without race conditions or heavy locking.

For formats that lack a native parallel interface, applications may adopt file per process or file per group patterns, where each MPI rank or node writes its own file. While simple, this can create problems on parallel filesystems, because very large numbers of small files stress metadata servers and complicate downstream analysis. Some postprocessing tools then merge these files into a more scalable container format.

The structure of a format also influences I/O patterns. Chunked storage can align naturally with domain decompositions, where each process writes one or several chunks that correspond to its subdomain. Contiguous storage is simpler but may require collective buffering strategies where a subset of processes gathers data and performs larger writes on behalf of the group.

When using compressed formats, it is important to consider that compression blocks are typically independent units. If compression blocks align poorly with how data is partitioned among processes, I/O can involve extra reading and decompression of irrelevant blocks. Many HPC oriented formats therefore provide controls for chunk and block sizes that can be tuned for the target decomposition.

For scalable performance, choose file formats that support MPI aware parallel I/O and configure their chunking and layout to match the parallel decomposition of your data.

The interaction between file formats, parallel I/O libraries, and the parallel filesystem is a central concern in practical HPC applications. File format design has a direct impact on achievable bandwidth and on the scalability of analysis codes.

Portability, Metadata, and Interoperability

A key requirement for HPC file formats is portability across architectures and software environments. This includes both low level numerical representation and high level semantic interpretation. Portable formats define endianness, alignment, and type sizes explicitly, so that a file written on one architecture can be read correctly on another.

Self describing formats store metadata about types, dimensions, and structure within the file itself. When combined with higher level conventions, such as CF for netCDF or community specific schemas for HDF5 and CGNS, this metadata allows tools to interpret variables without hardcoded assumptions. Units, coordinate references, and transformation rules can be discovered programmatically.

Versioning is another aspect of portability. Many formats include markers for specification versions and sometimes allow for extensions through attributes or additional groups. This allows new features to be introduced without breaking older readers, provided that backward compatibility rules are respected.

Interoperability depends not only on the file format but also on the availability of libraries and bindings on HPC systems. Well established formats such as HDF5 and netCDF are almost always present, with modules or preinstalled libraries. Domain specific formats often come bundled with their own software stacks. When designing new workflows, it is important to align format choices with what is available and supported on the target clusters.

In multi stage workflows that combine simulation, analysis, visualization, and machine learning, it is common to encounter several formats. For example, a simulation might write HDF5 or BP files, which are then downsampled into netCDF or XDMF for visualization, or converted into simple binary matrices for training. Each conversion step carries cost and possible loss of information, so a careful selection of core formats can reduce complexity.

Choosing File Formats in HPC Workflows

The selection of a file format in HPC depends on the application domain, the size and structure of the data, parallelism requirements, and the ecosystem of tools. There is rarely a single universally optimal format, but some patterns recur.

For generic multidimensional numerical data with strong requirements on portability and parallel I/O, HDF5 and netCDF 4 are common choices. For geoscience and grid based models, netCDF with established metadata conventions aligns well with analysis and visualization ecosystems. For highly time dependent simulations with in situ analysis requirements, ADIOS2 and its BP formats offer flexibility and performance.

Domain specific formats such as CGNS, XDMF, Exodus, ROOT, or MD trajectory formats are appropriate when they match the dominant tools in a field and when their libraries are available on the target HPC systems. In practice, many workflows combine a simulation oriented format with one or more lighter formats for sharing results and figures, including extracted ASCII tables, images, and small JSON or YAML metadata files for documentation.

The interaction with parallel filesystems, discussed elsewhere, constrains these choices. File formats that integrate well with MPI I/O and provide control over data layout are better suited to large scale jobs and high concurrency levels. Taking these factors into account early in application design helps avoid costly redesigns when simulations reach larger scales.

Ultimately, file formats in HPC are part of a broader strategy for data management, parallel I/O, and reproducible workflows. Aligning format choice with community practices, tool support, and performance characteristics is essential for effective use of high performance computing resources.

Comments

Please login to add a comment.

Don't have an account? Register now!