14.5 Managing large datasets

Core challenges of large datasets in HPC

Large datasets in HPC are not just “big files.” They often have:

High volume: TB–PB scale, often spread across many files and directories.
High velocity: data produced or consumed quickly (e.g., simulations writing GB/s).
High concurrency: many processes reading/writing the same dataset.
Complex structure: multiple variables, timesteps, and domains.

Managing them means balancing:

Storage space
I/O performance
Metadata and organization
Data movement costs
Long-term usability and reproducibility

The strategies below focus on practical techniques rather than low-level I/O details.

Organizing large datasets

A clear, consistent structure makes data usable and less error-prone.

Directory and naming conventions

Define conventions early and stick to them. Aim for:

Hierarchical structure that matches the project logic, e.g.:

  /project_name/
    raw/
      experiment_001/
        run_0001/
        run_0002/
      experiment_002/
    processed/
      experiment_001/
        run_0001/
        run_0002/
    analysis/
      figures/
      reports/

File and directory names that encode key information:

Simulation code name
Version or git commit hash
Parameter set or case ID
Date/time, run index, or restart index
Resolution, domain, or ensemble member

Examples:

simA_v3_res512_case07_t000100.h5
climate_ens05_1950-01-01_to_2000-12-31.nc

Avoid:

Ambiguous names: data1, new, test, final, final2.
Overly long names that exceed filesystem limits.

Separating raw, intermediate, and final data

Differentiate:

Raw data: As generated by instruments or initial simulations.
Intermediate data: Checkpoints, temporary formats, pre-processed outputs.
Final data products: What you analyze, publish, or share.

Benefits:

Easier cleanup of large intermediates.
Clear trace from raw to final (important for reproducibility).
Safer: you can delete intermediates while preserving raw/final.

Planning storage usage

Clusters often have multiple storage tiers (e.g., scratch, project, archive). Large datasets require conscious planning.

Storage tiers and appropriate use

Typical tiers:

Scratch / temporary: Fast, limited quota, auto-purged. Use for:

Intermediate outputs
Temporary checkpoints
Short-lived analysis files

Project / work: Larger, not auto-purged, moderate performance. Use for:

Long-running project data
Reusable processed data

Archive / tape: Very large, slow access. Use for:

Completed projects
Legacy datasets
Regulatory retention

Design a lifecycle:

Write heavy I/O to scratch.
Consolidate and compress results.
Move curated results to project space.
Archive rarely used but valuable data.

Quotas and capacity planning

Before running large jobs:

Check available quotas and limits (per-user, per-project, per-filesystem).
Estimate storage needs:

Per timestep / per snapshot size
Number of snapshots
Number of ensemble members

Example:

One snapshot: 50 GB
200 timesteps per run → 10 TB
10 ensemble members → 100 TB total

If that doesn’t fit your quota or filesystem capacity:

Reduce output frequency.
Save fewer variables.
Use on-the-fly compression.
Adopt in-situ or in-transit analysis (see below).
Coordinate with HPC support for special allocations.

Reducing data volume

Saving less data—intelligently—is often the biggest win.

Selecting what to store

Ask for each variable:

Is this needed for the final scientific conclusions?
Can it be recomputed from other saved variables?
Is this just for debugging / sanity checks?

Strategies:

Save derived quantities instead of all raw fields if recomputation is expensive.
Drop intermediate diagnostics once a simulation is validated.
Use regions of interest: store full data at a few key times; store coarser data elsewhere.

Temporal and spatial thinning

You rarely need to save every timestep and every grid point.

Temporal thinning:

Save every $N$th timestep, or only near important events.

Spatial thinning:

Save coarser-resolution versions (e.g., sub-sampling or averaging).
Extract slices, lines, or small 3D subsets for rapid inspection.

Good practice:

Combine high-resolution short windows with low-resolution long windows:

High-res for detailed events.
Low-res for global trends.

Compression strategies

Use format-level compression when supported (e.g., HDF5, NetCDF, Zarr, ADIOS).

Types:

Lossless compression:

Preserves exact values.
Examples: gzip, zstd, LZ4.
Often 1.5–5× reduction, depends on data.

Lossy, error-bounded compression:

Allows controlled error, e.g., (relative/absolute error bounds).
Examples (conceptually): SZ, ZFP.
Can achieve 10×+ reduction for smooth fields.

Choose based on:

Downstream analysis sensitivity (can you tolerate small errors?).
Compute vs I/O trade-off: compression uses CPU/GPU time but reduces disk traffic and storage.

Implement at scale:

Use parallel-aware compression filters or libraries.
Integrate compression into the same I/O library you already use (not as a separate “post-processing” step if performance is critical).

Data lifecycle and housekeeping

Without regular cleanup, large datasets quickly overwhelm storage.

Data lifecycle policies

Define a simple policy per project, such as:

Raw data: keep for project duration + X years.
Intermediates: keep until results are validated + backup made.
Final products: keep indefinitely (or per institutional policy).
Checkpoints: keep latest N per run, delete older ones.

Automate where possible:

Use scripts to:

Find old intermediate directories and remove them.
Keep only checkpoints matching a pattern (e.g., every 10th).
Summarize disk usage per project/run.

Example cleanup script snippet pattern:

find /project/myproj -type f -name "tmp_*.h5" -mtime +14 -delete
du -sh /project/myproj/* > usage_report.txt

Always test with -print instead of -delete first.

Versioning and immutable data

For large datasets:

Avoid “editing in place” when changes are substantial.
Prefer:

New version directories (v1, v2) or
Version tags in filenames (_v001, _v002).

If using a data format with internal metadata, track:

Creation date
Code version and commit
Input parameter checksum or hash
Simulation run ID

Immutability (don’t modify past versions) helps:

Reproducibility
Cross-checking analyses
Avoiding silent corruption

Dataset metadata and documentation

Large datasets without good metadata are nearly useless, even if well-compressed.

Descriptive metadata

At minimum, record:

What: variable names, units, dimensions, coordinate systems.
How: code name/version, compiler, major libraries, precision (single/double).
Why: simulation purpose, key physical parameters, or experiment design.
When/where: generation date, cluster, or project name.

Use:

Metadata attributes in file formats (e.g., NetCDF/HDF attributes).
Separate human-readable README files alongside data:

  /project/simA_v3_res512_case07/
    data/
    README.md

README.md example content:

Brief description of the dataset
Grid resolution and domain
Variables stored
Simulation configuration file name and location
Instructions to reproduce or re-run

Machine-readable catalogs

For very large projects:

Maintain a table (CSV, JSON, or small database) listing:

Dataset ID
Path
Size
Time coverage, parameter values
Version

Optionally integrate:

Checksums for integrity checking (e.g., MD5, SHA256).
Tags (e.g., “for paper X”, “training data”, “validation set”).

This makes it possible to write scripts that:

Search for relevant datasets.
Verify integrity.
Generate usage reports.

Data movement and locality

Moving large datasets can be expensive or infeasible. Plan to minimize data movement.

Minimizing data transfers

Common pitfalls:

Copying the same multi-TB dataset to personal directories.
Repeatedly moving data between project and scratch spaces.
Downloading large files to local machines over slow networks.

Mitigation:

Work where the data lives:

Run analysis jobs on the filesystem that holds the data.

Share centrally:

Designate shared datasets with proper permissions.
Use group/project spaces instead of per-user duplicates.

Use symbolic links:

Reference existing data rather than copying when allowed.

Coordinate with collaborators:

Agree on a single “authoritative” location and naming scheme.

In-situ and in-transit analysis (conceptual)

Instead of writing everything to disk then analyzing:

In-situ analysis:

Perform some analysis or reduction directly in the simulation process.
Save only reduced data (e.g., statistics, extracted features).

In-transit analysis:

Stream data to dedicated analysis processes or nodes while simulation runs.
Reduces I/O burden on the main filesystem.

Use cases:

Very high-frequency outputs (e.g., physics of transients).
Situations where writing full-resolution 3D fields at every timestep is impossible.

You might:

Compute time-averages, PDFs, or spectra on-the-fly.
Extract and track only certain features (e.g., vortices, particles).
Save smaller, analysis-ready datasets instead of raw fields.

Checkpointing and restart data management

Checkpointing and restart are often the largest contributors to data volume for long simulations.

Checkpoint retention policies

Avoid keeping every checkpoint forever:

Store:

Frequent checkpoints during early, unstable phases.
Less frequent checkpoints later.

Keep:

The last few checkpoints (e.g., last 2–3) for restart safety.
A few “milestone” checkpoints (e.g., pre-event, post-event).

Regularly prune old checkpoints:

Keep only those that are truly needed for:

Continuation
Critical analysis
Validation or debugging

Checkpoint size reduction

Techniques:

Save only state necessary for restart, not all diagnostic fields.
Use compression and parallel I/O.
Use reduced precision when safe (e.g., single precision for certain fields).
For ensemble runs, share read-only common data; only checkpoint per-run state.

Collaboration and access control

Large datasets typically involve teams; access needs to be controlled and coordinated.

Group ownership and permissions

Common patterns:

Store data in group/project directories where all collaborators have access.
Use Unix groups and permissions:

Group-readable for shared datasets.
Restrict write access when a dataset is considered “final.”