Table of Contents
Core challenges of large datasets in HPC
Large datasets in HPC are not just “big files.” They often have:
- High volume: TB–PB scale, often spread across many files and directories.
- High velocity: data produced or consumed quickly (e.g., simulations writing GB/s).
- High concurrency: many processes reading/writing the same dataset.
- Complex structure: multiple variables, timesteps, and domains.
Managing them means balancing:
- Storage space
- I/O performance
- Metadata and organization
- Data movement costs
- Long-term usability and reproducibility
The strategies below focus on practical techniques rather than low-level I/O details.
Organizing large datasets
A clear, consistent structure makes data usable and less error-prone.
Directory and naming conventions
Define conventions early and stick to them. Aim for:
- Hierarchical structure that matches the project logic, e.g.:
/project_name/
raw/
experiment_001/
run_0001/
run_0002/
experiment_002/
processed/
experiment_001/
run_0001/
run_0002/
analysis/
figures/
reports/- File and directory names that encode key information:
- Simulation code name
- Version or git commit hash
- Parameter set or case ID
- Date/time, run index, or restart index
- Resolution, domain, or ensemble member
Examples:
simA_v3_res512_case07_t000100.h5climate_ens05_1950-01-01_to_2000-12-31.nc
Avoid:
- Ambiguous names:
data1,new,test,final,final2. - Overly long names that exceed filesystem limits.
Separating raw, intermediate, and final data
Differentiate:
- Raw data: As generated by instruments or initial simulations.
- Intermediate data: Checkpoints, temporary formats, pre-processed outputs.
- Final data products: What you analyze, publish, or share.
Benefits:
- Easier cleanup of large intermediates.
- Clear trace from raw to final (important for reproducibility).
- Safer: you can delete intermediates while preserving raw/final.
Planning storage usage
Clusters often have multiple storage tiers (e.g., scratch, project, archive). Large datasets require conscious planning.
Storage tiers and appropriate use
Typical tiers:
- Scratch / temporary: Fast, limited quota, auto-purged. Use for:
- Intermediate outputs
- Temporary checkpoints
- Short-lived analysis files
- Project / work: Larger, not auto-purged, moderate performance. Use for:
- Long-running project data
- Reusable processed data
- Archive / tape: Very large, slow access. Use for:
- Completed projects
- Legacy datasets
- Regulatory retention
Design a lifecycle:
- Write heavy I/O to scratch.
- Consolidate and compress results.
- Move curated results to project space.
- Archive rarely used but valuable data.
Quotas and capacity planning
Before running large jobs:
- Check available quotas and limits (per-user, per-project, per-filesystem).
- Estimate storage needs:
- Per timestep / per snapshot size
- Number of snapshots
- Number of ensemble members
Example:
- One snapshot: 50 GB
- 200 timesteps per run → 10 TB
- 10 ensemble members → 100 TB total
If that doesn’t fit your quota or filesystem capacity:
- Reduce output frequency.
- Save fewer variables.
- Use on-the-fly compression.
- Adopt in-situ or in-transit analysis (see below).
- Coordinate with HPC support for special allocations.
Reducing data volume
Saving less data—intelligently—is often the biggest win.
Selecting what to store
Ask for each variable:
- Is this needed for the final scientific conclusions?
- Can it be recomputed from other saved variables?
- Is this just for debugging / sanity checks?
Strategies:
- Save derived quantities instead of all raw fields if recomputation is expensive.
- Drop intermediate diagnostics once a simulation is validated.
- Use regions of interest: store full data at a few key times; store coarser data elsewhere.
Temporal and spatial thinning
You rarely need to save every timestep and every grid point.
- Temporal thinning:
- Save every $N$th timestep, or only near important events.
- Spatial thinning:
- Save coarser-resolution versions (e.g., sub-sampling or averaging).
- Extract slices, lines, or small 3D subsets for rapid inspection.
Good practice:
- Combine high-resolution short windows with low-resolution long windows:
- High-res for detailed events.
- Low-res for global trends.
Compression strategies
Use format-level compression when supported (e.g., HDF5, NetCDF, Zarr, ADIOS).
Types:
- Lossless compression:
- Preserves exact values.
- Examples: gzip, zstd, LZ4.
- Often 1.5–5× reduction, depends on data.
- Lossy, error-bounded compression:
- Allows controlled error, e.g., (relative/absolute error bounds).
- Examples (conceptually): SZ, ZFP.
- Can achieve 10×+ reduction for smooth fields.
Choose based on:
- Downstream analysis sensitivity (can you tolerate small errors?).
- Compute vs I/O trade-off: compression uses CPU/GPU time but reduces disk traffic and storage.
Implement at scale:
- Use parallel-aware compression filters or libraries.
- Integrate compression into the same I/O library you already use (not as a separate “post-processing” step if performance is critical).
Data lifecycle and housekeeping
Without regular cleanup, large datasets quickly overwhelm storage.
Data lifecycle policies
Define a simple policy per project, such as:
- Raw data: keep for project duration + X years.
- Intermediates: keep until results are validated + backup made.
- Final products: keep indefinitely (or per institutional policy).
- Checkpoints: keep latest N per run, delete older ones.
Automate where possible:
- Use scripts to:
- Find old intermediate directories and remove them.
- Keep only checkpoints matching a pattern (e.g., every 10th).
- Summarize disk usage per project/run.
Example cleanup script snippet pattern:
find /project/myproj -type f -name "tmp_*.h5" -mtime +14 -delete
du -sh /project/myproj/* > usage_report.txt
Always test with -print instead of -delete first.
Versioning and immutable data
For large datasets:
- Avoid “editing in place” when changes are substantial.
- Prefer:
- New version directories (
v1,v2) or - Version tags in filenames (
_v001,_v002).
If using a data format with internal metadata, track:
- Creation date
- Code version and commit
- Input parameter checksum or hash
- Simulation run ID
Immutability (don’t modify past versions) helps:
- Reproducibility
- Cross-checking analyses
- Avoiding silent corruption
Dataset metadata and documentation
Large datasets without good metadata are nearly useless, even if well-compressed.
Descriptive metadata
At minimum, record:
- What: variable names, units, dimensions, coordinate systems.
- How: code name/version, compiler, major libraries, precision (single/double).
- Why: simulation purpose, key physical parameters, or experiment design.
- When/where: generation date, cluster, or project name.
Use:
- Metadata attributes in file formats (e.g., NetCDF/HDF attributes).
- Separate human-readable
READMEfiles alongside data:
/project/simA_v3_res512_case07/
data/
README.md
README.md example content:
- Brief description of the dataset
- Grid resolution and domain
- Variables stored
- Simulation configuration file name and location
- Instructions to reproduce or re-run
Machine-readable catalogs
For very large projects:
- Maintain a table (CSV, JSON, or small database) listing:
- Dataset ID
- Path
- Size
- Time coverage, parameter values
- Version
- Optionally integrate:
- Checksums for integrity checking (e.g., MD5, SHA256).
- Tags (e.g., “for paper X”, “training data”, “validation set”).
This makes it possible to write scripts that:
- Search for relevant datasets.
- Verify integrity.
- Generate usage reports.
Data movement and locality
Moving large datasets can be expensive or infeasible. Plan to minimize data movement.
Minimizing data transfers
Common pitfalls:
- Copying the same multi-TB dataset to personal directories.
- Repeatedly moving data between project and scratch spaces.
- Downloading large files to local machines over slow networks.
Mitigation:
- Work where the data lives:
- Run analysis jobs on the filesystem that holds the data.
- Share centrally:
- Designate shared datasets with proper permissions.
- Use group/project spaces instead of per-user duplicates.
- Use symbolic links:
- Reference existing data rather than copying when allowed.
- Coordinate with collaborators:
- Agree on a single “authoritative” location and naming scheme.
In-situ and in-transit analysis (conceptual)
Instead of writing everything to disk then analyzing:
- In-situ analysis:
- Perform some analysis or reduction directly in the simulation process.
- Save only reduced data (e.g., statistics, extracted features).
- In-transit analysis:
- Stream data to dedicated analysis processes or nodes while simulation runs.
- Reduces I/O burden on the main filesystem.
Use cases:
- Very high-frequency outputs (e.g., physics of transients).
- Situations where writing full-resolution 3D fields at every timestep is impossible.
You might:
- Compute time-averages, PDFs, or spectra on-the-fly.
- Extract and track only certain features (e.g., vortices, particles).
- Save smaller, analysis-ready datasets instead of raw fields.
Checkpointing and restart data management
Checkpointing and restart are often the largest contributors to data volume for long simulations.
Checkpoint retention policies
Avoid keeping every checkpoint forever:
- Store:
- Frequent checkpoints during early, unstable phases.
- Less frequent checkpoints later.
- Keep:
- The last few checkpoints (e.g., last 2–3) for restart safety.
- A few “milestone” checkpoints (e.g., pre-event, post-event).
Regularly prune old checkpoints:
- Keep only those that are truly needed for:
- Continuation
- Critical analysis
- Validation or debugging
Checkpoint size reduction
Techniques:
- Save only state necessary for restart, not all diagnostic fields.
- Use compression and parallel I/O.
- Use reduced precision when safe (e.g., single precision for certain fields).
- For ensemble runs, share read-only common data; only checkpoint per-run state.
Collaboration and access control
Large datasets typically involve teams; access needs to be controlled and coordinated.
Group ownership and permissions
Common patterns:
- Store data in group/project directories where all collaborators have access.
- Use Unix groups and permissions:
- Group-readable for shared datasets.
- Restrict write access when a dataset is considered “final.”
Avoid:
- Personal home-directory datasets that others cannot access.
- World-readable data with no tracking.
Shared conventions and documentation
Agree as a team on:
- Directory layout
- Naming conventions
- Metadata standards
- Retention and deletion policies
- Where to log major changes (e.g., project wiki,
CHANGELOG, or dataset catalog)
Write these down in a central place (e.g., PROJECT_GUIDE.md).
Practical tips and sanity checks
- Periodically check disk usage:
- Per project, per directory, and per user.
- Before running massive new jobs:
- Reassess which old data can be archived or deleted.
- For any dataset you plan to “keep forever”:
- Document it robustly.
- Ensure it’s in a stable, open format.
- Confirm it’s stored on appropriate long-term storage.
- Test your analysis workflow on a small subset before generating PBs of data.
Managing large datasets is as much about policy and discipline as it is about tools. Good practices established early will prevent most storage crises and make your results maintainable and reproducible over the long term.