Table of Contents
Why Case Studies Matter
Scientific HPC case studies show how abstract ideas—parallelism, scaling, memory, I/O—translate into real decisions: which algorithms to use, how to organize data, how to run and manage jobs, and what “success” means (e.g., faster time‑to‑solution, better resolution, more simulations).
In this chapter, the focus is on patterns that appear across scientific domains, illustrated by concrete examples. For each domain, pay attention to:
- What is being computed (simulation, data analysis, optimization, …)
- How the computation is parallelized (MPI, OpenMP, GPUs, hybrid, …)
- Resource usage (cores, GPUs, memory, storage, I/O)
- Performance and scaling behavior
- Workflow structure (preprocessing, main compute, postprocessing, visualization)
The goal is to help you recognize typical HPC “shapes” you will see again and again.
Climate and Weather Modeling
Problem Setting
Climate and numerical weather prediction (NWP) codes solve partial differential equations (PDEs) for the atmosphere and oceans on a global or regional grid. Typical goals:
- Weather forecasts from hours to days ahead
- Seasonal and climate projections over decades or centuries
- High‑resolution regional simulations for impact studies
These models are often coupled (atmosphere–ocean–land–ice), highly parallel, and run continuously.
Computational Characteristics
- Structured grids on spheres or regional domains, often decomposed horizontally into subdomains.
- Time stepping: billions of steps for century‑scale climate.
- Stencil operations: each grid point updated from neighboring points at each step.
- Communication pattern: nearest‑neighbor halo exchanges between subdomains.
Parallelization Strategies
- Domain decomposition + MPI:
- The global grid is split into horizontal tiles; each MPI process owns one tile.
- At every time step, processes exchange boundary “halo” cells with neighbors.
- Hybrid MPI + OpenMP:
- One MPI process per NUMA domain or socket; OpenMP threads iterate over vertical levels or subsets of the grid.
- GPU acceleration (in newer models):
- Stencil kernels (advection, diffusion, physics parameterizations) ported to CUDA, OpenACC, or OpenMP target directives.
- Asynchronous halo exchanges overlapping with GPU computation.
Example: Global Weather Forecast
- Objective: 7‑day global forecast at ~10 km resolution.
- Resources: tens of thousands of CPU cores or thousands of GPUs.
- Workflow:
- Data assimilation: Incorporate observations into a best estimate of current state (itself an HPC workload).
- Forecast integration: Run the model forward in time.
- Postprocessing & products: Interpolate to user grids, compute diagnostics, archive outputs.
HPC Challenges and Lessons
- Strong scaling limit: As you increase cores, halo communication and global reductions dominate; beyond some point, adding cores gives little benefit.
- I/O bottlenecks: Terabytes per day of model output; requires parallel I/O, compression, and careful selection of diagnostics.
- Resilience: Long climate runs must survive node failures—checkpointing and restart are essential.
- Parameter sweeps: Many ensemble members with slightly different initial conditions; schedulers and job arrays are heavily used.
Key takeaway: Climate and weather codes are canonical examples of regular, grid‑based, communication‑intensive MPI (often hybrid) applications with demanding I/O.
Astrophysics and Cosmology
Problem Setting
Astrophysics uses HPC for simulations of:
- Formation of galaxies and large‑scale structure in the universe
- Stellar evolution and supernova explosions
- Compact object mergers (black holes, neutron stars)
The physics involves gravity, hydrodynamics or magnetohydrodynamics (MHD), and sometimes general relativity and radiation transport.
Computational Characteristics
- Particles and/or grids:
- N‑body particles for dark matter dynamics.
- Adaptive mesh refinement (AMR) grids for gas or MHD.
- Multiscale: Large dynamic range in space and time; small regions require high resolution.
- Irregular structures: Clusters, filaments, shocks—nonuniform distribution of work.
Parallelization Strategies
- MPI over domain decomposition:
- Space split into subdomains; each MPI rank holds particles and grid cells in its region.
- Tree and AMR structures:
- Hierarchical data structures distributed across ranks.
- Tree traversals and AMR operations can cause load imbalance.
- Hybrid & GPU use:
- Gravity solvers (e.g., tree or particle‑mesh methods) and hydrodynamics kernels ported to GPUs.
- One MPI rank per GPU, with threads or CUDA streams exploiting fine‑grained parallelism.
Example: Large‑Scale Cosmological Simulation
- Objective: Evolve billions to trillions of particles in a box hundreds of Mpc on a side.
- Resources: Hundreds of thousands of cores, petabytes of storage.
- Workflow:
- Generate initial conditions (displacement fields).
- Main time integration loop (gravity + hydrodynamics).
- Periodic snapshot outputs and analysis.
HPC Challenges and Lessons
- Load balancing: Galaxy clusters form in some regions, leaving others sparse; domain decomposition must adapt dynamically.
- Communication patterns: Tree/mesh methods require nonlocal data exchanges; efficient communication and overlapping with computation are critical.
- Data volume: Snapshots can be petabytes; in situ or in transit analysis is used to reduce I/O.
- Numerical accuracy vs. performance: Gravity is sensitive to numerical errors; code must balance precision (e.g., double vs. mixed precision) against speed.
Key takeaway: Astrophysics simulations highlight dynamic load balancing, hierarchical algorithms, and extreme data challenges.
Computational Fluid Dynamics (CFD) and Engineering
Problem Setting
CFD is central to:
- Aerodynamics (aircraft, cars, wind turbines)
- Turbomachinery (jet engines, gas turbines)
- Process engineering (chemical reactors, pipelines)
- Environmental flows (urban wind, pollutant dispersion)
Codes solve the Navier–Stokes equations, often with turbulence models or direct numerical simulation (DNS) for research.
Computational Characteristics
- Structured or unstructured meshes:
- Structured grids for simple geometries (channels, pipes).
- Unstructured meshes for complex shapes (aircraft, cars).
- Stencil‑like kernels on structured grids, but more irregular memory access on unstructured meshes.
- High arithmetic intensity in turbulence models or advanced discretizations.
Parallelization Strategies
- MPI domain decomposition:
- Mesh partitioned into subdomains; interface faces require communication.
- Hybrid MPI/OpenMP:
- MPI across nodes; OpenMP threads over cells, elements, or faces within a subdomain.
- GPU acceleration:
- Finite volume/finite element loops on GPUs; careful data layout to ensure coalesced memory access.
Example: Aircraft Wing Simulation
- Objective: Simulate turbulent flow around a wing section at realistic Reynolds numbers.
- Resources: From a few hundred to tens of thousands of cores, depending on resolution.
- Workflow:
- Mesh generation (often on workstations or smaller clusters).
- Steady‑state RANS or unsteady simulation on the cluster.
- Postprocessing: lift/drag coefficients, flow visualizations.
HPC Challenges and Lessons
- Scalability on unstructured meshes:
- Graph partitioning (e.g., via METIS/ParMETIS) to balance elements per rank and minimize communication.
- Preconditioners and solvers:
- Linear solvers (e.g., Krylov methods) dominate runtime; preconditioners must be parallel and cache‑friendly.
- Cache and memory bandwidth:
- Core performance often limited by memory access patterns; data structure choices matter.
- Design space exploration:
- Many geometry or parameter variants; embarrassingly parallel ensembles can saturate schedulers.
Key takeaway: CFD showcases domain decomposition, linear solver performance, and the interplay between mesh quality, partitioning, and scalability.
Molecular Dynamics and Computational Chemistry
Problem Setting
Molecular dynamics (MD) and related methods simulate motion and interactions of atoms and molecules to study:
- Protein folding and conformational changes
- Drug binding and free energies
- Material properties (polymers, alloys)
- Membranes and biomolecular complexes
Simulations often run for nanoseconds to milliseconds of physical time, with time steps of femtoseconds.
Computational Characteristics
- Short‑range interactions (Lennard–Jones, short‑range Coulomb) with cutoffs and neighbor lists.
- Long‑range electrostatics using particle‑mesh Ewald (PME) or related methods.
- Fine time stepping: millions to billions of steps.
- Regular kernels inside each step but complex multi‑component algorithms.
Parallelization Strategies
- Spatial domain decomposition + MPI:
- Space split into cells; atoms assigned to ranks based on position.
- Neighbor lists updated periodically; forces computed using local and halo atoms.
- Force decomposition & task‑based approaches in some codes.
- GPU acceleration:
- Short‑range force kernels run on GPUs; PME may use separate ranks or GPUs.
- One or more GPUs per node; CPUs handle orchestration and some auxiliary tasks.
Example: Protein–Ligand Binding Simulation
- Objective: Estimate binding free energy of a small molecule to a protein.
- Resources: From single‑GPU workstations to modest HPC clusters with many GPUs.
- Workflow:
- System preparation (solvation, ionization).
- Equilibration runs.
- Production MD (possibly many replicas with different starting conditions).
- Analysis (RMSD, free energy estimators).
HPC Challenges and Lessons
- Strong scaling limits:
- For a single system, adding too many ranks leads to communication overhead; codes often scale well only up to a few dozen ranks per system.
- Ensemble simulations:
- To use large machines efficiently, run many independent replicas in parallel, each moderately parallelized.
- GPU utilization:
- MD kernels are compute‑heavy and vectorizable, making them well suited to GPUs; performance hinges on good GPU occupancy and minimizing host–device transfers.
- Load balancing:
- Inhomogeneous systems (e.g., membrane + solvent) can cause some domains to have many more atoms than others.
Key takeaway: MD is a classic example of moderately strong scaling per simulation plus massive ensemble parallelism across simulations, with heavy GPU usage.
Bioinformatics and Genomics
Problem Setting
Genomics and bioinformatics use HPC for:
- Genome assembly from short reads
- Alignment of sequencing reads to reference genomes
- Variant calling and functional annotation
- Metagenomics and transcriptomics analyses
These are typically data‑intensive rather than numerically intensive.
Computational Characteristics
- Huge input datasets (terabytes of reads).
- String processing, graph algorithms, and hashing dominate compute.
- Irregular memory access and branching; often memory bandwidth and latency limited.
- Embarrassingly parallel tasks at multiple stages (per‑sample, per‑contig, per‑chromosome).
Parallelization Strategies
- Coarse‑grained parallelism:
- Process many samples or read chunks in parallel.
- Use job arrays on the scheduler for thousands of similar jobs.
- Thread‑level parallelism:
- Many tools use OpenMP or Pthreads within a node.
- Distributed memory:
- For large assemblies, data and computation are distributed across many nodes (e.g., distributed de Bruijn graphs).
Example: Whole‑Genome Variant Calling Pipeline
- Objective: Identify variants in hundreds or thousands of human genomes.
- Resources: From small clusters to large centers; thousands of cores over days to weeks.
- Workflow:
- Read QC and trimming.
- Alignment to reference genome.
- Sort, mark duplicates, and recalibrate.
- Variant calling and joint genotyping.
- Annotation and reporting.
HPC Challenges and Lessons
- I/O and storage:
- Pipelines read and write many large intermediate files; parallel filesystems and smart caching are essential.
- Workflow management:
- Complex DAGs of tasks with dependencies; workflow engines (Snakemake, Nextflow, Cromwell) orchestrate jobs on clusters.
- Throughput vs. latency:
- Focus is often on processing as many samples per day as possible, not on minimizing time for a single sample.
- Reproducibility:
- Strict versioning and containerization are widespread due to clinical relevance.
Key takeaway: Genomics emphasizes I/O, workflow orchestration, and embarrassingly parallel throughput rather than extreme per‑job scalability.
High‑Energy Physics (HEP)
Problem Setting
Large experiments (e.g., at the LHC) produce enormous volumes of collision data. HPC is used for:
- Detector simulation (Monte Carlo)
- Event reconstruction
- Analysis of recorded events
- Theoretical simulations (lattice QCD, perturbative calculations)
Here we highlight two different patterns: Monte Carlo event simulation and lattice QCD.
Monte Carlo Event Simulation
Computational Characteristics
- Embarrassingly parallel: each simulated event is independent.
- Complex local computations per event (particle interactions, detector response).
- Moderate memory usage per task; huge overall data volume.
Parallelization Strategies
- Massive task parallelism:
- Millions to billions of events, each simulated as an independent job.
- Perfectly suited to distributed computing grids and cluster job arrays.
- Multi‑threading/GPU within an event:
- Newer frameworks vectorize or offload event steps to GPUs.
HPC Challenges and Lessons
- Resource federation: global computing grids coordinate thousands of sites.
- Data management: replicating and accessing petabytes of event data.
- Efficiency: aim for high throughput and high utilization; individual job performance is less critical than overall rate.
Lattice QCD
Computational Characteristics
- Discrete space‑time lattice; large sparse linear systems.
- Heavy use of iterative solvers and stencil‑like operations.
- Highly regular but communication‑intensive patterns (nearest‑neighbor on 4D lattices).
Parallelization Strategies
- Domain decomposition + MPI in 4D.
- GPU acceleration for linear algebra kernels; often one rank per GPU.
- Mixed precision techniques to speed up iterative solvers.
HPC Challenges and Lessons
- Strong scaling up to very large core counts, but limited by latency and global reductions.
- Machine‑specific tuning: performance highly sensitive to network, cache, and GPU characteristics.
Key takeaway: HEP showcases both embarrassingly parallel Monte Carlo workflows and tightly coupled, stencil‑based simulations.
Earth Sciences and Natural Hazards
Problem Setting
Earth sciences use HPC for:
- Seismic wave propagation and earthquake modeling
- Tsunami and storm surge simulations
- Volcanic eruption and landslide modeling
- Groundwater and reservoir simulations
These applications directly support hazard assessment and risk mitigation.
Computational Characteristics
- Wave and transport PDEs on 2D/3D meshes.
- Time‑critical in some cases (early warning, real‑time forecasting).
- Complex geometries (topography, subsurface structures).
- Multi‑physics coupling (e.g., earthquake rupture + wave propagation).
Parallelization Strategies
- MPI domain decomposition of the mesh.
- Hybrid MPI/OpenMP and GPU:
- Seismic wave equations implemented as high‑order stencil kernels on CPUs or GPUs.
- GPUs provide significant speedups for high‑order methods.
Example: Regional Earthquake Scenario Simulation
- Objective: Model ground motion for a hypothetical earthquake to produce hazard maps.
- Resources: Thousands to tens of thousands of CPU cores or hundreds of GPUs.
- Workflow:
- Build geological and source models.
- Run wave propagation simulation.
- Generate shaking intensity maps and risk metrics.
HPC Challenges and Lessons
- Real‑time constraints:
- For early warning, computation must keep up with or outpace real time.
- I/O and visualization:
- Large time‑dependent 3D fields; strategies include output decimation and in situ visualization.
- Mesh resolution vs. runtime:
- Trade‑offs between capturing high‑frequency waves and computational cost.
Key takeaway: Earth‑hazard applications illustrate the use of HPC for time‑critical simulations with strong societal impact.
Cross‑Cutting Patterns from Scientific Case Studies
Across these domains, a few recurring patterns emerge:
1. Workload Types
- Tightly coupled simulations (climate, CFD, lattice QCD):
Require fast interconnects, careful parallelization, and attention to scalability and communication patterns. - Embarrassingly or loosely coupled tasks (Monte Carlo, genomics pipelines, MD ensembles):
Dominated by workflow management, job arrays, and efficient use of cluster queues.
2. Parallelization Models
- MPI is almost universal for distributed memory.
- Hybrid MPI + threads is common for node‑level performance on CPUs.
- GPUs and accelerators are increasingly central, especially for regular, compute‑intensive kernels.
- Ensembles: Many independent moderate‑size jobs used to fill large systems.
3. Performance Concerns
- Scaling:
- Strong scaling limits appear in tightly coupled simulations; beyond a point, communication dominates.
- I/O:
- Many scientific codes are I/O‑bound when output frequency or resolution is high.
- Load balance:
- Adaptive meshes, inhomogeneous physics, and data‑dependent workloads require dynamic balancing and smart partitioning.
4. Workflow and Operations
- End‑to‑end pipelines:
- Preprocessing → main compute → postprocessing/analysis → archiving.
- Checkpoint/restart:
- Long‑running simulations rely heavily on fault tolerance.
- Reproducibility and provenance:
- Scientific results must be reproducible; environment control and consistent software stacks are crucial.
Understanding these real‑world patterns will help you reason about how to design, run, and optimize your own HPC workloads in scientific contexts.