19.2 AI and machine learning in HPC

Table of Contents

How AI and Machine Learning Intersect with HPC

AI and machine learning (ML) are no longer separate from HPC—they are now major drivers of hardware design, software ecosystems, and workflows. This chapter focuses on how AI/ML both use HPC and help improve HPC itself.

Why AI/ML Need HPC

Modern AI models, especially deep neural networks (DNNs), are computationally intensive:

Training involves billions to trillions of floating‑point operations.
Datasets may reach terabytes or petabytes.
Hyperparameter search and model ensembles multiply the cost.

As a result:

Training state‑of‑the‑art models often requires thousands of GPUs for days or weeks.
Even “medium” models may need dozens of GPUs or a large CPU cluster.

HPC systems provide:

High‑throughput compute (GPUs, many‑core CPUs, accelerators).
High‑bandwidth, low‑latency interconnects (InfiniBand, high‑end Ethernet).
Large, parallel filesystems and advanced schedulers.

This makes HPC the natural environment for large‑scale AI.

AI Workloads on HPC Systems

AI/ML on HPC typically appears in several patterns.

1. Large‑Scale Deep Learning Training

Key characteristics:

Data parallelism: Replicate the model on many GPUs, split data across them.
Model parallelism: Split the model itself across devices when it doesn’t fit on one.
Pipeline parallelism: Split the model into stages and stream mini‑batches through.

Common techniques:

Distributed training frameworks: PyTorch Distributed, TensorFlow tf.distribute, DeepSpeed, Horovod.
Collective communication: Heavy use of allreduce for gradient aggregation (often via NCCL on GPUs, MPI under the hood on clusters).

HPC‑specific issues:

Efficient job placement (e.g., GPUs on the same node or rack).
Tuning batch size, learning rate, and communication overlap to utilize the interconnect.
Handling failures during long training runs using checkpointing.

2. AI‑Assisted Simulation and Surrogates

In traditional HPC, simulations (CFD, climate, molecular dynamics, etc.) can be extremely expensive. AI adds new patterns:

Surrogate models / emulators:

Train ML models to approximate expensive parts of simulations (e.g., turbulence models, subgrid physics).
Once trained, surrogates can be evaluated far faster than the full simulation.

Emulation of parameter sweeps:

Run a limited number of high‑fidelity simulations.
Train a model to predict outcomes for new parameters instead of simulating each one.

Hybrid physics‑AI models:

Combine physically‑based PDE solvers with neural networks (e.g., neural operators) embedded in parts of the solver.

Impact on HPC:

Reduced per‑job compute cost for repeated analyses.
Shift of workload: large up‑front training on HPC, then many cheap inference runs (sometimes off‑cluster).

3. AI for Data Analysis, Post‑Processing, and Inference

HPC simulations and experiments generate massive outputs. AI helps to:

Classify, detect, and segment features in large datasets (e.g., vortices in CFD, galaxies in cosmology).
Reduce dimensionality (PCA, autoencoders) to compress data while preserving essential structure.
Accelerate I/O bound workflows by:

Online analysis during simulation (co‑scheduled AI jobs).
Learning to denoise or reconstruct from compressed data.

Inference patterns on HPC:

Batched inference jobs on GPUs or many‑core CPUs.
“Streaming” inference frameworks to process data as it is generated by running simulations or instruments.

4. AI for Experimental Facilities

Large scientific facilities (synchrotrons, telescopes, particle accelerators) often have direct ties to HPC centers:

Real‑time or near‑real‑time ML analysis of experimental data.
Online feedback loops: ML guides experimental settings based on rapidly processed data.
HPC clusters close to instruments, or connected via high‑speed networks, run the ML pipelines.

This is part of a broader trend where HPC, AI, and experimental science form a single workflow.

HPC Architectures Shaped by AI

AI workloads have influenced how HPC systems are designed.

GPU‑Heavy and Accelerator‑Rich Systems

Trends:

Higher ratio of GPUs to CPUs per node.
Specialized AI accelerators (e.g., Tensor Cores, matrix units, TPUs in some environments).
Hardware designed for:

Mixed‑precision arithmetic (FP16, BF16, FP8).
High throughput matrix operations.

Consequences for HPC:

Systems are often optimized for dense linear algebra (matrix multiplications) which benefits both AI and traditional workloads (e.g., dense solvers).
Power, cooling, and system density are adjusted for large GPU farms.

Interconnects Tuned for AI

AI training is communication‑heavy (especially with large models):

Frequent gradient exchanges using allreduce.
Large message sizes and high message rates.

As a result:

Networks are tuned for high bisection bandwidth and low latency (fat‑tree, Dragonfly topologies).
Specialized libraries (NCCL, vendor‑tuned MPI, SHARP in‑network reduction) are widely deployed.
Features like GPU‑direct RDMA minimize CPU involvement in data transfers.

Storage and I/O Considerations

AI places different demands on storage than some traditional HPC workloads:

Many small files (images, text) or large collections of medium‑sized records.
High streaming read rates for training data.
Need for efficient checkpointing of large model states.

Responses in HPC design:

Adoption of object stores, data lakes, or dedicated “AI storage” tiers in addition to parallel filesystems.
Use of data caching layers (node‑local SSDs, burst buffers).
Dataset formats tailored for streaming (e.g., TFRecord, WebDataset, LMDB) to reduce metadata overhead.

Software Ecosystem: From MPI to AI Frameworks

HPC and AI software stacks are converging.

AI Frameworks as First‑Class HPC Citizens

Common frameworks:

PyTorch, TensorFlow, JAX, MXNet, and domain‑specific libraries.

On HPC systems:

Installed as modules or container images.
Built with optimized BLAS, cuDNN, vendor math libraries, and MPI/NCCL support.
Integrated with batch schedulers (SLURM, PBS, etc.) via wrapper scripts or launchers (e.g., srun, mpirun for distributed training).

Implications:

Python becomes a central language in many HPC workflows.
Interfacing with compiled code (C/C++, Fortran) is common through bindings, extensions, or library calls.

Hybrid MPI + AI Runtimes

Under the hood:

Many AI frameworks use MPI or MPI‑like communication patterns:

Horovod explicitly uses MPI for allreduce.
Some PyTorch Distributed backends rely on MPI as a transport.

Jobs may mix:

MPI ranks (for simulation components).
AI processes using framework‑level distributed APIs.

Patterns:

Coupled workflows: Simulation and ML components run concurrently on the same allocation, sharing data.
Orchestrated pipelines: Simulation writes checkpoints; ML jobs run later on the results.

AI to Improve HPC Operations

AI is not only a workload on HPC—it is also used to manage and optimize HPC systems.

Intelligent Scheduling and Resource Management

Potential applications:

Queue time prediction: ML models predict job start times and expected runtimes based on historical data.
Adaptive job placement: Schedulers learn to place jobs to reduce network contention or I/O hotspots.
Fair‑share tuning: Reinforcement learning or predictive models inform policy adjustments.

Benefits:

Better utilization of expensive resources.
Improved user experience via more predictable behavior.

Energy‑Aware and Thermal‑Aware Management

AI can help reduce energy costs and environmental impact:

Predicting power consumption of jobs based on their characteristics.
Dynamically adjusting CPU/GPU frequencies (DVFS) with minimal performance loss.
Forecasting thermal patterns to control cooling systems more efficiently.

These techniques support sustainability goals discussed elsewhere in the course.

Failure Prediction and System Health

Large HPC systems experience hardware and software failures:

AI models trained on logs, sensor data, and error reports can:

Predict failing components (disks, memory, nodes).
Trigger preemptive maintenance.
Suggest job rescheduling to avoid unreliable nodes.

This contributes to higher effective availability and fewer job crashes.

New Algorithmic Directions at the AI–HPC Interface

Several emerging techniques blend AI methods with traditional numerical or simulation approaches.

Physics‑Informed and Operator‑Learning Methods

Examples:

Physics‑Informed Neural Networks (PINNs):

Neural networks trained to satisfy PDEs and boundary conditions by embedding physics in the loss function.

Neural operators and Fourier Neural Operators:

Learn solution operators mapping input fields to output fields, potentially replacing or accelerating some solvers.

Relevance to HPC:

Can offer fast evaluations for parameter studies or real‑time predictions.
Training these models often requires large‑scale HPC resources.

AI‑Driven Optimization and Control

In many HPC applications, the goal is to optimize something, not just simulate:

Design optimization (aerodynamics, materials).
Control systems (plasma confinement, smart grids).

AI contributes by:

Surrogate‑based optimization: Use ML to approximate objective functions for faster search.
Reinforcement learning: Agents interact with simulations to learn control policies.

These methods typically run simulations on HPC and the learning loop in tight integration.

Practical Considerations for AI on HPC

When using AI on HPC systems, users need to adapt to both worlds.

Workflow and Job Design

Key aspects:

Splitting workloads between:

Short interactive experiments (prototyping models).
Long batch jobs (full‑scale training or hyperparameter sweeps).

Managing software environments:

Using modules or containers for consistent versions.
Ensuring compatibility with system drivers and libraries (CUDA, MPI).

Hyperparameter searches:

Use job arrays or workflow tools (e.g., Snakemake, CWL, custom scripts) to launch many training jobs efficiently.
Plan resource usage to avoid overwhelming shared GPU pools.

Data Management for AI on HPC

Common challenges:

Moving large datasets to the cluster efficiently.
Avoiding overloading metadata servers with many small files.
Ensuring reproducible access to training and validation splits.

Typical approaches:

Packing datasets into fewer, larger files (e.g., tar shards, TFRecord, WebDataset formats).
Using local SSD caches and staging data near compute nodes.
Recording dataset versions and preprocessing steps for reproducibility.

Emerging Trends and Future Directions

AI and HPC are co‑evolving. Several future‑oriented trends include:

Exascale AI training:

Using full exascale systems for ultra‑large models (trillions of parameters).
New programming models for fault tolerance and elasticity at scale.

Heterogeneous and disaggregated architectures:

Mixes of CPUs, GPUs, AI accelerators, FPGAs, and high‑bandwidth memory.
Systems where compute, memory, and storage are disaggregated but connected by high‑speed fabrics.

Tighter integration of simulation and AI:

Workflows where simulations continuously feed AI models and vice versa.
“Digital twins” of physical systems that run partly as numerical simulations, partly as AI models.

Automated scientific discovery:

AI assisting in hypothesis generation, experimental design, and interpretation of results.
HPC providing the computational backbone for both simulations and AI reasoning.

Standardization and policy:

Evolving best practices for responsible use of AI on HPC (fair access, privacy, reproducibility).
Policies for sharing trained models and datasets produced on public HPC facilities.

Understanding these interactions prepares you to design workflows and applications that fully exploit both AI methods and HPC infrastructures, and to adapt as the combined field continues to advance.

Comments

Please login to add a comment.

Don't have an account? Register now!