19.2 AI and machine learning in HPC

Table of Contents

From Simulation Driven to Data Driven

High performance computing has traditionally focused on large numerical simulations, such as climate models, fluid dynamics, and structural mechanics. These applications typically solve partial differential equations or other mathematical models on large meshes or grids. In contrast, artificial intelligence and machine learning focus on learning patterns directly from data. In modern HPC practice, these two approaches increasingly interact and support each other.

On one side, AI workloads themselves need HPC capabilities, especially for large deep learning models, massive datasets, or large scale hyperparameter searches. On the other side, AI methods are increasingly integrated into classical simulation workflows, for tasks such as accelerating solvers, building surrogates that approximate simulations, or analyzing simulation output. This convergence changes how HPC resources are used, scheduled, and programmed.

AI Workloads as HPC Workloads

Traditional HPC jobs are dominated by floating point arithmetic over structured data and often use MPI and OpenMP to scale across nodes and cores. AI and machine learning jobs, especially deep learning, also perform large amounts of floating point arithmetic, but with some distinctive characteristics.

Deep learning training typically involves repeated linear algebra operations, such as matrix multiplications and convolutions, over large batches of data. These operations map very efficiently to GPUs and other accelerators that provide high throughput for dense linear algebra. As a result, AI workloads tend to be limited less by latency of individual operations and more by aggregate throughput and data movement.

Training large models often requires distributing the model, the data, or both, across many GPUs and nodes. This fits naturally in an HPC cluster environment, with fast interconnects and job schedulers, but introduces new scaling patterns compared to traditional simulation codes. Hyperparameter optimization, neural architecture search, and ensemble training lead to many similar jobs, which can be scheduled in large job arrays. Inference at scale, in contrast, often requires many small, short jobs or interactive workloads.

AI training on large models and datasets is a genuine high performance computing workload. It can saturate entire GPU clusters and consume petaflop or exaflop scale compute over the course of a training run.

Parallelism Patterns for Deep Learning

Although the underlying hardware and primitives overlap with general HPC, AI frameworks use a set of characteristic parallelism patterns. Understanding these patterns is important when running such workloads on shared HPC resources.

Data parallelism replicates the model across multiple devices or processes and splits each input batch across them. Each replica computes gradients on its subset of data, then gradients are aggregated, often via an allreduce collective, and model parameters are synchronized. This pattern is straightforward to scale, but it requires efficient communication for the global gradient aggregation. Performance is sensitive to batch size, communication topology, and interconnect quality.

Model parallelism splits a single model across devices, rather than replicating it fully. Layers or portions of layers reside on different devices. Forward and backward passes must move activations and gradients across device boundaries. This approach is useful when the model is too large to fit in a single device memory, which is common for very large language models. It increases communication complexity and often requires careful partitioning of the network architecture.

Pipeline parallelism organizes layers into stages that are assigned to different devices. Microbatches of data flow through the stages like an assembly line. While one stage processes microbatch $k$, another stage can process microbatch $k+1$. This pattern can improve device utilization, but introduces pipeline bubbles and requires balancing the computational load across stages.

Hybrid parallelism combines data, model, and pipeline parallelism in one training setup, often using hierarchical process groups. For instance, a large model may be split across GPUs within a node, with data parallelism applied across nodes, while pipeline parallelism is used within a subset of layers.

Many current deep learning frameworks implement these schemes internally, sometimes using MPI or NCCL over the same high performance interconnects that traditional HPC applications use. For an HPC user, this means that classic cluster scaling concepts such as process placement, affinity, and interconnect topology still matter, but they appear under new abstractions in AI libraries.

HPC Infrastructure for AI and ML

As AI usage has grown, many HPC centers have adapted their infrastructure to support these workloads. This has implications for hardware, software, and scheduling policies.

On the hardware side, modern HPC systems often dedicate partitions or nodes to GPU accelerators or specialized AI hardware. Typical configurations include several high end GPUs per node, high bandwidth GPU to GPU links, such as NVLink, and fast connections from GPUs to system memory and storage. Storage systems must sustain high read throughput, because AI training repeatedly accesses large datasets. Caching layers and data staging to local SSDs are common.

On the software side, HPC environments provide optimized versions of deep learning frameworks such as TensorFlow and PyTorch, along with supporting libraries like cuDNN, cuBLAS, and communication libraries. Environment modules and container technologies are widely used to manage complex AI software stacks that evolve rapidly. AI workloads tend to depend on Python ecosystems and require frequent updates, which contrasts with the more conservative software lifecycles of some traditional HPC codes.

On the scheduling side, AI jobs often request exclusive access to GPUs, large memory, and fast storage paths. Jobs may be interactive or require rapid turnaround for experimentation, which can conflict with long batch simulation runs. Schedulers may implement GPU aware policies, quality of service classes for interactive AI work, and fair share mechanisms to avoid one user or project monopolizing accelerators.

Integrating AI into Simulation Workflows

Beyond running AI workloads in isolation, one of the most important trends is the integration of AI and ML into classical HPC simulation workflows. This integration can happen at several points in the workflow.

One approach is to use ML as a surrogate model. A surrogate is trained to approximate the input output behavior of a costly simulation. Once trained, the surrogate can evaluate new inputs much faster than running the full simulation. This is attractive in scenarios such as optimization, design space exploration, and real time control, where many evaluations are required.

Another approach is to replace or augment parts of a numerical model. For example, an ML model may emulate an expensive subgrid scale physics model, a closure term, or a constitutive law, while the rest of the simulation remains physics based. This can reduce computational cost or enable more complex behavior than is available in conventional models, at the price of introducing learned components whose validity must be carefully tested.

ML can also be used for acceleration within solvers, for instance by predicting good initial guesses for iterative solvers, learning effective preconditioners, or guiding adaptivity decisions in mesh refinement. In these cases, the ML model does not replace the underlying mathematics but supports it.

Finally, AI methods can operate downstream of the simulation, analyzing large amounts of output. Examples include feature detection in fluid flows, anomaly detection in large ensembles of runs, and dimensionality reduction of high dimensional snapshots for visualization or storage reduction.

Whenever machine learning is integrated into an HPC simulation, rigorous validation is required. A surrogate or ML component that is fast but inaccurate can invalidate scientific conclusions or design decisions.

Data Handling and I/O for AI on HPC

AI training typically consumes and produces large datasets. This changes how I/O and data management are handled on HPC systems.

Datasets for training and validation can reach terabyte or petabyte scales. Storing these on parallel filesystems is feasible, but training performance can suffer if each worker reads small files independently and repeatedly. Techniques such as sharding datasets into large binary files, prefetching and caching data locally, and using efficient file formats such as TFRecord or HDF5 help to reduce metadata pressure and improve bandwidth utilization.

In an HPC context, data is often produced by simulation codes. One emerging pattern is to couple simulation and training so that the simulation generates training samples on the fly, which are consumed by an ML component. This reduces I/O to disk, but requires careful integration between the codes and the scheduler.

AI workflows may also generate large model checkpoints. These files can be tens or hundreds of gigabytes each. Frequent checkpointing for fault tolerance can stress the filesystem. Techniques such as incremental checkpoints, reduced precision storage, or model sharding can mitigate this, but need to be aligned with the center’s I/O policies.

Data locality, both in memory and across the cluster, remains a key performance factor. Placing data close to the GPUs that use it, possibly through burst buffers or node local storage, can significantly improve throughput. This mirrors similar concerns in simulation I/O, but the access patterns and file organization can differ.

Performance and Scaling Considerations

Although AI workloads use highly optimized kernels, achieving good utilization on an HPC system is not automatic. Several performance and scaling issues are characteristic of AI in HPC.

The arithmetic intensity of deep learning operations is often high, which matches GPU capabilities, but scaling across nodes introduces communication overheads. In data parallel training, the time for gradient synchronization via allreduce often grows with the number of workers. If communication is not carefully overlapped with computation, or if the interconnect is underutilized, scaling efficiency deteriorates.

Batch size plays a key role. Larger batch sizes can increase hardware utilization and reduce communication per sample, but may affect convergence and model quality. There is often a practical upper bound on batch size beyond which accuracy degrades, or optimization becomes unstable. Tuning batch size, learning rate, and gradient accumulation is therefore part of the performance engineering process.

Mixed precision training, which uses lower precision formats such as FP16 or BF16 for most computations, is now common. This reduces memory and bandwidth demands and can greatly increase throughput on modern GPUs that have specialized tensor cores. However, some operations still require higher precision, and loss scaling techniques are used to maintain numerical stability.

Resource fragmentation is a concern in multi user environments. If jobs request a single GPU per node, large nodes with many GPUs may become partially occupied and unusable for larger jobs. Conversely, jobs that span many nodes may be blocked if the scheduler cannot find enough contiguous GPUs. Good job sizing and awareness of cluster topology help to maintain overall throughput.

Software Ecosystems and Programming Models

AI and machine learning in HPC introduce additional programming models on top of the traditional MPI, OpenMP, and CUDA or OpenACC paradigms. Deep learning frameworks encapsulate many of the low level details of GPU programming and distributed communication.

Most frameworks expose a high level Python API that is popular among AI practitioners. On HPC systems, this often runs atop compiled backends in C and C++ that call GPU libraries directly. This layered design makes integration with compiled simulation codes possible. For example, a simulation may call into an ML model through C bindings, or share memory buffers with an ML framework to avoid extra copies.

For distributed training, frameworks use a variety of communication backends, including MPI, NCCL, and Gloo. When running on an HPC cluster, these backends must be configured to use the high performance interconnect. This may require particular MPI implementations or environment settings.

Auto differentiation is a central feature of many ML frameworks. It allows derivatives of complex composite functions to be computed automatically. In hybrid HPC applications, this can be exploited to compute sensitivities or gradients of simulation outputs with respect to parameters, without manually deriving and implementing adjoint codes.

Container technologies are especially important for AI workloads, which often depend on fast moving CUDA, driver, and Python ecosystems. Many HPC centers support containers that can encapsulate deep learning software stacks and run them efficiently with GPU and interconnect access. This reduces conflicts between the needs of AI users and the stability requirements of long lived simulation codes.

Scientific and Engineering Use Cases

AI and machine learning in HPC are not limited to commercial applications such as recommendation systems. They increasingly support scientific and engineering tasks that traditionally relied solely on simulations.

In climate and weather prediction, deep learning models are used to emulate unresolved physics, to correct biases in forecasts, and to downscale coarse predictions to finer spatial resolution. These models often train on large archives of simulation and observational data. Training and inference for global scale forecasts at high resolution can require large HPC resources.

In materials science and chemistry, ML potentials approximate quantum mechanical calculations, making it possible to run molecular dynamics simulations over longer time scales and larger systems than would be possible with ab initio methods. Training such potentials requires large datasets of reference calculations, which themselves come from HPC simulations.

In engineering design, such as aerodynamics or structural optimization, surrogates learned from samples of high fidelity simulations enable faster exploration of design spaces. Bayesian optimization and reinforcement learning approaches can guide the sampling process, balancing exploration and exploitation.

In experimental sciences, such as high energy physics and astronomy, AI methods process large data streams from detectors and telescopes. While some of this processing happens in real time near the instruments, HPC centers are used for large scale offline reconstruction, simulation of detector responses, and training of sophisticated pattern recognition models.

Reliability, Validation, and Trust

The adoption of AI in HPC science and engineering raises issues of reliability, verification, and trust. Traditional simulations are grounded in explicit mathematical models and conservation laws, which can be analyzed and tested in well established ways. Learned models introduce additional sources of uncertainty.

Generalization is a central concern. A machine learning model that performs well on data similar to its training set may fail in regimes not covered by that data. In a simulation context, this can correspond to physical conditions that lie outside the sampled parameter space. Extrapolation errors can be large and difficult to detect without careful testing.

Uncertainty quantification techniques are therefore important. Approaches include ensembles of models, Bayesian neural networks, and methods that estimate epistemic and aleatoric uncertainty. Incorporating these estimates into downstream decisions can make AI augmented workflows more robust.

Interpretability is another issue. Many deep models are difficult to interpret directly, which can conflict with the scientific need to understand mechanisms rather than just predict outcomes. Techniques for feature attribution, sensitivity analysis, and reduced representations can help, but they do not replace physical reasoning.

Reproducibility is also affected. Training outcomes can depend on random initializations, non deterministic GPU operations, and subtle software and hardware details. In an HPC context, where runs are expensive, it is essential to control and log random seeds, software versions, and environment settings as part of a reproducible workflow.

When AI and ML influence scientific conclusions or engineering designs, validation against trusted baselines and physical constraints is mandatory. Speed gains must never come at the expense of unrecognized systematic errors.

Looking Ahead: Coevolution of AI and HPC

AI and HPC are coevolving. Advances in AI drive demand for larger, more heterogeneous HPC systems, while advances in HPC enable ever more ambitious AI models and applications. Several trends are visible.

Model sizes and dataset scales continue to grow, pushing toward exascale and beyond in effective training compute. This motivates new hardware designs that combine general purpose CPUs, GPUs, and specialized accelerators. Interconnect architectures and memory systems are adapting to this mixture.

At the algorithm level, there is interest in combining data driven and physics informed approaches. Physics informed neural networks and other hybrid models attempt to encode known laws into learning architectures. These approaches can reduce data requirements and improve extrapolation but pose new challenges for optimization and implementation at scale.

Workflow automation using AI is emerging. For instance, AI agents can help select simulation parameters, manage ensembles of runs, or adaptively mesh regions of interest. These meta level uses of AI can change how scientists interact with HPC systems.

Finally, AI techniques themselves are being applied to system level HPC tasks, such as optimizing job scheduling, predicting node failures, and tuning system parameters. In this sense, AI is not only a workload but also a tool to improve the efficiency and reliability of HPC infrastructure.

For beginners in HPC, it is increasingly important to have a basic understanding of AI workflows, software stacks, and their interaction with traditional simulation tools. The boundaries between simulation, data analysis, and machine learning are becoming more fluid, and future HPC practice will routinely involve elements of all three.

Comments

Please login to add a comment.

Don't have an account? Register now!