Kahibaro
Discord Login Register

AI and machine learning in HPC

How AI and Machine Learning Intersect with HPC

AI and machine learning (ML) are no longer separate from HPC—they are now major drivers of hardware design, software ecosystems, and workflows. This chapter focuses on how AI/ML both use HPC and help improve HPC itself.

Why AI/ML Need HPC

Modern AI models, especially deep neural networks (DNNs), are computationally intensive:

As a result:

HPC systems provide:

This makes HPC the natural environment for large‑scale AI.

AI Workloads on HPC Systems

AI/ML on HPC typically appears in several patterns.

1. Large‑Scale Deep Learning Training

Key characteristics:

Common techniques:

HPC‑specific issues:

2. AI‑Assisted Simulation and Surrogates

In traditional HPC, simulations (CFD, climate, molecular dynamics, etc.) can be extremely expensive. AI adds new patterns:

Impact on HPC:

3. AI for Data Analysis, Post‑Processing, and Inference

HPC simulations and experiments generate massive outputs. AI helps to:

Inference patterns on HPC:

4. AI for Experimental Facilities

Large scientific facilities (synchrotrons, telescopes, particle accelerators) often have direct ties to HPC centers:

This is part of a broader trend where HPC, AI, and experimental science form a single workflow.

HPC Architectures Shaped by AI

AI workloads have influenced how HPC systems are designed.

GPU‑Heavy and Accelerator‑Rich Systems

Trends:

Consequences for HPC:

Interconnects Tuned for AI

AI training is communication‑heavy (especially with large models):

As a result:

Storage and I/O Considerations

AI places different demands on storage than some traditional HPC workloads:

Responses in HPC design:

Software Ecosystem: From MPI to AI Frameworks

HPC and AI software stacks are converging.

AI Frameworks as First‑Class HPC Citizens

Common frameworks:

On HPC systems:

Implications:

Hybrid MPI + AI Runtimes

Under the hood:

Patterns:

AI to Improve HPC Operations

AI is not only a workload on HPC—it is also used to manage and optimize HPC systems.

Intelligent Scheduling and Resource Management

Potential applications:

Benefits:

Energy‑Aware and Thermal‑Aware Management

AI can help reduce energy costs and environmental impact:

These techniques support sustainability goals discussed elsewhere in the course.

Failure Prediction and System Health

Large HPC systems experience hardware and software failures:

This contributes to higher effective availability and fewer job crashes.

New Algorithmic Directions at the AI–HPC Interface

Several emerging techniques blend AI methods with traditional numerical or simulation approaches.

Physics‑Informed and Operator‑Learning Methods

Examples:

Relevance to HPC:

AI‑Driven Optimization and Control

In many HPC applications, the goal is to optimize something, not just simulate:

AI contributes by:

These methods typically run simulations on HPC and the learning loop in tight integration.

Practical Considerations for AI on HPC

When using AI on HPC systems, users need to adapt to both worlds.

Workflow and Job Design

Key aspects:

Hyperparameter searches:

Data Management for AI on HPC

Common challenges:

Typical approaches:

Emerging Trends and Future Directions

AI and HPC are co‑evolving. Several future‑oriented trends include:

Understanding these interactions prepares you to design workflows and applications that fully exploit both AI methods and HPC infrastructures, and to adapt as the combined field continues to advance.

Views: 13

Comments

Please login to add a comment.

Don't have an account? Register now!