17.4 Hybrid HPC and cloud workflows

Key Concepts in Hybrid HPC–Cloud Workflows

Hybrid HPC–cloud workflows combine traditional on-premises HPC resources (clusters, supercomputers, specialized storage) with cloud and OpenShift-based environments. The goal is to use each environment for what it does best while presenting users with a coherent way to run, move, and scale workloads.

In this context, OpenShift often acts as:

A cloud-native execution environment for parts of the workflow.
A unifying control plane across multiple infrastructures.
A bridge between batch-oriented HPC systems and elastic cloud resources.

This chapter focuses on how OpenShift participates in hybrid setups, not on basic HPC or OpenShift concepts.

When to Use Hybrid HPC–Cloud Approaches

Typical reasons to combine HPC and cloud with OpenShift include:

Bursting: Temporarily offload extra work from an over-subscribed on-prem HPC cluster to cloud-based OpenShift clusters.
Pre- and post-processing: Run data preparation, meshing, feature engineering, or visualization on OpenShift, while the main simulation/solver runs on a traditional HPC cluster.
Heterogeneous accelerators: Use specialized GPUs or accelerators (e.g., in the cloud) for parts of workflows (AI/ML, visualization) that complement CPU-bound HPC computations.
Data-centric workflows: Combine HPC simulations with cloud-native data lakes, streaming systems, or analytics stacks running on OpenShift.
Geo-distributed teams: Researchers use OpenShift-hosted portals, notebooks, and UIs to orchestrate or submit jobs to remote HPC centers.

The common thread is that some steps are better suited to batch schedulers and tightly coupled MPI jobs, while others benefit from OpenShift’s elasticity and rich ecosystem.

Common Hybrid Workflow Patterns

1. Pre-Processing on OpenShift, Simulation on HPC

Many HPC simulations require significant pre-processing:

Generating input parameter sweeps
Meshing geometries
Cleaning and transforming datasets

Pattern:

User interface / orchestration on OpenShift

Web apps, Jupyter notebooks, or REST APIs run on OpenShift.
Users define experiments, parameter sets, or input conditions.

Input generation on OpenShift

Containerized tools generate input decks, mesh files, or job arrays.
Outputs are written to storage accessible to the HPC system (e.g., a shared file system, object store, or a synchronized data location).

Job submission to HPC

An OpenShift-based service or Operator interacts with the HPC scheduler (e.g., Slurm, PBS, LSF) via SSH, REST, or scheduler APIs.
Job metadata and status are tracked in OpenShift (e.g., via custom resources).

Simulation runs on the HPC cluster

The HPC system runs parallel MPI or tightly coupled workloads on bare metal.

Results returned

Result files are written back to the shared or synchronized storage for later consumption in OpenShift (post-processing).

Advantages:

OpenShift handles user-facing components and scales them independently.
HPC resources are reserved for the parts that truly need them.

2. Post-Processing and Analytics on OpenShift

Large simulations often produce massive outputs that require analysis or visualization.

Pattern:

Simulation completes on HPC

Results are stored in a file system or object store.

Data exposure to OpenShift

Storage is:

Mounted directly into OpenShift via CSI drivers, or
Replicated/synchronized to a cloud object store accessed from OpenShift.

Post-processing on OpenShift

Containerized analytics tools (Python, R, Spark, Dask, AI/ML frameworks) run on OpenShift using multiple pods.
Work is parallelized using cloud-native patterns (e.g., job queues, distributed dataframes, parallel map-reduce).

Visualization and sharing

Dashboards, notebooks, or custom portals on OpenShift render results.
Access control and multi-tenant sharing are handled by OpenShift’s RBAC and namespaces.

Advantages:

Data analysis workloads benefit from elastic scaling and modern data platforms.
Researchers use browser-based tools instead of logging into the HPC head node.

3. Cloud Bursting from HPC to OpenShift

Cloud bursting adds temporary capacity when the HPC queue is too long or demand spikes.

Pattern:

Primary scheduling on on-prem HPC

HPC scheduler remains the authoritative system.

Bursting triggers

Policies define when to burst:

Queue length thresholds
Wait time thresholds
Specific job tags or partitions

Launching capacity on OpenShift

OpenShift clusters (on cloud or additional data centers) are scaled up or created.
HPC job definitions are translated to containerized jobs (e.g., using job templates, container images with the same codes).

Execution on OpenShift

Jobs that meet bursting criteria are submitted as:

Job / CronJob resources in Kubernetes/OpenShift.
Specialized HPC job abstractions if you use HPC-oriented Operators.

Data staging

Inputs are moved between HPC storage and OpenShift storage (or both share a common backend).
Outputs are returned or archived.

Considerations:

Ensuring binary compatibility (compiler, MPI, libraries) between HPC and container images.
Cost management and quota policies for cloud bursting.
Job placement rules to keep tightly coupled MPI work on HPC, and embarrassingly parallel tasks on OpenShift.

4. End-to-End Scientific Pipelines Across Environments

Hybrid workflows often resemble multi-step pipelines where different steps run where they fit best.

Example high-level pipeline:

Ingest raw data on OpenShift

Data arrives via APIs, message queues, or data transfers into an OpenShift-hosted service.

Pre-processing (OpenShift)

Clean and reshape data, prepare multiple scenarios, generate configuration sets.

Stage data to HPC

Transfer relevant inputs to the HPC center’s high-performance storage.

Run core simulation on HPC

Possibly using batch arrays, MPI, and accelerators.

Stage results back to OpenShift

Either selective result subsets or entire datasets, depending on data size and cost.

Analytics and AI/ML (OpenShift)

Build surrogate models, perform parameter studies or uncertainty quantification using cloud-native compute.

Publish and archive (OpenShift)

Store derived datasets and models in object stores, catalogs, or data portals; expose APIs for downstream consumers.

OpenShift-native tools (like CI/CD pipelines, Operators, and event-driven workflows) can orchestrate this pipeline, providing reproducibility and automation across both environments.

Data Movement and Storage Strategies

Data locality is a key challenge in hybrid HPC–cloud workflows. Strategies must balance performance, cost, and complexity.

Shared Storage vs Data Replication

Directly shared storage

HPC file system or a parallel file system exported to OpenShift, or a shared object storage platform.
Simplifies paths and references: jobs on both sides see the same paths/buckets.
Often constrained by network bandwidth/latency, security constraints, and cross-site SLAs.

Replicated / synchronized data

Data is periodically or event-driven synced between HPC storage and cloud storage (e.g., via rsync, rclone, object storage replication, or data mover tools).
More robust to network interruptions; can be optimized to move only what’s needed.
Requires careful versioning and metadata management to avoid confusion.

On-demand staging

Jobs running on OpenShift use init containers or sidecars to fetch only required inputs from HPC storage (or vice versa) and to push results back after execution.
Can be orchestrated using Kubernetes Jobs, PersistentVolumeClaims, and container scripts.

Using OpenShift Storage Constructs

While the underlying storage mechanisms may be external, OpenShift concepts help manage how workloads see that storage:

Use PersistentVolume and PersistentVolumeClaim abstractions to expose external HPC-capable storage into OpenShift pods.
Use StorageClasses and dynamic provisioning where appropriate for cloud-native parts of the workflow.
Combine object storage (for raw and bulk data) with block/file storage (for scratch and low-latency working sets).

In hybrid workflows, it’s common to separate:

Scratch / intermediate data (often kept close to the compute: HPC scratch or OpenShift local/fast storage).
Long-term / shared results (object storage or replicated file systems across environments).

Orchestration Patterns with OpenShift

OpenShift can act as the overarching orchestrator for hybrid flows, coordinating HPC and cloud tasks.

Using OpenShift Pipelines / Workflow Engines

Pipelines (e.g., Tekton)

Represent each step in a hybrid workflow as a Task:

Pre-process task (OpenShift pod).
Submit-to-HPC task (small service container).
Poll-HPC-status task.
Post-process task (OpenShift pod).

Pass artifacts via storage or object stores.
Apply version control to pipeline definitions for reproducibility.

Workflow engines

Engines like Argo Workflows, Nextflow, or CWL-based systems can run on OpenShift and call out to HPC systems.
Provide DAG-based workflow definitions, retries, and provenance tracking.

Integrating with HPC Schedulers

Connecting OpenShift with traditional batch schedulers requires well-defined interfaces:

Job submission services

Containers on OpenShift run clients for Slurm, PBS, LSF, etc.
REST APIs or Operators abstract scheduler-specific commands.
Custom Resource Definitions (CRDs) can represent remote HPC jobs as Kubernetes objects.

Status and lifecycle sync

Periodic polling or event-driven updates from HPC to OpenShift:

Map HPC states (PENDING, RUNNING, COMPLETED, FAILED) to Kubernetes-style conditions.

Enables dashboards, automation, and alerts in OpenShift based on remote jobs.

Security and identity

The service that interacts with the HPC scheduler uses appropriate credentials (SSH keys, Kerberos tickets, or federated identity).
OpenShift RBAC controls who is allowed to trigger HPC jobs via the integration components.

Designing Hybrid-Friendly Containers

To run the same scientific applications across HPC and OpenShift, container design must consider hybrid requirements:

Build once, run many places

Container images should encapsulate solvers, libraries, and dependencies in a portable way.
For MPI workloads, images must align with the MPI stack and network fabric when running on HPC (this can be nuanced and may require matching system libraries or using HPC-specific container runtimes).

Separation of configuration and data

Use environment variables, ConfigMaps, and Secrets to adapt behavior by environment (HPC vs cloud) without rebuilding images.
Keep input data and configuration external to the image to avoid large image sizes.

Awareness of resource constraints

Containerized codes may assume a certain number of cores or memory; in OpenShift, those are governed by resource requests/limits and can differ from HPC node layouts.
Provide flexible runtime configuration (e.g., read number of processes/threads from environment).

Licensing considerations

Licensed commercial solvers might rely on hardware or network license servers.
Containers run on OpenShift must still comply with license policies; license servers might live on-prem, in the cloud, or both.

Operational Considerations and Governance

Hybrid workflows introduce additional complexities around governance, cost, and reliability.

Policy and Placement

Define clear policies for:

Which workloads must run on HPC (e.g., tightly coupled MPI, codes requiring specific interconnects).
Which can run on OpenShift (e.g., embarrassingly parallel tasks, post-processing, AI/ML).
When and how bursting is allowed, including budget and quota rules.

OpenShift’s constructs (namespaces, quotas, limit ranges) can help enforce limits on cloud-side workloads.

Cost and Accounting

On-prem HPC often uses allocation-based accounting (project hours, allocations), while cloud use is typically pay-as-you-go.
You may need:

Tagging of OpenShift workloads for cost attribution.
Synchronization of accounting data between HPC and cloud for unified reporting.

Reliability and Failover

Hybrid workflows should handle:

Partial failures (e.g., HPC cluster unavailable, storage temporarily offline).
Retry strategies for data transfers and job submissions.
Checkpointing where possible, especially for long-running simulations offloaded to OpenShift.

OpenShift’s built-in features (health checks, re-scheduling, and job retries) help on the cloud side, but must be complemented by HPC-savvy mechanisms (checkpoint/restart, scheduler policies) on the HPC side.

Example Hybrid Use Cases

Parameter Sweep Experiments

Thousands of independent simulations with different parameters.
On-prem HPC runs a subset (tightly coupled or high-priority cases).
OpenShift runs a large number of embarrassingly parallel containers in the cloud for the rest.
A pipeline on OpenShift:

Generates parameter sets.
Decides placement (HPC vs OpenShift) based on policies.
Aggregates and analyzes results.

Digital Twins and Continuous Simulation

Near-real-time simulations driven by live data streams processed on OpenShift.
HPC runs heavy baseline simulations to train high-fidelity models.
OpenShift hosts surrogate models and lightweight, frequent simulations for “digital twin” updates.
Data and models move between HPC and OpenShift as the system learns and improves.

Collaborative Research Portals

OpenShift hosts a portal where researchers:

Upload input cases.
Choose where to run them (HPC center A, HPC center B, cloud).
Visualize results in dashboards.

Portal orchestrates jobs across multiple HPC centers and OpenShift clusters, hiding infrastructure details from end users.

Practical Guidelines for Getting Started

When designing hybrid HPC and cloud workflows with OpenShift:

Start with a single pipeline

Choose one real workflow (e.g., pre-process → HPC run → post-process) and implement it end-to-end.
Use this as a reference for patterns, tools, and governance.

Standardize on container images

Create a small catalog of validated images for important codes and toolchains.
Use the same images where possible across HPC (via compatible runtimes) and OpenShift.

Invest in data paths early

Define how data will move or be shared.
Measure transfer times and bandwidths; identify bottlenecks.

Automate orchestration

Use OpenShift-native pipelines or workflow engines to avoid manual glue scripts.
Encapsulate HPC submission and monitoring logic into services or Operators.

Iterate on policy and governance

Start with simple rules (e.g., only use cloud for post-processing) and refine as you gain experience with performance, cost, and reliability.

By leveraging OpenShift as a cloud-native orchestration and execution environment, organizations can extend traditional HPC capabilities, balance workloads across infrastructures, and build more flexible and data-centric scientific and engineering workflows.

Comments

Please login to add a comment.

Don't have an account? Register now!

17.4 Hybrid HPC and cloud workflows

Key Concepts in Hybrid HPC–Cloud Workflows

When to Use Hybrid HPC–Cloud Approaches

Common Hybrid Workflow Patterns

1. Pre-Processing on OpenShift, Simulation on HPC

2. Post-Processing and Analytics on OpenShift

3. Cloud Bursting from HPC to OpenShift

4. End-to-End Scientific Pipelines Across Environments

Data Movement and Storage Strategies

Shared Storage vs Data Replication

Using OpenShift Storage Constructs

Orchestration Patterns with OpenShift

Using OpenShift Pipelines / Workflow Engines

Integrating with HPC Schedulers

Designing Hybrid-Friendly Containers

Operational Considerations and Governance

Policy and Placement

Cost and Accounting

Reliability and Failover

Example Hybrid Use Cases

Parameter Sweep Experiments

Digital Twins and Continuous Simulation

Collaborative Research Portals

Practical Guidelines for Getting Started

Comments

Where to Move