Table of Contents
Key Concepts in Hybrid HPC–Cloud Workflows
Hybrid HPC–cloud workflows combine traditional on-premises HPC resources (clusters, supercomputers, specialized storage) with cloud and OpenShift-based environments. The goal is to use each environment for what it does best while presenting users with a coherent way to run, move, and scale workloads.
In this context, OpenShift often acts as:
- A cloud-native execution environment for parts of the workflow.
- A unifying control plane across multiple infrastructures.
- A bridge between batch-oriented HPC systems and elastic cloud resources.
This chapter focuses on how OpenShift participates in hybrid setups, not on basic HPC or OpenShift concepts.
When to Use Hybrid HPC–Cloud Approaches
Typical reasons to combine HPC and cloud with OpenShift include:
- Bursting: Temporarily offload extra work from an over-subscribed on-prem HPC cluster to cloud-based OpenShift clusters.
- Pre- and post-processing: Run data preparation, meshing, feature engineering, or visualization on OpenShift, while the main simulation/solver runs on a traditional HPC cluster.
- Heterogeneous accelerators: Use specialized GPUs or accelerators (e.g., in the cloud) for parts of workflows (AI/ML, visualization) that complement CPU-bound HPC computations.
- Data-centric workflows: Combine HPC simulations with cloud-native data lakes, streaming systems, or analytics stacks running on OpenShift.
- Geo-distributed teams: Researchers use OpenShift-hosted portals, notebooks, and UIs to orchestrate or submit jobs to remote HPC centers.
The common thread is that some steps are better suited to batch schedulers and tightly coupled MPI jobs, while others benefit from OpenShift’s elasticity and rich ecosystem.
Common Hybrid Workflow Patterns
1. Pre-Processing on OpenShift, Simulation on HPC
Many HPC simulations require significant pre-processing:
- Generating input parameter sweeps
- Meshing geometries
- Cleaning and transforming datasets
Pattern:
- User interface / orchestration on OpenShift
- Web apps, Jupyter notebooks, or REST APIs run on OpenShift.
- Users define experiments, parameter sets, or input conditions.
- Input generation on OpenShift
- Containerized tools generate input decks, mesh files, or job arrays.
- Outputs are written to storage accessible to the HPC system (e.g., a shared file system, object store, or a synchronized data location).
- Job submission to HPC
- An OpenShift-based service or Operator interacts with the HPC scheduler (e.g., Slurm, PBS, LSF) via SSH, REST, or scheduler APIs.
- Job metadata and status are tracked in OpenShift (e.g., via custom resources).
- Simulation runs on the HPC cluster
- The HPC system runs parallel MPI or tightly coupled workloads on bare metal.
- Results returned
- Result files are written back to the shared or synchronized storage for later consumption in OpenShift (post-processing).
Advantages:
- OpenShift handles user-facing components and scales them independently.
- HPC resources are reserved for the parts that truly need them.
2. Post-Processing and Analytics on OpenShift
Large simulations often produce massive outputs that require analysis or visualization.
Pattern:
- Simulation completes on HPC
- Results are stored in a file system or object store.
- Data exposure to OpenShift
- Storage is:
- Mounted directly into OpenShift via CSI drivers, or
- Replicated/synchronized to a cloud object store accessed from OpenShift.
- Post-processing on OpenShift
- Containerized analytics tools (Python, R, Spark, Dask, AI/ML frameworks) run on OpenShift using multiple pods.
- Work is parallelized using cloud-native patterns (e.g., job queues, distributed dataframes, parallel map-reduce).
- Visualization and sharing
- Dashboards, notebooks, or custom portals on OpenShift render results.
- Access control and multi-tenant sharing are handled by OpenShift’s RBAC and namespaces.
Advantages:
- Data analysis workloads benefit from elastic scaling and modern data platforms.
- Researchers use browser-based tools instead of logging into the HPC head node.
3. Cloud Bursting from HPC to OpenShift
Cloud bursting adds temporary capacity when the HPC queue is too long or demand spikes.
Pattern:
- Primary scheduling on on-prem HPC
- HPC scheduler remains the authoritative system.
- Bursting triggers
- Policies define when to burst:
- Queue length thresholds
- Wait time thresholds
- Specific job tags or partitions
- Launching capacity on OpenShift
- OpenShift clusters (on cloud or additional data centers) are scaled up or created.
- HPC job definitions are translated to containerized jobs (e.g., using job templates, container images with the same codes).
- Execution on OpenShift
- Jobs that meet bursting criteria are submitted as:
Job/CronJobresources in Kubernetes/OpenShift.- Specialized HPC job abstractions if you use HPC-oriented Operators.
- Data staging
- Inputs are moved between HPC storage and OpenShift storage (or both share a common backend).
- Outputs are returned or archived.
Considerations:
- Ensuring binary compatibility (compiler, MPI, libraries) between HPC and container images.
- Cost management and quota policies for cloud bursting.
- Job placement rules to keep tightly coupled MPI work on HPC, and embarrassingly parallel tasks on OpenShift.
4. End-to-End Scientific Pipelines Across Environments
Hybrid workflows often resemble multi-step pipelines where different steps run where they fit best.
Example high-level pipeline:
- Ingest raw data on OpenShift
- Data arrives via APIs, message queues, or data transfers into an OpenShift-hosted service.
- Pre-processing (OpenShift)
- Clean and reshape data, prepare multiple scenarios, generate configuration sets.
- Stage data to HPC
- Transfer relevant inputs to the HPC center’s high-performance storage.
- Run core simulation on HPC
- Possibly using batch arrays, MPI, and accelerators.
- Stage results back to OpenShift
- Either selective result subsets or entire datasets, depending on data size and cost.
- Analytics and AI/ML (OpenShift)
- Build surrogate models, perform parameter studies or uncertainty quantification using cloud-native compute.
- Publish and archive (OpenShift)
- Store derived datasets and models in object stores, catalogs, or data portals; expose APIs for downstream consumers.
OpenShift-native tools (like CI/CD pipelines, Operators, and event-driven workflows) can orchestrate this pipeline, providing reproducibility and automation across both environments.
Data Movement and Storage Strategies
Data locality is a key challenge in hybrid HPC–cloud workflows. Strategies must balance performance, cost, and complexity.
Shared Storage vs Data Replication
- Directly shared storage
- HPC file system or a parallel file system exported to OpenShift, or a shared object storage platform.
- Simplifies paths and references: jobs on both sides see the same paths/buckets.
- Often constrained by network bandwidth/latency, security constraints, and cross-site SLAs.
- Replicated / synchronized data
- Data is periodically or event-driven synced between HPC storage and cloud storage (e.g., via
rsync,rclone, object storage replication, or data mover tools). - More robust to network interruptions; can be optimized to move only what’s needed.
- Requires careful versioning and metadata management to avoid confusion.
- On-demand staging
- Jobs running on OpenShift use init containers or sidecars to fetch only required inputs from HPC storage (or vice versa) and to push results back after execution.
- Can be orchestrated using Kubernetes
Jobs,PersistentVolumeClaims, and container scripts.
Using OpenShift Storage Constructs
While the underlying storage mechanisms may be external, OpenShift concepts help manage how workloads see that storage:
- Use PersistentVolume and PersistentVolumeClaim abstractions to expose external HPC-capable storage into OpenShift pods.
- Use StorageClasses and dynamic provisioning where appropriate for cloud-native parts of the workflow.
- Combine object storage (for raw and bulk data) with block/file storage (for scratch and low-latency working sets).
In hybrid workflows, it’s common to separate:
- Scratch / intermediate data (often kept close to the compute: HPC scratch or OpenShift local/fast storage).
- Long-term / shared results (object storage or replicated file systems across environments).
Orchestration Patterns with OpenShift
OpenShift can act as the overarching orchestrator for hybrid flows, coordinating HPC and cloud tasks.
Using OpenShift Pipelines / Workflow Engines
- Pipelines (e.g., Tekton)
- Represent each step in a hybrid workflow as a Task:
- Pre-process task (OpenShift pod).
- Submit-to-HPC task (small service container).
- Poll-HPC-status task.
- Post-process task (OpenShift pod).
- Pass artifacts via storage or object stores.
- Apply version control to pipeline definitions for reproducibility.
- Workflow engines
- Engines like Argo Workflows, Nextflow, or CWL-based systems can run on OpenShift and call out to HPC systems.
- Provide DAG-based workflow definitions, retries, and provenance tracking.
Integrating with HPC Schedulers
Connecting OpenShift with traditional batch schedulers requires well-defined interfaces:
- Job submission services
- Containers on OpenShift run clients for Slurm, PBS, LSF, etc.
- REST APIs or Operators abstract scheduler-specific commands.
- Custom Resource Definitions (CRDs) can represent remote HPC jobs as Kubernetes objects.
- Status and lifecycle sync
- Periodic polling or event-driven updates from HPC to OpenShift:
- Map HPC states (PENDING, RUNNING, COMPLETED, FAILED) to Kubernetes-style conditions.
- Enables dashboards, automation, and alerts in OpenShift based on remote jobs.
- Security and identity
- The service that interacts with the HPC scheduler uses appropriate credentials (SSH keys, Kerberos tickets, or federated identity).
- OpenShift RBAC controls who is allowed to trigger HPC jobs via the integration components.
Designing Hybrid-Friendly Containers
To run the same scientific applications across HPC and OpenShift, container design must consider hybrid requirements:
- Build once, run many places
- Container images should encapsulate solvers, libraries, and dependencies in a portable way.
- For MPI workloads, images must align with the MPI stack and network fabric when running on HPC (this can be nuanced and may require matching system libraries or using HPC-specific container runtimes).
- Separation of configuration and data
- Use environment variables,
ConfigMaps, andSecrets to adapt behavior by environment (HPC vs cloud) without rebuilding images. - Keep input data and configuration external to the image to avoid large image sizes.
- Awareness of resource constraints
- Containerized codes may assume a certain number of cores or memory; in OpenShift, those are governed by resource requests/limits and can differ from HPC node layouts.
- Provide flexible runtime configuration (e.g., read number of processes/threads from environment).
- Licensing considerations
- Licensed commercial solvers might rely on hardware or network license servers.
- Containers run on OpenShift must still comply with license policies; license servers might live on-prem, in the cloud, or both.
Operational Considerations and Governance
Hybrid workflows introduce additional complexities around governance, cost, and reliability.
Policy and Placement
Define clear policies for:
- Which workloads must run on HPC (e.g., tightly coupled MPI, codes requiring specific interconnects).
- Which can run on OpenShift (e.g., embarrassingly parallel tasks, post-processing, AI/ML).
- When and how bursting is allowed, including budget and quota rules.
OpenShift’s constructs (namespaces, quotas, limit ranges) can help enforce limits on cloud-side workloads.
Cost and Accounting
- On-prem HPC often uses allocation-based accounting (project hours, allocations), while cloud use is typically pay-as-you-go.
- You may need:
- Tagging of OpenShift workloads for cost attribution.
- Synchronization of accounting data between HPC and cloud for unified reporting.
Reliability and Failover
Hybrid workflows should handle:
- Partial failures (e.g., HPC cluster unavailable, storage temporarily offline).
- Retry strategies for data transfers and job submissions.
- Checkpointing where possible, especially for long-running simulations offloaded to OpenShift.
OpenShift’s built-in features (health checks, re-scheduling, and job retries) help on the cloud side, but must be complemented by HPC-savvy mechanisms (checkpoint/restart, scheduler policies) on the HPC side.
Example Hybrid Use Cases
Parameter Sweep Experiments
- Thousands of independent simulations with different parameters.
- On-prem HPC runs a subset (tightly coupled or high-priority cases).
- OpenShift runs a large number of embarrassingly parallel containers in the cloud for the rest.
- A pipeline on OpenShift:
- Generates parameter sets.
- Decides placement (HPC vs OpenShift) based on policies.
- Aggregates and analyzes results.
Digital Twins and Continuous Simulation
- Near-real-time simulations driven by live data streams processed on OpenShift.
- HPC runs heavy baseline simulations to train high-fidelity models.
- OpenShift hosts surrogate models and lightweight, frequent simulations for “digital twin” updates.
- Data and models move between HPC and OpenShift as the system learns and improves.
Collaborative Research Portals
- OpenShift hosts a portal where researchers:
- Upload input cases.
- Choose where to run them (HPC center A, HPC center B, cloud).
- Visualize results in dashboards.
- Portal orchestrates jobs across multiple HPC centers and OpenShift clusters, hiding infrastructure details from end users.
Practical Guidelines for Getting Started
When designing hybrid HPC and cloud workflows with OpenShift:
- Start with a single pipeline
- Choose one real workflow (e.g., pre-process → HPC run → post-process) and implement it end-to-end.
- Use this as a reference for patterns, tools, and governance.
- Standardize on container images
- Create a small catalog of validated images for important codes and toolchains.
- Use the same images where possible across HPC (via compatible runtimes) and OpenShift.
- Invest in data paths early
- Define how data will move or be shared.
- Measure transfer times and bandwidths; identify bottlenecks.
- Automate orchestration
- Use OpenShift-native pipelines or workflow engines to avoid manual glue scripts.
- Encapsulate HPC submission and monitoring logic into services or Operators.
- Iterate on policy and governance
- Start with simple rules (e.g., only use cloud for post-processing) and refine as you gain experience with performance, cost, and reliability.
By leveraging OpenShift as a cloud-native orchestration and execution environment, organizations can extend traditional HPC capabilities, balance workloads across infrastructures, and build more flexible and data-centric scientific and engineering workflows.