Kahibaro
Discord Login Register

Hybrid HPC and cloud workflows

Key Concepts in Hybrid HPC–Cloud Workflows

Hybrid HPC–cloud workflows combine traditional on-premises HPC resources (clusters, supercomputers, specialized storage) with cloud and OpenShift-based environments. The goal is to use each environment for what it does best while presenting users with a coherent way to run, move, and scale workloads.

In this context, OpenShift often acts as:

This chapter focuses on how OpenShift participates in hybrid setups, not on basic HPC or OpenShift concepts.

When to Use Hybrid HPC–Cloud Approaches

Typical reasons to combine HPC and cloud with OpenShift include:

The common thread is that some steps are better suited to batch schedulers and tightly coupled MPI jobs, while others benefit from OpenShift’s elasticity and rich ecosystem.

Common Hybrid Workflow Patterns

1. Pre-Processing on OpenShift, Simulation on HPC

Many HPC simulations require significant pre-processing:

Pattern:

  1. User interface / orchestration on OpenShift
    • Web apps, Jupyter notebooks, or REST APIs run on OpenShift.
    • Users define experiments, parameter sets, or input conditions.
  2. Input generation on OpenShift
    • Containerized tools generate input decks, mesh files, or job arrays.
    • Outputs are written to storage accessible to the HPC system (e.g., a shared file system, object store, or a synchronized data location).
  3. Job submission to HPC
    • An OpenShift-based service or Operator interacts with the HPC scheduler (e.g., Slurm, PBS, LSF) via SSH, REST, or scheduler APIs.
    • Job metadata and status are tracked in OpenShift (e.g., via custom resources).
  4. Simulation runs on the HPC cluster
    • The HPC system runs parallel MPI or tightly coupled workloads on bare metal.
  5. Results returned
    • Result files are written back to the shared or synchronized storage for later consumption in OpenShift (post-processing).

Advantages:

2. Post-Processing and Analytics on OpenShift

Large simulations often produce massive outputs that require analysis or visualization.

Pattern:

  1. Simulation completes on HPC
    • Results are stored in a file system or object store.
  2. Data exposure to OpenShift
    • Storage is:
      • Mounted directly into OpenShift via CSI drivers, or
      • Replicated/synchronized to a cloud object store accessed from OpenShift.
  3. Post-processing on OpenShift
    • Containerized analytics tools (Python, R, Spark, Dask, AI/ML frameworks) run on OpenShift using multiple pods.
    • Work is parallelized using cloud-native patterns (e.g., job queues, distributed dataframes, parallel map-reduce).
  4. Visualization and sharing
    • Dashboards, notebooks, or custom portals on OpenShift render results.
    • Access control and multi-tenant sharing are handled by OpenShift’s RBAC and namespaces.

Advantages:

3. Cloud Bursting from HPC to OpenShift

Cloud bursting adds temporary capacity when the HPC queue is too long or demand spikes.

Pattern:

  1. Primary scheduling on on-prem HPC
    • HPC scheduler remains the authoritative system.
  2. Bursting triggers
    • Policies define when to burst:
      • Queue length thresholds
      • Wait time thresholds
      • Specific job tags or partitions
  3. Launching capacity on OpenShift
    • OpenShift clusters (on cloud or additional data centers) are scaled up or created.
    • HPC job definitions are translated to containerized jobs (e.g., using job templates, container images with the same codes).
  4. Execution on OpenShift
    • Jobs that meet bursting criteria are submitted as:
      • Job / CronJob resources in Kubernetes/OpenShift.
      • Specialized HPC job abstractions if you use HPC-oriented Operators.
  5. Data staging
    • Inputs are moved between HPC storage and OpenShift storage (or both share a common backend).
    • Outputs are returned or archived.

Considerations:

4. End-to-End Scientific Pipelines Across Environments

Hybrid workflows often resemble multi-step pipelines where different steps run where they fit best.

Example high-level pipeline:

  1. Ingest raw data on OpenShift
    • Data arrives via APIs, message queues, or data transfers into an OpenShift-hosted service.
  2. Pre-processing (OpenShift)
    • Clean and reshape data, prepare multiple scenarios, generate configuration sets.
  3. Stage data to HPC
    • Transfer relevant inputs to the HPC center’s high-performance storage.
  4. Run core simulation on HPC
    • Possibly using batch arrays, MPI, and accelerators.
  5. Stage results back to OpenShift
    • Either selective result subsets or entire datasets, depending on data size and cost.
  6. Analytics and AI/ML (OpenShift)
    • Build surrogate models, perform parameter studies or uncertainty quantification using cloud-native compute.
  7. Publish and archive (OpenShift)
    • Store derived datasets and models in object stores, catalogs, or data portals; expose APIs for downstream consumers.

OpenShift-native tools (like CI/CD pipelines, Operators, and event-driven workflows) can orchestrate this pipeline, providing reproducibility and automation across both environments.

Data Movement and Storage Strategies

Data locality is a key challenge in hybrid HPC–cloud workflows. Strategies must balance performance, cost, and complexity.

Shared Storage vs Data Replication

  1. Directly shared storage
    • HPC file system or a parallel file system exported to OpenShift, or a shared object storage platform.
    • Simplifies paths and references: jobs on both sides see the same paths/buckets.
    • Often constrained by network bandwidth/latency, security constraints, and cross-site SLAs.
  2. Replicated / synchronized data
    • Data is periodically or event-driven synced between HPC storage and cloud storage (e.g., via rsync, rclone, object storage replication, or data mover tools).
    • More robust to network interruptions; can be optimized to move only what’s needed.
    • Requires careful versioning and metadata management to avoid confusion.
  3. On-demand staging
    • Jobs running on OpenShift use init containers or sidecars to fetch only required inputs from HPC storage (or vice versa) and to push results back after execution.
    • Can be orchestrated using Kubernetes Jobs, PersistentVolumeClaims, and container scripts.

Using OpenShift Storage Constructs

While the underlying storage mechanisms may be external, OpenShift concepts help manage how workloads see that storage:

In hybrid workflows, it’s common to separate:

Orchestration Patterns with OpenShift

OpenShift can act as the overarching orchestrator for hybrid flows, coordinating HPC and cloud tasks.

Using OpenShift Pipelines / Workflow Engines

Integrating with HPC Schedulers

Connecting OpenShift with traditional batch schedulers requires well-defined interfaces:

Designing Hybrid-Friendly Containers

To run the same scientific applications across HPC and OpenShift, container design must consider hybrid requirements:

Operational Considerations and Governance

Hybrid workflows introduce additional complexities around governance, cost, and reliability.

Policy and Placement

Define clear policies for:

OpenShift’s constructs (namespaces, quotas, limit ranges) can help enforce limits on cloud-side workloads.

Cost and Accounting

Reliability and Failover

Hybrid workflows should handle:

OpenShift’s built-in features (health checks, re-scheduling, and job retries) help on the cloud side, but must be complemented by HPC-savvy mechanisms (checkpoint/restart, scheduler policies) on the HPC side.

Example Hybrid Use Cases

Parameter Sweep Experiments

Digital Twins and Continuous Simulation

Collaborative Research Portals

Practical Guidelines for Getting Started

When designing hybrid HPC and cloud workflows with OpenShift:

  1. Start with a single pipeline
    • Choose one real workflow (e.g., pre-process → HPC run → post-process) and implement it end-to-end.
    • Use this as a reference for patterns, tools, and governance.
  2. Standardize on container images
    • Create a small catalog of validated images for important codes and toolchains.
    • Use the same images where possible across HPC (via compatible runtimes) and OpenShift.
  3. Invest in data paths early
    • Define how data will move or be shared.
    • Measure transfer times and bandwidths; identify bottlenecks.
  4. Automate orchestration
    • Use OpenShift-native pipelines or workflow engines to avoid manual glue scripts.
    • Encapsulate HPC submission and monitoring logic into services or Operators.
  5. Iterate on policy and governance
    • Start with simple rules (e.g., only use cloud for post-processing) and refine as you gain experience with performance, cost, and reliability.

By leveraging OpenShift as a cloud-native orchestration and execution environment, organizations can extend traditional HPC capabilities, balance workloads across infrastructures, and build more flexible and data-centric scientific and engineering workflows.

Views: 11

Comments

Please login to add a comment.

Don't have an account? Register now!