17 OpenShift on HPC and Specialized Workloads

Table of Contents

Positioning OpenShift in HPC Environments

OpenShift is usually associated with cloud-native, microservice-based, stateless applications. High-Performance Computing (HPC) traditionally focuses on tightly-coupled, performance-sensitive workloads running on bare metal or specialized schedulers (e.g., Slurm, PBS, LSF). This chapter focuses on how OpenShift can complement or extend HPC environments, not replace them outright.

Key differences from traditional HPC environments:

Scheduling focus

Traditional HPC: Job-centric (batch queues, walltime, MPI job launchers).
OpenShift: Pod-centric, focused on long-running services and elastic scaling.

Performance expectations

HPC: Predictable, low-jitter, high-throughput interconnects, minimal abstraction layers.
OpenShift: Abstractions for portability and multi-tenancy, which can add overhead but also improve manageability and elasticity.

Resource abstraction

HPC: Nodes are typically visible to users via job schedulers and environment modules.
OpenShift: Users see Kubernetes resources (Pods, Deployments, Jobs), not physical nodes.

This chapter concentrates on how to align OpenShift with HPC goals: performance, throughput, specialized hardware usage, and integration with existing HPC stacks.

HPC-Oriented Workload Patterns on OpenShift

Types of HPC and Specialized Workloads

Common workload types that map to OpenShift in HPC contexts include:

Embarrassingly parallel batch jobs
Independent tasks (parameter sweeps, Monte Carlo simulations, data partition processing) that can run as many similar Pods or Jobs.
Loosely coupled pipelines
Multi-stage workflows (e.g., preprocess → simulate → post-process → analyze) that can be implemented with batch Jobs, Pipelines, and event triggers.
Tightly coupled parallel jobs
Workloads using MPI or other parallel frameworks that require low-latency interconnect and coordinated job startup.
GPU-accelerated or specialized hardware jobs
Deep learning, molecular dynamics, computational fluid dynamics, or rendering tasks leveraging GPUs or other accelerators.
Hybrid analytics / simulation workflows
Combining HPC-style simulation with downstream data analytics, visualization, or AI/ML stages running as services.

Understanding which pattern applies helps decide how to best map jobs to OpenShift-related resources and what trade-offs to accept.

When OpenShift Makes Sense in HPC

OpenShift tends to be most useful when you need:

Multi-tenant, self-service access
Different research groups or teams can deploy, run, and manage their workloads with isolation and quota management.
Elastic capacity and hybrid cloud bursting
Dynamically scaling to cloud resources when on-prem clusters are full, using the same application packaging model (containers).
Reproducible, portable environments
Container images encode dependencies, libraries, and toolchains, reducing “works on one cluster but not another” problems.
Integration with modern services
HPC workloads that must interact with data services, message queues, APIs, or web-based UIs fit naturally into OpenShift.

Where ultra-low latency and deterministic performance are absolutely critical, some tightly coupled jobs may still be best on a traditional bare-metal HPC scheduler, possibly orchestrated alongside OpenShift rather than inside it.

Mapping HPC Concepts to OpenShift Concepts

HPC users often think in terms of nodes, queues, and batch scripts. On OpenShift:

Job script ➝ Kubernetes Job / CronJob
Instead of sbatch job.sh, a YAML Job resource defines command, image, resources, and restart policy.
Queue ➝ Namespace + ResourceQuota + PriorityClasses
Different queues with different limits can be approximated by Projects/Namespaces with enforced quotas and priorities.
Project allocation ➝ ResourceQuota & LimitRange
Accounting for per-project CPU, memory, and storage usage.
Modules / software stacks ➝ Container images
Application environments are baked into images instead of loaded at runtime via modules.
Node types ➝ Node labels, taints, tolerations
GPU nodes, high-memory nodes, high-IO nodes are exposed via labels and targeted via Pod placement rules.

This conceptual mapping is crucial for onboarding HPC users to OpenShift without overloading them with Kubernetes internals.

Performance Considerations for HPC on OpenShift

Overhead and Jitter

Containerization adds some overhead vs bare metal, but with modern container runtimes and proper tuning it can be small. In HPC contexts the primary concerns are:

Scheduling jitter: Pod startup latency may vary.
Network jitter: Overlay networking and SDN layers can increase latency.
Filesystem overhead: Persistent volumes, CSI plugins, and network file systems introduce variability.

Strategies to reduce this impact include:

Dedicated performance-oriented node pools
Use labels and taints to isolate HPC nodes from noisy neighbors, and run only performance-sensitive Pods there.
Minimal base images and optimized runtimes
Use slim container images and runtime options tuned for numeric workloads.
Affinity, anti-affinity, and topology awareness
Use podAntiAffinity and topologySpreadConstraints to control how Pods share nodes and NUMA domains.

Networking and Interconnects

For many HPC applications, especially MPI-based workloads, network performance and topology matter:

Low-latency fabrics
Access to RDMA, InfiniBand, or high-speed Ethernet may be required. This often involves:

Using specialized CNI plugins that support SR-IOV or RDMA.
Exposing host networking capabilities into Pods via device plugins or host networking modes (where appropriate).

Intra-node vs inter-node traffic
Performance is often much better when pods that need heavy communication are co-located on the same node or NUMA domain. Use nodeAffinity and resource requests/limits to influence placement.

Storage and I/O Patterns

HPC workloads can be I/O bound:

Parallel file systems (e.g., Lustre, GPFS, BeeGFS) may be exposed as persistent volumes.
Local SSDs or NVMe may be used as ephemeral scratch space, potentially exposed as emptyDir or local PVs.

Choosing the right storage type for each phase (scratch vs long-term, local vs network) is essential for keeping jobs performant.

Integrating OpenShift with Existing HPC Schedulers

Most organizations do not discard their existing HPC schedulers; instead, they:

Use OpenShift as a complementary platform for:

Pre/post-processing.
Data ingest/egress.
Interactive analysis and visualization.
AI/ML tasks that consume or produce data for classical simulations.

Treat OpenShift as another “execution partition”
Certain workloads are submitted to OpenShift instead of, or in addition to, the traditional scheduler.

Integration patterns include:

Scheduler-to-OpenShift bridges
Custom scripts or integrations where a Slurm job launcher submits workloads to OpenShift via the API or CLI (oc) as part of a job step.
Shared storage and identity
Use the same identity provider and shared filesystems so that data is accessible on both traditional nodes and OpenShift nodes.
Workflow managers
Tools like CWL, Nextflow, Snakemake, or other workflow systems can target multiple backends (HPC scheduler and Kubernetes/OpenShift) simultaneously, orchestrating cross-platform workflows.

Specialized Hardware on OpenShift for HPC

GPU-Accelerated Workloads

OpenShift supports GPUs primarily through Kubernetes device plugins and specialized node configurations. In an HPC context this enables:

GPU nodes as a distinct pool
GPU-equipped nodes are labeled (e.g., gpu=true), tainted to prevent accidental use, and targeted by GPU-requiring Pods.
Containerized GPU applications
Deep learning frameworks (TensorFlow, PyTorch), GPU-accelerated libraries (cuBLAS, cuDNN), and domain-specific codes (e.g., GROMACS with GPU support) are packaged in container images.

Key considerations:

Driver and runtime alignment
Host drivers must be compatible with the CUDA or other GPU libraries inside the container images. Operators are typically used to manage driver lifecycles.
Resource requests for GPUs
Users request GPUs as resources (e.g., nvidia.com/gpu: 1) in their Pod specs, enabling fair sharing and accurate scheduling.
Multi-tenancy and sharing strategies
Some environments may use GPU sharing technologies (e.g., time-slicing, MIG) to increase utilization, exposed via the device plugin.

Other Accelerators and Specialized Devices

Beyond GPUs, HPC workloads may depend on:

FPGAs
High-speed network adapters (RDMA, InfiniBand)
Specialized cryptographic or compression accelerators

These are typically integrated through:

Device plugins to expose them to Pods as schedulable resources.
Node labels and taints to ensure that only jobs that need them land on those nodes.
Custom Operators to manage firmware, device configuration, and lifecycle where appropriate.

Running Tightly Coupled Parallel Workloads

MPI and Process Launching

MPI jobs often expect:

A way to discover all ranks (nodes/processes) in the job.
Low-latency communication between ranks.
A “job launcher” (e.g., mpirun, srun) that coordinates processes.

On OpenShift, typical patterns include:

MPI job orchestrators
Use Kubernetes-native MPI job controllers (often provided by Operators) that:

Create a head/launcher Pod plus multiple worker Pods.
Configure hostnames and addresses so that MPI ranks can discover each other.
Optionally integrate with specialized networking for RDMA.

StatefulSets or custom controllers
For stable pod identities and predictable network names, useful for some MPI setups.
Job lifecycles aligned with Kubernetes Jobs
The MPI job is encapsulated as a single Kubernetes Job or custom resource, so its completion and logs integrate with the rest of the platform.

Limitations and Trade-Offs

Tightly coupled MPI jobs are sensitive to:

Node heterogeneity: Mixed hardware can impact performance. Scheduling policies must ensure homogenous rank placement where necessary.
Network abstraction overhead: Overlay networks may not be suitable for high-end MPI; direct access to underlying fabrics is often needed.
Startup and scaling behavior: Container-based job launch may be slower than tightly integrated bare-metal batch systems, though often acceptable for long-running simulations.

Organizations commonly start by moving loosely coupled and moderately coupled workloads first, then selectively move or co-locate highly coupled workloads where performance is acceptable.

Hybrid HPC and Cloud-Native Workflows on OpenShift

Multi-Stage Scientific Workflows

Many scientific and engineering workloads are naturally multi-stage:

Data acquisition / ingestion (from instruments, sensors, or external datasets).
Pre-processing and quality control.
Simulation or heavy computation (possibly on traditional HPC or on OpenShift).
Post-processing and reduction.
Analysis, visualization, and reporting (often interactive).
Archival and data publishing.

OpenShift excels at the stages that are:

Event-driven or service-oriented.
Data- and API-centric rather than pure compute.
Short-lived or bursty jobs that can scale elastically.
Interactive and user-facing (dashboards, notebooks, portals).

Orchestrating Hybrid Workloads

Some common patterns:

Cloud-bursting HPC jobs
When the main on-prem HPC cluster is full, selected workloads are run on OpenShift clusters backed by cloud infrastructure. Container images ensure the environment is consistent.
Simulation on HPC, analytics on OpenShift
Classical simulation runs on the bare-metal HPC cluster; results are pushed into storage accessible from OpenShift. Users then:

Launch notebooks and data analytics tools on OpenShift.
Expose web-based visualization services.
Run ML training on GPUs managed by OpenShift.

Workflow engines targeting multiple backends
A workflow manager coordinates tasks across:

Slurm or another scheduler for heavy compute.
OpenShift for data preparation, microservices, and post-processing.
Tasks share data via a common storage layer or data transfer tools.

Data Management and Movement

Because HPC-scale datasets can be huge, minimizing data movement is crucial:

Prefer shared, high-speed storage mounted on both the HPC cluster and OpenShift when possible.
Use data locality-aware scheduling on OpenShift (e.g., place compute Pods near their data).
Handle data lifecycle (temporary vs long-term) with a combination of:

Ephemeral scratch volumes.
Persistent volumes for reproducible outputs.
Object storage for large archival datasets.

User Experience and Enablement for HPC on OpenShift

Adapting the HPC User Mindset

HPC users are accustomed to:

Submitting batch scripts.
Using interactive shells on login nodes.
Loading modules for different software stacks.

Transitioning them to OpenShift may involve:

Templates and job generators
Provide ready-to-use YAML templates that map directly from common HPC batch script patterns (e.g., “N tasks, T hours, M GiB memory per task”).
Portals and UIs
Layer web portals or science gateways on top of OpenShift to submit jobs, monitor progress, and visualize results, shielding users from low-level YAML.
Pre-built images per domain
Offer curated, domain-specific container images that mirror the modules they’re accustomed to (e.g., “chemistry”, “CFD”, “bioinformatics” images).

Governance, Quotas, and Fair Use

OpenShift’s multi-tenant features are crucial when many research groups share a cluster:

Use Projects/Namespaces to separate groups or projects.
Apply ResourceQuota and LimitRange to manage CPU, memory, storage, and GPU usage.
Implement priority classes and preemption policies consistent with institutional policies.

This preserves the fair-sharing and accounting properties that HPC centers require, while giving teams self-service capabilities.

Design Principles and Best Practices for HPC on OpenShift

Start with the right workloads
Begin with embarrassingly parallel and data-intensive workflows that benefit from elasticity and containerization, before tackling the most tightly coupled MPI jobs.
Keep hardware-aware scheduling
Make deliberate use of node labels, taints, tolerations, and affinities for GPUs, high-memory nodes, and specialized interconnect nodes.
Optimize container images for performance
Use minimal base images, link against optimized math libraries, and ensure alignment between host drivers and container libraries (especially for GPUs).
Use Operators for complex stacks
Leverage Operators to manage complex data services, GPU drivers, MPI job frameworks, and other recurring platform components.
Integrate with existing HPC processes
Align identity, storage, accounting, and operational processes so users can move between the traditional HPC world and OpenShift with minimal friction.
Measure and tune
Benchmark representative applications on OpenShift vs traditional HPC nodes. Use the results to:

Decide where performance is acceptable.
Identify tuning opportunities.
Communicate realistic expectations to users.

By treating OpenShift as a complementary platform for HPC and specialized workloads—rather than a one-to-one replacement for traditional schedulers—you can combine the strengths of both approaches: the raw performance and specialized hardware of HPC with the flexibility, automation, and modern development workflows of cloud-native platforms.

17.1 Running batch workloads on OpenShift

17.2 OpenShift and GPUs

17.3 MPI and parallel workloads in containers

17.4 Hybrid HPC and cloud workflows

Comments

Please login to add a comment.

Don't have an account? Register now!