17.2 OpenShift and GPUs

Table of Contents

GPU Workloads in OpenShift: Concepts and Architecture

GPU support in OpenShift extends the standard container and Kubernetes model to accelerate workloads such as AI/ML, visualization, and certain HPC workloads. The key aspects are:

GPUs are exposed to pods as extended resources (e.g. nvidia.com/gpu).
Access is mediated by node-level drivers and device plugins.
Scheduling uses resource requests/limits and node labels to place GPU workloads.
Isolation and security are handled via Security Context Constraints (SCCs) and container runtimes.

In OpenShift, GPU integration is typically implemented using the Node Feature Discovery and GPU Operators, which manage kernel drivers, container runtime hooks, and device plugins in a cluster-native way.

GPU Use Cases on OpenShift in HPC Contexts

In an HPC-oriented OpenShift environment, GPUs are used for:

Numerical and scientific computing

Accelerated linear algebra (e.g. cuBLAS, cuSolver)
FFTs, PDE solvers, Monte Carlo simulations
Molecular dynamics, computational chemistry, CFD

AI/ML and data analytics

Training deep learning models (TensorFlow, PyTorch, JAX)
Inference services at scale
Distributed training (e.g. Horovod, PyTorch DDP) using multiple GPUs and nodes

Visualization and preprocessing

GPU-based rendering or postprocessing of simulation results
Real-time dashboards backed by GPU-accelerated analytics

Mixed CPU/GPU workflows

Pipelines where CPU-based batch workloads feed or consume GPU-accelerated stages
Hybrid MPI-plus-GPU workloads (covered in more depth in the MPI chapter)

The OpenShift cluster becomes a shared platform for both CPU-only and GPU-accelerated jobs, enabling multi-tenant HPC use without exposing low-level node details.

GPU Hardware and Node Roles

GPU-Enabled Node Types

In OpenShift, GPUs are typically confined to specific node roles:

Compute (worker) nodes with GPUs

Labeled with something like node-role.kubernetes.io/worker-gpu=true.
Run GPU workloads alongside regular pods, or dedicated to GPU jobs only.

Specialized accelerator pools

Separate MachineSets (on clouds) or node groups (on bare metal) for different GPU types:

e.g. gpu-type=v100, gpu-type=a100, gpu-type-l4.

Allows fine-grained scheduling and quota enforcement per group or project.

Control plane nodes usually do not host GPUs. All GPU configuration happens on worker nodes or dedicated accelerator nodes.

Node Features and Labelling

To make GPUs schedulable and discoverable:

Nodes are labeled with:

GPU presence: feature.node.kubernetes.io/pci-10de.present=true
GPU vendor / family: nvidia.com/gpu.present=true
Custom labels: gpu=true, gpu-memory=80g, gpu-generation=ampere, etc.

These labels are used for:

NodeSelectors / Affinity in pod specs.
Placement constraints for HPC-style job schedulers integrated with OpenShift.
Multi-tenancy and isolation between different workload types.

Node labeling can be automated using Node Feature Discovery (NFD) or done manually where appropriate.

OpenShift GPU Software Stack

NVIDIA GPU Operator (Typical Stack Example)

On many OpenShift clusters, GPU support is delivered through the NVIDIA GPU Operator, managed via the Operator Lifecycle Manager (OLM). Key components installed on GPU nodes include:

Kernel drivers

Vendor kernel modules required by the hardware.
Automatically kept in sync with node kernel versions where possible.

Container runtime hooks

Runtime libraries and hooks (nvidia-container-runtime) that modify containers at start to inject GPU libraries and device files.

Device Plugin

Kubernetes device plugin that advertises available GPU resources to the scheduler.
Implements resource allocation (e.g. whole GPUs, MIG slices on supported hardware).

GPU monitoring and telemetry

Exporters for metrics (temperature, utilization, memory) into Prometheus and the cluster monitoring stack.

Optional GPU Toolkit images

Base images with CUDA, cuDNN, and other libraries to build your own GPU-enabled containers.

While concrete installation steps belong to other chapters, it’s important from an HPC perspective that this stack is automated and reproducible at cluster scale.

Extended Resources: `nvidia.com/gpu` and Friends

GPU resources are exposed as extended resources rather than regular compute resources:

Resource names commonly used:

nvidia.com/gpu — full GPUs in a node.
Vendor-specific variants or experimental resources (e.g. for MIG slices).

Pods request GPUs using resources.requests and optionally resources.limits:

The number is integer-valued (fractional GPUs are usually not supported unless using vendor-specific sharing/multiplexing solutions).

The scheduler matches these requests against node allocations, just like CPU and memory, but with discrete GPU counts.

Scheduling GPU Workloads

Basic Pod Specification for GPU Access

A minimal pod that requests GPUs typically looks like:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-example
spec:
  containers:
  - name: gpu-container
    image: nvcr.io/nvidia/cuda:12.2.0-base-ubi8
    resources:
      requests:
        cpu: "2"
        memory: "8Gi"
        nvidia.com/gpu: "1"
      limits:
        cpu: "4"
        memory: "16Gi"
        nvidia.com/gpu: "1"
  restartPolicy: Never

Key aspects specific to GPUs:

nvidia.com/gpu is requested and limited, typically with the same value.
No explicit device paths are specified; the device plugin and runtime hooks handle device injection.
SCCs and runtime configuration must allow use of the vendor runtime and libraries (this is usually managed by the operator and cluster admins).

Placement and Affinity

From an HPC perspective, placement strategies are important to maximize performance:

Node selectors

Constrain workloads to nodes with the right GPU properties:

    nodeSelector:
      gpu: "true"
      gpu-type: "a100"

Pod affinity/anti-affinity

Spread or pack workloads across GPU nodes:

Spread to avoid contention on shared resources.
Pack to leave other nodes free for different workloads.

Topology-aware scheduling

Consider NUMA and PCIe topology to maximize bandwidth and minimize latency:

Ensuring that GPU-using pods have sufficient CPU and memory on the same NUMA domain.
Using topology-aware features of the GPU operator or node tuning operators (cluster-specific).

In HPC-style clusters, performance testing is often used to derive best-practice placement policies per application class.

Multi-GPU and Distributed Jobs

Multiple GPUs in a single pod or across multiple pods introduce additional concerns:

Single pod, multi-GPU

Request multiple GPUs:

    resources:
      requests:
        nvidia.com/gpu: "4"

All GPUs are visible inside the container; the application or framework (e.g. PyTorch, TensorFlow) decides how to use them.

Multi-pod, multi-node jobs

Use frameworks like MPI or multi-process data-parallel training that run across pods.
GPU-aware communication libraries (e.g. NCCL) may require:

Host networking or specific network setups (see network-related chapters).
Awareness of pod IPs and node topology via a launcher or job controller.

Distributed HPC or DL training is usually coordinated by higher-level controllers (e.g. Job, MPIJob, or custom operators) rather than individual pods.

Building GPU-Enabled Container Images

Base Images and Libraries

For GPU workloads on OpenShift:

Use GPU-enabled base images that match the driver/toolkit stack deployed on the cluster:

Public registries (e.g. NGC, vendor-provided images).
Enterprise registries with pre-approved CUDA images.

Ensure matching:

CUDA version, cuDNN, NCCL and other libraries.
Linux distribution and glibc compatibility.

A typical Dockerfile might be:

FROM nvcr.io/nvidia/cuda:12.2.0-runtime-ubi8
RUN microdnf install -y python39 && microdnf clean all
RUN pip3 install --no-cache-dir torch torchvision
COPY train.py /workspace/train.py
WORKDIR /workspace
CMD ["python3", "train.py"]

The GPU operator ensures that the container sees the correct host driver components; the application container only needs the user-space libraries.

Reproducibility and Performance in HPC

For HPC environments:

Pin library versions

Avoid automatic latest tags; use fixed image tags and package versions to ensure reproducible runs.

Minimize container overhead

Remove unnecessary packages and daemons.

Use multi-stage builds

Compile code (C/C++/Fortran/CUDA) in one stage, run with a lean base in another.

Container build pipelines (CI/CD) and image promotion flows (dev → test → prod) should be adapted for GPU images, including size handling and testing with real GPU nodes.

Security and Access Control with GPUs

Multi-Tenancy Concerns

On a shared HPC/OpenShift cluster:

Access control

Only specific projects/namespaces should be allowed to schedule GPUs.
Quotas can limit the number of GPUs per namespace.

Security Context Constraints

GPU workloads may require specific SCCs that allow usage of the GPU runtime without giving unnecessary extra privileges.
Operators usually create a dedicated SCC or adjust existing ones to enable GPU access safely.

Device isolation

By default, each pod is granted exclusive access to whole GPUs (no sharing between pods).
This provides strong isolation but may lead to underutilization if workloads are small.

Quotas and Limits

Resource quotas can control GPU resource consumption:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: research-team-a
spec:
  hard:
    requests.nvidia.com/gpu: "8"
    limits.nvidia.com/gpu: "8"

In HPC environments, quota policies may mimic batch system allocations (e.g. number of GPUs per group, time-limited access windows), although enforcement of time-based policies often requires external tooling or schedulers integrated with OpenShift.

Performance, Tuning, and Monitoring

Performance Considerations

Key HPC-specific considerations for GPU performance on OpenShift:

CPU/GPU balance

Ensure enough CPU cores are allocated for data preprocessing and GPU feeding.
Avoid CPU oversubscription that leads to GPU starvation.

Memory bandwidth and NUMA

Align GPU pods with the right NUMA node to minimize memory latency.
Use node tuning and topology-aware scheduling where available.

I/O and storage

Ensure high-throughput access to datasets (parallel file systems, object storage).
Avoid per-pod small-volume mounts when large shared datasets are needed.

Overhead of containers

Benchmark to compare bare metal vs. containerized performance.
Usually overhead is small if tuned correctly, but HPC workflows should verify.

Monitoring GPU Utilization

OpenShift’s monitoring stack, enhanced by the GPU operator, can provide:

Per-node and per-GPU metrics:

Utilization (%), memory usage, temperature, power consumption.

Integration with Prometheus and Grafana dashboards.
Alerting on:

Overheating
Persistent underutilization (useful for capacity planning)
Error conditions (ECC errors, driver problems)

These metrics are essential for HPC capacity management and for validating that scheduling and placement policies actually lead to good GPU utilization.

GPU Resource Management Strategies for HPC

Dedicated vs Shared GPU Nodes

Clusters must decide how to balance flexibility and predictability:

Dedicated GPU nodes

Only GPU jobs run there.
Simpler performance modeling; predictable environment.

Mixed-use nodes

GPUs may be idle but nodes can still be used for CPU-only jobs.
Requires careful QoS and priority handling to avoid interference.

Priority classes, preemption policies, and node taints/tolerations can enforce these strategies:

Taint GPU nodes:

  kubectl taint nodes gpu-node-1 gpu=true:NoSchedule

and allow GPU workloads to tolerate:

  tolerations:
  - key: "gpu"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"

Time-Sharing and Fractional Use

Standard Kubernetes/OpenShift exposes GPUs as integer resources. For finer granularity:

MIG (Multi-Instance GPU)

Some GPUs (e.g. A100, H100) can be partitioned into multiple logical instances.
Device plugin exposes each MIG slice as a separate resource.
Allows multiple pods to share a physical GPU with isolation at hardware level.

Software multiplexing

Vendor or third-party solutions that allow GPU time-sharing across containers.
Typically introduce additional scheduling layers on top of Kubernetes.

These strategies are particularly relevant when running many small inference or interactive workloads on HPC clusters.

Integration with HPC and Batch Workflows

While detailed MPI and batch scheduling integration is discussed elsewhere, GPU-specific angles include:

Job launchers aware of GPU topology

MPI or distributed training launchers must map ranks to GPU IDs and hosts.

Hybrid CPU/GPU batch queues

External schedulers (e.g. Slurm integrated with OpenShift) may offer queues that request GPUs from Kubernetes via CRDs.

Pipeline orchestration

GPU-intensive stages (e.g. training) can be one step in a larger pipeline orchestrated via OpenShift Jobs, CronJobs, or CI/CD tools.

HPC operators often combine OpenShift’s flexibility with traditional batch semantics by layering job management on top of the platform, especially for large-scale GPU campaigns.

Operational Considerations and Troubleshooting

Common Operational Challenges

In GPU-enabled OpenShift clusters, typical issues include:

Driver and kernel mismatch

Node OS or kernel updates can break GPU drivers if not coordinated with the GPU operator.

Insufficient permissions or SCC issues

Pods failing to see GPUs due to missing runtime hooks or wrong SCC bindings.

Resource allocation errors

Pods stuck Pending due to unsatisfiable GPU requests or node selectors.

Operational practices often include:

Change windows for GPU node updates.
Canaries/Bake-in nodes to test driver and operator upgrades.
Separate MachineSets for GPU nodes to control update strategies.

Practical Debugging Steps

For GPU workload issues:

Check pod scheduling status

Why is a pod Pending? Inspect events for unschedulable reasons.

Verify node resources

Confirm that nodes advertise nvidia.com/gpu and available counts.

Inspect container environment

Inside a running pod:

Run nvidia-smi (if available) to confirm device visibility.
Check CUDA versions and driver compatibility.

Look at operator and daemonset logs

GPU operator pods, device plugin daemonsets, and node-level logs.

Validate SCC and runtime

Ensure the pod is using the right service account and SCC.

In HPC settings, it’s recommended to maintain a simple, known-good “GPU diagnostics” pod spec that can be deployed quickly to validate node functionality.

This chapter focused on how GPUs are integrated into OpenShift from an HPC perspective: the software stack, resource model, scheduling, security, performance tuning, and operational practices. Other chapters address batch workloads, MPI, and higher-level workflow patterns that build on these GPU capabilities.

Comments

Please login to add a comment.

Don't have an account? Register now!