Table of Contents
GPU Workloads in OpenShift: Concepts and Architecture
GPU support in OpenShift extends the standard container and Kubernetes model to accelerate workloads such as AI/ML, visualization, and certain HPC workloads. The key aspects are:
- GPUs are exposed to pods as extended resources (e.g.
nvidia.com/gpu). - Access is mediated by node-level drivers and device plugins.
- Scheduling uses resource requests/limits and node labels to place GPU workloads.
- Isolation and security are handled via Security Context Constraints (SCCs) and container runtimes.
In OpenShift, GPU integration is typically implemented using the Node Feature Discovery and GPU Operators, which manage kernel drivers, container runtime hooks, and device plugins in a cluster-native way.
GPU Use Cases on OpenShift in HPC Contexts
In an HPC-oriented OpenShift environment, GPUs are used for:
- Numerical and scientific computing
- Accelerated linear algebra (e.g. cuBLAS, cuSolver)
- FFTs, PDE solvers, Monte Carlo simulations
- Molecular dynamics, computational chemistry, CFD
- AI/ML and data analytics
- Training deep learning models (TensorFlow, PyTorch, JAX)
- Inference services at scale
- Distributed training (e.g. Horovod, PyTorch DDP) using multiple GPUs and nodes
- Visualization and preprocessing
- GPU-based rendering or postprocessing of simulation results
- Real-time dashboards backed by GPU-accelerated analytics
- Mixed CPU/GPU workflows
- Pipelines where CPU-based batch workloads feed or consume GPU-accelerated stages
- Hybrid MPI-plus-GPU workloads (covered in more depth in the MPI chapter)
The OpenShift cluster becomes a shared platform for both CPU-only and GPU-accelerated jobs, enabling multi-tenant HPC use without exposing low-level node details.
GPU Hardware and Node Roles
GPU-Enabled Node Types
In OpenShift, GPUs are typically confined to specific node roles:
- Compute (worker) nodes with GPUs
- Labeled with something like
node-role.kubernetes.io/worker-gpu=true. - Run GPU workloads alongside regular pods, or dedicated to GPU jobs only.
- Specialized accelerator pools
- Separate MachineSets (on clouds) or node groups (on bare metal) for different GPU types:
- e.g.
gpu-type=v100,gpu-type=a100,gpu-type-l4. - Allows fine-grained scheduling and quota enforcement per group or project.
Control plane nodes usually do not host GPUs. All GPU configuration happens on worker nodes or dedicated accelerator nodes.
Node Features and Labelling
To make GPUs schedulable and discoverable:
- Nodes are labeled with:
- GPU presence:
feature.node.kubernetes.io/pci-10de.present=true - GPU vendor / family:
nvidia.com/gpu.present=true - Custom labels:
gpu=true,gpu-memory=80g,gpu-generation=ampere, etc. - These labels are used for:
- NodeSelectors / Affinity in pod specs.
- Placement constraints for HPC-style job schedulers integrated with OpenShift.
- Multi-tenancy and isolation between different workload types.
Node labeling can be automated using Node Feature Discovery (NFD) or done manually where appropriate.
OpenShift GPU Software Stack
NVIDIA GPU Operator (Typical Stack Example)
On many OpenShift clusters, GPU support is delivered through the NVIDIA GPU Operator, managed via the Operator Lifecycle Manager (OLM). Key components installed on GPU nodes include:
- Kernel drivers
- Vendor kernel modules required by the hardware.
- Automatically kept in sync with node kernel versions where possible.
- Container runtime hooks
- Runtime libraries and hooks (
nvidia-container-runtime) that modify containers at start to inject GPU libraries and device files. - Device Plugin
- Kubernetes device plugin that advertises available GPU resources to the scheduler.
- Implements resource allocation (e.g. whole GPUs, MIG slices on supported hardware).
- GPU monitoring and telemetry
- Exporters for metrics (temperature, utilization, memory) into Prometheus and the cluster monitoring stack.
- Optional GPU Toolkit images
- Base images with CUDA, cuDNN, and other libraries to build your own GPU-enabled containers.
While concrete installation steps belong to other chapters, it’s important from an HPC perspective that this stack is automated and reproducible at cluster scale.
Extended Resources: `nvidia.com/gpu` and Friends
GPU resources are exposed as extended resources rather than regular compute resources:
- Resource names commonly used:
nvidia.com/gpu— full GPUs in a node.- Vendor-specific variants or experimental resources (e.g. for MIG slices).
- Pods request GPUs using
resources.requestsand optionallyresources.limits: - The number is integer-valued (fractional GPUs are usually not supported unless using vendor-specific sharing/multiplexing solutions).
The scheduler matches these requests against node allocations, just like CPU and memory, but with discrete GPU counts.
Scheduling GPU Workloads
Basic Pod Specification for GPU Access
A minimal pod that requests GPUs typically looks like:
apiVersion: v1
kind: Pod
metadata:
name: gpu-example
spec:
containers:
- name: gpu-container
image: nvcr.io/nvidia/cuda:12.2.0-base-ubi8
resources:
requests:
cpu: "2"
memory: "8Gi"
nvidia.com/gpu: "1"
limits:
cpu: "4"
memory: "16Gi"
nvidia.com/gpu: "1"
restartPolicy: NeverKey aspects specific to GPUs:
nvidia.com/gpuis requested and limited, typically with the same value.- No explicit device paths are specified; the device plugin and runtime hooks handle device injection.
- SCCs and runtime configuration must allow use of the vendor runtime and libraries (this is usually managed by the operator and cluster admins).
Placement and Affinity
From an HPC perspective, placement strategies are important to maximize performance:
- Node selectors
- Constrain workloads to nodes with the right GPU properties:
nodeSelector:
gpu: "true"
gpu-type: "a100"- Pod affinity/anti-affinity
- Spread or pack workloads across GPU nodes:
- Spread to avoid contention on shared resources.
- Pack to leave other nodes free for different workloads.
- Topology-aware scheduling
- Consider NUMA and PCIe topology to maximize bandwidth and minimize latency:
- Ensuring that GPU-using pods have sufficient CPU and memory on the same NUMA domain.
- Using topology-aware features of the GPU operator or node tuning operators (cluster-specific).
In HPC-style clusters, performance testing is often used to derive best-practice placement policies per application class.
Multi-GPU and Distributed Jobs
Multiple GPUs in a single pod or across multiple pods introduce additional concerns:
- Single pod, multi-GPU
- Request multiple GPUs:
resources:
requests:
nvidia.com/gpu: "4"- All GPUs are visible inside the container; the application or framework (e.g. PyTorch, TensorFlow) decides how to use them.
- Multi-pod, multi-node jobs
- Use frameworks like MPI or multi-process data-parallel training that run across pods.
- GPU-aware communication libraries (e.g. NCCL) may require:
- Host networking or specific network setups (see network-related chapters).
- Awareness of pod IPs and node topology via a launcher or job controller.
Distributed HPC or DL training is usually coordinated by higher-level controllers (e.g. Job, MPIJob, or custom operators) rather than individual pods.
Building GPU-Enabled Container Images
Base Images and Libraries
For GPU workloads on OpenShift:
- Use GPU-enabled base images that match the driver/toolkit stack deployed on the cluster:
- Public registries (e.g. NGC, vendor-provided images).
- Enterprise registries with pre-approved CUDA images.
- Ensure matching:
- CUDA version, cuDNN, NCCL and other libraries.
- Linux distribution and glibc compatibility.
A typical Dockerfile might be:
FROM nvcr.io/nvidia/cuda:12.2.0-runtime-ubi8
RUN microdnf install -y python39 && microdnf clean all
RUN pip3 install --no-cache-dir torch torchvision
COPY train.py /workspace/train.py
WORKDIR /workspace
CMD ["python3", "train.py"]The GPU operator ensures that the container sees the correct host driver components; the application container only needs the user-space libraries.
Reproducibility and Performance in HPC
For HPC environments:
- Pin library versions
- Avoid automatic
latesttags; use fixed image tags and package versions to ensure reproducible runs. - Minimize container overhead
- Remove unnecessary packages and daemons.
- Use multi-stage builds
- Compile code (C/C++/Fortran/CUDA) in one stage, run with a lean base in another.
Container build pipelines (CI/CD) and image promotion flows (dev → test → prod) should be adapted for GPU images, including size handling and testing with real GPU nodes.
Security and Access Control with GPUs
Multi-Tenancy Concerns
On a shared HPC/OpenShift cluster:
- Access control
- Only specific projects/namespaces should be allowed to schedule GPUs.
- Quotas can limit the number of GPUs per namespace.
- Security Context Constraints
- GPU workloads may require specific SCCs that allow usage of the GPU runtime without giving unnecessary extra privileges.
- Operators usually create a dedicated SCC or adjust existing ones to enable GPU access safely.
- Device isolation
- By default, each pod is granted exclusive access to whole GPUs (no sharing between pods).
- This provides strong isolation but may lead to underutilization if workloads are small.
Quotas and Limits
Resource quotas can control GPU resource consumption:
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-quota
namespace: research-team-a
spec:
hard:
requests.nvidia.com/gpu: "8"
limits.nvidia.com/gpu: "8"In HPC environments, quota policies may mimic batch system allocations (e.g. number of GPUs per group, time-limited access windows), although enforcement of time-based policies often requires external tooling or schedulers integrated with OpenShift.
Performance, Tuning, and Monitoring
Performance Considerations
Key HPC-specific considerations for GPU performance on OpenShift:
- CPU/GPU balance
- Ensure enough CPU cores are allocated for data preprocessing and GPU feeding.
- Avoid CPU oversubscription that leads to GPU starvation.
- Memory bandwidth and NUMA
- Align GPU pods with the right NUMA node to minimize memory latency.
- Use node tuning and topology-aware scheduling where available.
- I/O and storage
- Ensure high-throughput access to datasets (parallel file systems, object storage).
- Avoid per-pod small-volume mounts when large shared datasets are needed.
- Overhead of containers
- Benchmark to compare bare metal vs. containerized performance.
- Usually overhead is small if tuned correctly, but HPC workflows should verify.
Monitoring GPU Utilization
OpenShift’s monitoring stack, enhanced by the GPU operator, can provide:
- Per-node and per-GPU metrics:
- Utilization (%), memory usage, temperature, power consumption.
- Integration with Prometheus and Grafana dashboards.
- Alerting on:
- Overheating
- Persistent underutilization (useful for capacity planning)
- Error conditions (ECC errors, driver problems)
These metrics are essential for HPC capacity management and for validating that scheduling and placement policies actually lead to good GPU utilization.
GPU Resource Management Strategies for HPC
Dedicated vs Shared GPU Nodes
Clusters must decide how to balance flexibility and predictability:
- Dedicated GPU nodes
- Only GPU jobs run there.
- Simpler performance modeling; predictable environment.
- Mixed-use nodes
- GPUs may be idle but nodes can still be used for CPU-only jobs.
- Requires careful QoS and priority handling to avoid interference.
Priority classes, preemption policies, and node taints/tolerations can enforce these strategies:
- Taint GPU nodes:
kubectl taint nodes gpu-node-1 gpu=true:NoScheduleand allow GPU workloads to tolerate:
tolerations:
- key: "gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"Time-Sharing and Fractional Use
Standard Kubernetes/OpenShift exposes GPUs as integer resources. For finer granularity:
- MIG (Multi-Instance GPU)
- Some GPUs (e.g. A100, H100) can be partitioned into multiple logical instances.
- Device plugin exposes each MIG slice as a separate resource.
- Allows multiple pods to share a physical GPU with isolation at hardware level.
- Software multiplexing
- Vendor or third-party solutions that allow GPU time-sharing across containers.
- Typically introduce additional scheduling layers on top of Kubernetes.
These strategies are particularly relevant when running many small inference or interactive workloads on HPC clusters.
Integration with HPC and Batch Workflows
While detailed MPI and batch scheduling integration is discussed elsewhere, GPU-specific angles include:
- Job launchers aware of GPU topology
- MPI or distributed training launchers must map ranks to GPU IDs and hosts.
- Hybrid CPU/GPU batch queues
- External schedulers (e.g. Slurm integrated with OpenShift) may offer queues that request GPUs from Kubernetes via CRDs.
- Pipeline orchestration
- GPU-intensive stages (e.g. training) can be one step in a larger pipeline orchestrated via OpenShift Jobs, CronJobs, or CI/CD tools.
HPC operators often combine OpenShift’s flexibility with traditional batch semantics by layering job management on top of the platform, especially for large-scale GPU campaigns.
Operational Considerations and Troubleshooting
Common Operational Challenges
In GPU-enabled OpenShift clusters, typical issues include:
- Driver and kernel mismatch
- Node OS or kernel updates can break GPU drivers if not coordinated with the GPU operator.
- Insufficient permissions or SCC issues
- Pods failing to see GPUs due to missing runtime hooks or wrong SCC bindings.
- Resource allocation errors
- Pods stuck
Pendingdue to unsatisfiable GPU requests or node selectors.
Operational practices often include:
- Change windows for GPU node updates.
- Canaries/Bake-in nodes to test driver and operator upgrades.
- Separate MachineSets for GPU nodes to control update strategies.
Practical Debugging Steps
For GPU workload issues:
- Check pod scheduling status
- Why is a pod
Pending? Inspect events for unschedulable reasons. - Verify node resources
- Confirm that nodes advertise
nvidia.com/gpuand available counts. - Inspect container environment
- Inside a running pod:
- Run
nvidia-smi(if available) to confirm device visibility. - Check CUDA versions and driver compatibility.
- Look at operator and daemonset logs
- GPU operator pods, device plugin daemonsets, and node-level logs.
- Validate SCC and runtime
- Ensure the pod is using the right service account and SCC.
In HPC settings, it’s recommended to maintain a simple, known-good “GPU diagnostics” pod spec that can be deployed quickly to validate node functionality.
This chapter focused on how GPUs are integrated into OpenShift from an HPC perspective: the software stack, resource model, scheduling, security, performance tuning, and operational practices. Other chapters address batch workloads, MPI, and higher-level workflow patterns that build on these GPU capabilities.