17.3 MPI and parallel workloads in containers

Why MPI in Containers on OpenShift Is Different

Running MPI and other tightly coupled parallel workloads in containers on OpenShift introduces a few unique considerations compared to traditional bare‑metal HPC:

The network path between ranks goes through Kubernetes/OpenShift abstractions (pods, CNI networking, services).
Pods are scheduled dynamically; there is no fixed hostfile prepared by an admin.
The lifecycle of an MPI job (start, run, fail, reschedule) needs to work with Kubernetes jobs/controllers.
You must balance HPC‑style performance tuning with OpenShift’s security and multi‑tenant model.

This chapter focuses on these container‑ and OpenShift‑specific aspects, not on MPI fundamentals themselves.

Containerizing MPI Applications

MPI runtimes and base images

Typical MPI stacks to containerize:

Open MPI
MPICH (and derivatives like Intel MPI, MVAPICH, etc.)

Common approaches:

Use a distribution’s MPI packages in the image (e.g., RHEL/Ubi):

Pros: stable, tested with OS.
Cons: may lag behind latest MPI.

Build MPI from source in the image:

Pros: full control over configuration and optimizations.
Cons: more complex build and maintenance.

Key containerization rules:

Avoid mixing host and container MPI: the MPI library inside the container (libmpi.so) must match the mpirun/mpiexec used to launch ranks.
Use a single MPI implementation/version in a given image.
Include all required libraries (MPI, math libs, domain‑specific libs) in the image; do not rely on host filesystem.

Image layout and multi‑stage builds

Because MPI applications can be large and require compilers, use multi‑stage builds:

Build stage: compilers, build tools, headers, MPI devel packages.
Runtime stage: only the compiled binaries, necessary MPI runtime libraries, and minimal OS packages.

Example Dockerfile pattern (simplified):

FROM registry.access.redhat.com/ubi9/ubi as build
RUN yum install -y openmpi-devel make gcc && yum clean all
ENV PATH=/usr/lib64/openmpi/bin:$PATH \
    LD_LIBRARY_PATH=/usr/lib64/openmpi/lib:$LD_LIBRARY_PATH
WORKDIR /src
COPY . .
RUN mpicc -O3 -o my_mpi_app main.c
FROM registry.access.redhat.com/ubi9/ubi-minimal
RUN microdnf install -y openmpi && microdnf clean all
ENV PATH=/usr/lib64/openmpi/bin:$PATH \
    LD_LIBRARY_PATH=/usr/lib64/openmpi/lib:$LD_LIBRARY_PATH
COPY --from=build /src/my_mpi_app /usr/local/bin/my_mpi_app
ENTRYPOINT ["my_mpi_app"]

Points specific to MPI:

Ensure PATH and LD_LIBRARY_PATH in the final image expose the MPI runtime.
Avoid embedding scheduler‑specific paths/assumptions (e.g., no hard‑coded /opt/slurm).

Process model inside containers

Each MPI rank is just a Linux process inside a pod’s container or containers. Common patterns:

One rank per pod:

Simplifies scaling and failure handling.
Better aligns with Kubernetes scheduling and resource accounting.

Multiple ranks per pod:

Can reduce container and orchestration overhead.
Useful for tightly packed node‑local processes (e.g., hybrid MPI+OpenMP).

Choosing between these depends on:

How much node‑level process placement control you need.
How you plan to integrate with OpenShift autoscaling and fault‑tolerance.

Networking Considerations for MPI on OpenShift

MPI performance and correctness are heavily influenced by how pods are networked.

Pod‑to‑pod connectivity

MPI requires:

Full bidirectional TCP connectivity among all ranks.
Stable addressing for the lifetime of the job.

In OpenShift, this means:

MPI ranks typically communicate directly via pod IPs, not Services.
Services load‑balance and NAT traffic, which adds latency and can break assumptions some MPI implementations make.

Common patterns:

Use an MPI launcher that discovers all pod IPs at runtime (via environment variables, a ConfigMap, a head pod, or an MPI Operator).
Avoid ClusterIP/Route/Ingress in the hot MPI data path; they are for control and external access, not rank‑to‑rank communication.

High‑performance fabrics (InfiniBand, RDMA, SR‑IOV)

For HPC‑grade performance, MPI often uses:

InfiniBand or RoCE for low‑latency, high‑bandwidth transport.
RDMA and GPU‑direct (for GPU workloads).

On OpenShift, this usually involves:

Device plugins / specialized Operators to expose RDMA or SR‑IOV network interfaces into pods.
Appropriate CNI plugins that understand these devices.

Container‑specific constraints:

The container must include the correct MPI build and transport libraries (e.g., Open MPI built with UCX/OFI).
Pods using RDMA/SR‑IOV may require specific node labels and tolerations to land on capable nodes.
Security context constraints can limit device access; MPI pods might use a more permissive SCC provided by an Operator.

If RDMA is not available or not exposed into pods, MPI will fall back to TCP over the cluster network, which is usually fine for modest‑scale or “throughput‑oriented” HPC, but not ideal for latency‑sensitive tightly coupled codes.

Job Orchestration Patterns for MPI on OpenShift

The main challenge is mapping MPI’s notion of “ranks” and “hosts” to OpenShift’s pods and controllers.

Using Jobs and Pods directly

Minimal pattern:

A Job creates a fixed number of identical pods.
One pod acts as the launcher (“head” pod) and runs mpirun or mpiexec.
The launcher discovers other pods’ IPs and starts ranks accordingly.

Typical sequence:

Deploy N worker pods via a Job or a separate Deployment.
When workers are ready, the launcher pod:

Queries the API (or reads a ConfigMap/Downward API) to obtain all worker pod IPs.
Builds a hostfile dynamically.
Invokes mpirun -np N --hostfile hosts.txt my_mpi_app.

Considerations:

You must handle startup ordering so that workers are ready when the launcher starts.
If any worker pod fails, the job may stall or fail unless explicitly handled.

MPI Operators and MPIJob CRDs

An MPI Operator introduces a Custom Resource Definition (CRD), often named MPIJob, which automates orchestration:

You specify:

The MPI image (and optionally a separate launcher image).
Number of worker replicas.
Resources per pod.
MPI command to execute.

The Operator:

Creates launcher and worker pods.
Manages pod placement and lifecycle.
Sets up hostfile/environment for MPI.
Collects logs and job status.

Benefits specific to OpenShift:

You get a declarative YAML description of the MPI workload, aligned with the rest of your applications.
Cluster admins can apply policy/quotas at the MPIJob level.
Easier integration with CI/CD and GitOps workflows.

Limitations:

Currently focused mostly on batch‑style MPI jobs, not long‑running services.
Some Operators assume particular MPI distributions; you must match your image accordingly.

Handling failures and retries

In an HPC job scheduler, node or rank failure may abort the whole job or trigger specific recovery logic. In Kubernetes/OpenShift:

Pods that fail will typically be restarted by Jobs/Deployments.
MPI libraries often treat the loss of a rank as a fatal error.

Patterns to align MPI with OpenShift behavior:

Configure Jobs to not restart pods that exit with certain codes, if you want failures to surface clearly.
Use checkpoint/restart capabilities at the application or MPI level:

Write checkpoints to persistent storage.
On failure, resubmit a new MPIJob that resumes from latest checkpoint.

Capture logs and exit codes via standard OpenShift logging/monitoring to understand job behavior.

Resource Management and Scheduling

Parallel HPC jobs are resource‑hungry and sensitive to placement.

CPU and memory requests/limits

For each MPI pod:

Set resources.requests for CPU and memory to guarantee capacity.
Use resources.limits deliberately:

Too tight limits can cause throttling or OOMKills that MPI is not tolerant of.
Some HPC users prefer no CPU limit but a clear memory limit.

Alignment with MPI:

Match -np, processes‑per‑node, and OpenMP thread counts to the allocated cores:

For one rank per pod, requests.cpu often equals cores per rank.
For multiple ranks per pod, ensure ranks_per_pod * cores_per_rank does not exceed allocatable CPUs.

Node selection and topology

MPI performance depends on:

Co‑locating ranks appropriately (e.g., within same node, rack, or zone).
Minimizing cross‑rack or cross‑zone communication where needed.

In OpenShift you use:

Node labels and nodeSelector/nodeAffinity to ensure placement on HPC nodes.
Pod anti‑affinity if you want to spread ranks across nodes (e.g., for resilience).
Topology‑aware hinting from the CNI or topology manager (where available) for NUMA‑aware placement.

For tightly coupled jobs:

You often want exclusive nodes:

Taint HPC nodes and allow only MPI pods with matching tolerations.
Request enough resources so that Kubernetes naturally doesn’t co‑schedule other noisy workloads.

GPUs and accelerators

For MPI+GPU workloads:

Requests must include GPU resources (e.g., nvidia.com/gpu: 1) per pod.
Ensure:

GPU device plugin or Operator is installed.
MPI image contains CUDA and GPU‑aware MPI build if needed.
Rank‑to‑GPU mapping is stable and appropriate, usually via environment variables (e.g., CUDA_VISIBLE_DEVICES).

Hybrid MPI+GPU patterns:

One pod per GPU, one rank per pod:

Simple model, easy scheduling.

Multiple GPUs per pod, multiple ranks per pod:

Requires careful intra‑node placement logic.

Storage and Data Locality for Parallel Jobs

Parallel workloads often need:

Shared input datasets.
Collective output (checkpoints, results).
Temporary scratch space.

Container‑specific points:

Use PersistentVolumeClaims for shared data accessible by all ranks.
For high‑performance I/O, attach to a parallel or high‑throughput filesystem integrated through OpenShift storage.
Use node‑local or emptyDir volumes for scratch if they don’t need to survive pod rescheduling.

Interaction with MPI:

MPI‑IO or application‑level collective I/O should point at the same mount path in each pod.
Path consistency across containers is essential; design the image and pod spec to mount to identical locations.

Security and MPI in Containers

MPI workloads often expect low friction with system resources; OpenShift adds important security controls.

Areas that often require attention:

Security Context Constraints (SCCs):

MPI pods might need access to additional devices (RDMA, GPUs, hugepages).
Avoid defaulting to fully privileged; use targeted SCCs (e.g., from GPU/RDMA Operators).

User IDs:

Many OpenShift clusters enforce non‑root UIDs.
MPI images should be built and tested to run as non‑root (no reliance on root‑owned paths for runtime).

Host networking and privileged capabilities:

Host networking is rarely required for MPI; prefer normal pod networking for security and portability.
If your MPI build tries to use low‑level features requiring extra capabilities, evaluate whether they are truly needed, and adjust securityContext accordingly.

Balancing security and performance:

Start from the most restricted profile that works, then incrementally add only what is clearly required for the MPI job (devices, hugepages, etc.).
Work with cluster admins to define standard “HPC SCCs” for MPI/GPU workloads.

Debugging and Performance Tuning

Debugging MPI inside OpenShift differs from SSH into a node and running ad‑hoc commands.

Observability in containerized MPI jobs

Leverage:

Pod logs: each rank produces logs; aggregate and label them clearly (rank ID, host).
Metrics: expose application metrics or use existing monitoring to observe CPU, memory, network usage per pod.
Tracing/profiling tools that work in containers (subject to security policies).

Common debugging tips:

Print rank and hostname information at startup to verify mapping:

Rank 0 running on $(hostname) etc.

Use kubectl/oc logs, label selectors, and oc exec to inspect rank pods while the job is running.
For segmentation faults or MPI errors, capture core dumps to a mounted persistent volume.

Tuning steps specific to containers

Adjust pod CPU pinning / NUMA policies if available, e.g. via topology manager, CPU Manager.
Experiment with:

Different MPI transports (TCP vs UCX vs OFI).
Process‑per‑node layout (ranks vs threads vs GPUs).
Collective algorithm tunables (e.g., environment variables for Open MPI/MVAPICH).

Consider building slim, performance‑oriented images:

Remove unnecessary daemons and background processes.
Use appropriate compiler flags and math libraries.

Design Patterns and Best Practices

A few practical patterns emerge for MPI and parallel workloads on OpenShift:

Declarative MPI jobs:

Represent each MPI run as a CR (MPIJob) or a Job manifest.
Store these manifests in Git and manage via GitOps for reproducibility.

Reproducible environments via images:

Bake compilers, libraries, and MPI into images tagged by version (e.g., my-mpi-app:1.2-openmpi4.1).
Avoid “module load” in the traditional sense; use image variants instead.

Separation of concerns:

One image for “base MPI runtime” (maintained by platform/HPC admins).
Application‑specific images extending the base.

Integration with batch/experiment management:

Use pipelines or workflow tools to generate and submit many MPIJobs with different parameters.
Collect results to a common persistent store for post‑processing.

By aligning MPI’s rank‑centric view with OpenShift’s pod‑ and controller‑centric model, you can keep the benefits of containerized, cloud‑native operations while still running demanding parallel workloads effectively.

Comments

Please login to add a comment.

Don't have an account? Register now!