Running batch workloads on OpenShift

Characteristics of Batch Workloads on OpenShift

Batch workloads on OpenShift are typically:

Non-interactive: Jobs run without user interaction once started.
Finite: They have a clear start and end (e.g., run simulation, write results, exit).
Often parallel: Many independent tasks (task arrays) or tightly coupled parallel codes.
Resource-hungry: Large CPU, memory, and sometimes I/O requirements.
Throughput-oriented: Focused on finishing many jobs reliably rather than serving continuous traffic.

On OpenShift, these run as pods controlled by higher-level resources (Jobs, CronJobs, etc.) instead of long-lived Services or Deployments that you would use for typical web applications.

Core OpenShift Primitives for Batch Workloads

Jobs: One-shot batch tasks

A Job ensures that a pod (or set of pods) runs to completion.

Key aspects for batch/HPC-style tasks:

Completions: Total number of successful pods required.
Parallelism: How many pods can run concurrently.
Restart policy: Typically OnFailure for resilience, but Never for strictly controlled behavior.
Backoff: Controls retry behavior to avoid infinite loops on persistent errors.

Basic example:

apiVersion: batch/v1
kind: Job
metadata:
  name: pi-calculation
spec:
  completions: 10
  parallelism: 5
  template:
    spec:
      restartPolicy: OnFailure
      containers:
      - name: pi
        image: registry.example.com/hpc/pi:latest
        command: ["python", "compute_pi.py", "--iterations", "100000000"]

This pattern is useful for parameter sweeps or embarrassingly parallel workloads where each pod computes an independent piece of work.

Controlling parallelism and completions

Use spec.completions for the total number of tasks.
Use spec.parallelism to limit concurrent pods for:

Fair sharing on a shared cluster.
Matching license, data, or I/O constraints.

For dynamic control, you can adjust parallelism on a live Job using oc apply or automation.

CronJobs: Scheduled batch workloads

CronJob is for periodic batch tasks (e.g., hourly data aggregation, daily model runs).

Example:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: daily-simulation
spec:
  schedule: "0 2 * * *"   # Every day at 02:00
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 5
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
          - name: run
            image: registry.example.com/hpc/model:stable
            args: ["--config", "/data/config.yaml"]

Things that matter for HPC-like runs:

concurrencyPolicy:

Forbid: Don’t start a new run if the previous isn’t finished (useful for heavy simulations).
Replace: Cancel current and start new (risky for long jobs).

History limits: Prevent too many old Jobs from accumulating.

Resource Management for Batch Workloads

Requests and limits

Batch/HPC jobs should define accurate resources.requests (and usually limits):

requests influence scheduling and capacity planning.
limits can prevent a runaway job from impacting others.

Example:

resources:
  requests:
    cpu: "8"
    memory: "32Gi"
  limits:
    cpu: "8"
    memory: "40Gi"

Guidelines:

Match requests to actual needs to avoid blocking scheduling.
For CPU-bound workloads, setting limits equal to requests avoids throttling surprises.
For memory, a small headroom above request can reduce OOM kills but must fit cluster capacity.

Node selection and placement

For HPC-style nodes (e.g., high-memory, high-core, or Infiniband-capable nodes), you often:

Label nodes with capabilities (e.g., hpc=true, cpu=highcore).
Use nodeSelector, nodeAffinity, or topologySpreadConstraints on Jobs.

Example with nodeSelector:

spec:
  template:
    spec:
      nodeSelector:
        node-type: hpc

Or more flexible with affinity:

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: node-type
          operator: In
          values: ["hpc"]

This is essential when mixing HPC nodes (fast interconnect, large RAM) with general-purpose cluster nodes.

Queuing and fairness

OpenShift doesn’t provide a native HPC-like queueing system by default, but you can:

Use resource quotas and limit ranges at the project/namespace level to prevent a single user from consuming the entire cluster.
Combine Job parallelism with quotas to get implicit queuing: jobs stay Pending until resources free up.
Add external or in-cluster schedulers (e.g., through Operators or custom controllers) if you need:

Priority queues.
Fair-share scheduling.
Job preemption policies tuned for HPC.

These extensions are typically integrated at the platform services/Operators layer rather than per-Job configuration.

Data and I/O Considerations for Batch Jobs

Persistent volumes for input and output

Batch workloads commonly need:

Shared input datasets.
A place to write outputs (checkpoints, logs, result files).

Use PersistentVolumeClaims (PVCs) bound to appropriate storage (parallel FS, high-performance NFS, object gateways via CSI, etc.). For example:

volumes:
- name: input-data
  persistentVolumeClaim:
    claimName: sim-input-pvc
- name: output-data
  persistentVolumeClaim:
    claimName: sim-output-pvc
containers:
- name: sim
  volumeMounts:
  - name: input-data
    mountPath: /data/input
  - name: output-data
    mountPath: /data/output

Considerations:

Prefer storage classes that map to high-throughput backends for I/O-heavy jobs.
For write-heavy workloads, avoid slow, replicated storage unless necessary for durability.

Working directories and ephemeral storage

Many HPC/batch codes benefit from fast local scratch:

Use emptyDir volumes for ephemeral high-speed working space on the node.
Save final results to persistent volumes near the end of the job.

Example:

volumes:
- name: scratch
  emptyDir:
    sizeLimit: "200Gi"
containers:
- name: solver
  volumeMounts:
  - name: scratch
    mountPath: /scratch

Be aware:

emptyDir is erased when the pod finishes.
Size must fit the node’s local storage.

Patterns for HPC-Style Batch Workloads

Job arrays and parameter sweeps

Traditional HPC schedulers support job arrays; on OpenShift, you can approximate them with Jobs parameterized by environment variables, labels, or config.

Pattern:

Create many Jobs (or a single Job with many completions) where each pod:

Picks a task index from environment/config.
Reads parameters from a central list (ConfigMap, file on PV, or task index encoded in the Job name).

Using completions with an index:

You can pass an index via environment variable generated by an init container or by splitting the parameter space across pre-generated Jobs.
Alternatively, generate individual Jobs programmatically using automation (e.g., a script calling oc apply).

This pattern is well-suited for:

Monte Carlo simulations.
Hyperparameter sweeps.
Independent scenario evaluations.

Multi-step pipelines using batch primitives

Some workflows involve multiple sequential stages (preprocess → simulate → postprocess). Without going into CI/CD or workflow engines, you can:

Chain Jobs with simple orchestration logic (e.g., a controller, script, or a workflow tool) that:

Waits for Job A to succeed.
Then creates Job B, etc.

Use labels and annotations on Jobs and pods to track stages and runs (e.g., workflow=climate-run, stage=postprocess).

For production-grade workflows, you would typically integrate with higher-level tools (pipelines, workflow engines), but the basic building blocks are plain Jobs and CronJobs.

Batch Workloads vs Long-Running Services in OpenShift

Understanding how batch jobs differ from normal application deployments guides how you design and operate them:

Lifecycle:

Batch: Pending → Running → Succeeded/Failed and then done.
Services: Long-lived; restart on failure; scale up/down based on load.

Scaling model:

Batch: Use parallelism and completions to define concurrency and total work.
Services: Use replicas or autoscalers.

Reliability and retries:

Batch: Use Job retry policies (backoffLimit, restartPolicy) carefully to avoid repeating expensive failures indefinitely.
Services: Self-healing with continuous restarts is usually acceptable.

Observability:

Batch: You often care about run history, exit codes, and artifacts (logs, output files).
Services: Focus more on ongoing metrics and SLIs like latency and error rate.

Design your batch pods to:

Exit with correct exit codes.
Write essential progress or checkpoints to persistent storage.
Log important info to stdout/stderr so it is captured by the logging system.

Operational Best Practices for Batch on OpenShift

Image design for batch jobs

Container images for batch workloads should:

Contain all dependencies needed at runtime (libraries, tools).
Avoid unnecessary daemons or services.
Start directly with the target command (entrypoint or command).
Be versioned immutably (e.g., model:1.3.0) for reproducible runs.

Keep images:

Lean enough to pull quickly, but not at the cost of rebuilding frequently during long-running campaigns.
Built reproducibly (pin package versions, keep build instructions under version control).

Fault tolerance and restarts

For long-running jobs, you should:

Consider checkpointing:

Periodically save state to a PV so the job can resume after failure.

Tune Job settings:

backoffLimit to limit retries.
activeDeadlineSeconds to avoid runaway jobs.

Example:

spec:
  backoffLimit: 3
  activeDeadlineSeconds: 86400  # 24 hours max

Balance:

Too many retries can waste cycles on persistent bugs.
Too few retries can fail transiently due to temporary storage or network glitches.

Monitoring batch runs

To operate batch workloads effectively:

Use labels (job-name, workflow, user, experiment-id) for:

Filtering in monitoring dashboards.
Grouping logs and metrics.

Track:

Job success/failure counts.
Time-to-completion per job type.
Resource utilization patterns for tuning requests/limits.

You can surface these in:

Built-in monitoring tools.
Custom dashboards for HPC/batch usage (cluster admins often expose these to users).

Multi-tenancy and fairness

In a shared OpenShift cluster used for HPC/batch:

Separate teams into different projects/namespaces.
Use resource quotas and limits per namespace to:

Avoid one group monopolizing all CPUs or memory.

Optionally use:

Priority classes for differentiating critical vs background jobs.
Admission policies to enforce sane Job specs (e.g., max parallelism, max runtime limits).

This provides a cluster-wide policies layer roughly analogous to traditional HPC center policies.

Integrating with Traditional Schedulers and Workflows

Many organizations have existing schedulers (Slurm, PBS, etc.) and want to use OpenShift as an additional execution backend.

Common integration patterns:

Outer scheduler, inner OpenShift:

Traditional scheduler decides when to run.
A submission script uses oc (or API) to create Jobs on OpenShift as tasks.

Wrapper services:

A microservice in OpenShift receives jobs from existing systems and translates them into OpenShift Jobs.

Operator-based integration:

Specialized Operators provide HPC-style queueing or scheduler semantics on top of OpenShift primitives.

When designing such integrations, pay attention to:

Mapping between external job IDs and OpenShift Job names/labels.
Synchronizing job state (Pending/Running/Completed) between systems.
Where accounting and usage tracking will live.

Summary

Running batch workloads on OpenShift centers on using Job and CronJob resources with appropriate resource, storage, and scheduling settings tailored for HPC-style tasks. By carefully defining parallelism, resource requests, data paths, and fault-tolerance strategies, you can move traditional batch and HPC workflows onto OpenShift while preserving familiar behaviors such as job arrays, queues, and multi-step pipelines, and while integrating with existing operational and scheduling practices.

Comments

Please login to add a comment.

Don't have an account? Register now!

Running batch workloads on OpenShift

Characteristics of Batch Workloads on OpenShift

Core OpenShift Primitives for Batch Workloads

Jobs: One-shot batch tasks

Controlling parallelism and completions

CronJobs: Scheduled batch workloads

Resource Management for Batch Workloads

Requests and limits

Node selection and placement

Queuing and fairness

Data and I/O Considerations for Batch Jobs

Persistent volumes for input and output

Working directories and ephemeral storage

Patterns for HPC-Style Batch Workloads

Job arrays and parameter sweeps

Multi-step pipelines using batch primitives

Batch Workloads vs Long-Running Services in OpenShift

Operational Best Practices for Batch on OpenShift

Image design for batch jobs

Fault tolerance and restarts

Monitoring batch runs

Multi-tenancy and fairness

Integrating with Traditional Schedulers and Workflows

Summary

Comments

Where to Move