16.1 OpenShift upgrade process

Table of Contents

Types of OpenShift Upgrades

OpenShift supports several upgrade scopes, each with slightly different processes and constraints:

Minor version upgrades (e.g. 4.12 → 4.13)

Performed frequently, add new features, may deprecate APIs.
Require more testing and planning.
May involve changes in Operators, APIs, and cluster behavior.

Patch version upgrades (e.g. 4.13.5 → 4.13.7)

Primarily bug fixes and security patches.
Typically lower risk, often automated in managed environments.

Y-stream / z-stream constraints

You can only upgrade along supported “hops” defined by Red Hat (e.g. 4.11 → 4.12 is allowed, 4.11 → 4.13 may require going through 4.12 first).
Skipping unsupported hops is not supported and can break the cluster.

OpenShift distribution differences

Self-managed (bare metal, vSphere, IPI, UPI, etc.): you control when/how upgrades run via the web console or oc CLI.
Managed OpenShift services (e.g. ROSA, ARO, OpenShift Dedicated): provider automates much of the process, but you still must coordinate windows and application readiness.

Key Components Involved in Upgrades

While a full architectural recap is in other chapters, the upgrade process specifically touches:

Cluster Version Operator (CVO)

Central orchestrator of the OpenShift upgrade.
Applies the new “release image” by reconciling all cluster version components to their specified versions.
Manages sequencing and monitors upgrade progress.

Machine Config Operator (MCO)

Handles node-level changes such as OS updates, kernel, and kubelet configuration.
Drives node reboots and rolling updates of control plane and worker nodes.

Operator Lifecycle Manager (OLM) and Operators

Platform and add-on Operators must be compatible with the new cluster version.
OLM upgrades Operators in coordination with the cluster version.

Understanding how these components work together is essential to interpreting upgrade status and troubleshooting when something stalls.

Upgrade Channels and Release Images

OpenShift uses update channels and release images to define what versions you can upgrade to and how:

Channels (examples: stable-4.12, fast-4.13, eus-4.12)

Control which release images are presented as upgrade targets.
stable: conservative, well-tested updates.
fast: quicker access to new versions, with more frequent releases.
eus (Extended Update Support): for long-lived clusters with extended support windows.

Release images

Each OpenShift version is a container image that bundles all core components at specific versions.
The CVO applies the content of the release image cluster-wide.

Choosing a channel

Match your risk profile and required support lifecycle.
For production, stable or eus is usually recommended; fast might be acceptable for non-critical or pre-production environments.

Pre‑Upgrade Planning

A safe OpenShift upgrade is more about preparation than clicking an “Upgrade” button. Key planning tasks:

Version and Compatibility Planning

Check supported upgrade paths in Red Hat documentation for your current version.
Verify version compatibility for:

Cluster add-ons and Operators (logging, monitoring, service mesh, storage, etc.).
External integrations (CNI plugins, load balancers, identity providers, CSI drivers).

Identify API deprecations that might affect:

Custom manifests, YAML files, Helm charts.
CI/CD pipelines that generate or apply resources.

Use:

oc get clusterversion to see current version and available updates.
oc describe clusterversion for channel and history.

Health and Capacity Checks

Before starting:

Confirm cluster health:

oc get co (ClusterOperators) – all should be Available=True, Progressing=False, Degraded=False.
oc get nodes – all nodes should be Ready.

Ensure sufficient capacity:

Enough compute to handle pods rescheduling during rolling node reboots.
Sufficient disk space on nodes and etcd volumes.

Check etcd health (often via built-in tools or monitoring stack).

Backup and Rollback Strategy

There is no supported “downgrade” of the OpenShift cluster itself, so:

Take backups of:

etcd (or full cluster backup if you use external tools).
Critical application data and persistent volumes.

Define a rollback plan:

If an upgrade fails catastrophically, likely recovery scenario is restoring from backup to the previous version.

Maintenance Window and Stakeholder Coordination

Choose a maintenance window appropriate to the potential impact.
Inform:

Application owners and users.
DevOps/CI teams whose pipelines may be affected.

For managed services, coordinate with the provider’s defined windows and procedures.

Upgrade Execution Workflow

The actual process is designed to be largely automated and rolling. The steps below refer to a typical self-managed OpenShift 4 cluster; managed offerings simplify some of these.

1. Select Channel and Target Version

Using the web console or CLI:

Set or confirm the update channel:

oc patch clusterversion version -p '{"spec":{"channel":"stable-4.13"}}' --type=merge

List available updates:

oc adm upgrade

Confirm the target version (e.g. 4.13.7) that matches your plan.

2. Initiate the Upgrade

You can start via:

Web console (Cluster Settings → Cluster Version → Update):

Select the target version and confirm.

CLI:

oc adm upgrade --to=4.13.7 or
oc adm upgrade --allow-explicit-upgrade=true --to-image=<release-image>

The CVO will pull the release image and begin reconciling components.

3. Control Plane Upgrade

Typically, the upgrade proceeds in this order:

Control plane components (API server, scheduler, controller manager, etc.) are upgraded.
etcd is upgraded as appropriate for the target version.
Each control plane node is rebooted and updated in a controlled fashion.

The CVO and MCO coordinate to:

Drain and cordon each node.
Apply new machine configs, reboot nodes.
Monitor readiness before moving to the next node.

4. Worker Node Upgrade

Once control plane and core Operators are updated:

The MCO updates machine configs for worker pools.
Nodes in each pool are updated one-by-one or in small batches:

Node is cordoned and drained.
Config is applied, node reboots.
Node returns to Ready and workloads reschedule.

Ensure PodDisruptionBudgets (PDBs) are correctly set so that workloads remain available while nodes drain.

5. Platform Operator and Add‑On Upgrade

During or after the main cluster upgrade:

OLM updates Operators according to their configured update channels and approval strategies (automatic vs. manual).
You may need to manually approve some Operator upgrades, especially in production.
Verify critical platform services (logging, monitoring, service mesh, storage) after their Operators report healthy status.

Monitoring Upgrade Progress

Continuous monitoring is crucial during the process:

oc get clusterversion

Shows status (Progressing, Available, Degraded), current and desired versions.

oc describe clusterversion

Detailed information on which component or step the upgrade is on.
History and conditions.

oc get co

Verify that each ClusterOperator moves to the new version and reaches healthy status.

oc get nodes

Track which nodes are being updated and their readiness state.

Typical indicators of success:

clusterversion shows Desired: 4.x.y, Progressing=False, Available=True.
All ClusterOperators are available and not progressing or degraded.
All nodes are Ready, with the expected OS and kubelet versions.

Handling Common Upgrade Issues

Even with planning, upgrades can encounter problems. Some patterns:

Stuck or Slow Upgrades

Symptoms:

clusterversion remains in Progressing for a long time.
One or more ClusterOperators show Degraded=True.

Actions:

Identify the blocking operator:

oc get co → look for the one Progressing/Degraded.

Check details:

oc describe co <name> for error messages.

Common causes:

Misconfigured Operators or missing permissions.
Failing webhooks or admission controllers.
Node-level issues (insufficient disk, failing reboots).

Node Upgrade Problems

Symptoms:

Node stuck in NotReady, fails to return after reboot.
MCO Degraded=True.

Actions:

Inspect the node:

Cloud provider console or out-of-band management.

Check MCO logs and machine configs.
Temporarily exclude problematic nodes from the upgrade (e.g. by removing from pool) to allow cluster progress, then fix them individually.

Application Disruptions

Symptoms:

Pods fail to schedule due to lack of resources during node drains.
Stateful workloads experience longer failovers than expected.

Actions:

Adjust PodDisruptionBudgets and replica counts.
Ensure sufficient capacity and anti-affinity rules so replicas can be spread during the upgrade.

Post‑Upgrade Validation

After the cluster reports that the upgrade is complete:

Platform-Level Validation

Re-check:

oc get clusterversion and oc get co for healthy status.
oc get nodes -o wide for consistent versions.

Validate critical platform features:

API server responsiveness and performance.
Ingress/routes and service connectivity.
Storage provisioning and PVC binding.

Application-Level Validation

Run application smoke tests:

Basic functional checks.
Authentication/authorization flow.
Data access and persistence.

Validate external integrations (CI/CD, monitoring, logging shipping, identity providers).

Documentation and Follow‑Up

Document:

Version before and after upgrade.
Timeline, observed issues, and resolutions.
Any manual changes made during the upgrade.

Update:

Runbooks and standard operating procedures.
Capacity or configuration adjustments discovered to be necessary.

Special Considerations for Different Deployment Models

Although the core process is similar, deployment models affect how you interact with upgrades.

Installer‑Provisioned vs. User‑Provisioned Infrastructure

IPI clusters:

Integrations with the infrastructure provider are standardized.
Machine pools and MCO behavior are more predictable and support automated node replacement if required.

UPI clusters:

You might be responsible for more of the underlying node lifecycle.
Extra attention to OS images and bootstrapping configuration is needed.

Managed OpenShift Services

In services like ROSA, ARO, or OpenShift Dedicated:

The provider often:

Schedules and executes the core cluster upgrades.
Maintains control plane and some Operators.

You remain responsible for:

Application readiness, PDBs, and resiliency.
User-installed Operators and workloads.

Coordination is typically done through the provider’s console or ticketing system, with defined upgrade windows and SLAs.

Automating and Scheduling Upgrades

To make upgrades repeatable and less error-prone:

Use automation where supported:

Scheduled upgrades through infrastructure provider tools or APIs.
CI pipelines that run health checks and basic tests before/after upgrades.

Define standard patterns:

Upgrade non-production clusters first, observe for a defined soak period.
Then upgrade staging, then production.

Integrate with existing change management processes to track approvals and results.

By treating the OpenShift upgrade process as a regular, well-documented operational activity, rather than an ad-hoc event, you can keep your clusters secure, supported, and aligned with the rest of your platform ecosystem.

Comments

Please login to add a comment.

Don't have an account? Register now!

16.1 OpenShift upgrade process

Types of OpenShift Upgrades

Key Components Involved in Upgrades

Upgrade Channels and Release Images

Pre‑Upgrade Planning

Version and Compatibility Planning

Health and Capacity Checks

Backup and Rollback Strategy

Maintenance Window and Stakeholder Coordination

Upgrade Execution Workflow

1. Select Channel and Target Version

2. Initiate the Upgrade

3. Control Plane Upgrade

4. Worker Node Upgrade

5. Platform Operator and Add‑On Upgrade

Monitoring Upgrade Progress

Handling Common Upgrade Issues

Stuck or Slow Upgrades

Node Upgrade Problems

Application Disruptions

Post‑Upgrade Validation

Platform-Level Validation

Application-Level Validation

Documentation and Follow‑Up

Special Considerations for Different Deployment Models

Installer‑Provisioned vs. User‑Provisioned Infrastructure

Managed OpenShift Services

Automating and Scheduling Upgrades

Comments

Where to Move