Table of Contents
Types of OpenShift Upgrades
OpenShift supports several upgrade scopes, each with slightly different processes and constraints:
- Minor version upgrades (e.g. 4.12 → 4.13)
- Performed frequently, add new features, may deprecate APIs.
- Require more testing and planning.
- May involve changes in Operators, APIs, and cluster behavior.
- Patch version upgrades (e.g. 4.13.5 → 4.13.7)
- Primarily bug fixes and security patches.
- Typically lower risk, often automated in managed environments.
- Y-stream / z-stream constraints
- You can only upgrade along supported “hops” defined by Red Hat (e.g. 4.11 → 4.12 is allowed, 4.11 → 4.13 may require going through 4.12 first).
- Skipping unsupported hops is not supported and can break the cluster.
- OpenShift distribution differences
- Self-managed (bare metal, vSphere, IPI, UPI, etc.): you control when/how upgrades run via the web console or
ocCLI. - Managed OpenShift services (e.g. ROSA, ARO, OpenShift Dedicated): provider automates much of the process, but you still must coordinate windows and application readiness.
Key Components Involved in Upgrades
While a full architectural recap is in other chapters, the upgrade process specifically touches:
- Cluster Version Operator (CVO)
- Central orchestrator of the OpenShift upgrade.
- Applies the new “release image” by reconciling all cluster version components to their specified versions.
- Manages sequencing and monitors upgrade progress.
- Machine Config Operator (MCO)
- Handles node-level changes such as OS updates, kernel, and kubelet configuration.
- Drives node reboots and rolling updates of control plane and worker nodes.
- Operator Lifecycle Manager (OLM) and Operators
- Platform and add-on Operators must be compatible with the new cluster version.
- OLM upgrades Operators in coordination with the cluster version.
Understanding how these components work together is essential to interpreting upgrade status and troubleshooting when something stalls.
Upgrade Channels and Release Images
OpenShift uses update channels and release images to define what versions you can upgrade to and how:
- Channels (examples:
stable-4.12,fast-4.13,eus-4.12) - Control which release images are presented as upgrade targets.
stable: conservative, well-tested updates.fast: quicker access to new versions, with more frequent releases.eus(Extended Update Support): for long-lived clusters with extended support windows.- Release images
- Each OpenShift version is a container image that bundles all core components at specific versions.
- The CVO applies the content of the release image cluster-wide.
- Choosing a channel
- Match your risk profile and required support lifecycle.
- For production,
stableoreusis usually recommended;fastmight be acceptable for non-critical or pre-production environments.
Pre‑Upgrade Planning
A safe OpenShift upgrade is more about preparation than clicking an “Upgrade” button. Key planning tasks:
Version and Compatibility Planning
- Check supported upgrade paths in Red Hat documentation for your current version.
- Verify version compatibility for:
- Cluster add-ons and Operators (logging, monitoring, service mesh, storage, etc.).
- External integrations (CNI plugins, load balancers, identity providers, CSI drivers).
- Identify API deprecations that might affect:
- Custom manifests, YAML files, Helm charts.
- CI/CD pipelines that generate or apply resources.
Use:
oc get clusterversionto see current version and available updates.oc describe clusterversionfor channel and history.
Health and Capacity Checks
Before starting:
- Confirm cluster health:
oc get co(ClusterOperators) – all should beAvailable=True,Progressing=False,Degraded=False.oc get nodes– all nodes should beReady.- Ensure sufficient capacity:
- Enough compute to handle pods rescheduling during rolling node reboots.
- Sufficient disk space on nodes and etcd volumes.
- Check etcd health (often via built-in tools or monitoring stack).
Backup and Rollback Strategy
There is no supported “downgrade” of the OpenShift cluster itself, so:
- Take backups of:
- etcd (or full cluster backup if you use external tools).
- Critical application data and persistent volumes.
- Define a rollback plan:
- If an upgrade fails catastrophically, likely recovery scenario is restoring from backup to the previous version.
Maintenance Window and Stakeholder Coordination
- Choose a maintenance window appropriate to the potential impact.
- Inform:
- Application owners and users.
- DevOps/CI teams whose pipelines may be affected.
- For managed services, coordinate with the provider’s defined windows and procedures.
Upgrade Execution Workflow
The actual process is designed to be largely automated and rolling. The steps below refer to a typical self-managed OpenShift 4 cluster; managed offerings simplify some of these.
1. Select Channel and Target Version
Using the web console or CLI:
- Set or confirm the update channel:
oc patch clusterversion version -p '{"spec":{"channel":"stable-4.13"}}' --type=merge- List available updates:
oc adm upgrade- Confirm the target version (e.g.
4.13.7) that matches your plan.
2. Initiate the Upgrade
You can start via:
- Web console (Cluster Settings → Cluster Version → Update):
- Select the target version and confirm.
- CLI:
oc adm upgrade --to=4.13.7oroc adm upgrade --allow-explicit-upgrade=true --to-image=<release-image>
The CVO will pull the release image and begin reconciling components.
3. Control Plane Upgrade
Typically, the upgrade proceeds in this order:
- Control plane components (API server, scheduler, controller manager, etc.) are upgraded.
- etcd is upgraded as appropriate for the target version.
- Each control plane node is rebooted and updated in a controlled fashion.
The CVO and MCO coordinate to:
- Drain and cordon each node.
- Apply new machine configs, reboot nodes.
- Monitor readiness before moving to the next node.
4. Worker Node Upgrade
Once control plane and core Operators are updated:
- The MCO updates machine configs for worker pools.
- Nodes in each pool are updated one-by-one or in small batches:
- Node is cordoned and drained.
- Config is applied, node reboots.
- Node returns to
Readyand workloads reschedule.
Ensure PodDisruptionBudgets (PDBs) are correctly set so that workloads remain available while nodes drain.
5. Platform Operator and Add‑On Upgrade
During or after the main cluster upgrade:
- OLM updates Operators according to their configured update channels and approval strategies (automatic vs. manual).
- You may need to manually approve some Operator upgrades, especially in production.
- Verify critical platform services (logging, monitoring, service mesh, storage) after their Operators report healthy status.
Monitoring Upgrade Progress
Continuous monitoring is crucial during the process:
oc get clusterversion- Shows status (
Progressing,Available,Degraded), current and desired versions. oc describe clusterversion- Detailed information on which component or step the upgrade is on.
- History and conditions.
oc get co- Verify that each
ClusterOperatormoves to the new version and reaches healthy status. oc get nodes- Track which nodes are being updated and their readiness state.
Typical indicators of success:
clusterversionshowsDesired: 4.x.y,Progressing=False,Available=True.- All
ClusterOperatorsare available and not progressing or degraded. - All nodes are
Ready, with the expected OS and kubelet versions.
Handling Common Upgrade Issues
Even with planning, upgrades can encounter problems. Some patterns:
Stuck or Slow Upgrades
Symptoms:
clusterversionremains inProgressingfor a long time.- One or more
ClusterOperatorsshowDegraded=True.
Actions:
- Identify the blocking operator:
oc get co→ look for the oneProgressing/Degraded.- Check details:
oc describe co <name>for error messages.- Common causes:
- Misconfigured Operators or missing permissions.
- Failing webhooks or admission controllers.
- Node-level issues (insufficient disk, failing reboots).
Node Upgrade Problems
Symptoms:
- Node stuck in
NotReady, fails to return after reboot. - MCO
Degraded=True.
Actions:
- Inspect the node:
- Cloud provider console or out-of-band management.
- Check MCO logs and machine configs.
- Temporarily exclude problematic nodes from the upgrade (e.g. by removing from pool) to allow cluster progress, then fix them individually.
Application Disruptions
Symptoms:
- Pods fail to schedule due to lack of resources during node drains.
- Stateful workloads experience longer failovers than expected.
Actions:
- Adjust PodDisruptionBudgets and replica counts.
- Ensure sufficient capacity and anti-affinity rules so replicas can be spread during the upgrade.
Post‑Upgrade Validation
After the cluster reports that the upgrade is complete:
Platform-Level Validation
- Re-check:
oc get clusterversionandoc get cofor healthy status.oc get nodes -o widefor consistent versions.- Validate critical platform features:
- API server responsiveness and performance.
- Ingress/routes and service connectivity.
- Storage provisioning and PVC binding.
Application-Level Validation
- Run application smoke tests:
- Basic functional checks.
- Authentication/authorization flow.
- Data access and persistence.
- Validate external integrations (CI/CD, monitoring, logging shipping, identity providers).
Documentation and Follow‑Up
- Document:
- Version before and after upgrade.
- Timeline, observed issues, and resolutions.
- Any manual changes made during the upgrade.
- Update:
- Runbooks and standard operating procedures.
- Capacity or configuration adjustments discovered to be necessary.
Special Considerations for Different Deployment Models
Although the core process is similar, deployment models affect how you interact with upgrades.
Installer‑Provisioned vs. User‑Provisioned Infrastructure
- IPI clusters:
- Integrations with the infrastructure provider are standardized.
- Machine pools and MCO behavior are more predictable and support automated node replacement if required.
- UPI clusters:
- You might be responsible for more of the underlying node lifecycle.
- Extra attention to OS images and bootstrapping configuration is needed.
Managed OpenShift Services
In services like ROSA, ARO, or OpenShift Dedicated:
- The provider often:
- Schedules and executes the core cluster upgrades.
- Maintains control plane and some Operators.
- You remain responsible for:
- Application readiness, PDBs, and resiliency.
- User-installed Operators and workloads.
- Coordination is typically done through the provider’s console or ticketing system, with defined upgrade windows and SLAs.
Automating and Scheduling Upgrades
To make upgrades repeatable and less error-prone:
- Use automation where supported:
- Scheduled upgrades through infrastructure provider tools or APIs.
- CI pipelines that run health checks and basic tests before/after upgrades.
- Define standard patterns:
- Upgrade non-production clusters first, observe for a defined soak period.
- Then upgrade staging, then production.
- Integrate with existing change management processes to track approvals and results.
By treating the OpenShift upgrade process as a regular, well-documented operational activity, rather than an ad-hoc event, you can keep your clusters secure, supported, and aligned with the rest of your platform ecosystem.