Table of Contents
Goals of Node Maintenance in OpenShift
Node maintenance in OpenShift focuses on keeping worker and control plane nodes healthy and up to date while minimizing disruption to running workloads. In practice this means:
- Safely draining workloads from a node
- Controlling where pods can (and cannot) be scheduled
- Performing OS / kernel / firmware updates or hardware work
- Returning nodes to service in a controlled way
- Handling failures and unplanned maintenance scenarios
This chapter concentrates on the mechanics and patterns specific to OpenShift nodes, assuming general cluster lifecycle concepts are already covered elsewhere.
Node States and Scheduling Controls
OpenShift relies on Kubernetes primitives plus some OpenShift‑specific tooling to control when nodes accept workloads.
Unschedulable vs Ready
Two commonly used states:
Ready: The node is healthy and can run pods (subject to taints/tolerations).SchedulingDisabled(unschedulable: true): Node is healthy but the scheduler will not place new pods there.
You typically mark a node unschedulable before any disruptive maintenance to stop new workloads from landing while you evacuate existing ones.
Key tools:
oc adm cordon: mark node unschedulableoc adm uncordon: mark node schedulable again
Cordoning does not move existing pods; it only prevents new scheduling to that node.
Taints and Tolerations for Maintenance
Taints allow you to repel pods from a node unless they explicitly tolerate the taint.
A common maintenance pattern is to apply a taint such as:
key=maintenanceeffect=NoScheduleorNoExecute
Examples of usage during maintenance:
NoSchedule: prevent new pods from landing on the node (similar effect to cordon, but taint-based and more expressive).NoExecute: immediately evict pods that do not tolerate the taint, in addition to preventing new scheduling.
Taints are especially useful when:
- You want a consistent automation pattern (e.g., GitOps-managed labels/taints for lifecycle).
- You need different behavior for some workloads (e.g., critical “must stay” pods with tolerations, all others evicted).
Cordoning and taints are complementary:
- Cordoning is simple and built-in for routine maintenance.
- Taints enable more nuanced control and integration with custom policies.
Draining Nodes
Draining is the core workflow for taking a node out of service in a controlled way.
What Draining Does
When you run oc adm drain on a node, OpenShift will:
- Mark the node unschedulable (cordon).
- Evict or delete pods on that node, subject to:
- Pod Disruption Budgets (PDBs)
- DaemonSets
- Static pods
- Local storage
- Let the scheduler move pods to other nodes (where allowed).
The goal is to maintain application availability while freeing the node for maintenance.
Basic Drain Command
Common pattern:
oc adm drain <node-name> \
--ignore-daemonsets \
--delete-emptydir-data \
--grace-period=60 \
--timeout=600sKey flags:
--ignore-daemonsets: do not try to evict DaemonSet pods (which will be recreated as needed).--delete-emptydir-data: confirms you accept losingemptyDirdata for evicted pods.--force: needed when evicting pods not managed by a controller (e.g., standalone pods).--pod-selector: optionally target only certain pods (less common for full node maintenance).
Interaction with Pod Disruption Budgets
PDBs limit simultaneous voluntary disruptions. During drain:
- If a PDB would be violated, drain will block/slow and may time out.
- This is expected behavior for highly available apps.
Typical operations pattern:
- Review PDBs for the node’s workloads.
- Ensure enough replicas and capacity in other nodes.
- Adjust PDBs temporarily if needed, with care.
DaemonSets and Static Pods
Draining does not evict:
- DaemonSet-managed pods (unless handled specially)
- Static pods defined via the kubelet (common on control plane nodes)
For these, you typically:
- Use platform-managed procedures for control plane components.
- Accept that node shutdown or kubelet stop will eventually terminate them.
On worker nodes, DaemonSet workloads (logging agents, monitoring agents, CNI components) are expected to stop when the node is taken offline.
Typical Node Maintenance Workflow
A common pattern for planned maintenance on a worker node:
- Identify the node
- E.g.,
oc get nodesand check labels/roles, or match hostnames to hardware. - Cordon the node
oc adm cordon <node-name>- New pods will not schedule there.
- Drain the node
oc adm drain <node-name> --ignore-daemonsets --delete-emptydir-data- Monitor progress; resolve any blocking PDBs or “unmanaged” pods.
- Perform OS / firmware / hardware work
- Reboot, apply OS patches, replace components, etc.
- From OpenShift’s perspective the node is now idle/empty.
- Verify node health after maintenance
- When the node comes back, ensure:
oc get node <node-name>showsReady- Node conditions (e.g.,
DiskPressure,MemoryPressure) areFalse - Machine config/state is as expected (for clusters using MachineConfig / MCO)
- Uncordon the node
oc adm uncordon <node-name>- New pods can now be scheduled onto the node.
- Observe rescheduling behavior
- Verify key workloads are distributed as desired.
- Watch for unexpected concentration of pods on the restored node.
For control plane nodes, you typically follow vendor-documented, highly controlled procedures, often integrated with the cluster’s Machine API and MachineConfig Operator.
Using Machine API and MachineConfig for Node Lifecycle
In many OpenShift deployments (especially on cloud or virtual platforms), nodes are managed via:
- Machine API (e.g.,
MachineandMachineSetresources) - MachineConfig Operator (MCO)
These abstractions let you treat nodes more like cattle than pets.
Replacing Nodes vs In-Place Maintenance
Instead of patching and rebooting a node repeatedly, a common pattern is:
- Scale up: Add a new node (via
MachineSetscaling). - Evacuate old node:
- Drain and eventually delete the corresponding
Machine. - Scale down: Let the platform remove the old VM/instance.
Advantages:
- Cleaner, predictable state (fresh OS image).
- Easier to roll back by deleting newly added nodes if something goes wrong.
- Better fit for rolling upgrade practices.
This pattern is particularly effective for:
- Large worker node pools
- Environments where node images are built and validated ahead of time
Coordinating with MachineConfig Updates
MachineConfig defines desired OS-level configuration (kernel version, systemd units, etc.). When a MachineConfig changes:
- MCO will roll out changes across nodes.
- Nodes are cordoned and drained automatically as part of the update.
- Reboots are handled by the cluster component, not manual operators.
For manual maintenance, you should:
- Check for in-progress MachineConfig updates before starting maintenance.
- Avoid conflicting updates (e.g., manually updating the OS while MCO is also changing configs).
- Use cluster-approved methods to roll configuration changes rather than ad-hoc tuning.
Handling Different Node Roles
Node maintenance strategies differ slightly by role.
Worker Nodes
Typical considerations:
- Application impact:
- Respect PDBs and HA design of applications.
- Ensure enough spare capacity in the pool to absorb drained workloads.
- Specialized workloads:
- For GPU or high-memory nodes, draining may displace large jobs that are hard to reschedule immediately.
- Coordinate with workload owners or schedulers (for HPC/batch, see the dedicated chapter on specialized workloads).
Infrastructure / Specialized Nodes
Some clusters designate specific nodes for:
- Ingress / routing
- Storage
- Logging / monitoring
- Registry or other infrastructure services
Maintenance patterns:
- Coordinate with the relevant platform Operators backing these services.
- Where possible, ensure multiple replicas of critical platform components so that draining one infra node doesn’t cause downtime.
- Validate that traffic or storage failover works as intended during and after node maintenance.
Control Plane Nodes
Control plane node maintenance is more sensitive:
- Usually done one node at a time.
- May have strict order and procedures defined by OpenShift documentation.
- Often automated via cluster upgrades rather than manual OS-level changes.
A general operational guideline:
- Prefer built-in cluster upgrade processes for control plane changes.
- Use manual node maintenance on control plane nodes only when recommended and with tested runbooks.
Managing Capacity and Disruption Risk
Node maintenance always interacts with cluster capacity and SLA expectations.
Ensuring Sufficient Capacity
Before draining:
- Confirm that remaining nodes can handle the displaced workloads under normal and peak load.
- Consider autoscaling:
- Cluster autoscaler may add nodes if configured, but this takes time.
- For planned maintenance, proactively scale up first, then drain.
This capacity-aware approach reduces:
- Risk of evictions due to insufficient resources.
- Performance degradation for user workloads.
Staggered vs Parallel Maintenance
For large fleets:
- Staggered maintenance:
- Drain a small set of nodes (often 1 per zone or rack) at a time.
- Observe system behavior, then continue.
- Parallel maintenance:
- Faster but riskier.
- Typically used only when capacity is abundant and workloads tolerate disruption.
Policies are often codified in:
- Runbooks / SOPs
- Automation tools (e.g., Ansible playbooks, GitOps pipelines)
Common Operational Scenarios
Routine OS Patching
Pattern:
- Build and validate new OS image (or MachineConfig).
- Scale out new nodes using the new image.
- Drain and remove nodes on the old image.
- Repeat iteratively across the fleet.
Benefits:
- Risk is spread out.
- You can stop the rollout if unexpected issues appear.
Hardware Replacement or Rack Work
Steps:
- Identify all nodes in the affected rack/cluster segment.
- For HA, ensure workloads are spread across other racks/zones.
- Drain and shut down affected nodes.
- Replace hardware / perform work.
- Bring nodes back, verify they rejoin correctly, then uncordon.
For critical applications, coordinate maintenance windows with application teams.
Node in Degraded or Unknown State
Symptoms:
- Node flapping between
ReadyandNotReady - High disk or memory pressure
- Repeated kubelet restarts
Typical actions:
- Cordon the node to avoid new workloads.
- Drain if feasible without violating critical PDBs.
- Investigate OS-level issues (disk, filesystem, etc.).
- If the node is unstable, consider fully replacing it (via Machine API) rather than attempting complex repair.
- Monitor workloads after rebalancing to ensure no residual impact.
Automation and Best Practices
Automating Node Maintenance
Common automation techniques:
- Ansible playbooks:
- Encapsulate
oc adm cordon/drain/uncordon. - Integrate with OS patching and reboots.
- GitOps:
- Use labels/taints managed in Git to track maintenance state.
- Automation reacts to state changes (e.g., taint applied → drain node).
- Cloud / VM API integration:
- Workflows that coordinate with instance lifecycle (stop/start/replace).
Automation should include:
- Safety checks (cluster health, capacity).
- Timeouts and rollback behavior.
- Logging and audit trails.
Recommended Practices
- Prefer node replacement over complex in-place repair when feasible.
- Maintain homogeneous node pools where possible to simplify scheduling and maintenance.
- Test maintenance procedures in non-production clusters before applying to critical environments.
- Document and regularly rehearse runbooks, including:
- Planned maintenance
- Emergency node removal
- Recovery from failed maintenance
- Coordinate with:
- Application owners (for disruption-tolerant vs sensitive workloads)
- Security/compliance teams (for maintenance windows and patching policies)
Verifying Post-Maintenance Health
After any node maintenance, validate the cluster and workloads:
- Node health
oc get nodesoc describe node <node-name>for conditions and resource capacity.- Key platform components
- Status of core OpenShift Operators (e.g.,
oc get co). - Check for degraded or progressing conditions.
- Workload distribution
- Ensure pods are spread across nodes and zones as expected.
- Look for unschedulable pods (
Pendingstate) or repeated restarts. - Application checks
- Run lightweight synthetic tests or smoke tests against critical applications.
- Confirm SLIs (latency, error rates) remain within expected ranges.
Closing the loop with verification ensures that maintenance not only completed, but did so without hidden side effects on cluster stability or application behavior.