16.3 Node maintenance

Table of Contents

Goals of Node Maintenance in OpenShift

Node maintenance in OpenShift focuses on keeping worker and control plane nodes healthy and up to date while minimizing disruption to running workloads. In practice this means:

Safely draining workloads from a node
Controlling where pods can (and cannot) be scheduled
Performing OS / kernel / firmware updates or hardware work
Returning nodes to service in a controlled way
Handling failures and unplanned maintenance scenarios

This chapter concentrates on the mechanics and patterns specific to OpenShift nodes, assuming general cluster lifecycle concepts are already covered elsewhere.

Node States and Scheduling Controls

OpenShift relies on Kubernetes primitives plus some OpenShift‑specific tooling to control when nodes accept workloads.

Unschedulable vs Ready

Two commonly used states:

Ready: The node is healthy and can run pods (subject to taints/tolerations).
SchedulingDisabled (unschedulable: true): Node is healthy but the scheduler will not place new pods there.

You typically mark a node unschedulable before any disruptive maintenance to stop new workloads from landing while you evacuate existing ones.

Key tools:

oc adm cordon: mark node unschedulable
oc adm uncordon: mark node schedulable again

Cordoning does not move existing pods; it only prevents new scheduling to that node.

Taints and Tolerations for Maintenance

Taints allow you to repel pods from a node unless they explicitly tolerate the taint.

A common maintenance pattern is to apply a taint such as:

key=maintenance
effect=NoSchedule or NoExecute

Examples of usage during maintenance:

NoSchedule: prevent new pods from landing on the node (similar effect to cordon, but taint-based and more expressive).
NoExecute: immediately evict pods that do not tolerate the taint, in addition to preventing new scheduling.

Taints are especially useful when:

You want a consistent automation pattern (e.g., GitOps-managed labels/taints for lifecycle).
You need different behavior for some workloads (e.g., critical “must stay” pods with tolerations, all others evicted).

Cordoning and taints are complementary:

Cordoning is simple and built-in for routine maintenance.
Taints enable more nuanced control and integration with custom policies.

Draining Nodes

Draining is the core workflow for taking a node out of service in a controlled way.

What Draining Does

When you run oc adm drain on a node, OpenShift will:

Mark the node unschedulable (cordon).
Evict or delete pods on that node, subject to:

Pod Disruption Budgets (PDBs)
DaemonSets
Static pods
Local storage

Let the scheduler move pods to other nodes (where allowed).

The goal is to maintain application availability while freeing the node for maintenance.

Basic Drain Command

Common pattern:

oc adm drain <node-name> \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --grace-period=60 \
  --timeout=600s

Key flags:

--ignore-daemonsets: do not try to evict DaemonSet pods (which will be recreated as needed).
--delete-emptydir-data: confirms you accept losing emptyDir data for evicted pods.
--force: needed when evicting pods not managed by a controller (e.g., standalone pods).
--pod-selector: optionally target only certain pods (less common for full node maintenance).

Interaction with Pod Disruption Budgets

PDBs limit simultaneous voluntary disruptions. During drain:

If a PDB would be violated, drain will block/slow and may time out.
This is expected behavior for highly available apps.

Typical operations pattern:

Review PDBs for the node’s workloads.
Ensure enough replicas and capacity in other nodes.
Adjust PDBs temporarily if needed, with care.

DaemonSets and Static Pods

Draining does not evict:

DaemonSet-managed pods (unless handled specially)
Static pods defined via the kubelet (common on control plane nodes)

For these, you typically:

Use platform-managed procedures for control plane components.
Accept that node shutdown or kubelet stop will eventually terminate them.

On worker nodes, DaemonSet workloads (logging agents, monitoring agents, CNI components) are expected to stop when the node is taken offline.

Typical Node Maintenance Workflow

A common pattern for planned maintenance on a worker node:

Identify the node

E.g., oc get nodes and check labels/roles, or match hostnames to hardware.

Cordon the node

oc adm cordon <node-name>
New pods will not schedule there.

Drain the node

oc adm drain <node-name> --ignore-daemonsets --delete-emptydir-data
Monitor progress; resolve any blocking PDBs or “unmanaged” pods.

Perform OS / firmware / hardware work

Reboot, apply OS patches, replace components, etc.
From OpenShift’s perspective the node is now idle/empty.

Verify node health after maintenance

When the node comes back, ensure:

oc get node <node-name> shows Ready
Node conditions (e.g., DiskPressure, MemoryPressure) are False
Machine config/state is as expected (for clusters using MachineConfig / MCO)

Uncordon the node

oc adm uncordon <node-name>
New pods can now be scheduled onto the node.

Observe rescheduling behavior

Verify key workloads are distributed as desired.
Watch for unexpected concentration of pods on the restored node.

For control plane nodes, you typically follow vendor-documented, highly controlled procedures, often integrated with the cluster’s Machine API and MachineConfig Operator.

Using Machine API and MachineConfig for Node Lifecycle

In many OpenShift deployments (especially on cloud or virtual platforms), nodes are managed via:

Machine API (e.g., Machine and MachineSet resources)
MachineConfig Operator (MCO)

These abstractions let you treat nodes more like cattle than pets.

Replacing Nodes vs In-Place Maintenance

Instead of patching and rebooting a node repeatedly, a common pattern is:

Scale up: Add a new node (via MachineSet scaling).
Evacuate old node:

Drain and eventually delete the corresponding Machine.

Scale down: Let the platform remove the old VM/instance.

Advantages:

Cleaner, predictable state (fresh OS image).
Easier to roll back by deleting newly added nodes if something goes wrong.
Better fit for rolling upgrade practices.

This pattern is particularly effective for:

Large worker node pools
Environments where node images are built and validated ahead of time

Coordinating with MachineConfig Updates

MachineConfig defines desired OS-level configuration (kernel version, systemd units, etc.). When a MachineConfig changes:

MCO will roll out changes across nodes.
Nodes are cordoned and drained automatically as part of the update.
Reboots are handled by the cluster component, not manual operators.

For manual maintenance, you should:

Check for in-progress MachineConfig updates before starting maintenance.
Avoid conflicting updates (e.g., manually updating the OS while MCO is also changing configs).
Use cluster-approved methods to roll configuration changes rather than ad-hoc tuning.

Handling Different Node Roles

Node maintenance strategies differ slightly by role.

Worker Nodes

Typical considerations:

Application impact:

Respect PDBs and HA design of applications.
Ensure enough spare capacity in the pool to absorb drained workloads.

Specialized workloads:

For GPU or high-memory nodes, draining may displace large jobs that are hard to reschedule immediately.
Coordinate with workload owners or schedulers (for HPC/batch, see the dedicated chapter on specialized workloads).

Infrastructure / Specialized Nodes

Some clusters designate specific nodes for:

Ingress / routing
Storage
Logging / monitoring
Registry or other infrastructure services

Maintenance patterns:

Coordinate with the relevant platform Operators backing these services.
Where possible, ensure multiple replicas of critical platform components so that draining one infra node doesn’t cause downtime.
Validate that traffic or storage failover works as intended during and after node maintenance.

Control Plane Nodes

Control plane node maintenance is more sensitive:

Usually done one node at a time.
May have strict order and procedures defined by OpenShift documentation.
Often automated via cluster upgrades rather than manual OS-level changes.

A general operational guideline:

Prefer built-in cluster upgrade processes for control plane changes.
Use manual node maintenance on control plane nodes only when recommended and with tested runbooks.

Managing Capacity and Disruption Risk

Node maintenance always interacts with cluster capacity and SLA expectations.

Ensuring Sufficient Capacity

Before draining:

Confirm that remaining nodes can handle the displaced workloads under normal and peak load.
Consider autoscaling:

Cluster autoscaler may add nodes if configured, but this takes time.
For planned maintenance, proactively scale up first, then drain.

This capacity-aware approach reduces:

Risk of evictions due to insufficient resources.
Performance degradation for user workloads.

Staggered vs Parallel Maintenance

For large fleets:

Staggered maintenance:

Drain a small set of nodes (often 1 per zone or rack) at a time.
Observe system behavior, then continue.

Parallel maintenance:

Faster but riskier.
Typically used only when capacity is abundant and workloads tolerate disruption.

Policies are often codified in:

Runbooks / SOPs
Automation tools (e.g., Ansible playbooks, GitOps pipelines)

Common Operational Scenarios

Routine OS Patching

Pattern:

Build and validate new OS image (or MachineConfig).
Scale out new nodes using the new image.
Drain and remove nodes on the old image.
Repeat iteratively across the fleet.

Benefits:

Risk is spread out.
You can stop the rollout if unexpected issues appear.

Hardware Replacement or Rack Work

Steps:

Identify all nodes in the affected rack/cluster segment.
For HA, ensure workloads are spread across other racks/zones.
Drain and shut down affected nodes.
Replace hardware / perform work.
Bring nodes back, verify they rejoin correctly, then uncordon.

For critical applications, coordinate maintenance windows with application teams.

Node in Degraded or Unknown State

Symptoms:

Node flapping between Ready and NotReady
High disk or memory pressure
Repeated kubelet restarts

Typical actions:

Cordon the node to avoid new workloads.
Drain if feasible without violating critical PDBs.
Investigate OS-level issues (disk, filesystem, etc.).
If the node is unstable, consider fully replacing it (via Machine API) rather than attempting complex repair.
Monitor workloads after rebalancing to ensure no residual impact.

Automation and Best Practices

Automating Node Maintenance

Common automation techniques:

Ansible playbooks:

Encapsulate oc adm cordon / drain / uncordon.
Integrate with OS patching and reboots.

GitOps:

Use labels/taints managed in Git to track maintenance state.
Automation reacts to state changes (e.g., taint applied → drain node).

Cloud / VM API integration:

Workflows that coordinate with instance lifecycle (stop/start/replace).

Automation should include:

Safety checks (cluster health, capacity).
Timeouts and rollback behavior.
Logging and audit trails.

Recommended Practices

Prefer node replacement over complex in-place repair when feasible.
Maintain homogeneous node pools where possible to simplify scheduling and maintenance.
Test maintenance procedures in non-production clusters before applying to critical environments.
Document and regularly rehearse runbooks, including:

Planned maintenance
Emergency node removal
Recovery from failed maintenance

Coordinate with:

Application owners (for disruption-tolerant vs sensitive workloads)
Security/compliance teams (for maintenance windows and patching policies)

Verifying Post-Maintenance Health

After any node maintenance, validate the cluster and workloads:

Node health

oc get nodes
oc describe node <node-name> for conditions and resource capacity.

Key platform components

Status of core OpenShift Operators (e.g., oc get co).
Check for degraded or progressing conditions.

Workload distribution

Ensure pods are spread across nodes and zones as expected.
Look for unschedulable pods (Pending state) or repeated restarts.

Application checks

Run lightweight synthetic tests or smoke tests against critical applications.
Confirm SLIs (latency, error rates) remain within expected ranges.

Closing the loop with verification ensures that maintenance not only completed, but did so without hidden side effects on cluster stability or application behavior.

Comments

Please login to add a comment.

Don't have an account? Register now!