11.3 Self-healing and pod restarts

Table of Contents

Understanding Self-Healing in OpenShift

Self-healing in OpenShift is the platform’s ability to automatically detect and correct certain classes of failures without human intervention. In practical terms, this often means:

Detecting that a pod is unhealthy or has failed.
Restarting or replacing that pod.
Ensuring that the desired number of replicas is maintained.

This chapter focuses on how OpenShift uses Kubernetes health checks and controllers to achieve self-healing, and what you, as an application developer or operator, need to configure.

Health Checks: Liveness vs Readiness vs Startup

OpenShift relies on Kubernetes probes to understand the state of your containers. These probes are the foundation of self-healing behavior for pods.

Liveness Probes: When to Restart a Container

A liveness probe answers: “Is this container still alive, or should it be killed and restarted?”

If a liveness probe fails repeatedly, the kubelet kills the container. The pod remains scheduled on the node, but its containers are restarted according to the pod’s restart policy.

Typical use cases:

The application is stuck in a deadlock.
The application is running, but will never recover to a healthy state on its own.
A critical internal subsystem has failed and cannot be reinitialized.

Common liveness probe types:

HTTP GET: OpenShift sends an HTTP request to your application.
TCP Socket: OpenShift checks that a TCP connection can be established.
Exec: OpenShift runs a command inside the container and checks the exit code.

Example (simplified) liveness probe in a pod template:

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 2
  failureThreshold: 3

When misconfigured, liveness probes can cause restart loops. For instance, if the probe is too strict or starts too early, the container may be repeatedly killed before it’s ready.

Readiness Probes: When to Receive Traffic

A readiness probe answers: “Can this container serve requests right now?”

If a readiness probe fails:

The pod is marked as not ready.
It is removed from Service endpoints.
Existing connections might fail (depending on the app), but new connections are not routed to that pod.

No restart happens solely due to a failing readiness probe. This is crucial for graceful handling of:

Long startup or warm-up phases.
Temporary overload or backpressure.
Dependencies not yet available (e.g., database, message broker).

Example readiness probe:

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5
  timeoutSeconds: 2
  failureThreshold: 3

Using readiness probes correctly helps self-healing at the traffic routing level: unhealthy pods silently stop receiving requests while they recover.

Startup Probes: Avoiding Premature Restarts

Startup probes are used for slow-starting applications, especially legacy or JVM-based workloads that take time to bootstrap.

A startup probe answers: “Has this application finished starting yet?”

Behavior:

While the startup probe is failing, liveness and readiness probes are suppressed.
Once the startup probe succeeds, liveness and readiness probes become active.
If the startup probe fails too many times, the container is killed and restarted.

This prevents liveness probes from killing containers during long initialization.

Example startup probe:

startupProbe:
  httpGet:
    path: /startup
    port: 8080
  failureThreshold: 60
  periodSeconds: 5

This configuration allows up to $60 \times 5 = 300$ seconds (5 minutes) for your app to start before being considered failed.

Pod Restart Policies in OpenShift

Each pod has a restartPolicy that defines how containers are restarted when they exit:

Always (default for most workloads):

Containers are restarted whenever they exit, regardless of exit code.
Used for typical long-running services and microservices.

OnFailure:

Containers are restarted only if they exit with a non-zero status.
Common for certain batch jobs (when used within higher-level controllers).

Never:

Containers are never restarted after exit.
Typically for one-off tasks or custom scenarios.

For controller-managed workloads in OpenShift:

Deployments, DeploymentConfigs, StatefulSets, etc., are generally used with restartPolicy: Always.
Jobs and CronJobs use restartPolicy: OnFailure or Never.

Restart policies act at the container level within a pod. Higher-level controllers (e.g., Deployment) act at the pod level, recreating pods if necessary.

Controller-Driven Self-Healing

Beyond restarting containers within a pod, OpenShift uses controllers to ensure that the desired state of your application is maintained.

ReplicaSet / Deployment: Keeping Replica Count

For Deployments (and similarly for DeploymentConfigs in OpenShift):

You define a desired number of replicas, e.g., replicas: 3.
The Deployment controller ensures that there are always 3 ready pods matching the template.

Self-healing scenarios:

A pod crashes and cannot be restarted (e.g., image pull error, node failure).
The node hosting a pod becomes unreachable.
A human accidentally deletes a pod.

In such cases, the controller creates replacement pod(s) until the desired replica count is restored.

From an application perspective, this means:

You should assume pods are ephemeral and stateless by default.
The platform may terminate and recreate pods at any time.

DaemonSets: One Pod per Node

For DaemonSets (commonly used for logging agents, monitoring agents, etc.):

The DaemonSet controller ensures that each eligible node has exactly one pod.
If a node is added, the controller creates a new pod there.
If a pod is deleted on a node, a new one is created.

This is a self-healing pattern focused on node-level services rather than scaling.

StatefulSets: Identity and Restart Behavior

StatefulSets add stable network identities and ordinal indices to pods. In terms of self-healing:

If pod-0 of a StatefulSet fails, the controller recreates pod-0, preserving its identity (name, DNS).
Combined with persistent storage, this supports recovery of stateful workloads.

You still benefit from self-healing, but need to carefully handle:

Application-level recovery from persisted state.
Ordered startup/shutdown if configured.

Node-Level Failures and Pod Eviction

Self-healing behaviors also apply when nodes fail or become unhealthy.

Node NotReady and Pod Eviction

When a node becomes NotReady (e.g., network partition, hardware failure):

The control plane marks pods on that node as unreachable.
After a timeout (controlled by pod-eviction-timeout), the scheduler may reschedule replacement pods on healthy nodes (for controller-managed pods).
If the node later returns, the original pods may be terminated or considered orphaned, depending on conditions.

From your application’s perspective:

Pods may “disappear” from one node and reappear on another.
Any data stored only on node-local ephemeral storage is lost.
Services and load balancing automatically redirect traffic to surviving/recreated pods.

Taints, Drains, and Maintenance

During planned maintenance:

Nodes are cordoned and drained.
Pods are evicted and rescheduled onto other nodes.
Controllers ensure that desired replica counts are maintained.

This is operational self-healing at the cluster level: the platform keeps your application running while nodes are taken in and out of service.

Common Patterns and Pitfalls in Self-Healing

Understanding how self-healing works helps you avoid misconfigurations that lead to instability.

Pattern: Health Endpoint Design

Good health endpoints are critical:

Liveness endpoint:

Should return failure only when the process truly cannot recover.
Avoid checking external dependencies too aggressively; transient dependency failures should not always trigger restarts.

Readiness endpoint:

Can depend on external services (database, cache, etc.).
Indicates whether it’s appropriate to receive traffic now.

Startup endpoint:

Focused purely on initial bootstrap (migrations, cache warm-up, etc.).

Bad practice examples:

Using the same endpoint for liveness and readiness with strict dependency checks:

Temporary DB outage → liveness fails → container restarts unnecessarily.

Making liveness probe too sensitive:

Small GC pauses or spikes in latency cause restarts.

Pattern: Backoff and Crash Loops

When a container exits repeatedly:

Kubernetes uses an exponential backoff delay before restarting it again.
You might see CrashLoopBackOff in pod status.

Common causes:

Application misconfiguration (bad environment variables, missing secrets).
Immediate crash on startup due to code or dependency issues.
Liveness probe killing the container repeatedly.

Self-healing cannot fix these underlying issues; it only restarts the container. You need to:

Inspect logs (oc logs).
Describe the pod (oc describe pod).
Fix configuration or code so the container can reach a stable state.

Pattern: Graceful Shutdown and Termination

When a pod is terminated (for restart, scaling, or eviction):

A SIGTERM signal is sent to the container.
terminationGracePeriodSeconds defines how long Kubernetes waits before sending SIGKILL.
During this grace period, readiness probes typically fail so the pod is removed from Service endpoints.

To participate correctly in self-healing and rescheduling:

Implement signal handling and graceful shutdown logic.
Flush in-flight requests, close connections, and release resources.
Keep the grace period long enough for typical shutdown, but not excessively long.

Incorrect behavior (e.g., ignoring SIGTERM) can make restarts and replacements slow and disruptive.

OpenShift-Specific Aspects

While the underlying mechanisms come from Kubernetes, OpenShift adds some platform-level integrations that affect self-healing:

Router (Ingress/Routes) integration:

Automatically respects pod readiness; unhealthy pods are removed from load-balancing.

Operators:

Many platform components (monitoring stack, logging, registries) use Operators that implement self-healing loops at a higher abstraction level (ensuring operands are healthy).

Security Context Constraints (SCCs) and self-healing:

Misconfigured security contexts may cause containers to crash at startup.
Controllers will keep trying (self-healing), but pods will stay in a broken state until SCC or permissions are corrected.

Designing Applications for Self-Healing

To benefit from self-healing and pod restarts, design your workloads with the following principles:

Stateless by default:

Store state in external services (databases, object storage) or via PersistentVolumes.

Idempotent and retry-safe operations:

Since pods and requests can be retried, operations should tolerate duplicates where possible.

Fast fail, fast recovery:

Prefer failing quickly with clear logs over hanging indefinitely.
Rely on liveness/readiness to manage lifecycle.

Observability:

Expose metrics and structured logs to understand why restarts are occurring.
Combine with alerts to distinguish healthy self-healing from pathological crash loops.

Self-healing and pod restarts are powerful tools, but they are only effective when the application is built to cooperate with the platform’s behavior.

Comments

Please login to add a comment.

Don't have an account? Register now!