Kahibaro
Discord Login Register

Self-healing and pod restarts

Understanding Self-Healing in OpenShift

Self-healing in OpenShift is the platform’s ability to automatically detect and correct certain classes of failures without human intervention. In practical terms, this often means:

This chapter focuses on how OpenShift uses Kubernetes health checks and controllers to achieve self-healing, and what you, as an application developer or operator, need to configure.

Health Checks: Liveness vs Readiness vs Startup

OpenShift relies on Kubernetes probes to understand the state of your containers. These probes are the foundation of self-healing behavior for pods.

Liveness Probes: When to Restart a Container

A liveness probe answers: “Is this container still alive, or should it be killed and restarted?”

If a liveness probe fails repeatedly, the kubelet kills the container. The pod remains scheduled on the node, but its containers are restarted according to the pod’s restart policy.

Typical use cases:

Common liveness probe types:

Example (simplified) liveness probe in a pod template:

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 2
  failureThreshold: 3

When misconfigured, liveness probes can cause restart loops. For instance, if the probe is too strict or starts too early, the container may be repeatedly killed before it’s ready.

Readiness Probes: When to Receive Traffic

A readiness probe answers: “Can this container serve requests right now?”

If a readiness probe fails:

No restart happens solely due to a failing readiness probe. This is crucial for graceful handling of:

Example readiness probe:

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5
  timeoutSeconds: 2
  failureThreshold: 3

Using readiness probes correctly helps self-healing at the traffic routing level: unhealthy pods silently stop receiving requests while they recover.

Startup Probes: Avoiding Premature Restarts

Startup probes are used for slow-starting applications, especially legacy or JVM-based workloads that take time to bootstrap.

A startup probe answers: “Has this application finished starting yet?”

Behavior:

This prevents liveness probes from killing containers during long initialization.

Example startup probe:

startupProbe:
  httpGet:
    path: /startup
    port: 8080
  failureThreshold: 60
  periodSeconds: 5

This configuration allows up to $60 \times 5 = 300$ seconds (5 minutes) for your app to start before being considered failed.

Pod Restart Policies in OpenShift

Each pod has a restartPolicy that defines how containers are restarted when they exit:

For controller-managed workloads in OpenShift:

Restart policies act at the container level within a pod. Higher-level controllers (e.g., Deployment) act at the pod level, recreating pods if necessary.

Controller-Driven Self-Healing

Beyond restarting containers within a pod, OpenShift uses controllers to ensure that the desired state of your application is maintained.

ReplicaSet / Deployment: Keeping Replica Count

For Deployments (and similarly for DeploymentConfigs in OpenShift):

Self-healing scenarios:

In such cases, the controller creates replacement pod(s) until the desired replica count is restored.

From an application perspective, this means:

DaemonSets: One Pod per Node

For DaemonSets (commonly used for logging agents, monitoring agents, etc.):

This is a self-healing pattern focused on node-level services rather than scaling.

StatefulSets: Identity and Restart Behavior

StatefulSets add stable network identities and ordinal indices to pods. In terms of self-healing:

You still benefit from self-healing, but need to carefully handle:

Node-Level Failures and Pod Eviction

Self-healing behaviors also apply when nodes fail or become unhealthy.

Node NotReady and Pod Eviction

When a node becomes NotReady (e.g., network partition, hardware failure):

From your application’s perspective:

Taints, Drains, and Maintenance

During planned maintenance:

This is operational self-healing at the cluster level: the platform keeps your application running while nodes are taken in and out of service.

Common Patterns and Pitfalls in Self-Healing

Understanding how self-healing works helps you avoid misconfigurations that lead to instability.

Pattern: Health Endpoint Design

Good health endpoints are critical:

Bad practice examples:

Pattern: Backoff and Crash Loops

When a container exits repeatedly:

Common causes:

Self-healing cannot fix these underlying issues; it only restarts the container. You need to:

Pattern: Graceful Shutdown and Termination

When a pod is terminated (for restart, scaling, or eviction):

To participate correctly in self-healing and rescheduling:

Incorrect behavior (e.g., ignoring SIGTERM) can make restarts and replacements slow and disruptive.

OpenShift-Specific Aspects

While the underlying mechanisms come from Kubernetes, OpenShift adds some platform-level integrations that affect self-healing:

Designing Applications for Self-Healing

To benefit from self-healing and pod restarts, design your workloads with the following principles:

Self-healing and pod restarts are powerful tools, but they are only effective when the application is built to cooperate with the platform’s behavior.

Views: 10

Comments

Please login to add a comment.

Don't have an account? Register now!