16.2 Backup and restore

Table of Contents

Goals of backup and restore in OpenShift

In OpenShift, backup and restore is about preserving and recovering:

Cluster configuration and state (API objects, RBAC, Operators, routes, etc.).
Persistent application data (PVC-backed volumes).
Workflows and automation (pipelines, GitOps, Operators).

For production clusters, you typically design backup around:

RPO (Recovery Point Objective) – how much data you can afford to lose.
RTO (Recovery Time Objective) – how long it can take to restore service.
Scope of disaster – single app, namespace, cluster, region.

The goal is not only to copy data, but to be able to rebuild the platform and workloads to a known-good state.

What to back up in OpenShift

Think in terms of distinct layers. Each layer has different tools and strategies.

1. Cluster control plane state (etcd)

etcd is the source of truth for almost all Kubernetes/OpenShift API resources:

Projects/namespaces, Deployments, DeploymentConfigs, DaemonSets, Jobs, CronJobs.
Services, Routes, Ingress, NetworkPolicy, SCC, RBAC.
Operators and custom resources (CRDs + CRs).
ConfigMaps, Secrets (encrypted at rest), PVC objects (but not the underlying storage data).

Key points:

etcd backup is the authoritative backup of cluster state.
It is tightly coupled to:

OpenShift version.
etcd version.
Cluster topology (e.g., single vs multi control-plane).

2. Workload data (persistent volumes)

Persistent Volumes (PVs) store application data outside the API server:

Databases (PostgreSQL, MySQL, MongoDB, etc.).
Message queues, caches with persistence.
User-generated content, logs, artifacts.

Backups here depend on:

Storage backend (NFS, Ceph/RBD, EBS, GCE PD, CSI, NAS, etc.).
Stateful app characteristics (crash-consistent vs application-consistent backups).

3. Platform configuration and tooling

Beyond etcd and PVs:

Cluster configuration:

MachineConfigs, IngressControllers, APIs, OAuth configuration, IDPs.
Network configuration (CIDRs, egress policies).

Operators and their CRs:

Logging, monitoring, Service Mesh, etc.

Automation:

Pipelines (Tekton), GitOps (Argo CD), CI/CD integration definitions.

Many of these should be declarative and stored in Git (GitOps), so Git itself becomes a key part of your backup strategy.

Backup strategies and patterns

Cluster-wide vs application-focused backups

You usually combine:

Cluster-wide backups:

etcd snapshots on control-plane.
Cluster-scoped resources and CRDs.

Application-level backups:

App namespaces (manifests).
PVC data for specific apps.
Database-native dumps for critical state.

This allows you to:

Recover the entire cluster after a catastrophic loss.
Restore or clone specific apps independently (e.g., PROD → STAGE).

Logical vs physical backups

Logical:

Export resources via oc get -o yaml/json.
Database dumps (e.g., pg_dump, mysqldump).
Helm/Operator configuration (values/CRs).

Physical:

etcd snapshot.
Storage-level snapshots, volume snapshots (via CSI), LVM/ZFS snapshots.
File-level backups from mounted PVs.

Logical backups are more portable and version-tolerant; physical backups are faster, but more tightly coupled to versions and infrastructure.

RPO/RTO and scheduling

Common practices:

Frequent etcd snapshots (e.g., every 30–60 minutes) with retention.
Regular PV snapshots or backups, tuned per application importance.
Daily or hourly application-level backups (DB dumps, pipelines, etc.).

Use:

CronJobs inside the cluster.
External backup orchestrators.
Storage-provider snapshot scheduling.

Backing up etcd in OpenShift

OpenShift provides cluster-native mechanisms and best practices; exact commands differ slightly by version, but the core ideas are stable.

When to take etcd backups

Take an etcd snapshot:

On a regular schedule (e.g., hourly).
Before:

Major upgrades.
Large platform changes (network, ingress, identity provider, etc.).
Major Operator upgrades.

Before and after disaster recovery drills.

Characteristics and constraints

etcd backups are cluster-version-specific:

Restore onto a cluster with the same OpenShift version and patch level.
Restore is typically done to a fresh cluster or set of control-plane nodes, following product documentation.

Backups include all Kubernetes/OpenShift resources stored in etcd, but not PV data.

Storage and retention

Treat etcd backups like high-value secrets:

Encrypt at rest.
Store off-cluster (object storage, backup vault).
Keep multiple generations and test restoration regularly.

Backing up workloads and namespaces

Backup of app configuration should be at a higher level than etcd whenever possible.

Resource manifests

Typical practice:

Export and store:

Namespace-scoped objects (Deployments, StatefulSets, Services, Routes, ConfigMaps, Secrets, PVCs).
Cluster-scoped resources needed by the app (ClusterRole, ClusterRoleBinding, CRDs, SCC usage, etc.).

Keep them in Git or another VCS:

Acts as the “source of configuration truth.”
Enables GitOps-based cluster reconstruction.

Using oc:

Use label-based selection to target a given application.
Avoid exporting noisy, runtime-generated fields (resource versions, statuses) in your long-term backups; focus on spec.

GitOps-centric backup

With GitOps:

Application, Operators, and cluster configuration live in repositories.
Backup focus shifts to:

etcd snapshots (for current runtime state).
PV data.
Git repositories and CI/CD tooling.

In a disaster:

Bring up a base OpenShift cluster.
Reinstall Operators.
Point Argo CD/Flux at your repos to reapply configuration.
Restore stateful data (PVs, DBs).

Persistent storage backup and restore

Storage backups are highly dependent on backends; OpenShift’s storage model (PVs, PVCs, StorageClasses) is the abstraction, but backup is implemented below or beside it.

Storage-level snapshots and backups

For CSI-based or cloud-native storage:

Use VolumeSnapshots (when supported):

Create VolumeSnapshotClass linked to your CSI driver.
Take snapshots of PVCs on a schedule.
Restore new PVCs from snapshots when needed.

For cloud block storage (e.g., EBS, PD):

Provider-native snapshots can be integrated with CSI.
Often combined with a backup product that moves snapshots to cheaper long-term storage.

Key considerations:

Crash-consistency vs application-consistency.
Snapshot retention and cleanup.
Whether snapshots are cross-zone/cross-region.

Application-consistent backups

For databases and other transactional systems:

Coordinate backups with the application:

Use pre/post hooks in Jobs or backup tools to quiesce I/O or run DB dumps.
For clustered databases, understand cluster-level backup mechanisms.

Store dumps in object storage or on dedicated backup volumes, then back those up.

Backups must be:

Version-aware (DB schema changes).
Encrypted where necessary.

Stateless vs stateful apps

Stateless workloads:

Usually don’t require PV backups.
Can be recreated from manifests or Git.

Stateful workloads:

Must backup/restore both:

Workload definitions (StatefulSets, PVCs, ConfigMaps, Secrets).
Underlying data (PVs, DB state).

Tools and approaches for backup and restore

This chapter does not prescribe a specific vendor, but there are common types of tools used with OpenShift.

Cluster-native and CLI-based workflows

Typical building blocks:

oc:

Resource export/import.
Scripting backups with oc get and oc apply.

etcdctl:

Used by cluster components; OpenShift documentation gives supported procedures to back up etcd from control-plane nodes.

kubectl-compatible tools:

Some backup tools use the Kubernetes API the same way OpenShift does.

Operator-based backup solutions

Backup solutions often come as Operators:

Deployed in a dedicated namespace.
Define custom resources such as:

Backup, Restore, Schedule, BackupStorageLocation.

Integrate with storage backends:

Object storage for backup archives.
CSI for snapshots.

Advantages:

Declarative definition of backup policies.
Multi-namespace or cluster-wide targeting.
Application-aware hooks.

Storage-integrated solutions

Many enterprise storage platforms:

Provide proprietary snapshot & cloning mechanisms.
Integrate with CSI to expose snapshots.
Include “application packs” for databases and popular middleware.

In OpenShift, you typically:

Deploy a storage Operator.
Configure StorageClasses and SnapshotClasses.
Define how application PVCs map to protection policies.

Restore scenarios and workflows

Recovery is where design choices around backup become visible. It’s important to distinguish what is being restored and where.

1. Restoring an entire cluster from etcd

Used after catastrophic failure of control plane or severe misconfiguration.

High-level flow:

Prepare a clean set of control-plane nodes, matching:

OpenShift version.
Infrastructure layout (IPs, hostnames, etc.).

Follow the official recovery procedure to:

Bootstrap a new control plane.
Restore etcd from a previously taken snapshot.

Verify:

All API resources are present.
Nodes rejoin and become Ready.
Operators reconcile and reach healthy status.

Limitations:

Restore is usually supported onto similar hardware/topology; changing too much can break assumptions.
Does not fix issues in underlying storage or external dependencies.

2. Restoring a namespace or application in-place

For localized incidents (accidental deletion, misconfiguration of a single app):

Reapply manifests from Git or backups:

Namespaces, roles, bindings, Deployments/StatefulSets, Services, Routes, ConfigMaps, Secrets, PVC definitions.

Restore PV data:

From snapshot → new PVC.
From backup archive → populate volume.
From DB dump.

Connect workload to restored PVC:

Update PVC name or use same PVC name when restoring.
Restart Pods/StatefulSets so they attach to restored data.

Validate:

Application functionality.
Data correctness and consistency.

3. Cross-cluster migration or DR failover

For DR or migration (e.g., region A → region B):

Typically avoid raw etcd restore; instead:

Use GitOps to reapply configuration on the new cluster.
Restore stateful data from storage/system-level backups that are replicated cross-region.

Steps:

Provision a target OpenShift cluster.
Reinstall Operators and platform services.
Recreate applications via manifests or GitOps.
Restore PVC data on the new storage backend.
Redirect traffic (DNS, global load balancer).

Key concerns:

Version compatibility between source and target clusters.
Data replication lag (RPO) and application cutover procedure.

Operational considerations and best practices

Test restores regularly

Backups are only as good as your ability to restore:

Run planned DR drills:

Test etcd restore in an isolated environment.
Practice namespace or application recovery.

Document and refine playbooks based on lessons learned.

Align with upgrades and maintenance

Relate backup/restore to other operational procedures:

Take known-good etcd snapshots:

Before cluster upgrades.
Before major Operator upgrades or infra changes.

Ensure you can:

Roll back from upgrade-related failures with a tested restore path.

Verify backups after upgrades:

Confirm that your backup tooling remains compatible with the new OpenShift version.

Security and compliance

Treat backups like production data:

Use encryption at rest and in transit.
Protect access with RBAC and service identities.

Handle Secrets carefully:

etcd snapshots contain encrypted secrets; ensure decryption keys are backed up securely (e.g., KMS keys).
If exporting secrets to YAML, protect those backups appropriately.

Comply with retention and privacy policies:

Data retention windows (e.g., 30, 90, 365 days).
Right-to-erasure and data residency requirements.

Separation of concerns

Platform team:

Own cluster-level backups (etcd, cluster config, core Operators).
Provide and manage storage & snapshot capabilities.

Application teams:

Own application-level backup logic (DB dumps, app-specific procedures).
Define RPO/RTO for their services.

Clear responsibilities and interfaces simplify incident response.

Documentation and automation

Keep up-to-date, versioned runbooks:

Where backups are stored.
How to run restores for each class of incident.

Automate as much as possible:

Scheduled Jobs or Operators for backups.
CI checks to validate backup configurations (e.g., that critical namespaces are covered).

Monitor backup health:

Alerts for failed backup jobs.
Capacity alerts for backup storage.

Designing a backup and restore plan for OpenShift

When creating a plan, you typically define:

Scope:

Which clusters (prod, stage, dev) and which namespaces.

Protection levels:

Critical, important, non-critical workloads and their RPO/RTO.

Mechanisms:

etcd snapshots.
Storage-level snapshots/backups.
Application-level dumps.
Git/GitOps for configuration.

Runbooks:

Entire cluster failure.
Single namespace/data corruption.
Cross-cluster migration or DR.

Validation:

Regular restore tests.
Audits of coverage (ensure new apps are included).

Integration with upgrades and operations:

Pre-upgrade snapshot policies.
Post-restore verification procedures.

A robust backup and restore strategy lets you perform maintenance and respond to failures confidently, and it is a central part of operating OpenShift as a reliable platform.

Comments

Please login to add a comment.

Don't have an account? Register now!