Kahibaro
Discord Login Register

Clustering and High Availability

Goals of Clustering and High Availability

Clustering and high availability (HA) aim to keep services running despite failures. This chapter focuses on:

Implementation details for specific tools (Pacemaker, Corosync, distributed filesystems, HAProxy, etc.) are covered in their own chapters.

Key Concepts and Terminology

Availability vs Reliability vs Scalability

HA clustering focuses on availability, sometimes at the cost of complexity.

RTO and RPO

Clustering strategies are selected to meet explicit RTO/RPO targets.

Redundancy Models

Failover and Fallback

Policies govern when and whether failback happens (immediate, manual, delayed).

Single Points of Failure (SPOFs)

An HA design tries to avoid any single component whose failure breaks the service:

Eliminating SPOFs usually increases cost and complexity; you must prioritize based on business needs.

Types of Clusters

High Availability (Failover) Clusters

Typical use: databases, NFS servers, important stateful services.

Load Balancing Clusters

Covered in more detail in the Load Balancing chapter; here you should recognize that load balancing is one HA technique, often combined with backend failover.

Storage Clusters

Storage clusters are often a foundation for HA application clusters.

Geo-Clusters (Site-to-Site Failover)

RTO/RPO are typically larger and more complex to manage compared to local failover.

Core Building Blocks in Linux HA

Cluster Membership and Quorum

Mechanisms include:

Fencing (STONITH)

When a node appears dead but might still be running, you must fence it before starting its services elsewhere.

Good fencing is non-optional in serious HA designs.

Virtual IPs and Service Identity

Services often expose a single IP or hostname, regardless of which node currently runs them:

The cluster manager controls where the VIP lives and associates it with resource health.

Resource Agents and Service Abstractions

In an HA cluster, services are abstracted as resources with standardized actions:

These are implemented by resource agents (scripts/programs that know how to manage the real service). The cluster engine uses these to automate failover decisions.

Health Checks and Monitoring

Health is checked at multiple levels:

Failover decisions use these checks plus policies (how many failures, over what time window, etc.).

Typical HA Architectures

2-Node Active–Passive Cluster

Basic pattern:

Considerations:

N-Node Active–Active Web Tier

Key design points:

Database HA Patterns (Logical Overview)

Common archetypes:

Clusters must align with the DB’s own replication/failover mechanisms and consistency guarantees.

Failure Modes and Design Tradeoffs

Types of Failures

Each must be considered in testing and runbooks.

Split-Brain Scenarios

Split-brain occurs when:

Consequences:

Prevention/mitigation:

Consistency vs Availability (CAP Thinking)

While full CAP theorem discussion is beyond this chapter, you should recognize:

Your cluster configuration (failover timing, fencing, write policies) must reflect this choice.

Designing an HA/Clustered Service

Requirements Gathering

Before picking tools:

  1. Define business requirements:
    • Target availability (e.g., $99.9\%$ vs $99.99\%$)
    • RTO and RPO
    • Regulatory/compliance needs (data locality, encryption, audit).
  2. Understand workload:
    • Read-heavy vs write-heavy
    • Stateless vs stateful
    • Latency sensitivity
  3. Budget and complexity tolerance:
    • Operational expertise available
    • On-call capabilities
    • Hardware/licensing budget

Choosing the Right Pattern

Match requirements to patterns:

No single pattern fits all; many environments mix these patterns.

Dependencies and Cascading Failures

In clusters, a “service” often depends on multiple components:

Mapping dependencies matters because:

Operational Aspects of Clustering and HA

Configuration Management and Consistency

Clusters are sensitive to drift:

Testing Failover

You must test:

Measure:

Observability and Alerting

Effective HA operations require:

Tie cluster logs and metrics into centralized monitoring and alerting systems.

Runbooks and Procedures

Document step-by-step:

Runbooks reduce mistakes during emergencies and standardize responses.

Common Anti-Patterns and Pitfalls

High-Level Lifecycle of a Clustered Service

A typical workflow from concept to production:

  1. Design:
    • Requirements, RTO/RPO, failure modes, patterns.
  2. Lab/Prototype:
    • Single or small test environment mirroring prod architecture.
  3. Automate:
    • Configuration management, deployment scripts.
  4. Test:
    • Functional tests
    • Failure scenario simulations
  5. Production Rollout:
    • Gradual ramp-up, pilot users, canary deployments where possible.
  6. Operate and Improve:
    • Track incidents
    • Refine configs, thresholds, runbooks
    • Periodic failover drills (“game days”)

Understanding this lifecycle helps you integrate Pacemaker/Corosync clusters, distributed filesystems, and load balancers into a coherent HA strategy, rather than isolated technologies.

Views: 20

Comments

Please login to add a comment.

Don't have an account? Register now!