Table of Contents
Concepts and Design Goals
A distributed filesystem (DFS) lets multiple machines access the same filesystem namespace over a network as if it were local storage. In high-availability and clustering contexts, the key goal is shared data with consistency and fault tolerance.
Common design goals:
- Single namespace
All nodes see the same directory tree (e.g./data,/home) regardless of which server they talk to. - Scalability
- Scale capacity by adding more storage nodes.
- Scale throughput by distributing reads/writes.
- Ideally, clients are added and removed without disruptive reconfiguration.
- Redundancy and availability
- Data is typically replicated or erasure-coded.
- Single-node failure should not make data unavailable.
- Consistency model
- Strong consistency: All clients see latest writes (often via locking or serialization of updates).
- Eventual or relaxed consistency: Better performance but clients may temporarily see stale data.
- POSIX semantics (or close)
- Many DFSs aim to behave like a local Unix filesystem (permissions, ownership, locks, atomic operations).
- Some trade POSIX strictness for performance or simplicity.
- Failure handling
- Transparent client failover to other servers.
- Self-healing: automatic resync/repair of replicas.
- Data integrity via checksums, versioning, or journaling.
In clustering/high-availability, choosing a DFS is mostly about the trade-offs among consistency, performance, complexity, and integration with your cluster stack.
Architecture Building Blocks
Distributed filesystems are usually made of some or all of these components:
- Metadata servers (MDS / management nodes)
- Store directory structures, filenames, permissions, inodes.
- Coordinate locks and sometimes placement of file data.
- Often replicated for availability, but can be a bottleneck.
- Object / storage nodes
- Store actual file contents as blocks/objects/chunks.
- DFS logic maps file offsets to one or more storage nodes.
- Clients
- Kernel module or user-space daemon that:
- Mounts the DFS (e.g. via
mount -t ceph). - Translates POSIX syscalls to DFS protocol.
- Handles caching, read/write paths, and failover.
- Data distribution mechanisms
- Hash-based placement (e.g. consistent hashing, CRUSH in Ceph).
- Central allocation (MDS decides placement).
- Static layouts (admin-configured volumes/targets).
- Redundancy mechanisms
- Replication: multiple copies of file blocks on different nodes.
- Erasure coding: break data into $k$ data blocks plus $m$ parity blocks; any $k$ pieces can reconstruct the original.
- RAID-like schemes across nodes.
Typical access path:
- Client contacts metadata service for file lookup/open.
- Metadata server returns layout: which storage nodes, offsets, striping info.
- Client talks directly to storage nodes for I/O.
- Locks/capabilities updated via metadata service and/or distributed lock managers.
Use Cases in HA and Clustering
Distributed filesystems are widely used in:
- Shared storage for application clusters
- Active/active web servers sharing
uploads/directory. - Shared content repositories across nodes.
- Virtualization platforms
- Storing VM images where multiple hypervisors can access the same disk images.
- Live migration without local disk dependencies.
- Container orchestration
- Persistent volumes shared between Kubernetes worker nodes.
- Home directories and user data in large environments
- User can log in anywhere and see the same
$HOME. - High performance computing (HPC)
- Parallel filesystems enabling multiple compute nodes to read/write large datasets concurrently.
A critical decision is whether you need:
- Cluster filesystem over shared storage (e.g. GFS2/OCFS2, not covered in depth here), or
- Truly distributed storage (no shared SAN; storage is spread across cluster nodes).
Key Design Dimensions and Trade-offs
Centralized vs distributed metadata
- Centralized metadata
- Simpler to implement and reason about.
- Potential single bottleneck/failure; usually mitigated by HA pairs or replicas.
- Example: some GlusterFS and MooseFS deployments.
- Distributed metadata
- Avoids single metadata bottleneck; better scalability.
- Requires complex protocols for consistency and failure handling.
- Example: CephFS with multiple MDS, Lustre.
Replication vs erasure coding
- Replication
- Multiple full copies (e.g. 3 replicas).
- Pros: Simple, fast writes, good performance, easier recovery.
- Cons: Expensive in storage overhead.
- Erasure coding
- Store $k$ chunks plus $m$ parity chunks, tolerate up to $m$ failures.
- Storage overhead $\approx \dfrac{k+m}{k}$ (usually lower than replication).
- Pros: Efficient space usage at large scale.
- Cons: More CPU and network heavy; higher write latency; often used for colder data.
In HA clusters running transactional workloads, replication is more common for “hot” data; erasure coding is more common for archival and object storage.
Consistency and locking
To support shared access in a cluster, DFSs use:
- Distributed locking
- Locks on files/directories/inodes (byte-range or whole-file).
- Ensures correctness for concurrent writes.
- Can hurt performance, particularly over WAN links.
- Client-side caching with leases/capabilities
- Cache data/metadata locally for performance.
- Lease/capability revocation when another client wants write access.
- Must handle network partitions and stale caches carefully.
Depending on the filesystem and workload, you might tune:
- Lock granularity (coarse vs fine-grained).
- Cache aggressive vs conservative behavior.
- Write-back vs write-through semantics.
Data locality and striping
Distributed filesystems often:
- Stripe large files across multiple storage nodes.
- Higher parallel throughput.
- Good for sequential I/O, HPC workloads.
- Locate data near compute where possible.
- Reduce cross-rack or cross-datacenter traffic.
Striping typically has parameters like:
- Stripe unit size (e.g. 64 MiB).
- Stripe count (number of storage targets per file).
Mismatched stripe settings and workload patterns can greatly affect performance.
Overview of Common Distributed Filesystems
This section focuses on conceptual behavior and typical deployment modes in HA clusters. Specific install commands will generally be covered in distribution- or service-specific chapters.
Ceph (CephFS, RBD, RGW)
Ceph is a unified distributed storage platform providing:
- RADOS: Object store with replication/erasure coding across OSDs (object storage daemons).
- CephFS: Distributed POSIX filesystem on top of RADOS.
- RBD: Block devices for VMs, etc.
- RGW: S3/Swift-compatible object gateway.
Core components:
- Monitors (MONs): Maintain cluster map and quorum.
- Managers (MGRs): Telemetry, services, some control plane tasks.
- OSDs: Daemons managing data on disks; handle replication, recovery.
- MDS: Metadata servers for CephFS specifically.
In HA/cluster context:
- CephFS offers:
- Multiple MDS for scalability and failover.
- Clients connect via kernel driver or FUSE.
- Strong consistency, POSIX-compliant, good for general-purpose shared storage.
Key operational concepts:
- CRUSH map for data placement (no central allocation).
- Pools with configurable replication or erasure coding.
- Failure domains (host, rack, DC) for data safety.
Typical use:
- Backing OpenStack, Kubernetes, or Proxmox clusters.
- Shared storage for virtual machines and container volumes.
- General-purpose shared filesystem for large environments.
GlusterFS
GlusterFS is a scale-out network filesystem built by aggregating storage from multiple “bricks” (directories on servers) into a trusted storage pool.
Concepts:
- Bricks: Basic storage units; a brick is usually a filesystem mounted on a server (e.g.
/srv/brick1). - Volumes: Logical filesystems created over sets of bricks.
- Distributed: Files spread across bricks (capacity).
- Replicated: Each file stored on multiple bricks (availability).
- Distributed-replicated: Combined striping and replication.
- Dispersed (erasure-coded): Space-efficient redundancy.
Architecture:
- No central metadata server; metadata lives with data on bricks.
- Clients use a FUSE mount or NFS/SMB gateways.
- Self-healing daemon repairs inconsistencies after failures.
In HA clusters:
- Often used for:
- Shared web content across multiple nodes.
- General NAS-like shared directories.
- Backing storage for virtualization environments (e.g. via libgfapi).
Operational topics:
- Replica sets & quorum: Avoid split-brain by enforcing quorum rules.
- Self-heal and split-brain resolution: Monitor and resolve conflicting copies.
- Geo-replication: Asynchronous replication to remote clusters.
Ceph vs GlusterFS (at a high level)
Both are widely used in HA clusters, but with different philosophies:
- Ceph:
- More complex, feature-rich (block, object, FS).
- Intelligent data placement via CRUSH, fine-grained cluster control.
- Typically better integrated with cloud platforms and large-scale deployments.
- GlusterFS:
- Simpler mental model (volumes/bricks), easier initial setup.
- Strong fit for “NAS replacement” and small-to-medium clusters.
- Can be fronted with NFS/SMB servers easily.
Choice depends on environment size, performance demands, and operational expertise.
Other Important Distributed / Cluster Filesystems
Brief conceptual overview of others you may encounter:
- Lustre
- HPC-focused parallel filesystem.
- Separates metadata targets (MDT) and object storage targets (OST).
- Designed for very high throughput on large clusters rather than small general-purpose HA.
- Ceph vs Lustre
- Ceph is general-purpose storage with broader protocol support.
- Lustre is highly specialized for HPC workloads (massive sequential I/O).
- HDFS (Hadoop Distributed File System)
- Non-POSIX, optimized for write-once, read-many workloads.
- Strongly tied to the Hadoop ecosystem; not a general shared filesystem replacement for OS-level POSIX workloads.
- Cluster filesystems on shared storage (e.g. GFS2, OCFS2)
- Require shared block devices (SAN, iSCSI, DRBD, etc.).
- Use a distributed lock manager (DLM) to coordinate access.
- Best for environments with existing SANs and where block-level sharing is available.
Design and Planning Considerations
When integrating a distributed filesystem into an HA/cluster design, focus on:
Capacity and performance requirements
- Capacity planning
- Raw vs usable space (consider replication/erasure coding).
- Growth patterns; ease of adding capacity without downtime.
- Performance
- Expected IOPS and throughput per node and cluster-wide.
- Workload profile: small random I/O vs large sequential.
- Latency sensitivity (databases vs media hosting).
Plan hardware accordingly:
- Disks: SSD vs HDD; mixed tiers.
- Network: 10GbE or better recommended; consider bonded interfaces.
- CPU: enough for checksums, encryption, erasure coding.
Fault domains and redundancy
Design redundancy so that:
- A single disk, server, or rack failure doesn’t cause data loss or unavailability.
- Replication/erasure coding is configured across failure domains:
- Replicas placed on different hosts or racks.
- For multi-site setups, consider asynchronous replication to a DR site.
Important:
- Replication factors and placement rules should match business SLAs.
- Understand how many simultaneous failures you can tolerate.
Integration with cluster stack
For clustering and HA:
- Mounts should be managed by cluster resources (Pacemaker, Kubernetes controllers, etc.).
- Ensure:
- Order constraints: storage cluster up before application services.
- Location constraints: avoid overloading a single node with both storage and heavy compute unless designed for it.
- STONITH/Fencing: prevent split-brain by fencing failed nodes that still have disk access.
Pay special attention to:
- How clients handle failover:
- Do they retry other servers automatically?
- Is there a VIP or load-balanced frontend?
- How read/write operations behave during partial outages.
Security and multi-tenancy
Security aspects:
- Authentication/authorization
- Integration with LDAP/Kerberos if needed.
- Per-client and per-user permissions.
- Encryption
- Data-in-transit (TLS, secure protocols).
- Data-at-rest (LUKS, filesystem-level encryption, or native DFS encryption).
- Multi-tenancy
- Separation between tenants/projects (different pools/volumes/exports).
- Quotas per user/group/project.
Operational Topics
Monitoring and health
Distributed filesystems must be continuously monitored:
- Key metrics:
- Cluster health status (Ceph’s
HEALTH_OK, Gluster’s volume status). - Disk utilization and pool/volume usage.
- Latency (read/write).
- OSD/brick up/down status.
- Recovery/rebalance status and throughput.
- Common tools:
- Built-in CLI tools (
ceph status,gluster volume status). - Prometheus exporters and Grafana dashboards.
- Syslog and systemd journal integration.
Maintenance and rolling updates
Planned maintenance should:
- Evacuate or gracefully drain nodes:
- Mark OSDs/bricks as out.
- Wait for rebalance/scrub operations to complete (within defined limits).
- Perform OS and daemon upgrades in rolling fashion.
- Verify after each step:
- Cluster returns to healthy state.
- No unexpected data movement or performance regression.
In many DFSs, you can codify operational procedures:
- Limits on how much recovery/backfill traffic is allowed.
- OSD/brick weight adjustments to smooth transitions.
Data integrity and self-healing
Key concepts:
- Scrubbing:
- Background verification of data consistency and checksums.
- Detects bit rot and replication mismatches.
- Self-heal:
- Automatic repair when bricks/OSDs rejoin.
- Resync of missing or outdated replicas.
Operators should:
- Tune scheduling for scrubs (off-peak hours).
- Inspect and resolve persistent inconsistencies (e.g. Gluster split-brain).
Backup and disaster recovery
Distributed filesystems are not a substitute for backups:
- Design backup strategies:
- Snapshots (filesystem-level, pool-level).
- Asynchronous replication to remote clusters or object stores.
- Incremental backups via
rsync,tar, or backup software. - DR planning:
- Recovery point objective (RPO): how much data loss is acceptable.
- Recovery time objective (RTO): how quickly you must restore service.
Backups must be logically separate from the primary DFS to protect against user errors, ransomware, and catastrophic cluster failures.
Practical Selection Guidelines
When choosing a distributed filesystem for an HA cluster:
- Clarify workload:
- VM images, databases, media files, big data, etc.
- Determine scale:
- Number of nodes, expected growth, bandwidth.
- Decide consistency/performance trade-offs:
- Do you need strict POSIX semantics, or is relaxed consistency acceptable?
- Assess operational complexity:
- Team’s comfort with running and debugging storage clusters.
- Availability of distro packages, management tools, and vendor support.
- Check ecosystem integration:
- Hypervisors, container platforms, backup tools, monitoring systems.
For small-to-medium environments with straightforward HA needs, simpler systems (e.g. GlusterFS or cluster filesystems over shared storage) may be sufficient. For larger or more complex environments, full-featured platforms like Ceph offer greater flexibility at the cost of operational complexity.
Summary
Distributed filesystems are a cornerstone of modern clustering and high-availability designs, providing:
- A shared, redundant, and scalable storage layer.
- Strong (or at least well-defined) consistency models for concurrent access.
- Integration points for cluster management, virtualization, and container orchestration.
A deep understanding of their architecture, trade-offs, and operational behaviors is essential for building resilient Linux server infrastructures at scale.