4.4.3 RAID levels

Table of Contents

Introduction

Redundant Array of Independent Disks, or RAID, is a method to combine multiple physical disks into one logical unit. Different RAID levels arrange data and redundancy in distinct patterns. This chapter focuses on what makes each common RAID level unique, how data and parity are laid out, and what this means for performance, capacity, and fault tolerance, without going into the low level configuration commands or LVM integration, which are covered elsewhere.

Core Concepts for Understanding RAID Levels

RAID levels differ mainly in three aspects. The first is how data is split across disks, often called striping. The second is whether and how redundancy is stored, often via mirroring or parity. The third is what happens when a disk fails and how the array degrades or rebuilds.

Think of each disk as a sequence of equal size blocks, sometimes called stripes or chunks. A RAID level is essentially a pattern that says which disk holds which data block and, if applicable, which disk holds the blocks that allow reconstruction after a failure.

Capacity is usually expressed in terms of $N$, the number of disks, and $S$, the capacity of the smallest disk in the array. RAID performance depends on the number of disks involved in read and write operations. Fault tolerance depends on how many disks can fail without losing data.

A RAID array is not a backup. It protects against some hardware failures, not against accidental deletion, corruption, or malware.

RAID 0: Striping without Redundancy

RAID 0 uses pure striping. Data is split into blocks of fixed size and distributed across all disks in round robin fashion. For example, block 1 goes to disk 1, block 2 to disk 2, and so on, wrapping around when the last disk is used.

The capacity of RAID 0 is the sum of all disk capacities. If you have $N$ disks of size $S$, the usable capacity is
$$C_{\text{RAID0}} = N \times S.$$

Because multiple disks can participate in a read or write, RAID 0 generally improves both read and write throughput, especially for large sequential I/O and for workloads that can leverage parallel access.

The critical tradeoff is that RAID 0 has no redundancy at all. If a single disk in a RAID 0 fails, data that was striped onto that disk is lost, and the logical array becomes unusable as a whole. The risk of failure grows with the number of disks, since any one of them can fail and take the entire array down.

RAID 0 is therefore suitable only for scenarios where performance matters more than reliability, and where the data can be regenerated or is not critical. It is not appropriate where data integrity is important.

RAID 1: Mirroring

RAID 1 uses mirroring. Each piece of data is stored identically on two or more disks. The minimum number of disks is two. All disks in the mirror hold the same content.

The capacity of RAID 1 is the capacity of a single disk, regardless of how many disks are in the mirror group. For $N$ disks of size $S$, the usable capacity is
$$C_{\text{RAID1}} = S.$$

This means the storage efficiency is $\frac{1}{N}$. With two disks, half of the raw capacity is usable, and the other half is used for redundancy. With more mirrors, efficiency decreases, but read performance can improve further.

On reads, a RAID 1 implementation can read from any disk in the mirror. In practice, it may balance reads across disks, which can increase read throughput and reduce latency, especially under concurrent access. On writes, all copies must be written, so write performance is closer to the speed of a single disk, though there is some parallelism.

Fault tolerance is strong. A RAID 1 array can usually survive the failure of all but one disk. As long as at least one disk with valid data remains, the array can still serve data. However, the exact tolerance can depend on the implementation and how many copies you create.

RAID 1 is useful for small, simple setups that require strong redundancy and straightforward recovery, such as critical system partitions where capacity is less important than reliability.

RAID 5: Striping with Single Parity

RAID 5 introduces block level striping with distributed single parity. Data and parity blocks are spread across all disks. Parity is computed in a way that allows the contents of any one missing disk to be reconstructed from the remaining disks.

A common parity operation used conceptually in RAID 5 is the bitwise XOR. For a given stripe, the parity block $P$ is computed from the data blocks $D_1, D_2, \ldots, D_{N-1}$ as
$$P = D_1 \oplus D_2 \oplus \cdots \oplus D_{N-1}.$$
If one disk fails and its data block is missing, it can be reconstructed by XORing the remaining data blocks with the parity.

RAID 5 requires at least three disks. One disk worth of capacity is effectively used for parity across the array. For $N$ disks of size $S$, the usable capacity is
$$C_{\text{RAID5}} = (N - 1) \times S.$$

Storage efficiency is therefore $\frac{N - 1}{N}$. As you add more disks, the fraction of capacity lost to parity decreases, which is one reason RAID 5 has been popular.

RAID 5 improves read performance through striping. Reads can be served from all disks, and for normal reads there is no need to use parity blocks. Write performance is more complex. A small write often requires a read-modify-write cycle: read the old data block and old parity block, compute the new parity, then write the new data and parity. This can cause what is often called the RAID 5 write penalty, where writes are slower compared to reads.

Fault tolerance in RAID 5 is limited to a single disk failure. If any one disk fails, the array enters a degraded mode. Data can still be accessed, but each read that touches the missing disk must be reconstructed using parity and the remaining disks. This reduces performance and puts heavy load on the surviving disks.

If a second disk fails before the array is rebuilt, all data in the RAID 5 array is lost. With modern large disks, rebuilds can take many hours or more, increasing the window of vulnerability. The risk of encountering an unrecoverable read error on another disk during rebuild also grows with disk size, which is a concern when using RAID 5 with very large disks.

RAID 5 is best suited to read heavy workloads that need good capacity efficiency and can tolerate a single disk failure, but it is considered less suitable for very large disks or critical data where double failures or errors during rebuild are a serious concern.

RAID 6: Striping with Dual Parity

RAID 6 extends RAID 5 by adding a second independent parity block per stripe. It is also block level striping, but with dual distributed parity. Instead of one parity disk worth of space, it uses two.

RAID 6 requires at least four disks. For $N$ disks of size $S$, the usable capacity is
$$C_{\text{RAID6}} = (N - 2) \times S.$$

Storage efficiency is $\frac{N - 2}{N}$. Compared to RAID 5, capacity efficiency is lower for the same number of disks, but redundancy is much stronger.

Conceptually, RAID 6 uses two different parity calculations. One is similar to the XOR based parity from RAID 5. The second uses a different mathematical scheme so that the system can solve for two unknown blocks instead of one. The important property from an administrator’s view is that any two disk failures can be tolerated.

RAID 6 can survive the loss of any two disks without losing data. When a single disk is failed, the array runs in degraded mode but still has parity protection against a second failure, though at a performance cost. Only after a third disk fails does the array lose data.

Read performance is similar to RAID 5. Normal reads come from data disks without involving parity blocks, and striping provides parallelism. Write performance is usually slower than RAID 5 because there are more parity calculations and more blocks to update for each write. The write penalty is greater for small random writes.

With large modern disks, RAID 6 is often considered safer than RAID 5 because it reduces the risk that a second disk failure or an unrecoverable read error during rebuild will destroy the array. It is often chosen for large capacity arrays where fault tolerance is a priority, and where the extra disks used for parity are an acceptable cost.

RAID 10: Mirrored Stripes

RAID 10, sometimes written as RAID 1+0, combines mirroring and striping. The data is first mirrored, then striped across mirror pairs. This is conceptually different from RAID 0+1, where data is striped then the stripes are mirrored, but in practice RAID 10 is the commonly implemented one and has better fault tolerance characteristics.

RAID 10 requires at least four disks, arranged as pairs. Each pair is a RAID 1 mirror, and the set of mirrors is then combined as a striped set. This means writes go to both disks in the mirror, and stripes distribute data across the mirror pairs.

For $N$ disks of size $S$, where $N$ is even, the usable capacity is
$$C_{\text{RAID10}} = \frac{N}{2} \times S.$$

Storage efficiency is $\frac{1}{2}$ regardless of the number of disks. Half of the raw capacity is used for redundancy. Compared to RAID 5 or 6, the capacity efficiency is lower for the same disk count. In exchange, RAID 10 provides high performance and strong fault tolerance.

On reads, RAID 10 can read from any disk in any mirror, and it can balance traffic across both disks in a mirror as well as across mirror pairs. Read throughput can be very high. On writes, each block must be written to both disks in the mirror, but stripes still distribute IO across multiple mirror pairs. Therefore, RAID 10 write performance is generally better than RAID 5 or 6, especially for random writes.

Fault tolerance in RAID 10 depends on which disks fail. Since disks are in pairs, each mirror can tolerate the loss of one disk. The array as a whole remains functional as long as no mirror loses all disks. This means multiple disks can fail, and the array can still survive, provided no mirror pair loses both members. For example, with four disks arranged as two mirrors, you can lose one disk from each mirror and remain operational. If both disks of one mirror fail, the array fails.

RAID 10 is often chosen for workloads with many random writes and a need for both performance and redundancy, such as database servers. It trades capacity efficiency for predictable performance and reliable recovery behavior.

RAID 0+1 and Other Nested Levels

Nested RAID levels combine basic patterns like RAID 0 and RAID 1 in different orders. RAID 0+1 is a stripe of mirrors, but arranged conceptually as striping first, then mirroring. A typical RAID 0+1 layout requires an even number of disks, grouped into two RAID 0 stripes, then those stripes are mirrored.

RAID 0+1 has similar capacity efficiency to RAID 10, but the fault tolerance behavior differs. If a single disk in one stripe fails, that entire stripe is considered failed and the array operates in degraded mode using only the other stripe. A second disk failure in the surviving stripe then destroys the array. As a result, RAID 0+1 is typically less robust than RAID 10 given the same number of disks and other conditions.

Other nested RAID levels exist, such as RAID 50 and RAID 60, which combine striping with RAID 5 or RAID 6 groups. These are usually built to increase performance and capacity while maintaining parity based redundancy. For example, RAID 50 is a stripe of several RAID 5 sets. Each RAID 5 set can survive a single disk failure within that set. RAID 60 is similar but uses RAID 6 sets, allowing two failures per set.

Nested RAID levels are mostly used in larger arrays where administrators want a specific balance of redundancy, performance, and capacity, and where they can carefully design the number and size of component RAID groups.

Comparing RAID Levels: Capacity, Performance, and Fault Tolerance

Each RAID level presents a distinct tradeoff.

RAID 0 maximizes capacity and performance, with $C = N \times S$, but has no fault tolerance at all. It is suited to temporary or easily reconstructable data, or scenarios where performance testing matters more than durability.

RAID 1 sacrifices capacity, using $N$ disks to provide the capacity of a single disk, with $C = S$, but provides strong redundancy and simple recovery. It has good read performance and predictable behavior on failure.

RAID 5 and RAID 6 offer better capacity efficiency, with $C = (N - 1) \times S$ for RAID 5 and $C = (N - 2) \times S$ for RAID 6, but introduce parity calculations that affect write performance and rebuild complexity. RAID 5 tolerates one disk failure, RAID 6 tolerates two.

RAID 10 combines the benefits of striping and mirroring, with $C = \frac{N}{2} \times S$, and delivers strong performance for both reads and writes and flexible fault tolerance, at the cost of using half of the raw capacity.

When selecting a RAID level in practice, the key questions are how many disk failures must be tolerated, how important write performance is compared to read performance, how large the disks are, and how expensive extra disks are compared to the risk of data loss or downtime.

RAID Levels and Rebuild Considerations

A critical aspect of RAID levels is what happens during a rebuild after a disk failure. Rebuild is the process of reconstructing lost data on a replacement disk. In RAID 1 and RAID 10, rebuild typically involves copying data from the surviving mirror disk to the new disk. The process is straightforward, though still IO intensive.

In RAID 5 and RAID 6, rebuild requires reading from all remaining disks and using parity to reconstruct the missing data. This stresses the surviving disks and can take a long time for large arrays, especially if the system is also handling normal workload IO. During rebuild, the array is also more vulnerable to additional failures. For RAID 5, a second disk failure during rebuild is catastrophic. For RAID 6, the array can survive one more, but the risk still exists if more disks fail or if unrecoverable read errors occur.

These rebuild dynamics are one reason why some administrators prefer RAID 10 or RAID 6 over RAID 5 for large and critical arrays, and why they sometimes avoid RAID 5 with very large individual disks, even though the raw capacity efficiency looks attractive.

Conclusion

RAID levels specify distinct patterns for distributing data and redundancy across multiple disks. RAID 0 focuses on performance and capacity with no safety. RAID 1 mirrors data for strong redundancy but low storage efficiency. RAID 5 and RAID 6 use parity to provide fault tolerance with better capacity usage but more complex writes and rebuilds. RAID 10 and other nested levels combine striping and mirroring or parity to tune performance and resilience for particular workloads.

Understanding these differences prepares you to match a RAID level to your requirements in terms of fault tolerance, performance characteristics, capacity efficiency, and rebuild risk. Configuration, monitoring, and integration with other storage technologies build on this foundation and are covered in other chapters.

Comments

Please login to add a comment.

Don't have an account? Register now!