Erasure Coding: the Basics of smarter Data Resilience
I wanted to write this for quite a while. Over the summer I revisited some fundamentals of RAIDs, erasure codes, parity calculations, etc., but never finished it, until Christmas time. While Nutanix HCI has EC-X, this post is not directly tied to any Nutanix technology. Hope you still enjoy the read!
RAIDs, Mirrors, Parities, Erasure Codes and Math
In the world of storage, we often hear about efficiency, resiliency, and recoverability as if they are some mystical trifecta. Depending on the vendor and their approach, the strategies can vary wildly. For example, traditional storage arrays lean heavily on RAID configurations – some more sophisticated than others – while modern distributed systems favor Erasure Coding.
But why the shift? And what exactly is Erasure Coding?
Let us dive into the fundamentals of RAID, the underlying principles of Erasure Coding, and how these approaches tackle the challenges of data resiliency and efficiency.
RAID Levels: From 0 to 6 and everything in between
Almost everyone in IT has bumped into RAID (Redundant Array of Independent Disks) at some point. Back in school, I had to learn the RAID levels: 0, 1, 5, 6, 10, 50, and 60.
All visuals are taken from the Wikipedia article on RAID levels since I learned them there and love the visualization 😊 (https://en.wikipedia.org/wiki/Standard_RAID_levels).
The core idea of RAID is simple: combine multiple storage devices into a logical unit that provides data redundancy and, in most cases, performance improvements.
RAID 0:
RAID 0 provides no redundancy, focusing entirely on performance. Data is striped across multiple disks at the block level, allowing concurrent writes and leveraging the combined throughput of all devices. For example, a 4-block file is split evenly across two disks, doubling write speeds. However, RAID 0 has no fault tolerance – if one disk fails, all data is lost. While it is useful for temporary or non-critical data, RAID 0 is rarely recommended for production environments due to its lack of resiliency.
RAID 1:
RAID 1 duplicates data across two or more disks, ensuring each disk contains an identical copy. Every write operation is mirrored, and the system can continue running seamlessly if one disk fails. This makes RAID 1 ideal for critical data where redundancy is more important than storage efficiency. However, the trade-off is that you only get 50% usable capacity, as one disk is a full mirror of the other. RAID 1 is simple and reliable but does not improve performance significantly and can only tolerate a single disk failure.
RAID 5:
RAID 5 combines striping for performance with distributed parity for redundancy. Data and parity blocks are spread across all disks, ensuring the array can recover from a single disk failure. For example, in a 3-disk RAID 5, parity is calculated using XOR and rotated among the disks. This provides better storage efficiency than mirroring, as only one disk’s capacity is reserved for parity. However, rebuilding a failed disk requires parity calculations and reading all remaining disks, which slows performance and increases stress on the array during recovery.
RAID 6:
RAID 6 extends RAID 5 by adding a second parity block, enabling it to survive two simultaneous disk failures. Like RAID 5, it stripes data and distributes parity blocks across all disks, but the second parity uses more complex calculation for added fault tolerance. This makes RAID 6 more resilient but also more resource-intensive during writes and recovery. The storage overhead is higher, with the capacity of two disks reserved for parity. RAID 6 is commonly used for large arrays or critical environments where recovery time and fault tolerance are top priorities.
The others (10, 50, 60) are combinations of the above. E.g. RAID 10 is striping across two RAID 1 volumes. Taking the increased throughput of striping (RAID 0) across redundant volumes by mirroring (RAID 1). 50 stripes across RAID 5 sets, 60 across RAID 6 – you get the idea.
Why Distributed Systems differ
Why do distributed systems not just stick with RAID? After all, RAID 5 and 6 already balance performance, redundancy, efficiency, and recoverability – what’s the issue? The short answer: distributed systems are playing a much bigger game. RAID is great for local arrays but does not scale or handle failures across entire nodes, networks, or data centers. Let us unpack why.
Distributed systems do not use RAID because their design does not fit RAID’s centralized approach. RAID relies on a controller managing disks in one array, while distributed systems spread data across nodes. This setup needs redundancy across the entire system, not just local disks, which RAID cannot handle.
Plus, distributed systems scale by adding nodes and / or disks, while RAID levels like 1, 5 and 6 scale by adding disks, where the total number of disks is often very limited. Erasure coding, which is software-based, is more flexible and handles larger-scale systems. Essentially, RAID is great for single arrays, but it cannot match the scalability and resilience distributed systems demand.
So, what is Erasure Coding?
Erasure coding (EC) is the solution that distributed systems turn to for redundancy and resilience. Think of it as RAID’s flexible, software-defined relative, designed to work across nodes instead of disks.
Here is the gist: EC splits your data into data blocks and adds extra parity blocks using mathematical techniques like XOR or Galois Fields (the same math is behind the second parity in RAID6 by the way). For example, in a 3+1 setup, your file is split into 3 data blocks, with 1 added parity block calculated. These 4 blocks are then spread across multiple nodes.
If one node fails, EC uses the remaining data, and parity blocks to reconstruct the missing pieces. Unlike RAID, which is tied to a physical array, EC scales beautifully with distributed systems, handling failures across disks, nodes, or even entire racks. It is all about resilience and efficiency at scale, making it the go-to redundancy method for modern infrastructure.
Let us walk through a theoretical example of XOR recovery. Imagine storing a (very small) picture of 4 KB called sunset.jpg in a distributed storage system using 3+1 erasure coding. Our made-up file is split into 1 KB (1.33 technically) blocks, Data 1 –3 (D1, D2, D3), and a 1 KB parity block (P1) is calculated for redundancy. The 4 blocks are then spread across different nodes in our distributed system.
The system now performs XOR operations across all three data blocks. To start that, we imagine our 1 KB block with the following binary values:
D1 = 1011
D2 = 1100
D3 = 0110
In Boolean math we use the ⊕ for XOR operations. XOR results in 1 only when the inputs differ (e.g. 1 and 0 or 0 and 1) and 0 when they match. So, the calculation for our Parity Block is:
P1 = D1 ⊕ D2 ⊕ D3
P1 = 1011 ⊕ 1100 ⊕ D3
P1 = 0111 ⊕ 0110
P1 = 0001
These blocks are spread across the 4 nodes in our distributed system. After a while, one of our nodes fail, and with it D2 (1100) gets lost. Our system still has D1 (1011), D3 (0110) and P1 (0001) on the remaining nodes to recover the failed piece of data. We replace our node, hook it up to the system and it starts the recovery:
D2 = P1 ⊕ D1 ⊕ D3
D2 = 0001 ⊕ 1011 ⊕ D3
D2 = 1010 ⊕ 0110
D2 = 1100
Voila! We calculated the lost block from the remaining data and parity blocks. Obviously, this happens at enormous scale (and is calculated faster than in my walk through) across many, many more nodes. However, I do hope that this example makes the concept easier to grasp.
Math, Magic and Modern Storage
Erasure coding is the backbone of modern distributed storage. By splitting data, calculating parity, and spreading it across nodes, it can deliver the perfect mix of redundancy, efficiency, and scalability. Whether a single disk fails, or an entire node goes offline, your data stays safe. Built for resilience and efficiency, erasure coding is a future-proof solution that continues to power reliable storage at scale.
More on efficiency on a follow-up post!