This blog entry has already appeared a couple of months ago. Unfortunately, we were strictly threatened from across the ocean , so that we would not tell about the unreleased features, so the text had to be removed. And now, in NOS 4.1.3, the Erasure Code is available for use in public beta status (experiment, but for now, wait a minute with the production, we are still optimizing the code), which means you can already talk publicly.If you’ve already read my story earlier about
how NDFS , the Nutanix Distributed File System, is the basis of how everything is done in Nutanix, you’ve probably noticed that the amount of disk space in NDFS is pretty generous.
')
Let me remind you that we do not use RAID, in its classical sense, when, for example, a disk keeps its mirror copy (RAID-1), or when an additional redundancy code (RAID-5 or 6) is calculated for a disk group. Instead, we store a block of data recorded on disks in two (or even three) places on different disks and even different nodes. This scheme is called RAIN (
Redundant Array of Independent Nodes , at the peak of RAID, which is the same, but
... Disks ). But, from the point of view of the system disk capacity, RF = 2, that is, an option when a copy is stored for each block, space consumption is equivalent to RAID-1, that is, 50% of raw capacity is available to us (minus some more, variable, the percentage of service structures and information, but we omit it here).
Yes, fault tolerance, reliability,
fast (minutes) failover , all this is true. But still the expense is quite large. Especially for people still thinking about drives in terms of raw or RAID-5 capacity. And you can say as much as you like that
RAID-5 is bad and unreliable , slow to write, and finally, at current HDD prices, the cost of increased reliability and performance of gigabytes for fault tolerance is small compared to what we are given in return for them. Does not matter. “We have four terabyte disks in the system. Why do we even have less than two terabytes left for our data? ”
That's why Nutanix has an idea that is now being actively implemented. Engaged in her at Nutanix, by the way, is “our”, Russian-language programmer.
This is what is called the "erasure code" (we called it
EC-X , Erasure Code-X). As is often the case with engineers, the name is “non-self-describing,” and no one knows why it caught on. In Russian, this will be the most correct - "redundancy code."
Here is how it works.
If we have data that the toad presses on us to keep on “RAID-1”, that is, in the Replication Factor (RF) mode = 2, then we can switch the storage mode from RF = 2 (or = 3) to erasure for this data code. At the same time, a special background process begins to work in our country, similar to how we do deduplication in a cluster, and after some time, instead of
block-and-copy , we have our disks being stored on the disks of the node
cluster; block -block- and_excess_information_for_them that allows you to unambiguously restore the contents of a block in this chain if it is lost, for example, as a result of a disk failure of one of the nodes.
And when this process finishes processing in the background, instead of a block and its copy in a cluster, we begin to store many blocks combined into a group, plus a separate block and a redundancy code. And in the data container, where we included the erasure code instead of RF, we get the same amount of stored information, and more free space for the new one.
Again, this is a bit like postponed deduplication.
Surely you are ready here to say: “Well, you just“ invented ”the RAID-5
bike !”, Not quite at the mathematical level, but the principle is remotely similar, yes.
"Payback" here (nothing without payback does not happen, as we remember) is that for more disk space for data, we pay higher CPU utilization, in case of need for data recovery. It is clear that, instead of simply copying, here we will have to restore the contents of the block from the contents of other data blocks and the redundant code, and this is a significantly more resource-intensive procedure.
It is also important that, with the help of Erasure Code, redundancy is enough to recover from the failure of two disks, nodes, or other cluster components, that is, in terms of fault tolerance, equivalent RF = 3, for which the volume usable on disks about 33% of raw.
And in the case of erasure code?
It depends on the size of the cluster. The more it is in the number of nodes - the more profitable, the greater the difference.

For a 4 node node on 80TB raw, approximately 40TB usable is obtained with RF = 2. When switching container in erasure coding space usable will be - 53TB.
On 5 nodes - 100 - 50 - 75, on 6 nodes - 120 - 60 - 96, on 7 - 140 - 70 - 116. As you can see, with an increase in the cluster size, the “storage efficiency” for the erasure code also increases, and can reach 80% of raw capacity.
What kind of coding is used? No, this is not the Reed-Solomon Code, familiar to the industry, and often used for such tasks. We had to invent our own algorithm, which provides a faster processing and calculation speed. And, of course, we use the distributed capabilities of the Nutanix cluster, the algorithm is distributed, like map-reduce, and runs on all the nodes of the cluster, which ensures its reliability and performance. It is also important to note that the use of EC-X does not violate our Data Locality principle. If the virtual machine is located on this virtualization host (cluster node), then its data on the SSD (performance tier of our storage) will also lie locally for it, on this node, both with RF and EC-X storage options, which provides low latency and high disk performance.
For what and where can this be applied?
First of all, it allows lowering storage costs ($ / GB), which is especially important for cold storage and capacity nodes, especially on large clusters, if you store information on Nutanix, albeit valuable, but not too "Hot", active. And we are ready to pay for more free space with higher CPU utilization and a longer recovery time.
At the same time, pay attention, in normal mode, during normal work with data under erasure code, CPU load does not increase significantly when accessing data, only during recovery.
You are also free to choose how to protect your data with redundancy. You can keep different data containers on one cluster, some with RF = 2, others RF = 3, and some with an erasure code. For data that is hot enough and critical, you can choose some kind of RF, and for those not so “hot”, and lying on nodes where the increased CPU load during recovery is not critical for us - Erasure Code.
Again: the choice of mode for storing data is yours, and depends on your choice and on your priorities.
Erasure Code appeared in the next release of Nutanix OS, which will arrive on your Nutanix systems with a regular update. Update Nutanix, by the way, does not stop the work of virtual machines and data inaccessibility, and the system is updated Over-The-Air, "like an iPhone", but more on that in the next post.