📜 ⬆️ ⬇️

NDFS - Nutanix Distributed File System, the "foundation" of Nutanix

image

Nutanix is ​​a “converged platform 2.0”, if under “converged platform 1.0” we consider the initial experiments on the mechanical connection of storages and servers into a single product and some conditionally common architecture undertaken in VCE (EMC + Cisco) Vblock , NetApp FlexPod or Dell VRTX . Unlike those listed, Nutanix is ​​no longer just wired in a 19 "rack together server, network and storage, integration is much deeper and more interesting.

But we will start with the basics, with what lies in the foundation of Nutanix as an architecture and solution, with Nutanix Ditributed File System, NDFS.

')
As the name implies, NDFS is a distributed, clustered file system.

One of the most important problems for the operation of a cluster distributed file system, which the developer has to solve, is the problem of working with so-called metadata.
What is “metadata”, and how do they differ from the “data” itself?
Any data that we store on the storage system always has ancillary data associated with it. This is, for example, the file name, the path to this file, that is, where it is located in the storage structure, its attributes, size, creation time, modification time, access time, some bit attributes, such as “for read, as well as access rights, from the simplest, then deployed ACL. These data are not our own stored data, but they are associated with it, and provide access to the data. These additional data are called “metadata”. As a rule, access to data begins with access to the metadata stored in the file system, and already with their help the OS can begin to work with the data.
The problem begins when there is more than one who wants to work with the data. With one - everything is simple. I counted the path to the file, read the necessary piece, changed it, wrote it down, if necessary - corrected the metadata. But what to do, when at the time, while one process, on one cluster node, wrote its changes to the file, this file was read and another process started to make its changes from another node?

One of the already classic mistakes of beginning system administrators is to mount an iSCSI from the storage system to two or more servers of its LUN (physically it is quite possible), format it in some NTFS, and then try to read and write to it from different servers. Who has already passed this - knows what happens next. Who else did not come across it - I tell. Everything goes well, while only one server writes to such a “disk”. As soon as the second starts recording, the file system immediately (more or less fatally) crumbles. Why? Because NTFS is not a clustered file system, and knows nothing about the situation when it can simultaneously write two different servers to the same file system.


The file system should monitor and “dispatch” access to data from different nodes of the cluster. The easiest way to do this is to assign a single node, the “old-age file system,” so that all metadata changes are made through it, and it operated with them exclusively, and already it would respond to requests for metadata changes, blocking some, and allowing others. So works, for example, Luster or pNFS . Unfortunately, this means that operations on metadata become, sooner or later, a bottleneck for scaling. When all operations with metadata pass through exactly one server, then one day it becomes a bottleneck. Not immediately, not always, but once and for all.

For example, those who tried to install multiple (hundreds) virtual machines on one LUN in a VMFS partition (and VMFS is a classic cluster file system with locks using SCSI Locks), they know that with a large number of metadata operations, LUN With VMFS, it sometimes sometimes “slows down” unpleasantly. A “metadata operation” is, for example, rebooting a VM, resizing a VMDK, creating, deleting and expanding a VMFS datastore on a LUN, creating templates or deploying a VM from a template, all of which cause LUN blocking and a brief I / O suspension, for all vm on it. Partially this problem is being fought with Atomic Test and Set locking, shifted to the storage system, if it supports VAAI, but this, all the same, often solves the problem only partially.

Therefore, the question of creating a good, truly widely scalable cluster system is, often, a question of creating a highly scalable metadata management scheme.
When creating NDFS, NoSQL Apache Cassandra database was used to store the file system metadata in the cluster. It was originally developed inside Facebook and then sent to OpenSource. It is not the first time it is used in this quality, because it scales well, but almost always, along with the problem of scaling, the problem of database consistency across the cluster scale also arises. Therefore, the developers of Nutanix were faced with the task of ensuring not just distribution (for example, Cassandra works distantly in a Faculty, on clusters of thousands of nodes), but also in consistency. For Facebook, it’s not too big a problem that your photos in the photoset will not update instantly on all nodes of the cluster, but for Nutanix the consistency of the storage metadata is critical. Therefore, so to speak, vanilla-Cassandra has been substantially reworked to provide the necessary consistency of storage in the cluster, which was lacking in the initial implementation of the base.

At the same time, due to the distribution of Cassandra in the system, there is no “main node”, the failure of which can affect the performance of the cluster as a whole. The metadata base is distributed across the cluster and there is no “main node”, “file system head”, which solely controls access to metadata, Nutanix does not.

How does the work and access to the disks in the Nutanix cluster occur?
At the “large-block” level, this work looks like this:

image

Unexpected to many is that Nutanix does not use RAID technology to ensure high availability of data on disks. We are so accustomed to RAID on disks in enterprise systems that the statement about “no RAID” often causes confusion. “But what about the necessary reliability of disk storage?” That's right, there is no RAID. But, nevertheless, data redundancy is ensured, and it is provided in a manner that we have called RAIN - Redundant Array of Independent Nodes. The redundancy of the stored data is ensured by the fact that, at the logical level, each recordable data block is recorded not only on the disks of the cluster node to which it is intended, but also simultaneously and synchronously on the disks of another (or two, as an option) another node .

What does the failure of the RAID? First of all, it is faster and more flexible recovery in the event of a disk loss, and, consequently, reliability is higher, because for the duration of recovery in a classic RAID, its reliability is reduced, and, besides, the performance of disk operations also decreases, since the RAID group is busy with its own internal processes of reading and writing blocks for recovery.
Secondly, it is much more flexible data placement. The file system knows where and what data blocks are written, and can recover exactly what is needed at the moment, or write as it is optimal for data, as it completely controls the entire process of writing data to physical disks, in the case of a RAID controller logically separated from the level of OS operations.

How is access to the disks organized?
First, you remember that Nutanix is ​​a platform for the hypervisor. All tasks within the Nutanix server revolve over baremetal hypervisor, for example ESXi, MS Hyper-V or Linux KVM.

Among virtual machines there is one - special, it is called CVM, or Controller VM. This is the so-called “virtual appliance”, within which the entire kitchen of forming and supporting the Nutanix file system works. Physically, CVM is a virtual machine under Centos Linux, with many of our services, for example, Apache Cassandra, NoSQL-storage of file system metadata, I have already mentioned above, and besides this there is a diverse zoo of processes that provides everything that Nutanix is able. In general, if you really sharpen, then Nutanix - this is the very CVM it is, this is its heart and brain, and the main intellectual property.
But inside the hypervisor, it's just one big virtual machine with a lot of special processes in it.

This virtual machine, as seen in the diagram, passes through the I / O traffic of the virtual machines to their virtual disks. Physical disks are found for the OS of virtual machines not only “behind the hypervisor”, but also behind the CVM. And already the CVMs of all the cluster nodes included in the common cluster create from the physical disks, from separate SSD and HDD, directly buried in them, the common space. Create and give it to the hypervisor, which already sees the shared storage. The CVM stores this repository in the form most convenient for this hypervisor. For example, for VMware ESXi, this will be a “virtual” NFS stack, for 2012R2 Hyper-V, it will be SMB3 protocol stack, and for KVM, iSCSI.
For each hypervisor (and now the three listed are supported), we now have our own CVM, which is installed in the hypervisor during the initial installation of the cluster.

The process that provides I / O on selected protocols, between physical disks on the one hand, and the VM on the hypervisor on the other, is called Stargate, and the process that distributes tasks across a cluster of nodes, including all Map-Reduce tasks, such as balancing load on nodes, scrabbing (online integrity check) of disks - Curator. Prism is the management interface, including the GUI, and Zookeper (also a product from the Apache lab) stores and maintains the cluster configuration.

image

Since, as I mentioned above, Nutanix does not use RAID, and lays out the data blocks on the disks on its own, this gives it greater flexibility. For example, you can see SSD drives in the diagram. They store the so-called hot tier, that is, those data blocks that are actively being accessed, read or modified. On the hot tier also get recorded on disc blocks. There they remain until they are driven out by cold tier, to HDD SATA due to inactivity. Moreover, since the “unfolding” of the blocks on the disks is handled by the Nutanix kitchen itself in the CVM, it completely controls where and how, what and for how long the blocks will lie.

Combining the nodes of the Nutanix cluster into a single cluster, we get the following structure:

image

Directly on top of the physical disks is the data storage. The data is stored in the form of extents, that is, sequentially distributed, addressable block chains, and groups of these extents. As an addressable storage, it was decided to use the well-developed ext4 file system, from which only the function of storing and addressing extents is used. All the logic of working with metadata, as I mentioned above, is brought up to the level of Nutanix itself.
The diagram below shows the yellow physical SSD and SATA HDDs, the green one is NDFS, consisting of ext4 as an extent store data and cluster metadata storage in Cassandra, and finally, there are data blocks of guest OS VM file systems on top of that NTFS, ext3. XFS, or whatever.

image

In the next posts, I plan to continue the “technical” part of the story about what happens in Nutanix “under the hood”, for example, in more detail about how RAIN works and fault tolerance, how deduplication and compression work, how a distributed cluster is implemented and how replication works data for DR, how to make backups, and much more interesting.

If you have just connected to our hub, then you will most likely be interested in reading other Nutanix publications on Habrahabr:
http://habrahabr.ru/company/nutanix/blog/240859/

In addition, those wishing to delve into the topic would like to recommend a blog, in which you can also find a lot of technical articles about how Nutanix works under the hood.
http://blog.in-a-nutshell.ru

Source: https://habr.com/ru/post/243019/


All Articles