On data locality in hyperconvergent systems

There are concepts that are commonly used in professional communication and in marketing, implying that they are strictly defined and everyone understands them the same way. But it often turns out that a number of terms around which there is a lively discussion in the IT community have not had precise definitions and no, and of course, they have all forgotten to agree on their common meaning.

For example, the concept of "enterprise readiness". Since we started talking seriously about the automation of enterprises, it is used with confidence both in technical documentation and with the support of sales by everyone. But he has no strict definition! There is a general understanding of “enterprise-ready systems,” or “enterprise-wide systems,” as solutions that are ready for use by large organizations from arbitrary branches of human activity. Often, very expensive solutions are often called enterprise-ready solutions — costing more than a million dollars, for example. It would seem - a curiosity. But this curiosity sets the level of discussion.

Reliability, availability, maintainability

Speaking of IT solutions "enterprise-level", under the "enterprise" understand any large organizations and management companies, but not IT companies. This is important because such organizations focus on certain standards and serial IT products, it is important for them that they can independently use the solutions they acquire; they are critical to the resiliency and reliability of the IT systems. Therefore, we prefer to call hardware and software solutions designed to work in large organizations and characterized by:
')

reliability (reliability): long time between failures;
high availability (availability): even in the event of a failure of one of the components, the systems must continue to function without a noticeable decrease in performance;
serviceability: i.e. the possibility of a sufficiently rapid restoration or replacement of a failed component at the lowest cost from both the operator organization and the supplier.

And it is not invented by us. Back in the 1960s. IBM has used the abbreviation “RAS” in its mainframe advertisements - Reliability, Availability, Serviceability. Other characteristic properties can be distinguished, but, in one way or another, they can easily be reduced to RAS. This is where the main expectations of large organizations from IT are concentrated.

However, reliability, availability and maintainability are also understood in different ways. In particular, there is some private opinion that the mandatory requirement of high availability of the system is the property of data locality. But after all, the locality of the data does not have a strict definition! The idea of data locality is that the data must physically be “somewhere close” to the place where they are processed. Of course, each manufacturer implements data locality in its own way. It's funny that it is the implementation of local data that causes the most questions - although in relation to enterprise readiness as a whole, as we see, this property is not even of the second, but of the third order. But since it is this property of enterprise-ready-systems that generates such fierce disputes, let's find out which solutions for ensuring local data exist in the context of readiness for loads and operation in the “enterprise scale”.

Data locality

First you need to understand the concept of data locality and examples of its implementation in various classes of infrastructure and platform systems.

What level of data locality are we talking about? About the locality of data in relation to the processor sockets? Or about the organization of interaction between geographically distributed data centers? Let's agree that we are talking about hyperconvergent systems, i.e. about x86-based virtualization complexes without external storage systems - today, they are becoming the de facto standard for implementing virtualized infrastructure for large and medium-sized organizations.

So, the hyperconvergent system is, in fact, a set of nodes that run virtual machines. Virtual machines store data on the internal drives of virtualization hosts. The data locality property implies that each virtual machine writes data to drives located on the same physical node as the virtual machine — in order not to overload the network. In the future, this data will be copied to other nodes to provide redundant data storage, but the virtual machine reads in “its” physical node.

Fig. 1. The hyperconvergent system combines resources from several nodes into a computational pool, and delegates local storage of nodes to a single storage pool. The virtual machine VM ₁ , located on the first node, writes the first replica of its data blocks to local drives.

But what if the node fails? The virtual machine will, of course, be moved to another physical node. But should there, to this virtual machine, "move" all its data? The main advantage of this approach is that the virtual machine can read its data more locally than through the network. The main disadvantage is that data transfer overloads the network, and, as experience shows, quite significantly. Therefore, in our implementation, which we used in the hyper-perverse computing platform "Skala-R", we deliberately did not start rebuilding the repository with automatic virtual machine data transfer. At the same time, we did not proceed from the speculative ideas of what data locality is, but from the actual system availability indicators, to which Scala-R corresponds.

Fig. 2. When a node fails, virtual machines from it, including VM ₁ , migrate to another node, the first replica for VM ₁ volume is now written to the local drives of the second node - this is its new locality. But do you need to automatically rebuild the entire storage pool so that the maximum number of data blocks of VM _{1 is} on the drives of the second node?

Why do we do that? Because we believe that IT infrastructure should not be redundant, its complexity and final cost should be justified, and its behavior - predictable. As examples of good practices, you can cite the implementation of monitoring and management systems such as HP OpenView, IBM Tivoli, BMC Patrol - they could proactively perform proactive and corrective actions in certain situations, but by default these features were disabled and only a system administrator was alerted.

We consider such a policy to be very reasonable, and the analogy with the policies of the behavior of hyperconvergent systems is seen as direct here. Migrating virtual machines from a failed host to others is a natural, predictable action required to ensure high availability. The transfer of local data, which inevitably increases the load on the network, from our point of view, should be left to the discretion of the operator.

Interstitial Locality in Hyper-Converged Systems

Indeed, is data locality required for system operation across the enterprise, and if it is needed, in which implementation variant? In the early 2010s. early hyperconvergent systems were designed for reasons of independence from the performance of the network infrastructure. The most common network solution in organizations' data centers was Gigabit Ethernet, and for the first systems, such as the pioneers of this market, Simplivity and Nutanix, inter-nodal locality was considered the most important feature. These solutions implemented the function of preferred writing to local devices, preferable reading from a local device, and automatic rebuilding of the entire storage network during live migration of a virtual machine to another node.

In software-defined storage networks (software-defined storage, SDS), the best effect was achieved when they were jointly used with virtualization platforms, when blocks of virtual machine volumes were located on the same nodes where these machines are running, preferably reading from a local device. storage. One of the historically first SDSs that implemented inter-node locality was Parallels Storage (now Virtuozzo Storage). It also formed the basis of the software-defined network of the Skala-R hyperconvergent complex (the R-Storage component).

But with the transition to 10-gigabit networks, many manufacturers of hyperconvergent systems, such as Maxta, Atlantis, systems based on VMWare vSAN, and others, refused to implement inter-node locality. Most of the existing SDSs, including Microsoft S2D, Dell-EMC ScaleIO, RedHat CephFS, and RedHat GlusterFS, do not implement cross-site locality, and VMWare implements locality in vSAN as a local hot-data cache and negates the need for inter-site locality . This is motivated by low latency in a modern 10-gigabit network and the potential damage to the balance of the storage system while observing the rules of inter-node locality.

Even in Nutanix, which made a special emphasis on inter-nodal locality in early implementations, since 2015 it has been implemented much thinner - if the delay from remote reading is lower than from local, then reading from a remote replica is done, and the full restructuring of the volume is not performed when migrating a virtual machine (“Cold” blocks remain in place, that is, the relocalization of a block that is on a remote node is carried out at the first reading).

At the same time, most hyperconvergent systems are currently delivered without a network solution! For our part, we made the Mellanox network solution for RoCE networks, having a capacity of 56 Gbit / s and providing unloading functions for the central processors (CPU offload), an integrated part of the Skala-R complex. Duplicated switches provide reliability, their properties provide a reliable reserve of bandwidth even in scenarios with mass migration of virtual machines, the failure of even the entire switch does not lead to reduced availability.

As for inter-node locality, it was noted that it was inherited by Scala-R from the Parallels Storage implementation: virtual machine data blocks are written preferably to local devices, and reading is done locally. But the significance of this property for Scala-R is small — the network solution we use virtually eliminates the network factor in terms of performance.

Implemented in the "Scale-R" and the function of rebuilding the storage based on locality, but it does not start automatically during a live migration of machines. “Automation” would be easy to implement, but the analysis of the operating experience of the system did not confirm the feasibility of such a solution. For example, in a situation of a planned or emergency reboot of one of the nodes (which happens to Scala-R much less frequently than in the case of Nutanix and Simplivity), which takes 1-2 minutes, automatic rebuilding of the storage will not make any sense and at the same time noticeable performance degradation. If the virtual machine remains on the new node after migration, its new data will in any case be recorded on local devices. One way or another, the system administrator always has complete information for making decisions about re-migrating machines, rebuilding the storage network, or intermediate measures.

Conclusion

So, how effective is data locality for hyperconvergent systems? In the general case, inter-site locality is useful in software-defined storage networks, because it helps reduce interworking, reduce network load, and improve overall system performance. But the automatic storage tuning function when migrating virtual machines is not only not necessary - in conditions of relatively large virtual machines, it is rather harmful.

In general, inter-node locality is not related to enterprise readiness and operation (in terms of RAS). This is only an additional feature, and the higher the performance of network solutions, the lower its value.

Source: https://habr.com/ru/post/346738/

All Articles

On data locality in hyperconvergent systems

Reliability, availability, maintainability

Data locality

Interstitial Locality in Hyper-Converged Systems

Conclusion

More articles: