How SSD caching by means of a hypervisor in the VMware cloud works

VMware, with the release of VMware vSphere 5.1, announced several new initiatives in the field of virtual machine data storage, including the possibility of using a distributed cache on SSD-drives of local disks of ESXi servers. This technology had the working name vFlash and was in the Tech Preview stage, later becoming the full vSphere Flash Read Cache (vFRC) feature of the VMware vSphere 5.5 platform. And it is quite a working tool that can be used in tasks of various levels.

The vFRC technology is designed to increase the efficiency of client applications with the disk subsystem by caching read data, while significantly increasing the performance of applications that are actively performing read operations.

Flash Read Cache works at the hypervisor level, while significantly improving the performance of virtual machines that heavily use the I / O system for read operations. For caching, PCIe flash cards and SAS / SATA SSDs locally installed in the host can be used. Devices are combined into a flash pool from which VMDK disks of virtual machines are allocated space for data caching.

Recall that VMDK (Virtual Machine Disk) is a file format developed by VMware for use as a disk image in its virtual machines.

In terms of performance, the SSD cache is located between the RAM and regular disks.

VFRC Architecture Overview

')
The architectural feature of the vFRC is as follows:

When a read request comes to the VMDK disk with vFRC enabled, it first turns out whether there is the required data on the vFlash.

If yes, then the virtual machine receives data from the cache. This event is called “hit” (vFRC hit).
If there is no data in the cache, then ESXi reads it from the VMDK disk and gives it to the machine, while simultaneously writing data to the cache. This event is called a “miss” (vFRC miss). When a write request arrives, the data is written to the VMDK disk and asynchronously to the cache.

What is needed for vFRC?

In order to activate the vFRC functionality, the following conditions must be met:

Have a configured host with at least one SSD or PCIe SSD.
Use vSphere 5.5 (vCenter 5.5 and ESXi 5.5).

How does vFRC turn on?

After the device is physically connected to the server with ESXi, it must be added to the vSphere Flash Infrastructure layer. You can do this on the Virtual Flash Resource Management tab in the host settings.

vFRC

To enable vFRC in a virtual machine, the Virtual Flash Read Cache item is used in the hard disk parameters, where you can specify the amount of allocated space for caching and the block size. The block size for vFRC should be selected depending on which blocks the application writes data to disk. The block statistics for each disk can be compiled using the vscsiStats utility on ESXi.

vFRC

Configuration Features

When configuring vSphere Flash Read Cache, the following features must be considered:

The host server must have VMware ESXi 5.5 installed in Enterprise Plus.
The vFRC is configured and managed only through the vSphere Web Client, so VMware vCenter Server is required.
The maximum cache size for one virtual disk is 400 GB.
The maximum cache size per host is 2 TB.
The maximum size of a virtual disk is 16 TB.
The maximum number of SSD drives used for cache is 8.
You need to update the Hardware Version of the virtual machine to version 10.
You must manually configure the cache for each virtual disk, the minimum value is 1 GB.

vFRC in practice in IT-GRAD

And a little from personal experience. How did we test SSD cache in the VMware cloud and what pitfalls did we encounter in practice?

Before offering any solution to customers, it is necessary to ensure its full functionality and performance, which means to test and understand that the result obtained fully complies with the stated capabilities and meets the requirements of the customer. And if there is a new product that no one has yet tested or introduced, it should be run in with particular attention. After all, it is no secret to anyone that in everything new there may be hidden flaws, bugs and other unpleasant trifles.

As soon as VMware announced a new possibility of using distributed cache on SSD-drives of local disks of ESXi servers, we decided to test this functionality. Since this technology was in the Tech Preview stage before the release of vSphere 5.5, I wanted to test the already revised solution. We were faced with the task of testing the operation of the vFRC on the constructed stand.

For testing, SSDs were connected to a Dell PERC H710P RAID controller. We created a RAID-0 group by the number of SSD disks, each group has one disk.

SSD RAID- Dell PERC H710P

Since the Dell PERC H710P RAID Controller cannot provide information about the physical type of media connected to it, I had to manually note that the disks connected to the ESXI are SSD disks. To do this, run the esxcl command:

esxcli

After launching the command in the device parameters, the “Is SSD” flag value has changed to “True”:

“Is SSD“

Then they added devices to the vSphere Flash Infrastructure layer. The current procedure was performed in the host settings through the option Virtual Flash Resource Management:

vFRC

In advance for testing SSD-cache, we prepared a booth with virtual machines based on Windows Server 2008 R2 x64 OS and two VMDK-disks with a capacity of 100 GB each allocated for each virtual machine:

VMDK1 is defined under OS,
VMDK2 - under the data.

Further, in the VMDK2 hard disk parameters of virtual machines, vFRC was turned on, allocating 100 GB for the cache, while determining the block size to be 4 KB.

vFRC

Basic configuration steps are performed. Next was the task of checking the functionality of the included functionality. However, when launching virtual machines, one simply refused to start, instead of the welcome window, a “blue screen” appeared with the following contents:

« »

On the other virtual machines, no obvious problems were observed.
Then we decided to use the tools sharpened by monitoring the SSD cache, and compare the test results. First, in one of the virtual machines, we launched the FIO utility, which generates the necessary amount of data on the VMDK2 disk. As mentioned earlier, it was he who was allocated for useful data. The FIO utility can work in various modes, we were interested in the “random read” procedure. That is why they launched it in rand-read mode.

Note: More information about the FIO utility can be found at http://freecode.com/projects/fio .

The FIO utility implies the use of a job-file (or, more simply, a configuration file), in which parameters are written for testing. The utility performs read operations on randomly generated VMDK2 disk data. In the configuration file for reading, the read block size is fixed (in our case, equal to 4 KB). After that, they started an arbitrary read operation. The test time was 6 hours and 46 minutes.

FIO

I was interested in the question: did the read data get into the cache and if so, what percentage of the hit?
To find the answer, we used the virtual disk performance graph of the machine using the vSphere WEB client.

It was interesting to look at the following counters: the average number of output operations per second, the read delay and the counter, which gives statistics on the use of the cache. The latter was somewhat disappointing, showing a very small percentage of the data in the cache. With an average number of output operations per second (18,689,328), the value for the cached data was 4,439,389, which is only 23% of the hit. According to this statistical alignment, the cache can simply be considered non-functional.

Since the standard tool did not show the expected results, they turned to another tool: the esxcli team. It also works with statistics on a specific VMDK disk with the vFRC option enabled. Run the command with the following parameters:
~ # esxcli storage vflsh cache stats get

esxcli

In this figure, you can see the cache hit rate, represented as a percentage. It shows the so-called "hit" vFRC hit, that is, the percentage of data from the cache that is used by the virtual machine. The team under consideration had to be run several times, since the results at the next launch turned out to be completely different. For one value, the cache did not work at all, as in the first case, for others it worked, with the percentage of data getting into the cache equal to 96%.

We did not dwell on the received one, used another utility: esxtop (with sending an interactive command “u” (u: disk device)) to display statistics on the use of the cache. According to the information displayed on the screen, we got the following result: when reading, the data was retrieved directly from the cache. Given that the average number of output operations per second was 18,689.328, and the volume of operations for data read from an SSD cache was 18,184.03, the percentage of data getting into the cache was approximately 97%.

esxtop

The test results did not fully meet our expectations, and we, as a major service provider, VMware partner, turned to colleagues on the vendor side for help.

VMware has a fairly extensive experience of interacting with its customers and partners. In the case of detection of bugs, bottlenecks and other points in terms of product functionality, the developers make every effort to make the necessary corrections.

As a result, in the autumn of 2014, an update of VMware ESXi 5.5 Update 2 was released, which eliminates the described problem on a blue screen of a virtual machine running Windows Server 2008 R2 x64.

The released update, of course, interested us. We decided to test it by installing it on a previously reviewed test site with vFRC enabled. What is the result? All virtual machines started as one. We put a “+” in this test and move in the direction of the meter readings. As well as at the very beginning of testing, we launched the FIO utility in the rand-read mode with the configuration file used earlier, and then we launched the random read operation. The counters for the most part showed working statistics and only occasionally indicated incorrect values. That is, VMware ESXi 5.5 Update 2 did not fix the described problem by displaying vFRC statistics. Despite this bug, the vSphere Flash Read Cache technology, as further practice of using this functionality has shown, significantly improves the performance of virtual machines by reducing the latency rate.

After the next tests, we proceeded to the introduction of SSD caching on the hosts in an industrial environment. Today, several projects have been successfully implemented at our sites using the vSphere Flash Read Cache for our customers who are particularly demanding of performance. The latter, in turn, are pleased with the results of the acceleration of their systems and applications.
You can read about other caching mechanisms in the article: “SSD caching in the VMware cloud” in the first corporate IaaS blog .

Source: https://habr.com/ru/post/250499/

All Articles