📜 ⬆️ ⬇️

FlashCache. How to use Flash in storage NOT as SSD?

image

The use of Flash-memory for modern storage systems has become almost an everyday affair, the concept of SSD - Solid-State Disk, has become widely used in the practice of enterprise storage systems and servers. Moreover, for many, the concepts of Flash and SSD have become almost synonymous. However, NetApp would not be itself if it had not found the best, its own way to use Flash for using it.
How can Flash be used for storage, but NOT as an SSD?


I begin, perhaps, with the definition of what is SSD.
SSD or Solid-State Disk (literally: “Solid-State Disk”, a disk without moving, mechanical parts) is a way to present non-volatile memory space for the OS using Flash technology as a virtual, emulated hard disk.
')
The most traditional form of "non-volatile memory", that is, non-volatile memory, for OS and applications are "hard drives", HDD. All OSs can work with hard disks, so it would be quite natural, to simplify the work and shorten the “path to the customer’s table”, not to support for each OS some kind of Flash architecture, and to emulate the Flash memory itself. for any OS driver, a pseudo “hard disk”, and to work with it as with a hard disk in the future.

The advantages of this approach are simple. This allows you to use existing drivers in any OS and not to fence specific support for each of them.
Cons - emulation is still emulation. A number of features of Flash can be quite difficult to use and take into account, working with it as with a “disk”. An example is quite a difficult organization of overwriting data bytes in Flash.

NetApp has chosen its own way to use Flash. It uses Flash-memory not in the form of an emulated "disk" (storage), but directly as a memory (memory).

image

It is not a secret for anyone that in the data stored on the disks a significant part is occupied by the data that are rarely accessed. In the English-language literature, they are called “cold data”, and, by estimation, the volume of cold data in the total storage volume can reach 80%. That is, approximately 80% of the stored data of the application is accessed relatively rarely. This means that approximately 80% of the money we spent on still very expensive SSDs is wasted, and does not give us any actual returns.

Cold data that lies on the SSD, could also lie on the cheap SATA, anyway, they are currently without movement. Because it lies and takes up space on super-expensive SSDs, our system as a whole will not work faster, because the advantage of SSD is manifested only at the moment of accessing the data, and data can “just lie” on any disk. But we lose this place for those active data that, if placed on an SSD, would give a performance boost. "Cold data" is the classic "dog in the hay." And the capabilities of the SSD do not use, and others in need, do not give.

What would seem to be the problem? Define the "cold" data, separate them, and store them in a separate storage.
But the trouble is that in these 80% at each point in time get different data. The data in the database on postings for the last quarter, the whole next quarter will be intact and “cold”, but during the preparation of the annual report, they will again become very “hot”.

It would be very nice if such a move along the “storage levels” occurred automatically. But such a system will greatly complicate the entire storage architecture. After all, the storage system will not only have to quickly receive and transmit data, but also evaluate the activity of the fragment, move it in some way, and also remember where the necessary fragment is located today in order to “re-route” the client’s address accordingly.

A lot of difficulties, and all due to the fact that we chose the wrong tool. I already somehow gave this funny analogy, about how difficult and not easy to drive the bolts into the nuts with a hammer. It is possible to take an ever heavier hammer for this task, or you can simply change the hammer to a wrench and use the bolts as intended. :)

Using Flash not as storage, but as memory, NetApp very elegantly solved the cold data problem.

I'm sure you all know about the principles of the RAM-cache in modern storage systems.
The cache is an intermediate on-line data storage in fast RAM.
If the client requests a block of data from the repository, then it is read and thus gets into the cache.
image

If the next block is again addressed to this block, it will be given to the client not from slow disks, but from the fast cache. If this block is not accessed for a long time, it will be automatically pushed out of the cache, and in its place a more requested block will be used.
image

Thus, we see that the data in the cache is always, “by definition”, 100% hot data!

Now let's imagine that we use flash not as a disk, but as memory, as some sort of “second-level cache”, additional space. Usually we have a relatively slow disk space in the storage system, and a significantly faster RAM cache. Alas, we cannot infinitely increase the size of the RAM cache. This is at least very expensive. However, if we have a rather capacious, though not as fast as DDR DRAM, Flash memory, we can supply it with a “second-level cache” to the existing RAM cache.

Now the cache mechanism works as follows: the data falls into the RAM-cache (usually its size, depending on the type and power of the controller is from 1 to 96GB), and are there until they are replaced by more relevant data. Expelled from the RAM cache, they are not deleted, but transferred to the Flash cache (usually, depending on the controller's power, its type and the number of possible modules, from 256GB to 4TB). There, the cached blocks are kept until they become “obsolete,” in turn, and are not superseded by actual data. And only then the reading will go from the disks. But while this data is in FlashCache, they are read about 10 times faster than from disks!
image

image

* Please note that the images are taken from the FlashCache manual, and there it is called PAM - Performance Acceleration Module, this is its “old” and “internal” name, that is, “PAM” is not “RAM”.

What are the results of using FlashCache? How much does the claimed effectiveness show itself in practice?
To compare performance, NetApp participates in a testing program using the standard and widely accepted industry test SPECsfs . This is a test of “network file systems”, that is, NAS, using NFS and CIFS protocols, the results of which are easy to extrapolate for other tasks.

The midrange model FAS3140 with 224 FC disks is taken as a reference system (these are 16 shelves with 14 disks in each, a full 48U-cabinet of FC disks!).
It is no secret that very often in a high-performance I / O system, when it is configured, many disks are forced to be installed (we usually say “disk spindles” - disk spindles). And very often it is the performance in IOPS, and not the storage capacity, that dictates the number of disks used by the system. Many disks in such a system is a necessary measure to ensure high I / O performance.

You can see that such a system showed on the SPECsfs2008 CIFS 55476 test of speckmark “parrots” in I / O performance.

Now let's take the same system, but now with the FlashCache board inside, but this time with only 56 same FC disks (these are only 4 disk shelves, four times smaller than the disk “spindles” that usually determine IOPS performance)
We see that for a system that costs (total disks, shelves, plus FlashCache) it is 54% cheaper, with the same level of performance in “parrots”, noticeably fewer delays are achieved, and due to a smaller number of disks, energy consumption and performance improved by 67% system space in the rack.

image

But that's not all. The third system contains 96 SATA disks instead of FC disks, usually characterized by low IOPS performance, and again FlashCache.
We see that even with the use of relatively “slow” SATA on 1TB, our system shows all the same performance indicators, while providing 50% more available capacity than the system on FC disks without FlashCache, and 66% less power consumption, while approximately equal to the level of delays.

The bottom line is so powerful and convincing that NetApp sold flash petabytes to FlashCache for its customers , only six months after the announcement of this product, and as far as I know, sales continue to grow (NetApp even claims that they are the second largest seller of Flash memory in the world, after Apple, with their iPod / iPad / iPhone).

But FlashCache would not have worked so effectively, if not for the (again!) Possibility of the underlying WAFL structure. Remember, talking about WAFL , and then about certain "chips" of NetApp storage systems, I repeated several times already that WAFL is the very "foundation", the foundation of everything that NetApp can do.
What "played" in this case?

Those who are seriously and deeply involved in the prospects of using SSD in productive applications already know one, extremely unpleasant feature of Flash as a technology - very poor performance on random write (random write).
Almost always, when it comes to SSD performance, manufacturers show tremendous results of random reading, but almost always in one way or another they bypass the question of the performance of random writing in small blocks, and this, in a typical workload, is about 30% of all operations.

These are IOPS figures for the popular high-performance enterprise-class SSD FusionIO
image

And this is the performance plate in IOPS for the popular Intel X25M, “hidden” in the “for partners” documentation. Pay attention to the highlighted and underlined.
image

Such "surprises" are found in almost any SSD, any manufacturer.
This is due to the complicated procedure of writing to a Flash cell organized in blocks. To (re) record even a small block, you need to completely erase and write again the entire Flash “page” containing it. This operation is not fast even by itself, even if one does not take into account the limitations of these cycles.

How, in this case, "plays" WAFL?
The fact is that in NetApp systems that use WAFL, which is “optimized for recording,” and if it is more detailed, then the records, as I have already said, are written in long consecutive “stripes,” but due to the fact that such a scheme allows not to keep records in the cache, and writing them directly to the disks, at the maximum possible speed for these disks, the cache in NetApp systems is practically not used for writing. Accordingly, FlashCache is not used for recording, the algorithmic part is also greatly simplified, and the problem with poor write performance of Flash is solved. Flash in this case is not used at all for writing, records go straight through the RAM buffer, go to the disks, you just don’t need to keep them in the cache, that is, we use Flash only in the most efficient way for it - random read (sometimes filling it with new ones). data, instead of the retired ones, with the help of the sequental write operation, which he also experienced well.

Thus, it is easy to see that using Flash “as memory” in the form of a cache, we automatically store in it only active, hot data, the data, increasing the speed of access to which really benefits usable, and eliminate problems with poor performance and limb resource write to flash.
Yes, FlashCache is also not cheap. But you are guaranteed to use the money spent on it, everything, and not just 20%. And that's exactly what NetApp calls "storage efficiency."

Read more and in Russian about FlashCache , you can read in the technical library of the company Netwell , which publishes translations into Russian of official technical manuals NetApp.

The idea of ​​using Flash-memory in a more efficient and “direct” way than simply emulating “hard disks” on it, in my observation, is spreading more and more in the industry. Adaptec, which presented its MaxIQ, and CacheZilla in ZFS, and Microsoft Research, published an interesting scientific paper “ Speeding Up Cloud / Server Applications Using Flash Memory” (brief presentation).

In the photo at the beginning of the article - installing FlashCache (for some time it was called PAM-II - Performance Acceleration Module, Generation 2, was still PAM-I, on DRAM, and used to cache NAS metadata)

Source: https://habr.com/ru/post/115345/


All Articles