EMC XtremIO Flash Array: A Brief Summary

About flash arrays to write an ungrateful business, this was not done by the lazy one yet. But still, we decided to take a chance and write about our XtremIO array, because it really stands out. And we’ll tell you not boring marketing stories on flash arrays, but interesting technical details.

Introduction

Flash arrays (usually this word is called storage, in which you can not put anything except flash drives) is now a popular topic. They write a lot about them, but often they rest only on performance (response time and IOPS). But this is only one side of the coin. In general, very few people really need hundreds of thousands of IOPS. And we developed from scratch our XtremIO not for the sake of laboratory IOPS (well, not only for their sake), here everything is quite simple. Everything is much more interesting here. We tried to make the most flash array, which would use the technological features of flash memory, allowing to organize a new, previously inaccessible functionality and model of using the array.

Historical context

XtremIO is a new device, the idea of its creation appeared in 2009. How did the flash array market look like then? On the one hand, it has finally become clear to everyone that the future is in flash memory, and she needs to be engaged. The companies of the “first wave” were present on the market, which relied on flash modules and controllers of their own design. At the same time, an understanding began to emerge that both the market requires and technologies make it possible to make something more interesting than just fast and primitive storages. It was necessary to offer more versatile systems, but deduplication, advanced functionality, and so on are also needed. And in this case, you won’t get far on a “piece of hardware” with RAID. This is how the idea of a flash array was born that led to the creation of XtremIO.
')
What features are inherent to him:

Built on x86 and SSD, all the magic in the software;
All vendors love to say "designed for flash", but behind this are different things. It happens that just an existing array is renamed. We have a lot of know-how, which is simply impossible without a flash (data layout scheme, working with metadata), more on this - below;
Simplicity;
Deduplication and other data services.

Let's look at it in more detail.

Iron

The base unit consists of two controllers (in fact, these are standard architecture servers), a shelf with an SSD, and UPS batteries.

At first glance, a strange construction. But, in addition to standardization, there is a clear logic behind this design. In the server, you can provide a large amount of memory (256 or 512 GB) and two multi-core processors with a rather large thermal pack. What is not always possible in its constructive.

UPS is used because NVRAM loses RAM in terms of access time, and, as we will see later, this is critical for us. The result was a maximally unified design with a huge amount of fast memory and a bunch of cores. And for a new generation, it is enough to take new servers, where there will be even more memory and cores - this is the beauty of the standard platform.

Such cubes of two controllers and shelves with SSD are called X-Bricks, and can connect with each other via Infiniband and form a cluster (from 1 to 6 "bricks" with a linear increase in productivity). We will talk about this now.

Scale out

XtremIO was originally designed for a cluster (Scale-Out) architecture. What does it mean? There can be many controllers in the array, they are all equal and provide access to the same data pool. In this case, the load is automatically evenly distributed between them.

What does this give in comparison with the usual approach with two “heads”? Very much: the lack of performance degradation in case of failure, and the ease of increasing the number of "cubes", and the ability to manage an array of any scale as one entity.

This is done as follows. Initially, the software functionality is divided into modules, each of which is responsible for its own part of the I / O processing. And having done the work, the modules transmit requests further.

Very simply, the "responsibility areas" of the modules can be described as follows:

module R (Routing) is responsible for communicating with the host and counts block hashes,
module C (Control) is responsible for routing requests and the log,
Module D (Data) is responsible for the metadata (about this below) and recording on the SSD.

Each module also contains many threads inside.

Such modules “live” on all cluster controllers and can communicate with each other through a very fast Infiniband network with RDMA. That is, modules from different controllers can participate in processing a request, which allows load balancing within the cluster.

And this is another important difference from many software stacks of traditional storage systems, which in principle are not designed for Scale-Out. This opportunity needs to be laid in the architecture from the very beginning, “screwed” later will not work.

Metadata

Some mysterious metadata, hashes and a large amount of memory have already been mentioned several times. This is about what. In order to make an easy-to-manage, balanced and cluster array, it is necessary to decouple the physical location of the data on the SSD from their logical address. Moreover, the SSD does not matter the sequence of data placement, they are ideal for random load.

To do this, XtremIO uses the following scheme. There are two tables with metadata. The first one stores the correspondence of a logical block (for example, a block with LBA 12345 with LUN 1) and its hash. And in the second - the correspondence between the hash and the address of its physical location (roughly speaking, block 6789 on SSD number 5 in the second “bricket”).

That is, the physical location of the block is determined by its address.

Another pleasant consequence is that the principle of thin provisioning is always used in the array. As long as nothing is written to the logical block, no records are created for it in the metadata tables and the physical block is not allocated.

And all these tables are stored and processed entirely in RAM, as it is necessary to reduce delays to a minimum. Naturally, in case of failure there is a copy on the SSD.

Having laid such an architecture, you can get many advantages, in principle, impossible with traditional storage systems, and the most important of them, as many have already guessed, is inline-deduplication.

Deduplication

Deduplication (that is, searching for identical data blocks and storing them in only one instance) works fine in many cases. Imagine a virtual farm where the same OS occupies the lion's share of space inside each virtual machine. For many, it is, but usually post-processing. That is, after the data is saved, a special process is run over them and removes duplicates. This works well for archival tasks, but is absolutely not suitable in situations where the highest performance is required.

In the case of XtremIO, we already have a mechanism for counting hashes and the ability to refer to one physical block to several logical addresses. And all this is done before writing data to the disks.

That is, we used inline deduplication, on the fly. It does not reduce, but increases productivity - the amount of writing to the flash decreases, so performance is no longer limited by expensive SSDs, but by cheap and fast processors.

Snapshots

A similar story with snapshots. In our case, the snapshot is virtually the same as the LUN itself — it’s the same set of logical links to physical blocks from a single pool. Snapshots can be recordable, they are hierarchical, but not tied to each other, there is no need to allocate special space for storing changes. Using snapshots does not create an additional load on the system.

XDP

And the last. Another know-how, impossible without the use of flash: our data protection scheme. How do you usually protect data in the usual storage systems (remember, we are talking about tasks that require high performance, including non-sequential access with small blocks)? RAID 10 - data is mirrored and stripped by disk. Fast, but too expensive when using SSD: double storage redundancy. It would be great to use a wide RAID-6, but it is not suitable for “random” loads due to the high overhead of writing partial incomplete updates. For example, each 4K record will require three readings and three records to update the stripe correctly, which kills performance.

So, you need to write only full stripes entirely. But we also have two super-things: the “innate” ability of an SSD to work perfectly with a completely “random” load, and an architecture that allows you to decouple the physical placement of data from their logical address. Hence the following idea - let's form a set of wide double-parity stripes in the array (see the figure, in fact there are 23 + 2, that is, k = 23, not 5), when writing, it will accumulate data (moreover, all related to different LUN) and then write them in a "bundle" in the most low-filled stripe. This saves space (a usable capacity ratio of around 80%), high performance (minimum spurious load from partial updates) and good performance even when the storage is full.

There are many more things: the fastest cloning in VMware using VAAI, as this only updates metadata, and data compression on the fly, which saves capacity even for poorly deduplicating data, for example, database, and much more, but everything in one post you will not tell.

What happened and where it is needed

This set of know-how allowed us to make a very interesting product with inline-deduplication, scale-out, simplicity and automatic load balancing, “free” snapshots and many other advantages. And all this brings absolutely real benefits.

Moreover, it is not a niche product for small high-loaded bases or something like that. This is a universal storage.

Here are some examples:
• VDI infrastructure. There, on the one hand, there are high performance requirements, and on the other hand, thousands of identical workplaces need to be effectively stored. Usually, capacity is saved using Linked clones at the software level, which introduces many difficulties. And with XtremIO, you can use full clones, and the array itself will take care of deduplication.
• Virtual infrastructure. Here, deduplication is also excellent, so you can fit thousands of virtual machines into several units. Deploying new clones is almost instantaneous, the level of performance compared to traditional arrays is amazing.
• Or an array for a DBMS where performance is always sufficient, and you can make snapshots without affecting performance. This can completely change the way you build test landscapes. Instead of a complex construction with the re-creation of clones and replication to an array for test / dev, you can put everything in one small box.

Competitors

Nobody has such a set of know-how. And it is this architecture and functionality that distinguishes XtremIO from all other players in the market. And this set of functionality (on-the-fly deduplication, compression, scale-out, snapshots, and all this in a single array without additional gateways and complex structures), perhaps, no one has.

Some vendors use familiar arrays as AFA, they only replace disks with SSDs. But they fail to achieve such simplicity, scale-out, inline-deduplication, and further down the list.

Someone offers their own flash modules and RAID controller chips, but in such solutions additional functionality is either not available or is achieved using external gateways, which greatly complicates the system and reduces performance.

And we sincerely believe that we have chosen the right path for the development of storage systems and initially the right architecture.

PS

As you probably already understood, we, the EMC company, started our blog. Stay tuned!

Source: https://habr.com/ru/post/240709/

All Articles