What can DSS - or old songs about the main thing

A couple of days ago, my colleagues called me with a question - the old disk shelf is completely dying (they are still old IBM), what should I do? No discs, no support, no money ~~called Oleg~~ .

What to buy, where to run, how to continue to live?

On Habré, apart from the absence of the “insert table” button and mixing (after merging with GT) into one heap of philosophy, politics, school cosmonautics and autumn aggravation, there is also an almost complete absence of fresh articles about how modern storage systems work, choose this season and so on.
')
New authors, too, something is not, about, say, 3par from 2015 almost no word (but, suddenly, there was material on Huawei Oceanstor - so there will be many links about it).

I'll have to write me how I can - so that there is something for my colleagues to tell about the storage system and what to refer to.

TL / DR What is going on in the storage now? Briefly and with errors, in the inherent style of the author, "we throw in a heap."

So, what modern storage systems can do from the category of “expensive of course, but there is not much choice”.

1. Junior storage systems
1.1. Do-it-yourself systems.
1.2. The same, but from the factory and with updates from the manufacturer.
1.3. Junior real-time storage systems. Already with mother of pearl buttons.
2. Hyper-converged storage systems.
3. Systems based on Ceph, Gluster and so on.
4. Modern storage of the middle price segment.
5. Modern storage systems of the older price segment will not be considered, although it is also interesting there.
6. Where it all goes.
7. Some conclusions.
8. Link and what else to search and read, including in Russian.

1. Junior storage systems

1.1. Do-it-yourself systems.

There is nothing complicated about that - they took any SAS controller of disks, or even without it, took any case “more”, stuffed disks there, collected Raid 1-10-5-6 from them, and gave them 10G iSCSI. Windows can build software Raid 5 since Server 2003, and give it to iSCSI also somewhere from Server 2003. Linux also knows how to raid, and iSCSI target, the configuration process is described, I did.

There is a problem with such a self-collection in everything, starting from the speed of the array itself, ending with monitoring (there is no SNMP, the controller doesn’t know anything, but the Zabbix agent should be installed), rebuilding speed and various hardware-software foci in work.

Comment from colleagues:

Well, by the way, nothing, FreeNAS will give everything out of the box. Below is ZFS, well, so every evil Pinocchio himself chooses what to entertain himself with.

And the speed is quite civil, a dozen 900gb 10k is quite pulling mediocre buzines (Ex ~ 100 was digging with fat boxes, Ball, terminals and other tinsel). Raid-10 of course, Storage Spaces, JBOD SS shelf and heads on DL120g8.

In general, the most painful place for modern simple arrays in my opinion is the speed of rebuilding Raid 5-6 on capacious and cheap 7200 SATA / NL-SAS drives.

The reason for this pain is known - in order to build a fallen block, you must read all the blocks from the remaining ones, count the fallen block, and write this block, taking into account the Raid penalty and the fact that the remaining disks continue to work. Rebuilding such an array can go 3 days, and maybe a week, and all the disks will be loaded, moreover, as they usually are not loaded.

The second and THIRD drive for me with such rebuilds flew out, sometimes I had to take data from backup.

Decisions of the form “let's collect not 6, but 60” leads to a mild form of amphibious asphyxiation (strangulation with a toad). For example, we have an array of 24 disks, without a hot-swap disk (it is in the closet). We divide it into 3 blocks of 6 + 2 disks, collect R60 and get the free capacity "-2 from each block, total -6". This is certainly not -12 from Raid10, but already rapidly approaching it.

Yes, and not all controllers support Raid60, and even on 6-8 TB disks it will still be sad.

1.1a. Separately, it is necessary to note the decision “not to understand what ” - for example , this .

Not a word about what nodes, what controllers, what with snapshots and backups, what is there for "well, we do not have x86 in the database" - and what, raspberry? or what kind of ARM?, and hints of total data loss in the comments.

1.2. The same, but from the factory and with updates from the manufacturer.

These are junior QNap, Synology, who else is there. Not quite the youngest "for the house - SOHO", with ZFS and neonkoy, but something similar to the storage system.

In these storage systems there are already two power supplies, two SAS controllers more expensive, two 10G network cards, or maybe even FC 8/16, and everything above is programmatically defined - you want iSCSI, you want SMB (and if you are brave, even SMB 1.0), you want NFS, you can FTP. Want - even you can make thin disks. On the controller, even a cache with a battery is available! Of course, this is not such a big battery to turn the dials for an hour, but there is one.

The system is good, working - but the problems are the same as in self-collection. Speed and rebuild. And yes, when updating, there will be a break in work for 30 minutes (if the OS) or for a couple of hours (if also FW controllers), and often you cannot roll back.

On top of the bugs are added as "from the manufacturer", and from the fact that under the hood of such a storage system, which controller and OS.

1.3. Junior real-time storage systems. Already with mother of pearl buttons.

The fact that in the picture book is called "entry-level"
Of the known ones, this is the HPE MSA line - 2052, 2050, 2042, 2040, 1050, 1040, and the old-old P2000 G3, although it has not been sold for a long time.

Dell has this, of course (Dell Storage SCv2000), and Lenovo, the old Huawei.

For example, now the old (4th) generation of the HPE MSA line completes the life cycle, as the HPE website modestly writes - buy 1040 -> 1050, 2040 -> 2050, 2042 -> 2052.

At this level, there are such functions as upgrading the controller firmware without interrupting the service, the ability to replace the controller with the next generation (only the controller), of course, “hot swap of everything” - disks, controllers, power supplies. Snapshots appear at the storage level, tiering, and remote replication. You can read not in Russian here - HPE MSA 2050 SAN Storage - Overview , or here HPE MSA 1050/2050/2052 Best Practices , or here in translation

The problems with rebuilding Raid are still the same, and new ones are added - with the same tearing.

What is a tearing, I hope everyone knows. If they don't know, then

Tiring:

This is a combination of 2-3 types of different disks and Raid types into one storage group, for example, we put SSD in Raid 10 and SATA 7200 in Raid 6/60 into one group. After that, we have a storage system looking (a couple of days) to which data more often accessed (hot data), and leaves them on the SSD, and more "cold" sends a level lower. As a result, the hot data is on the SSD, completely cold on the cheap SATA 7200. That's just the rebuild is once a day, and sometimes you need to "now." The array does not operate with separate virtual machine files, so speeding up “one machine there right now” will not work unless you have guaranteed a fast separate LUN left and transfer the machine (from the virtualization environment) to it or to the local storage of the virtualization host.

New, fresh problems and pains begin around here, and the problems are not with the storage system. A simple example is that storage systems are bought “expensively”, and then pseudo-specialists from (from) the same Gilev are launched to it, and they start to drive CrystalDiskMark in the virtualization environment in order to try to get reading speed at the output at the level of the slowest storage level. Why they do it is not clear at all.

However, those who allow such unfortunate specialists into their infrastructure pay for all this work with their money, this is their right. As they say, that only people will not do, instead of buying a new 1U server with fast processors and a multi-memory “just under 1s” - for example, with
Intel Xeon Gold 5122 (3.6GHz / 4-core / 16.5MB / 105W) Processor
or Intel Xeon Platinum 8156 (3.6GHz / 4-core / 16.5MB / 105W) Processor
By the way, it will be necessary to ask myself to buy such a miracle under 1s - there is no money, but perhaps I’ll do a price request for Dell / Lenovo / Huawei / HPE.

2. Hyper-converged storage systems.

This is not a junior segment at all, especially Nutanix, however, given the ubiquitous love for import substitution “trophy”, and the presence of Nutanix CE (some even have a productive role), you will have to place it here.

Simply put, hyperconvergence is a server stuffed with disks and SSD, and the working data lies locally on the same servers where the virtual machines are running (and the copy is somewhere else).
Of the well-known systems, this is Nutanix (as a whole, as a hardware-software complex, although there is CE “to look at”), MS Storage space direct (S2D), VMWare VSAN. With the latter, everything is very fun, the system is called “not a day without patches,” “we are suffering again,” or, as with the latest update 6.7u1, “we broke your backup, MVU XO XO”.

Not that the others didn’t have it, everyone has a problem, I just often come across VMWare.

A number of manufacturers (except for Nutanix itself) are already selling such systems — for example, the Dell XC Series.

In general, these systems seem to me that everything is not bad. Everyone has. And MS is actively sold (and even in the Azure stack format - already two pieces in the Russian Federation), and Nutanix is growing both in sales and capitalization, and the performance gain in some scenarios for Nutanix is decent, really “at times”, up to the growth rate of some services for counting photos from 5 hours to an hour (if not faster). Even deduplication with compression is available.

Judging by the financial indicators, VMWare as a whole is still not bad either.

3. Systems based on Ceph, Gluster and so on.

It is possible to collect something similar on open source components, especially if you are brave, and (or) you have a staff of developers, as in Krok. Or the data is not sorry. Of course, such as in Cloud Mouse will no longer be what you are. Or maybe just a fartu suit.

examples of times
examples two
examples three

Problems with the implementation of such solutions are:

technical: and how is it all backed up? And how much will be rebuilt in case of failure of one disk or one machine? How to configure and who will support? And how much will the restoration take?
Administrative: And if your administrator falls under the tram or just quits, then who will take up the support? How much is it? And if you suddenly lose your magnificent development team at full strength? Of course, this is not a bunch of keys, but it happens that people leave right away with commands.

Comment from colleagues:

What will happen with blackout? And that is a cool story, as the fallen Symmetra dropped the data center in one FSUE (s). Fallen in the literal sense, the raised floor failed.

Like the cheerful welders in another FSUE, who showed the capitalist inverter the harsh Russian-Tajik "Aha!".

So, how it will wind up and be repaired, with 100500k IOPS recordings per second with a fall - the question.

And if you are not in Default City, but in some khibiny, among bears and mosquitoes? You can teach a bear to play the balalaika, but to fix the gluster ...

In the mid-level storage systems, by the way, there are also batteries for the cache (or supercapacitors), or their own battery modules, and the cache itself is also partially in RAM, and partly in SSD and with a power failure of the entire data center - will accurately transfer everything that did not have time - to transaction logs and to your reserved cache.

BECAP BUT DO IT EQUALLY.

4. Modern storage of the middle price segment.

As I already wrote, the search for Habra quickly found three articles about Huawei OceanStor

Import Substitution Part 2. Huawei OceanStor Family , three lines here
and testing All-flash OceanStor Dorado V3 .

There are a couple of reviews of 2012-2015 by 3par
example
I somehow did not search for reviews on Dell, IBM Storwize, maybe in vain.

There are slightly more materials in Russian on the big Internet, for example, the same Oceanstor is reviewed here (by the way, on the whole, a useful weblog is full of references to Huawei).

So, what a modern HPE 3par / Huawei / IBM-Lenovo level array offers us.
4.1 First of all, this is a new approach to partitioning disks. This 3Par RAID, Huawei Raid 2.0 and Lenovo-IBM is Storwize Distributed RAID (DRAID). The rumors of this rumor were still in the HP Enterprise Virtual Array (EVA), in 3Par it came to a logical and more or less understandable state.
How it works. In general, there is a video, albeit in English, but everything is clear.

HPE 3PAR StoreServ Architecture Overview ChalkTalk

Huawei RAID 2.0 Technology
Or (a little longer, in more detail and in Russian) - Webinar E = DC2 №2: RAID technology and its application - about Raid 2.0 from 1:18.

How it works, if described in words, not pictures.

First, we allocate a disk domain. Just a group of disks, different - for example, SSD and SATA.
Each individual disk is divided into logical blocks of 64/256 / 1GB (all in different ways), also called differently. HPE has chunklets, Huawei chunk, IBM's extent (but this is not certain).

Then from these chunklets \ chunk (not from all at once! And from no more than N, the R5 array from 50 + 1 you are not nesberete, and smeared on different disks!), The chunklets \ chunk should be of the same type (for example, SSD), the capacity of the disks and their number to the word is also regulated, such as “disks are the same, add in pairs - 2.4) and collect the“ we need ”array. For example, on SSD, we need speed - we collect R10, and on SATA the volume is important - we collect Raid 5. We’re not “collecting”, but choosing “how we can collect here”. There are no 100,500 manual operations here. The resulting chunklets \ chunk group gather into groups, the groups are cut into small sectors (under the tearing), bug for the granddaughter, and we end up with a thin or thick logical drive (already LUN or file system under SMB), the desired size plus reserved space for the rebuild . Details you can see for example here .

What are the advantages of this approach to splitting.

The pros are pretty simple.

First of all, it is a much faster rebuild when a single disk fails.

ATTENTION! Backups still need to be done! Raid is not a backup, and will not protect against incorrect data changes in any way.

If the disk failed in Raid5 / 6, we had to read all the data, count the math and write all the data on one disk, then the resulting disk itself became the bottleneck.
Here the recording will go on all discs at once. Here, of course, it would be necessary to consider such a functional as recording with full stripes, but this is now difficult to surprise anyone.
As a result, the rebuilding is much faster.

How it works in terms of mathematics and raid.

I hasten to reassure everyone, there is no magic here. Raid has not gone away, just became a little more virtual.

Suppose we have 2 arrays on 20 disks. and + spare in each. 1.2 - it does not matter.
The disks are different, but the total raw volume of the array is the same.
The first array is assembled from 2TB disks as Raid 5. 19 + 1.
The second array is assembled from 1TB disks as Raid 50. 4 groups of Raid 5 by 5 disks are assembled into a raid-group. Total (4 + 1) * 4

Let us fly out by a total volume of 2 TB in each case. In different disk groups for the second array.

In the first case, we need to read the data from the 19 remaining disks, calculate "what was lost" and write to 1 disk. The speed of recording on one disc will be - let it be 10 IOPS (we have no raid penalties and there is an other load).

In the second case, we need to read data from only 8 disks, and write data already on two disks, 10 IOPS per disk, for a total of 20.

Astrologers have announced a week of rebuilding, the speed of rebuilding is doubled.

Now we find out that we have such “micro-arrays 5 + 1”, one hundred megabytes (maximum gigabyte) each, not at all 4, but the entire disk. In addition, we know which microarrays were used and which were not, and we can throw empty space out of rebuilding. In addition, the backup space is spread over all the remaining disks, and the recording will go on all disks at once, at a speed of 10 * N, where N is the number of disks.

Yes, after the failed disk is replaced, rebalancing will be performed - the disk will be divided into segments, and part of the data will be transferred to it. However, this will be an operation in the background, and without the need for recalculation, and the workload will not ask for the time of such a rebuilding.

You have to pay for everything, and such a partition is no exception. If you lost 2 disks in the array for 20 disks and Raid 6, 2/20, 10%, then the space here will be used both for rebuilding reservations, and for the same Raid 6, if you take 6 + 2, then you lose 2 / 8, 25%.
On the other hand, there are advantages as well - one of the reasons for using Raid 6 is the already significant risk of collisions on R5 arrays when using “from 1 TB approximately” disks. In numbers, the risk may be small, but I don’t want to lose data. Another reason is “a long rebuild with the risk of losing another disk, which is fatal for R5”. Here, the disk groups are small, the risks of collision are much lower, the rebuilding is quick - and in some cases you can do R5.

Now for those who have a question "how does it live at the very moment of departure, before rebuilding" - I will answer with the quote:

System preserve data for system administration.
These logical disks are multi-level logical disks with three way mirrors for enhanced redundancy
and performance. The following logical disk types are created by the system:

• logging logical disks are RAID 10 logical disks that are temporarily hold data during
disk failures and disk replacement procedures. Logging logical disks are created by the system
during the initial installation and setup of the system. Each controller node in the system has
a 60 GB logging LD.

How spare chunklets work:

• When the connection fails, it writes
the disk disk comes back
logging is online. Logging disk space is allocated when the
system is set up. This does not apply to RAID 0 chunklets which have no fault-tolerance.

• If the logical disk becomes a volume, the relocation
The spares starts automatically.
Chunklets for chunklets.
HP 3PAR StoreServ 7200 2-node Administrator's Manual: Viewing Spare Chunklets
3PAR InForm OS 2.2.4 Concepts Guide

4.2 In addition to fast rebuilding , such a partition adds the ability to do anything convenient in the “thin disk” format. In principle, it can do both Microsoft and Vmware, but both with their own interesting features in terms of speed. Here, due to caches and preliminary allocation of a place “in reserve”, there are slightly fewer problems.

4.3 Deduplication and compression.

The same deduplication has been on NTFS for a long time, and recently appeared on ReFS, but it does not always work in an understandable way, offline and on a schedule.

On the storage system, it is possible to do deduplication and compression on the fly (again, with restrictions). If your data is “similar, or even the same,” for example, some LUN is allocated strictly for disks with operating systems, then deduplication will free up 2/3 of the space (if not more). Compression can be added some more. Or a lot. Or, if you have a compressed video, it won't add anything at all.

4.4 The aforementioned TIRING - when once a day, at your choice, either cold data goes to cheap slow drives, or suddenly suddenly hot data goes to SSD.

The function is useful if used wisely, correctly configure the collection of statistics and monitor the counters.

4.5 Snapshots at the storage level.

I underestimated the previously useful thing.

How a virtual machine backs up completely “usually” - snapshot is done, then snapshot is presented to the backup system (SRK), after recording snapshot in the SRK - the system takes a long and tedious consolidation.

It should be remembered that it would be nice to have a backup system agent (service) inside the virtual machine being copied, which will instruct other services to reset everything on the disk, give the wallet and generally stand up (this is Volume Shadow Copy Service, VSS in Windows).

VSS — . , ( , Exchange ) „ , “ ( dirty exchange) , , , , , - (, , , DAG Always On).

, ( Veeam) , .

.

.
Halt! ausweis.
.
LUN ( , , . Veeam + 3par , Veeam + Huawei — 2018
IBM, IBM Storwize. Nimble , — .

Veeam, .

, Weggetreten! . ( - ) . „ “ „ “, .

, , , , , , , SSD ( , ), ( )

4.6 , .

, ( ), . , , — .
— Live ( dead) migration.

, , .

— , „ “ ( ).

, LUN „ “, .
, , , .

, „“, .

4.7 SSD .

, , — , . , - - . , SSD .

. , (1U) , ( ) Intel Xeon Scalable
Intel Xeon Platinum 8156 Processor (3.6 / 3.7 , 105 W TDP) Intel Xeon Gold 5122 Processor, , 1.
, 2 SSD 4 m2, raid 1/5/6 ( , 0), 1-, (, 16-24, , ).

4.8 .

, ( ) .
— x86 , , - „ “.

4.9 .

, , , .
, , , .

, ( ) , () — .

4.10

, , .

, (), „ “. , LUN, , , „ “.

The reverse side of the function: if you suddenly have such an array, there is no offline backup (in classics, 3-2-1, 3 copies, on two types of media, 1 copy outside the main site) and evil hackers (or suddenly fell ill the head of the administrator) there is access to this function - there is a risk of losing everything at once.

4.11. Delegation, integration with LDAP and other.

Of course, there are functions “to give rights to see someone” with rather high granularity, plus beautiful graphics and not boring pictures. You can safely give the Big Boss the rights to "view," let him look at the pictures if he needs. Will not break.

6. Where it all goes.

Storage systems are now being developed in two directions - these are hyperconverged software-hardware complexes (PAK), and all-flash arrays. SSD drives have fallen in price in the past 5 years at times, if not dozens of times (per gigabyte), increased speed, survivability, connectivity options. As a result, there are almost no 15k disks (of course you can still buy), 10k disks are already on the way to the price range, 7200 disks will live again.

Chips (ASIC) and processing in them also do not stand still.

Conclusion — , , . , , „ “ Nutanix CE, SSD, .

, „ SSD 15“

, , .

„ IOPS“ „ “.
872392-B21 HPE 1.92 TB , SAS, read intensive, digitally signed FW, solid state — 2700$.
875492-B21 HPE 960 GB, SATA, mixed used, digitally signed FW, solid state, M.2 2280 — 1200$
870759-B21 — HPE 900 GB SAS, enterprise, 15K rpm, small form factor hard disk drive — 900$
702505-001 The hard disk of the HP 900GB SAS SFF 10K is about $ 400

7. Some conclusions.

The need to understand storage arises from the moment when a business begins to be interested in “why we are so slow,” and you are not in the right place to answer. Immediately there is a need to understand the theory - what is IOPS, latency (and where and from what it arises), raid penalty, load profile (day, night, during backup).

Commentary from colleagues:
Either when he begins to be interested, “how can we get uptime for five nines, but cheaply?”

RTO/RPO — , , - , .

RPO/RTO : „RDP“ 1 (SSD raid 1, hdd raid 1 ) , SSD.

SSD , — 2 . . — 30 , . , „ “, .

, „ -, 100 “ — , .

, — .8, . , , , .

8.

» Nutanix — http://nutanix.ru/ — ( ) « ». In Russian.

Report Highload 2016. What industrial storage systems are already capable of.

Brocade: FC 101-WBT Introduction to Fiber Channel Concepts
One , two , three , four , five .

Storage Systems Huawei Oceanstor V3
Storage Systems Huawei Oceanstor V3 (Part 2)
Storage Systems Huawei Oceanstor V3 (Part 3)
Storage Systems Huawei Oceanstor V3 (Part 4)

Huawei Storage Simulator

Surprisingly handy thing. This is a small web server (hundreds of megabytes) that allows you to make a virtual storage system. Of course, you can’t pick it up anywhere, but you can press any buttons, create a disk domain, LUN and whatever, study the logs and study the command line separately - it’s quite possible.

Links
there to take files of the form "Demo_for_".

Huawei Hands-on LAB - for example,
Huawei Dorado 5000 V3 Storage Active-Active Solution (Carrier / Enterprise)

H6LH3AAE Managing HPE 3PAR StoreServ Hands On Lab
This is a course from HPE, for which they want 23.400 rubles without VAT what causes certain problems when carrying out, and especially when returning).

HPE 3PAR StoreServ Simulator
h20392.www2.hpe.com/portal/swdepot/displayProductInfo.do?productNumber=HP3PARSIM

What else to search.

HK902S Managing HPE 3PAR StoreServ I: Management and Local Replication (there is video on the Internet)
HK904S Managing HPE 3PAR StoreServ II: Optimization and Remote Replication (similarly, there is video)

EMC Information Storage and Management Student Guide (this is not a book in Russian)

Vdisks, Mdisks, Extents and Grains - Scale Out for Sure

Source: https://habr.com/ru/post/427167/

All Articles

What can DSS - or old songs about the main thing

More articles: