How to choose a storage system without shooting yourself in the foot

Introduction

The time has come to buy storage. Which one to listen to? Vendor A talks about vendor B, and there is also an integrator C, who tells the opposite and advises vendor D. In such a situation, the experienced storage architect also has a head around, especially with all new vendors and modern SDS and hyperconvergence.

So, how to understand all this and not be fools? We ( AntonVirtual Anton Zhbankov and korp Yevgeny Elizarov) will try to tell Russian about it in white.
The article has much in common, and in fact is an extension of the “ Design of a virtualized data center ” in terms of choosing storage systems and reviewing storage technology. We briefly consider the general theory, but we recommend that you read this article.

What for

You can often observe the situation as a new person comes to the forum or specialized chat, such as Storage Discussions and asks the question: “here I am offered two storage options - ABC SuperStorage S600 and XYZ HyperOcean 666v4, what do you advise”?
')
And a measure begins for someone who has the specific features of the implementation of scary and incomprehensible chips, which are completely Chinese letters for an unprepared person.

So, the key and the very first question that you need to ask yourself long before comparing the specifications in commercial proposals - WHY? Why is this storage system needed?

The answer will be unexpected, and very much in the style of Tony Robbins - to store data. Thank you, captain! Nevertheless, sometimes we go so far into comparing details that we forget why we do all this at all.

So, the task of a data storage system is storage and provision of access to DATA with a given performance. From the data we begin.

Data

Data type

What kind of data we plan to store? A very important question that can eliminate a lot of storage systems even from consideration. For example, it is planned to store videos and photos. You can immediately delete systems designed for random access by a small block, or systems with branded chips in compression / deduplication. These may simply be excellent systems, we don’t want to say anything bad. But in this case, their strengths either become weak on the contrary (video and photos are not compressed) or simply significantly increase the cost of the system.

Conversely, if the target use is a loaded transactional DBMS, then excellent multimedia streaming systems capable of delivering gigabytes per second would be a bad choice.

Data volume

How much data do we plan to store? Quantity always grows into quality; it should never be forgotten, especially in our time of exponential growth of data. Petabyte-class systems are no longer uncommon, but the larger the petabyte of volume, the more specific the system becomes, the less will be the usual functionality of systems with random access to small and medium volumes. It is trite because only the access statistics tables by blocks become larger than the available amount of RAM on the controllers. Not to mention the compression / tearing. Suppose we want to switch the compression algorithm to a more powerful one and compress 20 petabytes of data. How long will it take: half a year, a year?

On the other hand, why make a fuss if you need to store and process 500 GB of data? Only 500. Household SSD (with a low DWPD) of a similar volume cost nothing at all. Why build a Fiber Channel factory and buy a high-end external storage system worth a cast-iron bridge?

What is the percentage of total hot data? How uneven is the data load? This is where multi-level storage technology or Flash Cache can be very helpful if the amount of hot data is scant compared to the total. Or vice versa, with a uniform load over the entire volume, which is often found in streaming systems (video surveillance, some analytics systems), such technologies will not give anything, and only increase the cost / complexity of the system.

IS

The reverse side of the data is an information system using this data. An IP has a set of requirements that inherit data. For more information on IP, see the “Design of a virtualized data center”.

Failover / Availability Requirements

Requirements for fault tolerance / availability of data are inherited from the IP that uses them and are expressed in three numbers - RPO , RTO , accessibility .

Availability - the share for a specified period of time during which data is available to work with them. It is usually expressed in quantities of 9. For example, two nines per year means that accessibility is equal to 99%, or else 95 hours of unavailability per year are allowed. Three nines - 9.5 hours per year.

The RPO / RTO is not total, but for each incident (accident), as opposed to availability.

RPO is the volume of data lost in the event of an accident (in hours). For example, if backup occurs once a day, then RPO = 24 hours. Those. In the event of an accident and a complete loss of storage, data up to 24 hours can be lost (since the backup copy). Based on the RPO set for the IS, for example, the backup procedure is written. Also, based on the RPO, one can understand how necessary is synchronous / asynchronous data replication.

RTO is the time to restore the service (data access) after a crash. Based on the RTO setpoint, we can understand whether a metrocluster is needed, or if unidirectional replication is sufficient. Do you need a multi-controller storage hi end class - too.

Performance requirements

Despite the fact that this is a very obvious question, most of the difficulties arise with it. Depending on whether you already have some kind of infrastructure or not, ways of collecting the necessary statistics will be built.

You already have a storage system and you are looking for a replacement for it or want to purchase another one for expansion. Everything is simple here. You understand what services you already have and what you plan to implement in the near future. Based on current services you have the opportunity to collect performance statistics. To determine the current number of IOPS and current delays - what are these indicators and are there enough for your tasks? This can be done both on the storage system itself and on the part of the hosts that are connected to it.

And you need to watch not just the current load, but for some period (preferably a month). See what the maximum peaks are during the daytime, what load backup creates, etc. If your storage system or software does not give you a complete set of this data, you can use free RRDtool, which can work with most of the most popular storage systems and switches and can provide you with detailed performance statistics. It is also worth looking at the load and on the hosts that work with this storage, for specific virtual machines or what specifically you work on this host.

It should be noted separately that if the delays on a volume and a datastor that is on this volume differ quite strongly - you should pay attention to your SAN network, it is highly likely that there are problems with it and before you acquire a new system, you should deal with this issue. because the probability of increasing the performance of the current system is very high.

You build the infrastructure from scratch, or you acquire a system for some kind of new service, about which loads you are not aware of. There are several options: talk with colleagues on specialized resources to try to find out and predict the load, contact the integrator, who has experience in implementing similar services and who can calculate the load for you. And the third option (usually the most difficult, especially when it comes to self-written or rare applications) try to figure out the performance requirements of the system developers.

And, attention, the most correct option from the point of view of practical application is a pilot on the current equipment or equipment provided for the test by the vendor / integrator.

Special requirements

Special requirements - all that is not subject to the requirements of performance, fault tolerance and functionality for the direct processing and provision of data.

One of the simplest special requirements for a data storage system can be called “alienable storage media”. And immediately it becomes clear that this data storage system should include a tape library or just a streamer, to which the backup copy is dropped. After that, a specially trained person signs the tape and proudly carries it to a special safe.
Another example of special requirements is protected anti-shock performance.

Where

The second main component in choosing one or another storage system is information about WHERE this storage system will stand. Starting from geography or climatic conditions, and ending with staff.

Customer

For whom is this storage system planned? The question has the following reasons:

Government customer / commercial.
A commercial customer has no restrictions, and is not even obliged to hold tenders, except according to its own internal regulations.

The state customer is a different matter. 44 FZ and other delights with tenders and TK that can be challenged.

Customer under sanctions
Well, the question is very simple - the choice is limited only by the offers available to the customer.

Internal regulations / authorized vendors / models
The question is also extremely simple, but we must remember about it.

Where physically

In this part, we consider all the issues with geography, communication channels, and microclimate in the accommodation room.

Staff

Who will work with this storage system? This is no less important than what the storage system itself can do.
No matter how promising, cool and great is the storage system from vendor A, there is probably little point in putting it if the staff can only work with vendor B, and no further procurement or permanent cooperation with A is planned.

And of course, the reverse side of the question is how well-trained personnel are available in a given geographic location directly in the company and potentially in the labor market. For regions, choosing storage with simple interfaces or remote centralized management can make a significant sense. Otherwise, at some point it can become painfully painful. The Internet is full of stories like a new employee who arrived, yesterday's student, who was such that the whole office was killed.

Environment

Well and certainly an important question - in what environment the given SHD will work.

What about power / cooling?
What is the connection
Where it will be mounted
Etc.

Often, these questions are taken for granted and are not specifically considered, but sometimes they can turn everything up to the exact opposite.

what

Vendor

Today (mid-2019), the Russian storage market can be divided into conditional 5 categories:

Top Division - well-deserved companies with a wide range from the simplest disk shelves to hi-end (HPE, DellEMC, Hitachi, NetApp, IBM / Lenovo)
Second Division - Limited Line Companies, Niche Players, Serious SDS Vendors, or Rising Newbies (Fujitsu, Datacore, Infinidat, Huawei, Pure, etc.)
Third Division - niche solutions with the rank of low end, cheap SDS, knee sharing on ceph and other open projects (Infortrend, Starwind, etc.)
SOHO segment - small and ultra-small home / small office storage systems (Synology, QNAP, etc.)
Import-substituted storage systems - this includes both the hardware of the first division with re-glued labels, and rare representatives of the second (RAIDIX, we will give them an advance payment of the second), but basically this is the third division (Aerodisk, Baum, Depo, etc.)

The division is rather conditional, and does not mean at all that the third or SOHO segment is bad and cannot be used. In specific projects with a well-defined data set and a load profile, they can work very well, far exceeding the first division in terms of price / quality ratio. It is important to first determine the objectives, growth prospects, required functionality - and then Synology will serve you faithfully and your hair will become soft and silky.

One of the important factors when choosing a vendor is the current environment. How much and what kind of storage you already have, with which storage engineers can work. Do you need another vendor, another contact point, will you gradually migrate the entire load from vendor A to vendor B?

It is not necessary to produce entities beyond what is necessary.

iSCSI / FC / File

On the issue of access protocols, there is no consensus among engineers, and disputes resemble more theological discussions than engineering. But in general, the following points can be noted:

FCoE is more dead than alive.

FC vs iSCSI . One of the key advantages of FC in 2019 over IP storage, a dedicated factory for data access is leveled by a dedicated IP network. FC has no global advantages over IP networks and on IP it is possible to build storage systems of any load level, up to systems for heavy DBMS for ABS of a large bank. On the other hand, the death of FC has been prophesying for more than a year, but this is constantly getting in the way. Today, for example, some players in the storage market are actively developing the NVMEoF standard. Will he share the fate of FCoE - time will tell.

File access is also not unworthy of attention. NFS / CIFS perform well in production environments and, with proper design, have no more complaints than block protocols.

Hybrid / All Flash Array

Classic storage systems come in 2 types:

AFA (All Flash Array) - systems optimized for SSD.
Hybrid - allowing you to use both HDD and SSD, or a combination of both.

Their main difference is the supported storage efficiency technologies and the maximum level of performance (high IOPS and low latency). Both those and other systems (in most of their models, not counting the low-end segment) both block devices and file devices can work. Both the supported functionality and the younger models depend on the level of the system; it is most often trimmed to the minimum level. It is worth paying attention to when you study the characteristics of a particular model, and not just the capabilities of the entire line as a whole. Also, of course, the technical characteristics of the system depend on the level of the system, such as a processor, memory size, cache, number and types of ports, etc. From the point of view of control, AFA differs from hybrid (disk) systems only in the implementation of mechanisms for working with SSD drives, and even if you use SSD in a hybrid system, this does not mean that you can get a performance level at the AFA level . Also, in most cases, inline mechanisms for efficient storage on hybrid systems are disabled, and their inclusion leads to a loss in performance.

Special Storage

In addition to general-purpose storage systems, focused primarily on operational data processing, there are special storage systems with key principles that are fundamentally different from the usual (low latency, a lot of IOPS):

Media

These systems are designed for storing and processing media files that differ in large size. Corresponding to the delay becomes almost unimportant, and the ability to send and receive data in a wide band in many parallel streams comes to the fore.

Deduplicating storage systems for backups.

Since backups differ from each other rarely in normal conditions of friend to friend (the average backup differs from yesterday's by 1-2%), this class of systems extremely effectively packs the data recorded on them into a fairly small number of physical media. For example, in some cases, data compression ratios can reach 200 to 1.

Object storage.

These storage systems do not have the usual volumes with block access and file balls, and most of all they resemble a huge database. An object stored in such a system is accessed by a unique identifier or by metadata (for example, all JPEG objects, with a creation date between XX-XX-XXXX and YY-YY-YYYY).

Compliance system .

Not so often found in Russia today, but it is worth mentioning them. The purpose of such storage systems is to guarantee data storage in order to comply with security policies or regulatory requirements. In some systems (for example, EMC Centera), the function of prohibiting the deletion of data has been implemented - as soon as the key is turned and the system switches to this mode, neither the administrator, nor anyone else can physically delete the already recorded data.

Proprietary Technologies

Flash cache

Flash Cache is a common name for all proprietary technologies for using flash memory as a second-level cache. When using a flash cache, the storage system is usually calculated to provide a steady load from the magnetic disks, while the peak serves the cache.

At the same time, it is necessary to understand the load profile and the degree of localization of calls to the blocks of storage volumes. Flash cache is a technology for loads with high localization of queries, and is practically inapplicable for uniformly loaded volumes (such as, for example, analytics systems).

Two flash cache implementations are available on the market:

Read Only. In this case, only the data is read-cached, and the write goes straight to the disks. Some vendors, such as NetApp, consider that writing to their storage systems is optimal, and the cache will not help.
Read / Write. Not only reading, but also writing is cached, which allows you to buffer the flow and reduce the impact of RAID Penalty, and as a result, increase overall performance for storage systems with a less-than-optimal writing mechanism.

Tiering

Multi-level storage (tearing) - the technology of combining into a single disk pool of levels with different performance, such as SSD and HDD. In the case of a pronounced irregularity of references to data blocks, the system will be able to automatically balance the data blocks by moving the loaded ones to a high-performance level, and the cold ones, on the contrary, to a slower one.

Hybrid systems of the lower and middle classes use multi-level storage with data movement between levels on a schedule. At the same time, the size of the multi-level storage unit for the best models is 256 MB. These features do not allow us to consider multi-level storage technology as a technology for increasing productivity, as many mistakenly considered. Multi-level storage in systems of the lower and middle classes is a technology for optimizing the cost of storage for systems with pronounced uneven load.

Snapshot

No matter how much we talked about storage reliability, there are many opportunities to lose data that does not depend on hardware problems. These can be viruses, hackers or any other, unintentional deletion / corruption of data. For this reason, backing up productive data is an integral part of the engineer’s work.

A snapshot is a snapshot of a volume at some point in time. With most systems, such as virtualization, database, and so on. we need to take such a snapshot, from which we will copy the data to a backup copy, while our IP can safely continue working with this volume. But it is worth remembering - not all snapshots are equally useful. Different vendors have different approaches to creating snapshots related to their architecture.

CoW (Copy-On-Write) . When an attempt is made to record a data block, its original content is copied into a special area, after which the recording proceeds normally. This prevents data corruption inside the snapshot. Naturally, all these “parasitic” data manipulations cause additional load on the storage systems and for this reason vendors with a similar implementation do not recommend using more than a dozen snapshots, and on high-loaded volumes not to use them at all.

RoW (Redirect-on-Write) . In this case, the original volume is naturally frozen, and when you try to write a data block, the data storage system writes data to a special area in free space, changing the location of this block in the metadata table. This allows you to reduce the number of rewrites, which ultimately eliminates the drop in performance and removes restrictions on snapshots and their number.

Snapshots are also of two types with respect to applications:

Application consitent . At the moment of creating a snapshot, the storage system pulls an agent in the operating system of the consumer, which forcibly flushes disk caches from memory to disk and forces it to make this application. In this case, when recovering from snapshots, the data will be consistent.

Crash consistent . In this case, nothing like this happens and snapshot is created as is. In the case of recovery from such a snapshot, the picture is identical as if the power were suddenly turned off and some loss of data that is stuck in the caches and never reached the disk is possible. Such snapshots are easier to implement and do not cause performance drops in applications, but are less reliable.

Why are snapshots on storage systems?

Agentless backup directly from the storage
Creating test environments based on real data
In the case of file storage, it can be used to create VDI environments through the use of storage snapshots instead of a hypervisor
Ensuring low RPOs by creating scheduled snapshots with a frequency much higher than the backup frequency

Cloning

Volume cloning - works on the same principle as snapshots, but serves not only for reading data, but for full-fledged work with them. We have the opportunity to get an exact copy of our volume, with all the data on it, without making a physical copy, which will save space. Usually, volume cloning is used either in Test & Dev or if you want to check the performance of any updates on your IC. Cloning will allow you to do this as quickly and economically as possible in terms of disk resources, since only modified data blocks will be recorded.

Replication / logging

Replication is the mechanism for creating a copy of data on another physical storage system. Usually there is a proprietary technology for each vendor that works only within its own line. But there are also third-party solutions, including those running at the hypervisor level, such as VMware vSphere Replication.

The functionality of the proprietary technologies and their usability are usually far superior to universal ones, but they are inapplicable when, for example, you need to make a replica with NetApp on HP MSA.

Replication is divided into two subspecies:

Synchronous . In the case of synchronous replication, the write operation is forwarded to the second storage system immediately and the execution is not confirmed until the remote storage confirms. Due to this, the access delay increases, but we have an exact mirror copy of the data. Those. RPO = 0 for the case of loss of primary storage.

Asynchronous . Write operations are performed only on the main storage system and are confirmed immediately, accumulating in parallel in the buffer for packet transmission to the remote storage system. This type of replication is relevant for less valuable data, either for low-throughput channels or with high latency (typical for distances over 100 km). Respectively RPO = packet sending frequency.

Often, along with replication, there is a mechanism for journaling disk operations. In this case, a special area for journaling is allocated and the recording operations of a certain depth in time or limited by the journal volume are stored. For individual proprietary technologies, such as EMC RecoverPoint, there is integration with system software that allows you to bind certain bookmarks to a specific journal entry. Due to this, it is possible to roll back the state of the volume (or create a clone) not just on April 23, 11 hours 59 seconds 13 milliseconds, but at the time preceding “DROP ALL TABLES; COMMIT ”.

Metro cluster

Metro cluster is a technology that allows you to create bidirectional synchronous replication between two storage systems in such a way that, from the side, this pair looks like one storage system. It is used to create clusters with geographically dispersed shoulders at metro distances (less than 100 km).

Using the virtualization environment as an example, the metrocluster allows you to create a datastore with virtual machines that is available for writing from two data centers at once. In this case, a cluster is created at the hypervisor level, consisting of hosts in different physical data centers, connected to this datastore. What allows you to do the following:

Full automation of the recovery process after the death of one of the data centers. Without any additional funds, all VMs that worked in the dead data center will be automatically restarted in the remaining ones. RTO = high availability cluster timeout (15 seconds for VMware) + operating system boot time and services start.
Disaster avoidance or, in Russian, catastrophe avoidance. If the planned work on the power supply in the data center 1, then we have in advance, before starting work, we have the opportunity to migrate all the important load in the data center 2 non-stop.

Virtualization

Storage virtualization is technically the use of volumes from another storage system as disks. A storage virtualizer can simply proxy another volume to the consumer as its own, simultaneously mirroring it to another storage system, or even create a RAID from external volumes.
The classic representatives in the storage virtualization class are EMC VPLEX and IBM SVC. Well, of course, storage systems with a virtualization function - NetApp, Hitachi, IBM / Lenovo Storwize.

Why may need?

Backup at storage level. A mirror is created between the volumes, with one half on HP 3Par and the other on NetApp. EMC.
. , 3Par, , Dell. 3Par, VPLEX . , . Dell, 3Par .
.

/

Compression and deduplica are the technologies that allow you to save disk space on your storage system. It is worth mentioning at once that far from all data is subject to compression and / or deduplication in principle, while some types of data are compressed and deduplicated better, and some are vice versa.

Compression and deduplication are of 2 types:

Inline - compression and deduplication of data blocks occurs before writing this data to disk. Thus, the system only calculates the hash of the block and compares it on the table with the existing ones. First, it is faster than simply writing to a disk, and secondly, we are not wasting extra disk space.

Post- when these operations are carried out already on the recorded data that are on the disks. Accordingly, the data is first written to the disk, and only then, the hash is calculated and the removal of unnecessary blocks and the release of disk resources.

It should be said that most vendors use both types, which allows to optimize these processes and thereby increase their efficiency. Most storage vendors have utilities that allow you to analyze your data sets. These utilities work according to the same logic as implemented in the storage system, so the estimated level of efficiency will be the same. Also, do not forget that many vendors have efficiency guarantee programs that promise a level not lower than stated for certain (or all) data types. And do not neglect this program, because by counting the system for your tasks, taking into account the efficiency ratio of a particular system, you can save on volume. It is also worth considering that these programs are designed for AFA systems, but thanks to the purchase of a smaller amount of SSD,rather than HDD in classic systems, this will reduce their cost, and if you don’t compare with the cost of a disk system, then you will get very close to it.

Model

And here we come to the right question.

“Here I am offered two storage options - ABC SuperStorage S600 and XYZ HyperOcean 666v4, what do you recommend?”

Turns into “Here I am offered two options for storage systems - ABC SuperStorage S600 and XYZ HyperOcean 666v4, what advise?

Target load VMware VMs from production / test / development circuits. Test = productive. 150 TB each with a peak performance of 80,000 IOPS 8kb in a block of 50% random access 80/20 read / write. 300 TB for development, there are 50,000 IOPS enough, 80 random, 80 entries.

The output is supposedly in the RPO = 15 minutes metrocluster RTO = 1 hour, development in asynchronous replication RPO = 3 hours, test on one site.

There will be a 50TB DBMS, it would be nice for them to log.

We have Dell servers everywhere, the storage systems of the old Hitachi barely cope, we plan to increase 50% of the load in terms of volume and performance. ”

As they say, a properly formulated question contains 80% of the answer.

Additional Information

What should be read further according to the authors

Books

Olifer and Olifer “Computer networks”. The book will help to systematize and perhaps better understand how the data transmission medium for IP / Ethernet storage systems works.
“EMC Information Storage and Management”. Excellent book on the basics of storage, why, how and why.

Forums and Chats

General recommendations

Prices

Now, as for prices - in general, the storage prices, if they come across, are usually the List price, from which each customer receives an individual discount. The discount amount is made up of a large number of parameters, so it’s impossible to predict what the final price your company will receive without asking the distributor. But at the same time, recently low-end models began to appear in ordinary computer stores, such as, for example, nix.ru or xcom-shop.ru . In them you can immediately purchase the system you are interested in at a fixed price, like any computer components.

But I want to note right away that a direct comparison of TB / $ is not true. If to approach from this point of view, then the simplest JBOD + server will be the cheapest solution, which will not give either the flexibility or the reliability that a full-featured, dual-controller storage system provides. This does not mean at all that JBOD is filthy and dirty and dirty, just again you need to understand very clearly how and for what purpose you will use this solution. You can often hear that there is nothing to break in JBOD, there is also one backplane. However, backplanes can also fail. Everything breaks down sooner or later.

Total

It is necessary to compare systems with each other not only by price, or not only by performance, but by the totality of all indicators.

Buy HDD only if you are sure that you need HDD. For low loads and incompressible data types, otherwise you should pay attention to the storage efficiency guarantee programs on SSD, which most vendors now have (and they really work, even in Russia), but it all depends on the applications and data that will be located on this storage system.

Do not pursue cheapness. Sometimes under these hides a lot of unpleasant moments, one of which Yevgeny Elizarov described in his articles about Infortrend . And that, ultimately, this cheapness can go sideways to you. Do not forget - "the miser pays twice."

Source: https://habr.com/ru/post/457956/

All Articles