How to ensure reliable storage of large amounts of data within a moderate budget

Good day, Habrahabr! Today, we’ll talk about how data storage requirements are changing and why the traditional systems we trust can no longer cope with capacity expansion and ensure reliable storage. This is my first post after a long break, so just in case I will introduce myself - I am Oleg Mikhalsky, director of products for Acronis.

If you follow the trends in the industry, you probably already encountered such a thing as software defined anything. This concept involves the transfer to the software level of the key functions of the IT infrastructure, ensuring its scalability, manageability, reliability and interaction with other parts. Gartner calls Software Defined Anything among the 10 key trends of 2014, and IDC has already published a special review of the Software Defined Storage segment and predicts that by 2015 only commercial solutions of this type will be bought for $ 1.8 billion. It is about the storage of this new type will be discussed further.

First, let's turn to the statistics of data growth and draw some conclusions. A few years ago, the amount of data created worldwide exceeded 1 zettabyte - it is about a billion of full-capacity 1TB hard drives, and already exceeds all the available storage space today. According to the forecast of EMC, the world leader in the storage market, over the current decade, data volumes will increase another 50 times, creating a shortage of storage space
more than 60%.
')

_{Fig .: The lack of space for storing the generated information is increasing}
^{Source: IDC The Digital Universe Decade - Are You Ready?} ⁽²⁰¹⁰⁾

How much and why?

What are the reasons for the avalanche-like growth of information volumes:

creating new information is now much cheaper than before: the cost of storage and processing has decreased 6 times since 2005
IT budgets have increased one and a half times the same time.
By 2020, the number of devices that create information will increase by 8 times: from smartphones and higher resolution cameras to all sorts of sensors and smart personal devices
additional information is created as a derivative of the already created - first of all, backups, as well as logs, archives of digital audio, video

In turn, the lack of storage space is explained by the fact that hardware storage systems have evolved for a long time according to the principle of faster, higher, stronger - that is, from tape to larger disks, faster disks, flash memory drives, multiple shelf systems with different drives type and speed. And the optimization of storage systems was sharpened for the needs of companies with large budgets - a quick stack for virtualization, a super-fast stack for real-time data processing, a smart stack with optimization for specific business applications. At the same time, backups, archives and logs that do not directly create business value and simply take up space, customers seem to have forgotten, and storage vendors have not thought of it (name the vendor hardware storage system, which is sold specifically as “the cheapest and most reliable backup storage system your data ").

You're doing it WRONG

From practice, for example, I know cases when backups and logs are stored in hundreds of terabytes on the shelves of branded vendors designed for online storage of business application data, or vice versa - on a self-made JBOD of several petabytes, half of which is a complete second copy “for reliability. " As a result, the paradox: the cost of storing data (at the level of 10-15 cents per gigabyte per month) exceeds several times the cost of storing in the Amazon cloud, the capabilities of iron for processing this data are not used, and the reliability required for backup and long-term storage is the opposite is not provided. (about the reliability will be discussed below). In the case of JBOD, the costs of supporting and expanding it also increase. But as noted above, companies have not had this problem for a long time.

Development in the right direction

Not surprisingly, the first to notice the problem were the developers and engineers who are directly connected with large data arrays - such as those on Google, Facebook, and also in scientific experiments such as the famous hadron collider. And they began to solve it with available software tools, and then share their achievements in publications and at conferences. Perhaps, this is partly why the Storage segment in Software Defined Anything quickly turned out to be filled with a large number of open-source projects, as well as startups, which began to offer highly specialized solutions for a specific type of problem, but again bypassing backups and long-term archives.

Reliability of storage is in the title of the article, and we will now further analyze why storing large amounts of data on conventional storage systems becomes not only difficult as data grows, but also dangerous - which is especially important for backups or logs (which, by the way, include video surveillance archives) which can be useful rarely, but on a very important occasion - for example, to conduct an investigation. The fact is that in traditional storage systems, the more data becomes, the higher the storage costs and the risks of data loss as a result of a hardware failure.

Calculations and entertaining statistics

It is established that on average, hard drives fail with a probability of 5-8% per year ( Google data ). For a storage system with a capacity of petabyte, this means a failure of several disks per month, and with a storage size of 10 petabytes, the disks may fail every day.

_Fig. _{How hard drives fail.} _{(Goolge data)}

Example: Using RAID 5 considering the probability of a read error of 10 ^-15 per bit means a possible loss of real data at every 26th recovery or every few months. For example, if there are 10 thousand disks in the system and the average time between errors is 600 thousand hours for one disk, then it will be necessary to restore the disks every few days. (based on data from an Oracle article )

It should be noted that RAID-based systems recover bad disks with limitations. And the recovery time depends on the size of the disk. The larger the disk, the longer it will recover, increasing the likelihood of a re-failure leading to data loss. Thus, with the growth of Ramer disks and the amount of storage space, reliability decreases. In addition, there are errors that are not detected at the RAID level. For those who want more details - an excellent overview of the problems of RAID published on Habre here .

Add to this that, according to a NetApp study , on average one of the 90 disks has hidden damage associated with checksums, block write errors or incorrect parity bits that are not detected in traditional storage systems. As another study shows, such errors are not able to detect traditional file systems. The probability of even the most common of these types of errors is small. But as the array of data grows, the probability of loss also increases. Storage is no longer secure.

Reliability hardware, coping with limited amounts of data, is not enough to safely store hundreds of terabytes and petabytes.

Software defined storage

Based on these prerequisites and accumulated experience of working with growing volumes of data, the concept of Software Defined Storage has begun to develop. The first developments that appeared in this area did not put any one problem at the forefront, such as reliability. Guided by the needs of their own projects, the developers of Google, for example, simultaneously tried to solve several problems: ensuring scalability, accessibility, performance, and, among other things, reliability when storing large amounts of data, using inexpensive typical (commodity) components, such as desktop hard drives and non-brand chassis, which often fail expensive brands.

It is for this reason that the Googler (GFS) file system can be considered in some way the progenitor of the decision class, which will be discussed below. Other development teams, such as the open source projects of Gluster (later incorporated into RedHat) and CEPH (now supported by Inktank), focused primarily on achieving high performance data access. This list will be incomplete without HDFS (Hadoop filesystem), which appeared on the basis of Google development and is focused on high-performance data processing. The list can be continued, but a detailed review of existing technologies is beyond the scope of this article. I will only note that the problem of optimizing long-term storage in its pure form was not a priority, but was solved as it were in the process of optimizing the cost of the solution as a whole.

It is clear that the creation of a commercial solution based on open source is an experiment difficult and risky and only a large company or a system integrator who has enough expertise and resources to work with the opensource code that is difficult to install, integrate and support and have sufficient motivation for that. But as mentioned above, for commercial vendors the main motivation is aimed at such high-budget areas as high-speed storage systems for virtualization or parallel data processing.

Turnkey solutions

The closest to solving problems with inexpensive and reliable storage were startups that focused on providing cloud backups, but many of them have already fallen away, while others have been absorbed by large companies and have ceased to invest in the development of technology. Best of all, vendors such as BackBlaze and Carbonite, who relied on deploying cloud storage in their own data centers based on standard components and managed to gain a foothold in the market for their cloud services, advanced. But they, in view of the extremely high competition in their main market, do not actively promote storage technology as an independent solution of the Software Defined Storage class. Firstly, in order not to create competitors, and secondly, in order not to dissipate their resources on completely different lines of business.

As a result, storage administrators who are responsible for storing backups, logs, archives of video surveillance systems, TV shows, voice call recordings, have a problem of choice: on the one hand, there are convenient but expensive solutions that, if there is enough budget, can easily cover current needs in storage of 100-150TB of data. And it will be safe and secure - as they say in the industry, no one has yet been dismissed for buying iron from a tough vendor. But as soon as the storage capacity exceeds the threshold of 150-200TB of data, there are problems with further extensibility - in order to merge all the hardware into a single file system, freely redistribute space, upgrade hard drives to larger capacity drives, additional migration costs arise, expensive components and specialized software for "storage virtualization". As a result, the cost of ownership of such a system over time becomes far from optimal for the “cold data”. Another alternative is to assemble the storage itself on the basis of Linux and JBOD, perhaps suitable for a specialized company such as a hosting provider or a telecom provider with experienced and qualified specialists who will take responsibility for the performance and reliability of their own solutions. An ordinary company of medium or small size, whose main business is not connected with data storage, most likely does not have a budget for expensive iron and qualified specialists. For such companies, an interesting alternative could be Acronis's own development - Acronis Storage - a software solution that allows you to quickly deploy highly reliable and easily expandable storage systems on inexpensive typical chassis and disks that can be arbitrarily combined with each other, changed one by one on a “hot system”, increasing space with arbitrary blocks from several terabytes to several tens or hundreds of terabytes, using essentially only PC assembly skills and an intuitive non-specialist web-interface for config uration and monitoring of the entire storage system and its individual nodes and disks. This development is the result of Acronis internal startup for cloud storage for backups, which has now expanded to several petabytes in three data centers.

Summing up

An overview of approaches to storing large amounts of data will not be complete without mentioning solutions that are based on software, but are delivered to the market in the form of hardware-software complexes (appliances). In some cases, this provides an opportunity to quickly deploy a solution and may not suit a very large company with limited resources. But the use of a predefined hardware configuration limits the ability to fine-tune the system and, naturally, sets a higher price threshold for a pure software solution, which already includes the hardware. And, of course, this approach inherits many specific hardware storage systems in terms of upgrading a single server (scale-up by replacing disks with more capacious and faster ones, replacing networks with a faster one).

In conclusion, let us once again turn to the data of analysts on the storage industry and fix a few conclusions. According to a Forrester Forrsights Hardware Survey study conducted at the end of 2012, 20% of companies already backed up volumes of 100TB per year, and the complexity of expanding storage systems for back-ups became a problem for 42% of respondents. The company of the company is different, but this data gives experts a reason to think about the long-term planning of storage capacity that may be needed in their organization over the next few years. Assuming that all companies are roughly similar to each other in storage backups, almost half of them will have the problem of optimizing storage systems for backups in the coming years, and possibly other cold data. The data on traditional RAID-based storage systems suggests that to improve the reliability and optimize the cost of storing “cold data”, alternative new software Defined Storage solutions should be included in the storage selection process, which can better manage scalability and give administrators more flexibility. and freedom of choice in the maintenance and expansion of the repository.

Source: https://habr.com/ru/post/215007/

All Articles