Ceph as a connected storage: 5 practical conclusions from a large project

Given the growth of data in our time, it is increasingly often referred to software-defined and distributed data warehouses, with a lot of attention traditionally paid to the open platform Ceph. Today we want to talk about the findings to which we arrived in the process of implementing a project for storing data for one large Russian department.

When it comes to storing data of different types, of course, a distributed data warehouse immediately comes to mind. Purely theoretically, there are many advantages of such solutions: you can use any disks, the system works on any servers (even very old ones), there is practically no limits to scaling. That is why the introduction of such a system was launched several years ago in one of the major Russian departments with departments not only in all regions of the Russian Federation, but even in all more or less large cities.

After analyzing the available solutions, the choice was made in favor of Ceph. This decision was a number of reasons:
• Ceph is a fairly mature product, and already today there are installations of Ceph on petabytes of information.
• A large community (including us) is engaged in development, which means new features and improvements will appear for the repository.
• Ceph already has a good API with support for various programming languages. This was important because the product obviously needed to be finished to meet the requirements and expectations of the customer.
• Licenses are not worth anything. No, of course, the system needs to be refined, but in the case of specific customer tasks, it would be all the same to conduct additional development, so why not do it on the basis of a free product?
• Finally, the sanctions. State-owned companies should be insured from the fact that at the next moment in time someone comes up with the idea of imposing restrictions on them, and therefore it is dangerous to rely on foreign and especially American products. Another thing, Open Source.
')
Practical conclusions
The introduction of Ceph occurred gradually over several months. First, the storage was launched in the central region, and then we replicated the solution, connecting regional data centers. With the advent of each new network node, storage capacity increased, despite an increase in data flows within it, ensuring the transfer of information from the region to the region.
A feature of the work of any large organization is the need to preserve heterogeneous information, which is often binary files. As practice shows, employees simply have no time to figure out what kind of files they are, categorize them and process them in a timely manner - information manages to accumulate faster. And in order not to lose the potentially important data for operating activities, it is necessary to organize their competent storage. For example, on the basis of the distributed storage.
And in the process of implementing such a project, we made several conclusions on the use of Ceph:

Conclusion 1: Ceph completely replaces all backup solutions.
As practice has shown, backup for most unstructured information is not performed at all, since it is extremely difficult to implement it. When implementing Ceph, backup is obtained as if “as a bonus.” When setting up, we simply set the replication parameters — the number of copies and their location. If the customer has several data centers, a disaster-proof configuration is obtained, which simply does not require additional backups if there are - 3-4 copies of data on different drives and servers. Such a system works guaranteed better than any hardware solution, at least for the time being about large amounts of data and geographically distributed systems.

Conclusion 2: With large installations, Ceph performance is 99% equal to network performance.
When we transferred data from a PostgreSQL database (more on this below) to a ceph-based storage, the “fill” speed in most cases was equal to the data transmission network bandwidth. If in some cases this was not the case, reconfiguring Ceph made it possible to achieve this speed. Of course, here we are not talking about connections of 100 Gbps, but with data channels standard for geographically distributed infrastructures, it is quite possible to achieve Cephp performance at 10 Mbps, 100 Mbps or 1 Gbps. It is enough to distribute disks correctly and set up information clustering.

Conclusion 3: The main thing is to correctly configure ph taking into account the peculiarities of the company's activity
By the way, about the settings - the largest part of the expertise in the work of Ceph is required at the stage of system configuration. In addition to the replication parameters, the solution also allows you to specify access levels, data retention rules, and so on. For example, if we have mini-computing centers throughout Russia, we can organize quick access to documents and files created in our region, as well as access to all corporate documents from anywhere. The latter will work with somewhat greater delays and lower speed, but such a “concentration” of information at the place of ownership creates optimal conditions for the organization’s work.

Conclusion 4: When it is already configured, any Linux administrator can manage Ceph.
Perhaps one of the most pleasant features of Ceph is that the system works without unnecessary human involvement when it is already configured. That is, it turned out that in the remote mini data centers it is enough to simply contain a Linux administrator, since the support for the next Ceph segment does not require any additional knowledge.

Conclusion 5: The addition of Ceph with an external indexing system makes the repository convenient for contextual search.
As you know, inside eph there is no index that can be used to search by context. Therefore, when adding an object to the repository, you can save meta-data that serves as an index. Their volume is rather small, and therefore the usual relational DBMS easily copes with them. Of course, this is an additional system, but this approach allows you to quickly find information on the context among the huge amounts of unstructured data.

A few words about data transfer
A large project involves many stages, but the most interesting for us, perhaps, was the process of transferring huge amounts of data from PostgreSQL to a new repository. After launching Ceph, a task arose to migrate data from a variety of databases, without stopping services and business processes and ensuring the integrity of information.
To do this, we had to contribute to the development of the open source project of Ceph and create the migration module pg_rbytea, the source code of which can be found at the link ( https://github.com/val5244/pg_rbytea ). The essence of the solution was to simultaneously transfer data from the specified database to the ceph storage. The developed module allows you to simultaneously migrate data without stopping the database using the abstraction of the Rados object storage, the support of which is implemented in Ceph at the native level. By the way, about this we made a report on PG Conf in early 2018 ( https://pgconf.ru/2018/107082 ).
At the first stage, various binary data necessary for the functional work of the departments of the department were transferred to the repository. In fact, all those files and objects that are not clear how to store because of their huge total volume and fuzzy structure. Next, it is planned to transfer to Ceph various media content, storage of original documents that are created before recognition and attachments from corporate letters.
In order for all this to work on top of the repository, RESTful services were developed, which made it possible to use Ceph for integration into customer systems. Here again, the presence of a convenient API that allows you to create a plug-in service for various information systems played a role. So Ceph and became the main repository, claiming more and more new volumes and types of information within the organization.

Conclusion
Different distributed data warehouses are on the market, including commercial solutions and other open source products. Some of them use special optimizations, others work with compression or use Erasure Coding. However, in practice we were convinced that Ceph is ideally suited for truly distributed environments and huge storage facilities, because in this case the system performance is limited only by the speed of the communication channels, and you save a lot of money on licenses by the number of servers or by the amount of data (depending on with which to compare the product). A well-tuned Ceph system allows for optimal performance with minimal supervision by local local administrators. And this is a serious advantage if you introduce a geographically distributed implementation.

Source: https://habr.com/ru/post/417613/

All Articles

Ceph as a connected storage: 5 practical conclusions from a large project

More articles: