Ceph: Cloud Storage without compromise

Hello, dear readers!

Modern clouds used for hosting purposes raise the bar for storage infrastructure requirements high. In order for a client to receive a quality service, a variety of properties must be inherent in the storage system:

high storage reliability
high availability of data, i.e. minimal downtime during accidents
high access speed and minimal delays
low storage cost
various application features: cloning, snapshots, etc.

Neither RAID-arrays, nor "iron" storage systems are able to solve all the listed tasks simultaneously. That is why Software-defined storage is becoming increasingly common in the hosting industry. One of the brightest representatives of SDS is a distributed repository called Ceph.

We decided to tell about this wonderful product, which is used in CERN , 2GIS , Mail.ru and in our cloud hosting .

What is Ceph?

Ceph is a fault-tolerant distributed data storage operating over TCP. One of the basic properties of Ceph is scalability to petabyte sizes. Ceph provides a choice of three different abstractions for working with storage: an abstraction of object storage (RADOS Gateway), a block device (RADOS Block Device) or a POSIX-compatible file system (CephFS).
')

Object Storage Abstraction

The object storage abstraction (RADOS Gateway, or RGW) together with the FastCGI server allows using Ceph to store user objects and provides an API that is compatible with S3 / Swift. Habré already had an article on how to quickly configure Ceph and RGW. In the object storage mode, Ceph has long been successfully used in production by a number of companies.

Block device abstraction

The abstraction of a block device (originally RADOS Block Device, or RBD) allows the user to create and use virtual block devices of arbitrary size. The RBD software interface allows you to work with these devices in read / write mode and perform service operations — resizing, cloning, creating and returning to a snapshot, etc.

The QEMU hypervisor contains a driver for working with Ceph and provides virtual machines with access to RBD block devices. Therefore, Ceph is now supported in all popular cloud orchestration solutions - OpenStack, CloudStack, ProxMox. RBD is also ready for industrial use.

Abstraction of a POSIX-compatible file system

CephFS is a POSIX-compatible file system that uses Ceph as storage. Despite the fact that CephFS is not production-ready and does not yet have a significant industrial application, here on Habré there is already an instruction for setting it up .

Terminology

Below are the main entities of Ceph.

Metadata server (MDS) is an auxiliary daemon to ensure the synchronous state of files at CephFS mount points. The active copy + reserves scheme works, and there is only one active copy within the cluster.

Mon (Monitor) is an element of Ceph infrastructure that provides data addressing within the cluster and stores information about the topology, state and distribution of data within the repository. A client wishing to access a rbd block device or a file on a mounted cephfs receives from the monitor the name and position of the rbd header, a special object that describes the position of other objects related to the requested abstraction (block device or file) and then communicates with all the OSDs involved in file storage.

Object (Object) - a block of data of a fixed size (4 MB by default). All information stored in Ceph is quantized by such blocks. To avoid confusion, we emphasize that these are not user objects from Object Storage, but objects used for the internal presentation of data in Ceph.

OSD (object storage daemon) is an entity that is responsible for storing data, the main building block of a Ceph cluster. Multiple OSDs can be located on one physical server, each of which has a separate physical data storage.

An OSD map (OSD Map) is a map that associates to each placement group a set of one Primary OSD and one or more Replica OSD. The distribution of placement groups (PG) across the OSD storage nodes is described by an osdmap map slice, which shows the positions of all PGs and their replicas. Each change in the location of the PG in the cluster is accompanied by the release of a new OSD card, which is distributed to all participants.

A Placement Group (PG) is a logical group that unites a variety of objects, designed to simplify addressing and synchronization of objects. Each object is only in one placement group. The number of objects participating in the placement group is not regulated and may vary.

Primary OSD - The OSD selected as the Primary for this placement group. Client IO is always served by the OSD, which is Primary for the placement of the group in which the client is interested in a data block (object). Asynchronous Primary OSD replicates all data to Replica OSD.

RADOS Gateway (RGW) is an auxiliary daemon acting as a proxy for the supported object storage APIs. It supports geographically separated installations (for different pools, or, in Swift view, regions) and active-backup mode within the same pool.

Replica OSD (Secondary) - OSD, which is not Primary for this placement group and is used for replication. The client never communicates with them directly.

Replication factor (RF) - data storage redundancy. The replication factor is an integer and shows how many copies of the same object a cluster stores.

Ceph architecture

The main types of daemons in Ceph are two - monitors (MON) and storage-nodes (OSD). RGW and MDS daemons do not participate in data distribution, being ancillary services. Monitors are combined into a quorum and communicate over a PAXOS-like protocol. Actually, the cluster is operational as long as the majority of participants of the configured quorum remains in it, that is, in a split-brain situation in the middle and an even number of participants, only one part will survive, because the previous elections in PAXOS automatically reduced the number of active participants to an odd number . If most of the quorum is lost, the cluster is “frozen” for any operations, preventing possible discrepancy of the recorded or read data until the minimum quorum is restored.

Data Recovery and Rebalance

The loss of the appearance of one of the copies of the object leads to the transition of the object and the placement group containing it to the degraded state and to release a new OSD (osdmap). The new map contains the new location of the lost copy of the object and, if after a specified time the lost copy does not return, the missing copy will be restored elsewhere to keep the number of copies determined by the replication factor. Operations performed at the time of such an error will automatically switch to one of the available copies. In the worst case, their delay will be measured in units of seconds.

An important feature of Ceph is that all cluster rebalancing operations take place in the background simultaneously with user I / O. If the client accesses an object that is in the recovering state, Ceph will out of the queue restore the object and its copies, and then execute the client's request. This approach ensures minimal latency I / O, even when the cluster recovery is in full swing.

Cluster data distribution

One of the most important features of Ceph is the ability to fine-tune replication set by the CRUSH rules — a powerful and flexible mechanism based on randomly distributed PG across the OSD group based on the rules (weight, state of node, ban on placement in the same group of nodes). By default, OSDs have a weight based on the amount of free space at the corresponding mount point at the time the OSD is inserted into the cluster and obey the data distribution rule, which prohibits keeping two copies of one PG on one node. CRUSHMAP - a description of the distribution of data - can be modified under the rules that prohibit keeping the second copy within the same rack, thereby ensuring fault tolerance at the departure level of the entire rack.

Theoretically, this approach allows for real-time geo-replication, but in practice this can only be used in Object Storage mode, since in CephFS and RBD modes, operation delays will be too great.

Alternatives and Benefits of Ceph

The most qualitative and congenial free cluster FS are GlusterFS. It is supported by RedHat and has some advantages (for example, localizes Primary a copy of the data next to the client). However, our tests showed some lag of GlusterFS in terms of performance and poor responsiveness when rebuilding. Other serious disadvantages are the lack of CoW (including in the forecasted future) and low community activity.

The advantage of Ceph over other cluster storage systems is the lack of single points of failure and almost zero maintenance costs for recovery operations. Redundancy and resistance to accidents laid at the level of design and gets a gift.

Possible replacements are divided into two types - cluster fs for supercomputers (GPFS / Luster / etc.) And cheap centralized solutions like iSCSI or NFS. The first type is rather difficult to maintain and is not sharpened to operate in conditions of failed equipment - freezing I / O, which is especially sensitive when exporting a mount point to a computational node, does not allow the use of similar fs in the public segment. The drawbacks of “classic” solutions are fairly well understood - the lack of scalability and the need to lay out a topology for failover at the hardware level, which leads to an increase in cost.

With Ceph, restoring and rebuilding a cluster is really unnoticeable, with virtually no impact on client I / O. That is, the degraded cluster for Ceph is not an extraordinary situation, but just one of the working states. As far as we know, no other open source software storage system has this feature sufficient for its use in the public cloud, where the planned cessation of service is impossible.

Performance

As mentioned at the beginning of the article, the data in Ceph are quantized in fairly small portions and pseudo-randomly distributed over OSD. This leads to the fact that the real I / O of each Ceph client is fairly evenly spread across all the disks in the cluster. As a result:

Reduces the intensity of the struggle between customers for disk resource
The upper bar of the theoretically possible intensity of work with the block device grows
As a result of Clauses 1 and 2, each client receives significantly higher specific limits on iops and bandwidth than the classical approach can give for the same money.

There are other reasons for Ceph performance. All write operations first get into the OSD log, and then, without delaying the client, are asynchronously transferred to the persistent file storage. Therefore, placing a log on an SSD that is recommended in the Ceph documentation speeds up write operations many times over.

Goals and Results

Two years ago, Ceph bribed us with its impressive capabilities. Although many of them at that time did not work perfectly, we decided to build a cloud on it. In the following months, we faced a number of problems that gave us a lot of unpleasant moments.

For example, immediately after the public release a year ago, we found that rebuilding a cluster affects its responsiveness more than we would like. Or that a certain type of operation leads to a significant increase in the latency of subsequent operations. Or that in certain (fortunately, rare) conditions, the client virtual machine may hang on the I / O. Anyway, the bugfix for half a year has done its job, and today we are absolutely confident in our storage system. Well, in the process of eliminating difficulties, we acquired a number of debugging and monitoring tools. One of them is the monitoring of the duration of all operations with block devices (at the moment the cluster serves several thousand read / write operations every second). This is what the report on the latency in our admin panel looks like today:

The minimum duration of operations is marked green on the graph, the maximum is red, the median is turquoise. Impressive, isn't it?

Although the data storage system has long been absolutely stable, these tools still help us in solving everyday tasks and at the same time confirm the excellent quality of our service with numbers.

Ultimately, Ceph allowed us to provide:

reduced IO latency to good local SSD values
upgrade of all storage disks from 1Tb to 4Tb models in a mode imperceptible for users
hardware failure of one node is invisible to users as an event
thanks to live migration and using Ceph hardware and software updates occur with zero downtime
snapshots, cloning, incremental diffs have found their application in production and delight customers
ultra low prices for services

Flops.ru remains the only one in Russia and one of the few hosters in the world who use Ceph in production. Thanks to Ceph, we succeeded in realizing what the visionary posts often write about the future of clouds - to combine computational nodes and storage nodes on regular hardware and achieve indicators close to the values of enterprise "shelves" without increasing cost. Well, since the money saved is the same as the money earned, we were able to reduce the prices of services to almost the level of Western discounters. This is easily seen by looking at our rates.

If you use dedicated hosting in your work - we suggest you try our services in action and evaluate the benefits of Ceph. You do not need to pay anything - a two-week trial period and a test balance of 500 rubles can be activated immediately after registration.

In the following posts we will look at the practical side of using Ceph, talk about the features that have appeared in our country over the past year (and there are a lot of them) and dwell in more detail on the benefits of using SSD.

Thanks for attention!

Source: https://habr.com/ru/post/218065/

All Articles