But you say Ceph ... is it really good?

I love Ceph. I have been working with him for 4 years already (0.80.x - ~~12.2.6~~ 12.2.5). Sometimes I am so fascinated by him that I spend evenings and nights in his company, and not with my girlfriend. I encountered various problems in this product, and I still live with some to this day. Sometimes I was happy with easy decisions, and sometimes I dreamed of meeting with the developers to express my indignation. But Ceph is still used in our project and it is possible that it will be used in new tasks, at least by me. In this story, I will share our experience in operating Ceph, in some way I will speak about what I do not like about this decision and maybe help those who are just eyeing it. By writing this article, I pushed the events that began about a year ago, when Dell EMC ScaleIO, now known as Dell EMC VxFlex OS, was delivered to our project.

This is by no means an advertisement for Dell EMC or their product! Personally, I'm not very good at big corporations, and black boxes like the VxFlex OS. But as you know, everything in the world is relative and using the example of VxFlex OS it is very convenient to show what Ceph is in terms of operation, and I will try to do it.

Options. It's about 4-digit numbers!

Ceph services such as MON, OSD, etc. have different parameters to configure all subsystems. The parameters are set in the configuration file, the demons read them at the time of launch. Some values can be conveniently changed on the fly with the help of the "injection" mechanism, about which just below. Everything is almost super, if you omit the moment that there are hundreds of parameters:

Hammer:

> ceph daemon mon.a config show | wc -l 863

Luminous:

 > ceph daemon mon.a config show | wc -l 1401

It turns out ~ 500 new parameters in two years. In general, parametrization is cool, it’s not cool that there are difficulties in understanding 80% of this list. The documentation described in my estimation ~ 20% and sometimes ambiguous. Understanding the meaning of most of the parameters has to be found in the github project or in the mailing lists, but this does not always help.

Here is an example of several parameters that were interesting to me just recently, I found them in the blog of one Ceph-gadfly:

 throttler_perf_counter = false // enable/disable throttler perf counter osd_enable_op_tracker = false // enable/disable OSD op tracking

Comments in the code in the spirit of best practices. As if, I understand the words and even about what they are, but what it will give me is not.

Or here: osd_op_threads in Luminous was gone and only the source helped to find a new name: osd_peering_wq threads

I also like the fact that there are especially holivar parameters. Here the dude shows that an increase in rgw_num _rados_handles is good :

and the other dude thinks that> 1 is impossible and even dangerous .

And my most favorite is when beginners give examples of the config in their blog posts, where all parameters are thoughtlessly (I think so) copied from another same blog, and so a bunch of parameters that no one knows except the author of the code roams from config to config

I’m also burning wildly with what they did in Luminous. There is a super cool feature - changing parameters on the fly, without restarting processes. You can, for example, change the parameter of a specific OSD:

 > ceph tell osd.12 injectargs '--filestore_fd_cache_size=512'

or put '*' instead of 12 and the value will be changed on all OSD. This is very cool, really. But, like much in Ceph, this is done with the left foot. Bai design not all parameter values can be changed on the fly. More precisely, they can be meshed up and they will appear in the output modified, but in fact, only some of them are re-read and re-applied. For example, you cannot change the size of the thread pool without restarting the process. So that the performer of the team understood that it is useless to change the parameter in this way - they decided to print a message. Is healthy

For example:

 > ceph tell mon.* injectargs '--mon_allow_pool_delete=true' mon.c: injectargs:mon_allow_pool_delete = 'true' (not observed, change may require restart) mon.a: injectargs:mon_allow_pool_delete = 'true' (not observed, change may require restart) mon.b: injectargs:mon_allow_pool_delete = 'true' (not observed, change may require restart)

Ambiguous. In fact, the removal of pools becomes possible after injection. That is, this warning is not relevant for this parameter. Ok, but there are still hundreds of parameters, including very useful ones, which also have a warning and there is no possibility to check their actual applicability. At the moment, I can not even understand the code, which parameters are applied after the injection, and which are not. For reliability, you have to restart services and this, you know, infuriates. Enrages because I know that there is an injection mechanism.

How is this with VxFlex OS? Similar processes like MON (in VxFlex are MDM), OSD (SDS in VxFlex) also have configuration files in which there are about a dozen of parameters. True, their names also do not speak about anything, but it is good that we have never resorted to them to burn like with Ceph.

Technical duty

When you begin your acquaintance with Ceph with the most current version for today, everything seems fine, and I want to write a positive article. But when you live with him in the prode from version 0.80, then everything does not look so rosy.

Before Jewel, Ceph processes ran as root. In Jewel, it was decided that they should work from the user 'ceph' and this required a change of ownership in all the directories that are used by Ceph services. It would seem that this? Imagine an OSD that serves a 2 TB magnetic SATA disk with 70% capacity. So, chown such a disk in parallel (to different subdirectories) with full disk utilization takes 3-4 hours. Imagine that you have, for example, 3 hundreds of such disks. Even if you update the nodes (chown'it 8-12 disks at once), you get a rather long update, in which the cluster will have OSD of different versions and one replica of data less at the time of the server output for the update. In general, we thought it was absurd, rebuilt Ceph packages and left OSD to work as root. We decided that as OSD is introduced or replaced we will transfer them to a new user. Now we are changing 2-3 disks per month and adding 1-2, I think we can handle it by 2022).

CRUSH Tunables

CRUSH is the heart of Ceph, everything revolves around it. This is an algorithm by which, in a pseudo-random manner, the location of the data is selected and thanks to which clients working with the RADOS cluster will know on which OSD the data (objects) they need is stored. The key CRUSH feature is that there is no need for any metadata servers, such as Luster or IBM GPFS (now Spectrum Scale). CRUSH allows clients and OSD to interact with each other directly. Although, of course, it is difficult to compare the primitive object storage of RADOS and file systems, which I cited as an example, but I think the idea is clear.

In turn, CRUSH tunables is a set of parameters / flags that affect the operation of CRUSH, making it more efficient, at least in theory.

So, when upgrading from Hammer to Jewel (test naturally), a warning appeared, that the profile tunables has parameters that are not optimal for the current version (Jewel) and it is recommended to switch the profile to optimal. In general, everything is clear. The dock says that this is very important and this is the right way, but it also says that after switching the data there will be a rebelance of 10% of the data. 10% - it doesn't sound scary, but we decided to test it. For a cluster, about 10 times less than what was sold, with the same number of PGs per OSD filled with test data, we got a rebelance of 60%! Imagine, for example, with 100TB of data, 60TB begin to move between OSDs and this with a constantly going client load demanding latency! If I haven’t said yet, we provide s3 and even at night the load on rgw, of which there are 8 and 4 more for static websites, is not particularly reduced. In general, we decided that this was not our way, all the more so to make such a rebelance on the new version, which we hadn’t worked with before, at least too optimistic. In addition, we had large bucket indices, which are very badly experiencing rebelance, and this was also the reason for the delay in switching the profile. About indexes will be separately slightly lower. In the end, we simply removed the warning and decided to return to this later.

And when switching the profile in testing, we fell off cephfs-clients that are in CentOS 7.2 cores, since they could not work with the newer hashing algorithm that came with the new profile. We do not use cephfs in the sale, but if used, it would be another reason not to switch the profile.

By the way, the dock says that if what does not happen during rebelance does not suit you, you can roll back the profile. In fact, after a clean installation of the Hammer version and an update to Jewel, the profile looks like this:

 > ceph osd crush show-tunables { ... "straw_calc_version": 1, "allowed_bucket_algs": 22, "profile": "unknown", "optimal_tunables": 0, ... }

Here it is important that it is "unknown" and if you try to stop the rebelance by switching it to "legacy" (as stated in the dock) or even to "hammer", the rebelance will not stop, it will simply continue in accordance with other tunables, and not " optimal. " In general, everything needs to be thoroughly checked and rechecked, ceph no confidence.

CRUSH trade-of

As you know, everything in this world is balanced and there are disadvantages attached to all the advantages. The disadvantage of CRUSH is that PGs are spread over different OSD unevenly, even with the same weight of the latter. Plus, nothing prevents different PGs from growing at different speeds, it’s like a hash function. Specifically, we have an OSD utilization range of 48-84%, despite the fact that they have the same size and, accordingly, weight. Even we try to make servers equal in weight, but this is so, just our perfectionism, no more. And the fig with the fact that IO is distributed unevenly across disks, the worst thing is that when full (95%) reaches at least one OSD in a cluster, the entire recording stops and the cluster goes into readonly. The whole cluster! And it does not matter that the cluster is still full of space. Everything, final, we leave! This is an architectural feature of CRUSH. Imagine, you are on vacation, some kind of OSD has broken the 85% mark (the first warning is the default), and you have 10% in stock to prevent the recording from stopping. And 10% with an active recording is not so much / long. Ideally, with this design, Ceph needs a person on duty who is able to execute the prepared instruction in such cases.

So, we decided to unbalance the data in the cluster, because several OSD were close to the nearfull-mark (85%).

There are several ways:

Add discs

The easiest, slightly wasteful and not very effective, because the data itself may not move from the overflowing OSD or the movement will be negligible.

Change permanent weight OSD (WEIGHT)

This leads to a change in the weight of all higher-level buckets (CRUSH terminology) in the hierarchy, OSD server, data center, etc. and, as a result, to the movement of data, including not from those OSD from which it is necessary.
We tried, reduced the weight of one OSD, after the rebelance of the data was filled with another, lowered it, then the third, and we realized that we would be playing this way for a long time.

Change non-permanent weight OSD (REWEIGHT)

This is what is done when calling 'ceph osd reweight-by-utilization'. This leads to a change in the so-called adjustment weight OSD and at the same time the weight of the higher-level buckets does not change. As a result, the data balances between different OSD of the same server, as it were, without going beyond the CRUSH. We really liked this approach, we saw in the dry-run what the changes would be and performed on the sale. Everything was fine until the rebelance process had a stake around the middle. Again, google, reading mailings, experimenting with different options and in the end it turns out that the stop is caused by the lack of some tunables in the profile mentioned above. Again we caught up with technical debt. As a result, we took the path of adding disks and the most inefficient rebelance. Fortunately, we still had to do it because switching of the CRUSH profile was planned to be done with a sufficient margin in capacity.

Yes, we know about balancer (Luminous and above) included in mgr, which is designed to solve the problem of uneven distribution of data by moving the PG between OSD, for example, at night. But I have not heard yet positive reviews about his work, even in the current Mimic.

You will probably say that technical debt is a purely our problem and I, perhaps, will agree. But for four years with Ceph, we had only one downtime s3 recorded during the sale, which lasted a whole 1 hour. And then, the problem was not in RADOS, but in the RGW, which typed their default 100 threads hung tight and most users did not execute queries. It was still on the Hammer. In my opinion, this is a good indicator and it is achieved due to the fact that we do not make any sudden movements and are rather skeptical about everything in Ceph.

Wild gc

As you know, deleting data directly from the disk is a fairly resource-intensive task and in advanced systems deletion is done pending or not done at all. Ceph is also an advanced system and in the case of the RGW, when deleting an s3 object, the corresponding RADOS objects are not deleted from the disk immediately. RGW marks s3-objects as deleted, and a separate gc-stream is engaged in deleting objects directly from RADOS pools and, respectively, from disks is postponed. After the update on Luminous, the behavior of gc changed noticeably, it began to work more aggressively, although the parameters of gc remained the same. By the word, noticeably, I mean that we began to see the work of gc on external monitoring of the service on jumping latency. This was accompanied by a high IO in the rgw.gc. pool. But the problem we faced was much more epic than just an IO. When gc is working, a lot of logs are generated:

 0 <cls> /builddir/build/BUILD/ceph-12.2.5/src/cls/rgw/cls_rgw.cc:3284: gc_iterate_entries end_key=1_01530264199.726582828

Where 0 at the beginning is the logging level at which this message is printed. However, logging is nowhere below zero. As a result, ~ 1 GB of logs was generated by us for one OSD in a couple of hours, and everything would be fine if ceph nodes were not diskless ... We load the OS via PXE directly into memory and do not use a local disk or NFS, NBD for the system partition (/). Stateless-servers turn out. After a reboot, the entire state is rolled by automation. How it works, I will somehow describe it in a separate article, now it’s important that 6 GB of memory is allocated for "/", of which ~ 4 is usually free. We send all logs to Graylog and use a rather aggressive log rotation policy and usually do not experience any problems with disk / RAM overflow. But we were not ready for this, with 12 OSD on the server "/" filled very quickly, the duty officers did not react in time to the trigger in Zabbix and the OSD just started to stop due to the inability to write the log. As a result, we lowered the intensity of gc, the ticket did not start because such was already there, and they added a script to cron, in which we do a forced truncate of OSD logs when a certain volume is exceeded without waiting for a logrotate. By the way, the level of logging increased .

Placement Groups and vaunted scalability

In my opinion, PG is the most difficult abstraction to understand. PG is needed to make CRUSH more efficient. The main purpose of PG is to group objects to reduce resource consumption, improve performance and scalability. Addressing objects directly, individually, without combining them into a PG would be very expensive.

The main problem of PG is the determination of their number for the new pool. From the Ceph blog:

"This is a bit of a nightmare."

It is always very specific to a particular installation and requires lengthy reflection and calculation.

Key recommendations:

Too many PGs on the OSD are bad, there will be overspending on their maintenance and brakes during rebalance / recovery.
Few PGs on the OSD are bad, performance will suffer, and the OSD will be filled unevenly.
The PG number must be a multiple of degree 2. This will help get the "power of CRUSH".

And here I burned. PGs are not limited in scope or number of objects. How many resources (in real numbers) do you need to maintain one PG? Does it depend on its size? Does it depend on the number of replicas of this PG? Should I bathe if I have enough memory, fast CPUs and a good network?
And still need to think about the future growth of the cluster. The number of PG can not be reduced - only increase. At the same time, it is not recommended to do this, since this will, in fact, entail a split of the PG part into new and wild rebelance.

"It is one of the most impactful cases for production clusters if you can."

Therefore, we need to think about the future immediately, if possible.

Real example.

A cluster of 3 servers with 14x2 TB OSD in each, total 42 OSD. Replica 3, useful space ~ 28 TB. Will be used under S3, you need to calculate the number of PG for the data pool and index pool. RGW uses more pools, but these two are basic.

We go into the PG calculator (there is such a calculator), we consider with the recommended 100 PG on the OSD, we get only 1312 PG. But not everything is so simple: we have an introductory one - the cluster will definitely grow three times over the course of a year, but the iron will be purchased a little later. Increase the "Target PGs per OSD" three times, to 300 and we get 4480 PG.

Set the number of PGs for the corresponding pools - we get warning: too many PG Per OSD ... arrived. Received ~ 300 PG on OSD with a limit of 200 (Luminous). It used to be 300, by the way. And the most interesting is that all unnecessary PGs are not allowed to be peering, that is, it is not just a warning. As a result, we believe that we are doing everything correctly, raising limits, turning off the warning and moving on.

Another real-life example is more interesting.

S3, 152 TB of usable volume, 252 OSD at 1.81 TB, ~ 105 PG per OSD. The cluster grew gradually, everything was fine until with the new laws in our country there was no need for growth up to 1 PB, that is + ~ 850 TB, and at the same time it is necessary to preserve the performance, which is now pretty good for S3. Suppose we take disks of 6 (5.7 really) TB and taking into account replica 3 we get + 447 OSD. Taking into account the current, 699 OSDs of 37 PG each will be obtained, and if we take into account different weights, then it turns out that the old OSD will have only ten PGs. And you tell me, how far will it work? The performance of a cluster with different numbers of PG is quite difficult to measure synthetically, but the tests that I conducted show that for optimal performance you need from 50 PG to 2 TB OSD. And further growth? Without increasing the number of PGs, you can reach the PG mapping for OSD 1: 1. Maybe I do not understand something?

Yes, you can create a new pool for the RGW with the desired number of PGs and apply a separate S3 region to it. Or even build a new cluster nearby. But agree that all this crutches. And it turns out that Ceph seems to be well-scalable because of its concept. PG scales with reservations. You will either have to live with disabled vornings in preparation for growth, or at some point rebel all data in a cluster, or to score on performance and live with what you get. Or go through it all.

I am glad that the developers of Ceph understand that PG is a complicated and superfluous abstraction for the user and he shouldn’t know about it.

In the luminous environment, it’s not necessary to make it out. think about ".

In vxFlex there is no concept of PG or any analogs. You just add the disks to the pool and that's it. And so to 16 PB. Imagine, nothing needs to be counted, there are no heaps of statuses of these PGs, disks are utilized evenly throughout the growth. Since disks are given to vxFlex as a whole (there is no file system on top of them) there is no way to estimate the fullness and there is no such problem at all. I don't even know how to convey to you how nice it is.

"Need to wait for SP1"

Another "success" story. As you know, RADOS is the most primitive repository of key-value. S3, implemented on top of RADOS, is also primitive, but still a bit more functional. , S3 . , , RGW . — RADOS-, OSD. . , . OSD down. , , . , scrub' . , - 503, .

Bucket Index resharding — , (RADOS-) , , OSD, .

, , Jewel ! Hammer, .. -. ?

Hammer 20+ , , OSD Graylog , . , .. IO . Luminous, .. . Luminous, , . , . IO index-, , . , IO , . , … ; , :

, . , .. , .

, Hammer->Jewel - . OSD - . , OSD .

— , , . Hammer s3, . , . , , etag, body, . . , . Suspend . "" . , .

, 2 — , Cloudmouse. , Ceph, , .

vxFlex OS 2 . , . , . , . , , , Dell EMC.

Performance

. , ? Good question. , . , Ceph, vxFlex . - . , .

9 ceph-devel : , CPU ( Xeon' !) IOPS All-NVMe Ceph 12.2.7 bluestore.

, , "" Ceph . ( Hammer) Ceph , s3 . , ScaleIO Ceph RBD . Ceph, — CPU. RDMA InfiniBand, jemalloc . , 10-20 , iops, io, Ceph . vxFlex . — Ceph system time, scaleio — io wait. , bluestore, , , -, , Ceph. ScaleIO . , , Ceph Dell EMC.

, , PG. (), IO. - PG IO, , . , nearfull. , .

vxFlex - , . ( ceph-volume), , .