In connection with the growing popularity of Rook, I want to talk about its pitfalls and the problems that await you on the way.
About me: The experience of ceph administration from the version of hammer, the founder of the
t.me/ceph_ru community in telegrams.
In order not to be unfounded, I will refer to posts accepted by Habr (judging by the rating) about problems with ceph. With most of the problems in these posts, I also encountered. References to the material used at the end of the post.
')
In a post about Rook, we mention ceph for a reason - Rook is essentially ceph wrapped in kubernetes, which means it inherits all its problems. We'll start with ceph problems.
Simplify Cluster Management
One of the advantages of Rook is the convenience of managing ceph through kuberentes.
However, ceph contains more than 1000 parameters to configure, at the same time, through the rook, we can only correct a smaller part of them.
Luminous example
> ceph daemon mon.a config show | wc -l
1401
Rook is positioned as a convenient way to install and update cephThere are no problems with installing ceph without Rook - the ansible playbook is written in 30 minutes, but there are a lot of problems with updating.
Quote from the post KrokExample: incorrect crush tunables after upgrading from hummer to jewel
> ceph osd crush show-tunables
{
...
"Straw_calc_version": 1,
"Allowed_bucket_algs": 22,
"Profile": "unknown",
"Optimal_tunables": 0,
...
}
But even within the minor versions there are problems.
Example: Update 12.2.6 brings the cluster to health err and conditionally broken PG
ceph.com/releases/v12-2-8-releasedNot updated, wait and test? But we kind of use Rook for the convenience of updates as well.
The complexity of the disaster recovery cluster in Rook
Example: OSD is falling down with errors under its feet. You suspect that the problem is in one of the parameters in the config, you want to change the config for a specific daemon, but you cannot, because you have kubernetes and DaemonSet.
There is no alternative. ceph tell osd.Num injectargs does not work - the OSD also lies.
Debug difficulty
For some settings and performance tests, you must connect directly to the osd daemon socket. In the case of Rook, you must first find the container you need, then go into it, find the missing container for debug and get very upset.
The difficulty of raising OSD consistently
Example: OSD falls on OOM, rebalance begins, then the following ones fall.
Solution: Raise the OSD one at a time, wait for its full inclusion in the cluster and raise the following. (Read more in the Ceph report. Anatomy of a catastrophe).
In the case of baremetal installations, this is done simply by hands, in the case of a Rook and one OSD per node, there are no particular problems, problems with alternate elevation occur if OSD> 1 per node.
Of course, they are solvable, but we carry the Rook to simplify, and we get a complication.
The difficulty of selecting limits for ceph demons
For baremetal ceph installations, it is easy enough to calculate the necessary resources per cluster - there are formulas and there are studies. When using weak CPUs, you still have to run a series of performance tests, find out what Numa is, but it's still more simple than in Rook.
In the case of Rook, in addition to the memory limits that you can calculate, the question of setting a CPU limit arises.
And here you have to sweat with performance tests. In the case of lowering the limits, you will get a slow cluster, in the case of setting unlim you will get active use of the CPU when rebalancing, which will badly affect your applications in kubernetes.
Networking Issues v1
For ceph it is recommended to use 2x10gb network. One for client traffic, the other for business needs ceph (rebalance). If you live with ceph on baremetal, then this separation is easily configured; if you live with Rook, then separation by networks will cause you problems, due to the fact that not every cluster config allows you to submit two different networks to the pod.
Networking Problems v2
If you refuse to separate the networks, then if you rebalance the traffic ceph will clog the entire channel and your applications in the kubernetes will slow down or fall. You can reduce the ceph rebalance rate, but then due to the long rebalance you get an increased risk of the second node falling out of the cluster on disks or OOM, and there already guaranteed read only per cluster.
Long rebalance - long application brakes
Quote from the Ceph post. Anatomy of a catastrophe.Test cluster performance:
A 4 KB write operation takes 1 ms, a performance of 1000 operations / second in 1 stream.
An operation of 4 MB in size (object size) takes 22 ms, the performance is 45 operations / second.
Consequently, when one of the three domains fails, the cluster is in a degraded state for some time, and half of the hot objects spread to different versions, then half of the write operations will begin with forced recovery.
The compulsory recovery time is calculated approximately - write operations to the degraded object.
First we read 4 MB for 22 ms, we write 22 ms, and then 1 ms we write 4 Kb of data itself. A total of 45 ms per one write operation to a degraded object on the SSD, when we had a nominal performance of 1 ms - a performance drop of 45 times.
The greater the percentage of degraded objects we have, the more terrible things become.
It turns out that the speed of rebalance is crucial for the correct operation of the cluster.
Specific server settings for ceph
ceph need a specific host tuning.
Example: the sysctl settings are the same JumboFrame, some of these settings may negatively affect your payload.
The real necessity of Rook is questionable
If you are in the cloud you have storage from your cloud provider, which is much more convenient.
If you are on your servers, then ceph management will be more convenient without kubernetes.
Do you rent a server in some low cost hosting? Then you will have a lot of fun with the network, its delays and bandwidth, which clearly has a negative effect on ceph.
Total: The introduction of kuberentes and the implementation of the repository are different tasks with different introductory and different solutions - to mix them means to make a possibly dangerous trade-off in favor of one or another. Combining these solutions will be very difficult even at the design stage, and there is still a period of operation.
References:
Post # 1 But you say Ceph ... is it good?
Post # 2 Ceph. Anatomy of a catastrophe