How we broke Ceph

Detective in 1 parts.
This story happened in October.

There is a Ceph version 0.94 (Hammer). 6 storages, 8 monitors, 6-8 osd on each store, SSD disks with sizes from 1 TB to 4 TB. Replica - 2, minimum 1.
')
Once this happened: in the evening, a storm passed through the entire cluster and many osd spontaneously restarted. The osd logs showed standard ceph errors of the form no reply from osd.X Some slow requests and spontaneous restarts, which is not a critical event for ceph and quite frequent. In general, it happens. But, unfortunately, not all of the restarted osd were able to rise. Namely, the hero fell osd.45 on storage6.

We decided that it was not a bad thing, because it’s only 1 of 50 osd, and tried to run it in different ways. Well, you know, umount / mount directories, flush and re-create logs, that's all. As a result, as usual, any admin comes in such situations, it ended up restarting the entire node, because from experience it is known that systemd sometimes goes a little crazy and does not want to launch the necessary processes correctly. Or, osd ceases to work, and without restarting the server, it does not want to start.

And after the restart we already had 4 dead osd, i.e. all on this story. Errors are the same as on the primary osd:

 017-10-12 12:10:01.756562 7f02459af880 0 ceph version 0.94.10 (b1e0532418e4631af01acbc0cedd426f1905f4af), process ceph-osd, pid 2343 2017-10-12 12:10:01.767719 7f02459af880 0 filestore(/var/lib/ceph/osd/ceph-45) backend xfs (magic 0x58465342) 2017-10-12 12:10:01.769302 7f02459af880 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-45) detect_features: FIEMAP ioctl is supported and appears to work 2017-10-12 12:10:01.769310 7f02459af880 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-45) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2017-10-12 12:10:01.785177 7f02459af880 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-45) detect_features: syncfs(2) syscall fully supported (by glibc and kernel) 2017-10-12 12:10:01.785229 7f02459af880 0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-45) detect_feature: extsize is disabled by conf 2017-10-12 12:10:01.788289 7f02459af880 0 filestore(/var/lib/ceph/osd/ceph-45) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled 2017-10-12 12:10:01.790182 7f02459af880 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway 2017-10-12 12:10:01.790184 7f02459af880 1 journal _open /var/lib/ceph/osd/ceph-45/journal fd 19: 10485760000 bytes, block size 4096 bytes, directio = 1, aio = 0 2017-10-12 12:10:01.790428 7f02459af880 1 journal _open /var/lib/ceph/osd/ceph-45/journal fd 19: 10485760000 bytes, block size 4096 bytes, directio = 1, aio = 0 2017-10-12 12:10:01.790923 7f02459af880 0 <cls> cls/hello/cls_hello.cc:271: loading cls_hello 2017-10-12 12:10:01.796251 7f02459af880 0 osd.45 19681 crush map has features 1107558400, adjusting msgr requires for clients 2017-10-12 12:10:01.796260 7f02459af880 0 osd.45 19681 crush map has features 1107558400 was 8705, adjusting msgr requires for mons 2017-10-12 12:10:01.796263 7f02459af880 0 osd.45 19681 crush map has features 1107558400, adjusting msgr requires for osds 2017-10-12 12:10:01.796273 7f02459af880 0 osd.45 19681 load_pgs 2017-10-12 12:10:01.850570 7f02459af880 -1 *** Caught signal (Aborted) ** in thread 7f02459af880 ceph version 0.94.10 (b1e0532418e4631af01acbc0cedd426f1905f4af) 1: /usr/bin/ceph-osd() [0xb35f7d] 2: (()+0x11390) [0x7f0244897390] 3: (gsignal()+0x38) [0x7f0242c8f428] 4: (abort()+0x16a) [0x7f0242c9102a] 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7f02435d284d] 6: (()+0x8d6b6) [0x7f02435d06b6] 7: (()+0x8d701) [0x7f02435d0701] 8: (()+0x8d919) [0x7f02435d0919] 9: (object_info_t::decode(ceph::buffer::list::iterator&)+0xb7a) [0x7b47ba] 10: (object_info_t::object_info_t(ceph::buffer::list&)+0x1b7) [0x78cd07] 11: (PGLog::read_log(ObjectStore*, coll_t, coll_t, ghobject_t, pg_info_t const&, std::map<eversion_t, hobject_t, std::less<eversion_t>, std::allocator<std::pair<eversion_t const, hobject_t> > >&, PGLog::IndexedLog&, pg_missing_t&, std::__cxx11::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >&, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >*)+0xf52) [0x784c82] 12: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x24e) [0x81791e] 13: (OSD::load_pgs()+0xbb6) [0x6b9eb6] 14: (OSD::init()+0x137f) [0x6bdddf] 15: (main()+0x2c20) [0x646310] 16: (__libc_start_main()+0xf0) [0x7f0242c7a830] 17: (_start()+0x29) [0x65c779] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

If you start osd directly via a binary, and not via upstart or systemctl, then you can see the thrown exception and a little more information (the default binary writes STDERR and STDOUT to the console):

 root@storage6:~# /usr/bin/ceph-osd -f --cluster ceph --id 45 starting osd.45 at :/0 osd_data /var/lib/ceph/osd/ceph-45 /var/lib/ceph/osd/ceph-45/journal 2017-10-12 15:13:45.288995 7fc5b1b06880 -1 osd.45 22551 log_to_monitors {default=true} terminate called after throwing an instance of 'ceph::buffer::malformed_input' what(): buffer::malformed_input: __PRETTY_FUNCTION__ decode past end of struct encoding *** Caught signal (Aborted) ** in thread 7fc58e699700 terminate called recursively Aborted (core dumped)

Here we see that the daemon swears at some wrong input ( malformed_input ). But what kind of data is not yet clear. The situation is complicated by the fact that unfound objects appeared in the cluster:

 recovery 3/3102030 unfound (0.000%)

These 4 non-starting osds were timed out by monitors with a timeout. Data recovery and replication began, and in the cluster, there were no one hundred ongoing slow requests. Since Ceph is used by Openstack as a storage backend (not prod, but dev environment), the clients (development teams) have felt this very much and all the work has slowed down noticeably, since the virtual women have been very slow.

In Ceph we have two pools. One for virtualo disks directly - rbd1, the second - for the images from which these disks are created. Unfound objects in this case are critical for both pools, as they may result from inconsistency of data. But if in the first case - all the anfounds in rbd1 - we will have a maximum of 3 mince, then in the second case - all the anfounds in images - we potentially have a whole shop of stuffing in connection with the peculiarities of the work of Openstack. Therefore, since we were basically ready to lose three cars, because openStek is not a production, it was necessary to check which pool these non-found objects belong to:

 root@volume1:~# ceph pg 2.67 list_missing { "offset": { "oid": "", "key": "", "snapid": 0, "hash": 0, "max": 0, "pool": -1, "namespace": "" }, "num_missing": 3, "num_unfound": 3, "objects": [ { "oid": { "oid": "rbd_data.8e53d86b8b4567.0000000000000220", "key": "", "snapid": -2, "hash": 2391933031, "max": 0, "pool": 2, "namespace": "" }, "need": "19232'24922826", "have": "19172'24922749", "locations": [] }, { "oid": { "oid": "rbd_data.42375f55c840f3.000000000000c6d3", "key": "", "snapid": -2, "hash": 1598576743, "max": 0, "pool": 2, "namespace": "" }, "need": "19229'24922824", "have": "19110'24922274", "locations": [] }, { "oid": { "oid": "rbd_data.b4b7816b8b4567.0000000000000298", "key": "", "snapid": -2, "hash": 2351466599, "max": 0, "pool": 2, "namespace": "" }, "need": "19229'24922823", "have": "19110'24921517", "locations": [] } ], "more": 0 }

They were all on the same osd, in one pg and in pool number 2. And ... this is a pool of rbd1, so they exhaled. If anything, we lose only three virtuals, it is not so bad.

Remove one of the problematic osd from the cluster using `ceph osd rm osd.45`. Anfoond becomes 28. But what is this! This is already a problem. It is necessary in any way to try to restore the broken osd, or we are waiting for a complete reset of all virtualok in Openstack, and this, firstly, is difficult, and secondly, terribly embarrassing. In general, we cut in the mode of fierce debag.

So, what do we have at the moment?

- an anomaly passed through the cluster, which caused the osd part to restart;
- one of them could not start;
- after restarting its node, the remaining osd could not start;
- there are blocked requests in the cluster, all on one osd (8th);
- the cluster has unfound objects (28 pieces), all on the same osd (8th);
- Apparently, deleting the rest of the osd from the cluster again will increase the number of anfoosts, which means you should not do this;
- clients suffer, the recovery periodically stops and you have to trigger it to restart any of the working osd (the ceph cluster can sometimes stick and stop recovering. Then you need to trigger the recovery by the event: osd up-> osd down-> osd up).

Streysim launch osd.45:

 pidof `start ceph-osd id=45` | xargs strace -p 2> 45-strace.log

We see that osd stumbles upon reading one of the local data sections, after which the familiar exception of ceph::buffer::malformed_input thrown:

 stat("/var/lib/ceph/osd/ceph-45/current/2.c_head", {st_mode=S_IFDIR|0755, st_size=24576, ...}) = 0 stat("/var/lib/ceph/osd/ceph-45/current/2.c_head/DIR_C", {st_mode=S_IFDIR|0755, st_size=24576, ...}) = 0 stat("/var/lib/ceph/osd/ceph-45/current/2.c_head/DIR_C/DIR_0", {st_mode=S_IFDIR|0755, st_size=24576, ...}) = 0 stat("/var/lib/ceph/osd/ceph-45/current/2.c_head/DIR_C/DIR_0/DIR_0", {st_mode=S_IFDIR|0755, st_size=24576, ...}) = 0 stat("/var/lib/ceph/osd/ceph-45/current/2.c_head/DIR_C/DIR_0/DIR_0/DIR_D", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 stat("/var/lib/ceph/osd/ceph-45/current/2.c_head/DIR_C/DIR_0/DIR_0/DIR_D/DIR_B", 0x7ffcccfdf330) = -1 ENOENT (No such file or directory) stat("/var/lib/ceph/osd/ceph-45/current/2.c_head/DIR_C/DIR_0/DIR_0/DIR_D/rbd\\udata.e1bde639901b63.0000000000000000__head_852BD00C__2", {st_mode=S_IFREG|0644, st_size=8388608, ...}) = 0 stat("/var/lib/ceph/osd/ceph-45/current/2.c_head/DIR_C/DIR_0/DIR_0/DIR_D/rbd\\udata.e1bde639901b63.0000000000000000__head_852BD00C__2", {st_mode=S_IFREG|0644, st_size=8388608, ...}) = 0 open("/var/lib/ceph/osd/ceph-45/current/2.c_head/DIR_C/DIR_0/DIR_0/DIR_D/rbd\\udata.e1bde639901b63.0000000000000000__head_852BD00C__2", O_RDWR) = 22 fgetxattr(22, "user.ceph._", 0x7ffcccfdf7c0, 100) = -1 ERANGE (Numerical result out of range) fgetxattr(22, "user.ceph._", NULL, 0) = 250 fgetxattr(22, "user.ceph._", "\17\10\r\1\0\0\4\3I\0\0\0\0\0\0\0(\0\0\0rbd_data.e1bde639901b63.0000000000000000\376\377\377\377\377\377\377\377\f\320+\205\0\0\0\0\0\2\0\0\0\0\0\0\0\6\3\34\0\0\0\2\0\0\0\0\0\0\0\377\377\377\377\0\0\0\0\0\0\0\0\377\377\377\377\377\377\377\377\0\0\0\0M\311\274\2\0\0\0\0\34K\0\0L\311\274\2\0\0\0\0\34K\0\0\2\2\25\0\0\0\10h\361\374\0\0\0\0\0\341\244\v\0\0\0\0\0\0\0\0\0\0\0\200\0\0\0\0\0\265?\336Y\256\270N\5\2\2\25\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0M\311\274\2\0\0\0\0\0\0\0", 250) = 250 futex(0x7f963ba19680, FUTEX_WAKE_PRIVATE, 2147483647) = 0 write(2, "terminate called after throwing "..., 48) = 48 write(2, "ceph::buffer::malformed_input", 29) = 29 write(2, "'\n", 2) = 2 write(2, " what(): ", 11) = 11 write(2, "buffer::malformed_input: __PRETT"..., 79) = 79 write(2, "\n", 1) = 1 rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0 tgkill(23708, 23708, SIGABRT) = 0 --- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=23708, si_uid=0} --- write(2, "*** Caught signal (Aborted) **\n "..., 55) = 55 futex(0x7f963b802110, FUTEX_WAKE_PRIVATE, 2147483647) = 0 write(2, " ceph version 0.94.10 (b1e053241"..., 1418) = 1418 futex(0x48d40b4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x48d40b0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x48d4030, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x48d4058, FUTEX_WAIT_PRIVATE, 2, NULL) = 0

Perhaps a broken, inconsistent object e1bde639901b63.0000000000000000__head_852BD00C__2 ? Since the Streys is clearly seen that the exception is thrown after trying to process this object. Well, we are looking for his cue on other storajs:

 find /mnt/osd? -name *e1bde639901b63.0000000000000000__head_852BD00C*

We find it on storage5 in osd.37. We verify the md5-sum. The same. Well, maybe xattrs are different? Carefully copy the rsync object from storage5 to storage6 while preserving all the meta-information. Back up the original file and replace it with the copied one. We try to run, but to no avail - everything is the same. But osd works there! How so?

We delve into the study of the source code of demons and look for, because of what it can fall. Here, at this stage, Azat, a C ++ developer who knows about ceph, but has never worked with him, helps us.
It is clear that the problem is in the data extracted by the osd boot process. We try to run other healthy osd and stretc them to understand how they should work normally. We start the sick osd through gdb and try to debug. Compare the startup process.

A few hours later, the developer says that he understands the difference in the start of osd, and in which part of the code it crashes, but the root cause of this behavior is not clear. He says that the 37th osd (with which we copied the "valid" object) then should not work either. It suggests restarting it to prove the hypothesis. We are in denial. Like, everything is bad, and there’s another node there, no matter how bad it is. The developer says a logical thing, that it still doesn’t matter, that it works there on snot and can fall at any moment and not rise, but at least we can prove the theory and move on. Agree Restarting - does not rise. Happiness for the developer, mourning for us, as there are more anofounds, and the number of slow requests has increased to 800. But what to do. It is clear that if at 100 slow requests clients somehow could work, then at 800 everything finally becomes a stake.

Now we have potentially two full storages, the osd on which hang in the balance and the remaining 3 running osd on storage5 will not survive the restart.

We start to compare file systems and file structures on healthy osd, since we see the difference in the start. We find that the matter seems to lie in xattr'ah and work with them. To clarify the healthy stretch osd:

 pidof `start ceph-osd id=8` | awk {'print $1'} | xargs strace -fe open,fsetxattr -p 2>8-osd-xattr.strace

Of course, it doesn’t give much, but we see that in normal and broken osd, there are different requests at startup. Some with prefixes, others without. The developer began to look at all the places where xattr was generated and \ or read, and it was clear that such xattr should be generated, but for some reason there was an error when reading. Just this moment, why so, and was incomprehensible. And then we look in git upstream and see that we added it relatively recently.

Here is the patch that should help correctly go through the process of starting and reading the file system. The developer offers to update Ceph and try to run on it. And in the meantime, he casually says more jokingly: “ Listen, guys, have you never accidentally put the 10th ceph here? ” And he laughs. And one admin replies: " Yes, they accidentally installed, then pappet reinstalled the necessary version ." A second of silence. And everyone starts to laugh.

And the thing was this: storage [56] was poured the most recent, at the time of their setup from a separate environment, apt pins were incorrectly indicated, which is why the default version of Ceph from Ubuntu rap was installed. Pappet rolled it, rolled osd, launched them, they started to work. A day later, the admin noticed that the ceph version was incorrect, made changes, installed the correct one, restarted (!!!) osd (without changing over), and quit smoking bamboo. Two weeks later there was an incomprehensible cataclysm, described at the very beginning, because of which something happened, and one osd did not rise.

The developer still in the middle of the path looked at the file structures and did not understand why they are so strange, and wrote off to possible data corruption. It turned out that the guys from Ceph in version 10 (Jewel) changed the way they were stored and presented to xattrs. The fact is that they used to store all the xattr values inline, but this was non-optimal and had a length limit. Accordingly, in the 10th version, they changed the storage structure and began to store them in blocks, breaking up into several parts. Osd was initialized on the 10th version and worked for a day, creating objects “in a new way”, then it was downgraded, and she worked for another 2 weeks. Why it started after the downgrade is not clear. Probably never reached this data. But in our case, she ran across these xattrs and could not digest them, throwing out the malformed_input exception.

Of course, we immediately installed Ceph's Jewel versions on storage [56], and all the osds started up as pretty. Anfoond left, blocked requests also disappeared. Ceph began to recover. We were almost certain that we would lose the whole open stack. And that's how it happened to drag. Of course, many thanks to the developer, he really helped.

Such is the story. A couple of hours after the occurrence of the problem, one admin said that he had recognized the exception (mailformed input) thrown by the daemon, and went to look at the code, and there were only (!) 23 places where it could be thrown. I laughed and said that this is a lot and the wrong approach, since we still will not reassemble the ceph. At that moment it was wild for me, and the scale of the problem allowed me to dig into the ceph, trying various options. But when at the end of this day I realized that there was no better way, it seemed to me a good chance to be used.

And as a result, the approach “let's look into the raw materials of the application to understand what is happening and what is going wrong” turned out to be very effective, and I was wrong.

Source: https://habr.com/ru/post/353854/

All Articles

How we broke Ceph

More articles: