📜 ⬆️ ⬇️

Linux still not cake

This story began about a month ago, when Cyril Thay added support for nested namespaces in CRIU, after which our CI system ordered to live long. At that moment nothing foreshadowed those fascinating adventures in which we were involved.

image

On closer examination, it turned out that the core is only flowing:

$ slabtop
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
441920 441450 99% 1.00K 13810 32 441920K kmalloc-1024
224070 222908 99% 0.19K 5335 42 42680K kmalloc-192
38304 21198 55% 0.19K 912 42 7296K dentry
25602 25133 98% 0.12K 753 34 3012K kernfs_node_cache
19380 19380 100% 0.04K 190 102 760K Acpi-Namespace

$ uname -a
Linux zdtm.openvz.org 4.10.17-200.fc25.x86_64 #1 SMP Mon May 22 18:12:57 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux


“It flows and flows with whom it does not happen”, we thought, and were updated to 4.11. In that part of CI, we have Fedora, and then it was its newest core. We downloaded, and after a couple of minutes, CI sent us hello again through a broken netfilter — when trying to remove an added rule, a vague error was issued:
')
[root@zdtm ~]# iptables -w -t filter --protocol tcp -A INPUT --dport 12345 -j DROP
[root@zdtm ~]# iptables -w -t filter --protocol tcp -D INPUT --dport 12345 -j DROP
iptables: Bad rule (does a matching rule exist in that chain?).


Using iptables, CRIU blocks the network to fix the state of TCP sockets. It is clear that our CI could not work with such a bug either. Without thinking twice, we collected and loaded the kernel directly from the Linus tree, but it did not last long — the memory flowed again:

[root@zdtm criu]# cat /proc/slabinfo | grep mnt
mnt_cache 36456 36456 384 42 4 : tunables 0 0 0 : slabdata 868 868 0

[root@zdtm criu]# python test/zdtm.py run -t zdtm/static/env00 --iter 10 -f ns
========================= Run zdtm/static/env00 in ns ==========================
Start test
./env00 --pidfile=env00.pid --outfile=env00.out --envname=ENV_00_TEST
Run criu dump
Run criu restore
Run criu dump
....
Run criu restore
Send the 15 signal to 339
Wait for zdtm/static/env00(339) to die for 0.100000
Removing dump/zdtm/static/env00/31
========================= Test zdtm/static/env00 PASS ==========================

[root@zdtm criu]# cat /proc/slabinfo | grep mnt
mnt_cache 36834 36834 384 42 4 : tunables 0 0 0 : slabdata 877 877 0


About the problem reported in lkml and went on to go about their business. Soon our compatriot Alexander Viro replied that Andrei Vagin was not a good person and did not properly explain the causes of the problem. And Alexander, for a moment, is the most important person in the subsystem responsible for everything related to files, file systems. It is not very responsive to the changes sent, but if you sent a certain r ... but, then the answer will overtake you immediately and inevitably. And if you are stupid, then he will even explain everything in his own words in simple words.

Sometimes it's easier to fix a bug than to explain what happened, what Andrei did. After some time, the patch was ready , it was sent to the mailing list and attached to the kernel, which, in turn, was uploaded to the CI. With a sense of accomplishment, Andrew went to do his own thing. Viro was silent, which meant no obvious problems in the patch. Another sign of getting rid of the old problem was the message about the new problem.

[ 699.207570] BUG: Bad page state in process ip6tables-save pfn:1499f4

We no longer had time and desire to understand, limited to writing in lkml , and in CI we downloaded the kernel 4.10, which had a significant advantage - it worked and did not fall. Yes, it flowed a little, and we decided to reboot the car once a day. Someone remembered the good old days, when everyone overloaded the Windows, if it began to slow down.

Fresh bug world efforts pretty quickly repaired. CI installed the Newest and Best Core, but it was not there. There were errors in the tests, about every second.

In this direction Kirill Gorkunov kept the defense (partly due to the fact that Andrew went to sleep). By morning, a big discussion was unfolding in lkml. It turned out that our nuclear colleagues repaired the Terrible Vulnerability CVE-2017-1000364 and broke the backward compatibility of the user API. This breakdown was almost intentional: without this, the code became much more complicated, and the community, reluctantly, decided on extreme measures. And due to the fact that it was a terrible vulnerability (we need a picture of the windows security model with a gate without a fence), the changes were not discussed publicly and were quickly merged into the core. Immediately after that, it turned out that the changes brought another bug to the core, which also turned out to be a Terrible Vulnerability, and a few more days were spent on the already open debate on a new issue.

The confusion has affected distributions. When the RedHat and Ubuntu engineers transferred these changes to their kernels, something went wrong, and both distributions were broken in two different ways. For us, this was also critical, as part of our CI revolves in Travis, and there only Ubuntu is offered to choose from. Another part of the CI is spinning on Fedora, you can replace it with Ubuntu for the sake of homogeneity, but certainly not in a hurry. So Fedora just downloaded the non-native kernel, it should have been fixed already! After installation, out of habit, immediately saw whether it flows.

unreferenced object 0xffff88006342fa38 (size 1024):
comm "ip", pid 15477, jiffies 4295982857 (age 957.836s)
hex dump (first 32 bytes):
b8 b0 4d a0 ff ff ff ff c0 34 c3 59 00 88 ff ff ..M......4.Y....
04 00 00 00 a4 01 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[ffffffff8190510a] kmemleak_alloc+0x4a/0xa0
[ffffffff81284130] __kmalloc_track_caller+0x150/0x300
[ffffffff812302d0] kmemdup+0x20/0x50
[ffffffffa04d598a] dccp_init_net+0x8a/0x160 [nf_conntrack]
[ffffffffa04cf9f5] nf_ct_l4proto_pernet_register_one+0x25/0x90


It flows. The necessary changes were found quickly, for some reason the maintainer DCCP did not send them to Linus, and they were lost in his tree. We take the patch (there is no longer a mood report), reboot into a new kernel.

In one of his works, Mark Twain describes a man named Oliver who went to the silver deposits. The journey took place in very difficult conditions, but Oliver silently endured all the burdens that fell to his lot. One day, during a long stay (perhaps it was already the end of their trip) a mule fell into his house, smashing the roof. The next day, the situation repeated itself, and Oliver moved his house to the side where the mules did not walk, but this did not save him - the house again broke the roof, the cow fell. At that moment, Oliver for the first time expressed his displeasure with the phrase: “It is already becoming monotonous,” after which he resigned and left.

“This is already becoming monotonous,” thought Andrei, looking at the loaded core.

unreferenced object 0xffff9f79442cd980 (size 112):
comm "kworker/1:4", pid 15416, jiffies 4307432421 (age 28687.562s)
hex dump (first 32 bytes):
00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00 .....N..........
ff ff ff ff ff ff ff ff b8 39 1b 97 ff ff ff ff .........9......
backtrace:
[ffffffff9591d28a] kmemleak_alloc+0x4a/0xa0
[ffffffff95276198] kmem_cache_alloc_node+0x168/0x2a0
[ffffffff95279f28] __kmem_cache_create+0x2b8/0x5c0
[ffffffff9522ff57] create_cache+0xb7/0x1e0
[ffffffff952305f8] memcg_create_kmem_cache+0x118/0x160
[ffffffff9528eaf0] memcg_kmem_cache_create_func+0x20/0x110
[ffffffff950cd6c5] process_one_work+0x205/0x5d0
[ffffffff950cdade] worker_thread+0x4e/0x3a0
[ffffffff950d5169] kthread+0x109/0x140
[ffffffff9592b8fa] ret_from_fork+0x2a/0x40
[ffffffffffffffff] 0xffffffffffffffff
unreferenced object 0xffff9f798a79f540 (size 32)


To the credit of Andrei, he did not resign, but reported on the problem in lkml, set up a preventive reboot, launched CI and went about his business. After half a day, a letter came in about new problems.

> [22458.504137] BUG: Dentry ffff9f795a08fe60{i=af565f,n=lo} still in
> use (1) [unmount of proc proc]
> [22458.505117] ------------[ cut here ]------------
> [22458.505299] WARNING: CPU: 0 PID: 15036 at fs/dcache.c:1445

> [22458.515141] ---[ end trace b37db95b00f941ab ]---
> [22458.519368] VFS: Busy inodes after unmount of proc. Self-destruct
> in 5 seconds. Have a nice day...
> [22458.813846] BUG: unable to handle kernel NULL pointer dereference
> at 0000000000000018



The kernel did not just flow, it hung up tests, it hung itself in an incomprehensible state, but it gave signs of life. The problem seemed familiar. It turned out that Andrei had already written about this problem in lkml several months ago, but then no one had dealt with it before. This time it was decided to inform many more people, and the reaction finally arose. Eric Biederman almost immediately found a problem patch, but the essence of the problem a week later was still covered with darkness.

Now, on the CI machines, we have a patched-up core loaded, it is not falling yet, but it is still leaking. In the main branch of Linus, all these problems bloom and smell a little, but this did not prevent the release of 4.12 kernel.

People, keep money in savings banks work only on stable cores, unless of course you have them. A typical Linux kernel lacks testing.

In conclusion, I would like to quote Alexander Viro, who said at a nuclear summit about ten years ago: “We’re READ this damned thing. "

Authors: Pavel Emelyanov, Kirill Gorkunov and Andrey Vagin.

Source: https://habr.com/ru/post/332740/


All Articles