Dedicated hearth memory and OOM Killer intervention

Hello again! The translation of the next article was prepared specifically for students of the course “Infrastructure platform based on Kubernetes” , which is being launched this month. Let's start.

In recent days, some of my pods have constantly crashed, leaving an entry in the OS system log stating that the OOM Killer destroyed the container process. I decided to figure out why this is happening.
')

Cgroup memory limit and memory parameters

Let's do a test on the K3s distribution. We create 123MB (123 Mi) with a characteristic memory limit.

kubectl run --restart=Never --rm -it --image=ubuntu --limits='memory=123Mi' -- sh If you don't see a command prompt, try pressing enter. root@sh:/#

In another console, find out the uid hearth.

 kubectl get pods sh -o yaml | grep uid uid: bc001ffa-68fc-11e9-92d7-5ef9efd9374c

On the server where it runs under, we find out the parameters of the cgroup , specifying the uid desired sub.

 cd /sys/fs/cgroup/memory/kubepods/burstable/podbc001ffa-68fc-11e9-92d7-5ef9efd9374c cat memory.limit_in_bytes 128974848

128974848 is exactly 123 MiB (123 * 1024 * 1024). The situation clears up. It turns out that in Kubernetes the memory limit is specified via cgroup. As soon as the memory limit grows larger, the cgroup initiates the destruction of the container process.

Stress test

Let's install the stress testing tools through an open command console session.

 root@sh:/# apt update; apt install -y stress

At the same time, we will keep track of the system log entries with the dmesg -Tw .

First, run the stress test utility, allocating 100 MB to it in memory. The process started successfully.

 root@sh:/# stress --vm 1 --vm-bytes 100M & [1] 271 root@sh:/# stress: info: [271] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd

Now we will carry out the second stress test.

 root@sh:/# stress --vm 1 --vm-bytes 50M stress: info: [273] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd stress: FAIL: [271] (415) <-- worker 272 got signal 9 stress: WARN: [271] (417) now reaping child worker processes stress: FAIL: [271] (451) failed run completed in 7s

The launch led to the instantaneous destruction of the first stress test process (PID 271) at signal 9.

Meanwhile, the following entries appeared in the system log:

[Sat Apr 27 22:56:09 2019] stress invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=939
[Sat Apr 27 22:56:09 2019] stress cpuset=a2ed67c63e828da3849bf9f506ae2b36b4dac5b402a57f2981c9bdc07b23e672 mems_allowed=0
[Sat Apr 27 22:56:09 2019] CPU: 0 PID: 32332 Comm: stress Not tainted 4.15.0-46-generic #49-Ubuntu
[Sat Apr 27 22:56:09 2019] Hardware name: BHYVE, BIOS 1.00 03/14/2014
[Sat Apr 27 22:56:09 2019] Call Trace:
[Sat Apr 27 22:56:09 2019] dump_stack+0x63/0x8b
[Sat Apr 27 22:56:09 2019] dump_header+0x71/0x285
[Sat Apr 27 22:56:09 2019] oom_kill_process+0x220/0x440
[Sat Apr 27 22:56:09 2019] out_of_memory+0x2d1/0x4f0
[Sat Apr 27 22:56:09 2019] mem_cgroup_out_of_memory+0x4b/0x80
[Sat Apr 27 22:56:09 2019] mem_cgroup_oom_synchronize+0x2e8/0x320
[Sat Apr 27 22:56:09 2019] ? mem_cgroup_css_online+0x40/0x40
[Sat Apr 27 22:56:09 2019] pagefault_out_of_memory+0x36/0x7b
[Sat Apr 27 22:56:09 2019] mm_fault_error+0x90/0x180
[Sat Apr 27 22:56:09 2019] __do_page_fault+0x4a5/0x4d0
[Sat Apr 27 22:56:09 2019] do_page_fault+0x2e/0xe0
[Sat Apr 27 22:56:09 2019] ? page_fault+0x2f/0x50
[Sat Apr 27 22:56:09 2019] page_fault+0x45/0x50
[Sat Apr 27 22:56:09 2019] RIP: 0033:0x558182259cf0
[Sat Apr 27 22:56:09 2019] RSP: 002b:00007fff01a47940 EFLAGS: 00010206
[Sat Apr 27 22:56:09 2019] RAX: 00007fdc18cdf010 RBX: 00007fdc1763a010 RCX: 00007fdc1763a010
[Sat Apr 27 22:56:09 2019] RDX: 00000000016a5000 RSI: 0000000003201000 RDI: 0000000000000000
[Sat Apr 27 22:56:09 2019] RBP: 0000000003200000 R08: 00000000ffffffff R09: 0000000000000000
[Sat Apr 27 22:56:09 2019] R10: 0000000000000022 R11: 0000000000000246 R12: ffffffffffffffff
[Sat Apr 27 22:56:09 2019] R13: 0000000000000002 R14: fffffffffffff000 R15: 0000000000001000
[Sat Apr 27 22:56:09 2019] Task in /kubepods/burstable/podbc001ffa-68fc-11e9-92d7-5ef9efd9374c/a2ed67c63e828da3849bf9f506ae2b36b4dac5b402a57f2981c9bdc07b23e672 killed as a result of limit of /kubepods/burstable/podbc001ffa-68fc-11e9-92d7-5ef9efd9374c
[Sat Apr 27 22:56:09 2019] memory: usage 125952kB, limit 125952kB, failcnt 3632
[Sat Apr 27 22:56:09 2019] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
[Sat Apr 27 22:56:09 2019] kmem: usage 2352kB, limit 9007199254740988kB, failcnt 0
[Sat Apr 27 22:56:09 2019] Memory cgroup stats for /kubepods/burstable/podbc001ffa-68fc-11e9-92d7-5ef9efd9374c: cache:0KB rss:0KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
[Sat Apr 27 22:56:09 2019] Memory cgroup stats for /kubepods/burstable/podbc001ffa-68fc-11e9-92d7-5ef9efd9374c/79fae7c2724ea1b19caa343fed8da3ea84bbe5eb370e5af8a6a94a090d9e4ac2: cache:0KB rss:48KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:48KB inactive_file:0KB active_file:0KB unevictable:0KB
[Sat Apr 27 22:56:09 2019] Memory cgroup stats for /kubepods/burstable/podbc001ffa-68fc-11e9-92d7-5ef9efd9374c/a2ed67c63e828da3849bf9f506ae2b36b4dac5b402a57f2981c9bdc07b23e672: cache:0KB rss:123552KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:123548KB inactive_file:0KB active_file:0KB unevictable:0KB
[Sat Apr 27 22:56:09 2019] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[Sat Apr 27 22:56:09 2019] [25160] 0 25160 256 1 28672 0 -998 pause
[Sat Apr 27 22:56:09 2019] [25218] 0 25218 4627 872 77824 0 939 bash
[Sat Apr 27 22:56:09 2019] [32307] 0 32307 2060 275 57344 0 939 stress
[Sat Apr 27 22:56:09 2019] [32308] 0 32308 27661 24953 253952 0 939 stress
[Sat Apr 27 22:56:09 2019] [32331] 0 32331 2060 304 53248 0 939 stress
[Sat Apr 27 22:56:09 2019] [32332] 0 32332 14861 5829 102400 0 939 stress
[Sat Apr 27 22:56:09 2019] Memory cgroup out of memory: Kill process 32308 (stress) score 1718 or sacrifice child
[Sat Apr 27 22:56:09 2019] Killed process 32308 (stress) total-vm:110644kB, anon-rss:99620kB, file-rss:192kB, shmem-rss:0kB
[Sat Apr 27 22:56:09 2019] oom_reaper: reaped process 32308 (stress), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

The process with PID 32308 on the host was destroyed due to low memory (OOM). But the most interesting thing is hidden at the end of the journal entries:

Here are the processes of this pod, which are marked as candidates for destruction by the OOM Killer component. The basic pause process, which stores the network namespaces, received an oom_score_adj rating of -998 , meaning the process is guaranteed not to be destroyed. The remaining processes in the container received an oom_score_adj rating of 939 . You can check this value using the formula from the Kubernetes documentation below:

 min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999)

Find out the amount of memory available to the node:

 kubectl describe nodes k3s | grep Allocatable -A 5 Allocatable: cpu: 1 ephemeral-storage: 49255941901 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 2041888Ki

If the amount of requested memory is not specified, by default it will be equal to the limit. Substituting the values, we get the following oom_score_adj value: 1000–123*1024/2041888=938.32 , which is very close to the value 939 specified in the system log. (I don’t know how the OOM Killer gets the exact value of 939.)

So, all processes in the container have the same oom_score_adj value. The OOM Killer component calculates the OOM value based on memory usage and adjusts the result based on the oom_score_adj estimate. And, ultimately, it destroys the process of the first stress test, which has eaten out most of the memory, 100 MB, which corresponds to an estimate of oom_score = 1718.

Conclusion

Kubernetes controls the memory limit of the hearth through the parameters cgroup and the OOM Killer component. It is necessary to carefully coordinate the conditions of the OOM operating system and the OOM flows.

How do you like the material? Anyone who wants to learn more about the course is invited to the June 17 free webinar , where we will explore the possibilities of Kubernetes for organizing the practice of continuous delivery (CI / CD) and approaches for a small team with several applications, and for a large organization with a large number of systems.

Source: https://habr.com/ru/post/456002/

All Articles

Dedicated hearth memory and OOM Killer intervention

Cgroup memory limit and memory parameters

Stress test

Conclusion

More articles: