Little admin's story: how to catch OOM

Admin mystery: On the server, there were three oom kill'a, and monitoring said only about two. Why?

Configuration

We have a ganglia-shinken-logstash-elasticsearch-kibana bundle configured to monitor everything. The full description is quite extensive , so I will confine myself to the part that is relevant to the problem.

Logs from all servers are sent to logstash. He puts them in elasticsearch. In the logstash config, there is a reaction to all strange messages that indicate problems. If a message appears, event is sent to the monitoring (shinken), which by various methods begins to bother admins.
')
In addition to syslog'ov, which send messages from most applications, we also configured to send netconsole from all cores. The technology itself is simple to impossible - the kernel besides dmesg sends messages in the form of a UDP datagram to the specified IP and mac address. The MAC address is needed because netconsole is very low-level and is not going to solve the problem of “how to make a MAC from IP” (that is, ARP). Due to the low-level messages are even in situations of complete cataclysm. For example, if the soft switch fails (and the network is unavailable), messages will still be sent. Moreover, they will be sent, even if ijumps says -j drop_vsyo_nafig. And, the most important and valuable, these messages will be successfully sent if the disk subsystem does not fully work. That is, for the post-mortem research "what exactly happened to the hung server" is the most it.

An obvious candidate for “bad” messages is a message from oom-killer.

 [517935.914380] ntpd invoked oom-killer: gfp_mask = 0x201da, order = 0, oom_score_adj = 0
 [517935.914730] Call Trace:
 [517935.914807] [<ffffffff816e14ce>] dump_header + 0x83 / 0xbb
 [517935.914877] [<ffffffff816e155b>] oom_kill_process.part.6 + 0x55 / 0x2cf
 ...
 with the final triumphant: 
 [517935.951044] Out of memory: Kill process 4550 (apache2) score 247 or sacrifice child
 [517935.951203] Killed process 4550 (apache2) total-vm: 2610268kB, anon-rss: 2012696kB, file-rss: 3928kB

So back to the riddle. There is a start-up, pre-production, how, suddenly, the Apache (more precisely, the wsgi-application) sucks the data to obscenity, and it is nailed with the words “go be fat somewhere else”. Admin comes a message. It would seem that everything is good (well, in the admin sense, "good"). But…

There were three oom'a, messages came about two. Monitoring is OK, netconsole is OK. Riddle? Problems? Symptoms of mysterious unknown garbage? Call the court shaman with a tambourine?

Data collection

First question: Was this “third” oom? Recheck locally on the servers - was. I write with pens - three processes were definitely killed.

Second question: mail? We check - according to the shinken logs - two operations, everything that shinken generated reached.

Third question: Does netconsole not work?

Check: # echo "performing tests" > /dev/kmsg

In kibana the message appeared. So sending and receiving is working.

What are the messages about OOM in kibana? The naive request “invoked” (not the most common word for logs) is three messages.

How three? Why then alert two? We look at the filters, the filters are excited not on the “invoked oom-killer” line, but on the “Out of memory: Kill process. + (. +)” Line.

We look. Wow, and the lines “Out of memory: Kill process” in kibana are two, not three. Invoked three, kill two.

We configure the sample in kibana so as to be alone with only a specific netconsole output about this OOM. [Spoiler: how to admin to survive with the kibana - in the next series]. We look. It seems that everything is there. We start to compare line by line with the fact that in dmesg on a host. You can see the trace, you see a rather ugly looking dump of free memory (it is sent in separate pieces, so it turns into logstash into a set of incoherent messages like “5040”, “1 * 256kB (U)”, etc.), there is a list of processes ... Aha , the list of processes is not entirely, and in general, there is no lower half of the sheet.

For the two previous OOM triggers, everything is in place, but there are no messages.

So, the concrete fact is №1: not a complete oom log was recorded in elasticsearch. Since the excitement of shinken occurs when a message is received, it can be assumed that the log has not reached the filters completely either, that is, elasticsearch and its interaction with the logstash can be skipped for now.

We narrow the focus of the proceedings to the problem of getting netconsole. Why? Well, on the one hand, UDP, not guaranteed delivery and all that, but on the other hand, this is our local network, there were no monitoring reports of increased network load, there was no complaint from switches from anything, there were no blinks of network links either. Network loading graphs are also smooth - no connectivity problems are observed.

Moment of intuition: And what size do we have a record from oom kill? I write to the file, wc -c. It turns out, 12.5kb. Hypotheses about “not going into UDP” are not appropriate, because each message is short - hundreds of bytes. But the question is: who is faster - a kernel that generates UDP packets at a low level in a nuclear thread, or an application to java, which filters filters and regexps on the received messages, writes their database and sends event messages to other servers? It is obvious that "it makes faster than dirty C code, in which the word goto occurs 119287 times ( 3.16-rc7).

Question: what happens to packages that are faster than an enterprise grade software can process them? A good wizard helps them to fall asleep, right at the exit from the network interface using headshots.

Focus on this issue. Suddenly, a sensation, a drama, a scandal, a network interface on one of the servers with logstash scream about universal injustice, but no one hears:

 bond0 Link encap: Ethernet HWaddr blah blah blah
           UP BROADCAST RUNNING MASTER MULTICAST MTU: 1500 Metric: 1
           RX packets: 455816267 errors: 0 after: 130361 overruns: 0 frame: 0
           TX packets: 73379669 errors: 0 dropped: 0 overruns: 0 carrier: 0
           collisions: 0 txqueuelen: 0

One hundred and thirty thousand packets instead of living, rejoicing, bringing happiness, sleeping on the network interface, and no one can wake them up.

We fix the fact number 2: monitoring does not control the number of drop'ov on network interfaces. Regardless of further proceedings, "can this be fixed?" - we have a valuable observation, which immediately turns into a task in the task tracker: "monitor the change in the number of dropes on the network interfaces." The next run to Google suggests that, alas, it is impossible to reset the counters on the interface with humane methods, so that we fix the changes.

Forensic system administration ends at this stage. The reason for the failure of monitoring is clear to the very bottom: OOM causes sending a large number of messages in an ultrashort period of time, java does not keep up with it, redundant drop packets. In this case, including, the package with the message that the application was killed was lost - the event was not sent to shinken.

We repair

The second part - how should we deal with this?
A short run to Google showed that the buffer size for a socket when reading a UDP datagram is controlled by the application ( man setsockopt , SO_RCVBUF). We continue the sprint with obstacles in Google - there is an option in the logstash configuration, called buffer_size . By common thought, we raised it to 1MB. The existence of net.core.rmem_max is also magnified - by default it is 200 + kb. To our case (only 12kb) most likely does not apply, but it will not hurt to raise to 2Mb.

Result

0. The reason for the failure of monitoring is precisely clarified.
1. Now shinken keeps track of the dropes on all network interfaces.
2. logstash is ready to accept a larger burst of logging without packet loss.

PS For those who talk about "tcp for syslog". Netconsole is very low-level, and it does not know how to handle any tisipey, it even makes UDP using the “filled in fields and sent by ethernet-frame” method. And, probably, the core would look very funny if it started to slow down due to the thoughtful recording of the TCP logs ...

Source: https://habr.com/ru/post/231623/

All Articles

Little admin's story: how to catch OOM

Configuration

Data collection

We repair

Result

More articles: