Kernel message handling

Foreword

A terrible tale:

EDAC MC0: 1 CE read ECC error on CPU # 0 Channel # 1_DIMM # 0 (channel: 1 slot: 0)
EXT4-fs error: ext4_wait_block_bitmap: 445: Cannot read block bitmap
Out of memory: Kill process 95 (sshd) score 31 or sacrifice child
CMCI storm detected: switching to poll mode
page allocation failure: order: 1, mode: 0x4020
invalid opcode: 0000 [# 1] SMP

It looks bad, right? The list ~~can be very long~~ very long. In this article I will tell you how to live with it and what we did to it.

Some of these messages in the examples above will make you plunge into the abysses of modern processor architecture (“CMCI storm”, good luck in finding the way back, from the wilds of the Internet) ... Strange things in the kernel may violate expectations about how computers work, making subsequent debugging very obstructed The lack of knowledge about what happened may even leave with a sad answer, "some kind of unknown garbage, rebutted, it seems, has passed."

Strange things can be divided into four groups:

Failure of the central equipment (including the processor. How do you get the message "Generic CACHE Level-2 Generic Error"?)
Peripheral equipment failure (CMCI storm from network card? Fallen SAS HBA?)
Core bugs
Lack of resources

')
Sometimes errors occur in sad path (a triad of concepts: happy path, sad path, bad path) is sad but predictable. There is a meaningful code branch that reacts to this problem. The kernel was divided into zero, the hard disk began to pour Buffer I / O error on device ... etc. Most often, such errors either mean rebooting the server (ideally, automatic), or are processed in the software area (a single broken hard drive is not even a problem, but rebuild the raid after a replacement), or in the hardware (recoverable ECC error in memory). Sometimes errors even go unnoticed (for example, IO error for a CD drive, a lone segmentation fault for a program that “has restarted and continues to work”).

Sad example

But sometimes errors are not processed in the “expected” way, but they bring the system into a state where “there has never been such a thing, and here it is again” (bad path).

A living example from recent practice: the RAID controller on the server went offline and immediately came back (PCI bus reset by watchdog in the driver). All arrays were discovered as “new” disks and received new letters, the same uuid remained. Everything that was in the databag of the chef was mounted at the first opportunity along the same paths.

The result was a construction, to put it mildly, not provided for by the Administrative Code:
/ dev / sda -> / mountpoint (broken)
/ dev / sdd -> / mountpoint (working)
At the same time, the files that were opened on the “bat” version of / mountpoint gave IO error for any operations, and new files were created on the working / mountpoint. Newfound readable files were also taken from there. At the same time, the swap was on “bat” / dev / sda and fantastic messages could be observed:

 Read-error on swap-device (8: 0: 330471000)
 Write-error on swap-device (8: 0: 362589920).

(up to this point I did not know that Linux can make a read error on the swap and stay alive).

The sda itself was a hardware raid, but who can disk redundancy help if the controller falls off entirely? The resulting design completely passed all checks. Is there a disc? There are also mounted. Raid status? OK! Free place? There is. Queue size for IO? All OK. Error counter for mounted block device? Zero (new device). Do new write requests work? Work out. What more could you want?

But at the same time, a lot of software caught a recording error and went into an unexpected state. Someone ended, someone continued to write to the file in spite of errors (the same core with the swap file). Even a special check that all the disks that the chef expects to see are mounted, also said “ok.”

After debriefing, we changed the check to “mount” from check to “drive mounted” to “drive mounted no more than once”. Well, the analysis of the logs. Actually, about them and the article.

Central equipment failures

It is difficult to collect broken teeth with broken arms. The same applies to the system response to failure of the central components. If the server has power problems with the processor, with the processor itself (yes, they also break down), RAM, a PCI controller, or other components that are critical for operation, it either hangs (and then the watchdog comes and reboots) or panics. However, there is a whole class of noncritical errors (for example, a memory error, although very bad, but often does not lead to a reboot). Sometimes a processor is wrong, a program crashes in an ACPI virtual machine, memory controllers or PCI buses are freaking out. Sometimes these errors "only" change a few bytes (bits?) In some places. The consequences of this are impossible to predict. That is, not "difficult", but "impossible."

Core bugs

There are also. BUG on bla-bla-bla. Unpleasant, but not fatal "well, did not work." In some cases, the error is fatal and leads to panic, in some, from the point of view of the kernel, no. However, the consequences of a kernel error again, it is very difficult to describe from the standpoint of userspace. Previously, everything was good, nothing has changed, and now everything is bad.

Story

Something went wrong in the core. Either the mutex, or the memory was corrupted, but at the output we got a strange trace, and an unkillable process, eating 100% CPU. I discovered it when I saw that nova-compute (written in python) is not responding to heartbeats. I went to the host. I see, in the top, 100% CPU. Not restarted. Does not kill (kill -9). Strace shows nothing (no request).

Wait, an experienced admin. This process was not in 'D', it did not "go to IO and never returned." He did not perform any syscall at all. He just ate 100% CPU (in nuclear context?). He was 'R' and he ignored signal 9 (and he was not init). In that case, the story was not completely excavated, but the assumption was that either as a result of an error or an unnoticed memory damage, the kernel could not process a lock when the context was switched, so the CPU time was valid for the process, but the processor worked in the context kernels. That is, “a pear is hanging, eating 100%, cannot be beaten”. Surprisingly, the process is self-terminated (or finished with kill -9) after 10 minutes, after which shutdown / migration went in a regular way (like, happy end).

Lack of resources

When your cozy application is told “FIG YOU, not a file socket,” this is: sad path. Everything is under control and good. But when the network driver cannot get a whole piece of 8kB memory for itself (for the jumbo frame) - “page selection failure: order: 2”, and the packet silently drops — this is bad. That is, from the outside you can see what is bad, but inside everything is fine, except for the lines in dmesg. The main reason for this error is memory fragmentation (chunks of memory must be continuous), plus the feature of the kernel used (at that time). By the way, as a temporary measure (before the server was decommissioned), we slightly reduced the size of the jumbo frame to the state when the module had 2 pages in a row rather than 4. It felt a little better.

And it happens that the magic OOM killer comes and takes away New Year's gifts from some process. This is especially sad when an auxiliary process dies (for example, epam for beam.smp in the context of erlang) and nobody knows about it. Even more interesting, when a shortage of memory happens on a very busy system - not only does the OOM killer kill anyone, the search process for the "victim" can take tens of seconds, and during this time the server does nothing. Nothing at all - neither the network nor the disk work. Only the core rustles through the structures in memory. And then (minus the “not waking loser”) it continues to work as if nothing had happened. And guess after that why the failover worked with a timeout of 20s.

Leading up to the story topic

In all this sad there is one thing in common - the kernel writes about the causes of sadness in dmesg. Even if the kernel decides to panic, it first writes to dmesg and then panics. Analysis of the causes of post mortem - perhaps the most important work of the system administrator, it resembles the work of a specialist in evidence. The smallest traces of what was left, allow you to put together pieces of the mosaic. The mosaic is especially interesting when kurtosis occurs on the scale of several servers.

The second important thing: to know about all these nasty things when they happen. Because between the “sad trace” in dmesg and the degradation of the service, which will be noticeable from the side, considerable time intervals can take place (up to several tens of hours). And there is a time to “fix it as if nothing had happened.”

Thus, I want to "collect logs" and "respond to them."

Those who have already sighed to themselves "well, another story about netconsole" - no. I will say a few words about netconsole, and then I will tell you how to live with it later, when data is already received via UDP.

netconsole

Detailed howto has already been written: habrahabr.ru/post/131220 , and here I will give a brief squeeze: netconsole is a kernel module for sending kernel logs in UDP packets. Note that this is so low that even ARP does not work on it, that is, you must specify the recipient's MAC address.

But netconsole works, even if you break through the entire network stack. For example, make rmmod ip, or, or remove the IP address from the interface, or when bridging is enabled, rmmod openvswitch. It is very funny to see how the kernel over the network writes that the entire network is broken.

When the module is loaded, it is specified with a local interface, the IP address from which to send, the IP and recipient MAC addresses. If the traffic is sent outside the network segment, then the MAC address of the gateway (.1) is specified as the MAC.

Reception

Everything interesting begins here. You can collect netconsole using syslog (and its variations are syslog-ng, rsyslog). In our case, logstash is used. We have already written about this construction: wrote .

Logs are firstly accepted, secondly marked on input as 'netconsole' (conditionally chosen label), and most importantly, the server name is recorded instead of the address. This is a very important process and about him below.

Config example:

input { udp { type => "netconsole" port => "6666" buffer_size => "1048576" } } filter { if [type] == "netconsole" { dns { action => "replace" reverse => ["host"] } mutate { add_field => ["syslog_host", "%{host}"] } } }

The only nontrivial here is setting the host field. And it is very important, because otherwise there will be no dmesg. It is so important that we are now having a discussion about whether we need to write out all the addresses in / etc / hosts on the monitoring hosts and use it so that the dns crash at the time of recording the trace would not spoil the error log.

Further, we need to look at these logs. In the context of kibana, for the sake of convenience, a 'TERMS' column was added with the choice: syslog or netconsole. The resulting filter looks like this:
host must 'hostname' and source must 'netconsole' .

The resulting can already be read, but very inconvenient. For convenience, I use another manual script,
which rips out dmesg just by playtext from elasticsearch:

 #!/usr/bin/python from pyelasticsearch import * import json import sys es = ElasticSearch('http://elastic.local:9200/') q=json.load(sys.stdin) res=es.search(q)['hits']['hits'] for i in res: print(i['_source']['message']),

On stdin, he is fed the interested request from kibana in the form of JSON (the 'i' button in the widget of interest), on stdout it gives out a neat pile of lines without unnecessary garbage. However, I will write about life and struggle with Kibana separately.

Monitoring

Above, it was shown how to view post mortem data. There is also the task to react to the “bad” lines and notify shinken (nagios) about them. And how to get on the nerves of admins, nagios himself knows very well.

This is done using filters.

 filter { if [type] == "netconsole" { if ( [message] =~ /ioc0: attempting task abort!/ or [message] =~ /mptbase: ioc0:.+Code=\{Abort\}/ or [message] =~ /Device offlined/ or ... [message] =~ /Read-error on swap-device/ or "1" == "2" ) { noop { add_tag => ["notify"] } }

We set the notify flag if we don’t like the line. Pay attention to “1” == “2” - this line was added to boldly write or on each “meaningful” line and not be afraid to break the syntax.

Flag processing is very simple:

 output { if "notify" in [tags] { exec { command => "/usr/lib/nagios/plugins/log_event_sender.sh %{syslog_host} log_event '%{message}'" } } }

Testing

I would say that this is “everything”, if not for one “but”. Some types of scary lines appear extremely rarely - and it is very difficult to know whether monitoring will work on them or not. Up to a certain point, we relied on “we will not break it,” but the fears of a ridiculous typo in the filters, which completely disable actuation for errors, were too great.

And the microform was born for testing.

The idea is simple: each regexp in filters should catch only one line. And for each regexp there should be an original (preferably copied straight to the live, from a real event) that we want to catch.

I showed an example of the file file above. An example file with examples you can see in the introduction. One muck on one line.

Next comes the simplest test: we run logstash in stdin / stdout mode, and check that the number of input lines is equal to the number of output lines (labeled notify). If it diverges - it means that someone broke something. If it converges, it means that for every bad line it will catch something.

Below run_test.sh, in which nothing interesting happens, we write microconfig with echo, insert its filter in the middle, which we test, check that everything coincides. If not - the admin himself understands what's wrong.

 #!/bin/bash TMPC=logstash-test-$RANDOM.conf TMPO=logstash-test-output-$RANDOM.txt LOGSTASH=logstash/logstash-1.4.2/bin/logstash TIMEOUT=60 FILTERS=....../filters.conf.erb if test ! -d logstash; then echo "No local logstash found at $LOGSTASH (Please download/unpack)" exit -1 fi echo 'input { stdin { type => "netconsole" } }' > $TMPC cat $FILTERS >>$TMPC echo 'output { if "notify" in [tags] { stdout { } } }' >>$TMPC #echo $LOGSTASH --quiet -f $TMPC cat netconsole_examples.txt | timeout -s SIGKILL $TIMEOUT $LOGSTASH --quiet -f $TMPC > $TMPO if test `wc -l netconsole_examples.txt | awk '{print $1}'` -ne `wc -l $TMPO | awk '{print $1}'`; then echo Netconsole test FAILED echo Input: `wc -l netconsole_examples.txt` echo Output: `wc -l $TMPO` echo not removing files $TMPC and $TMPO exit 1 else echo netsonsole test PASSED rm $TMPC $TMPO exit 0 fi

After each edit, you must / can run run_test.sh and check that it works. At the same time, netconsole_examples.txt serves as an excellent method for storing all the most terrible things you can meet in this life. (These lines themselves are commented on taste, with the addition of links to articles in the wiki, where post mortem analyzes are stored, etc.).

At the monitoring output, we have the following messages:

 ** PROBLEM alert - database1984.utah42.nsa.gov/log_event is CRITICAL **
 Shinken notification

 Notification Type: PROBLEM

 Service: log_event
 Host: database1984.utah42.nsa.gov
 Address: 238.211.14.5
 State: CRITICAL

 Date / Time: 01-23-2015 Additional Info: Jan 23 03:59:21 database1984 kernel: [15941700.780017] mce: [Hardware Error]: Machine check events logged

Source: https://habr.com/ru/post/92969/

All Articles