📜 ⬆️ ⬇️

Processing logs based on previous messages in logstash / elasticsearch

I recently wrote about capturing nuclear MCE (machine check error) and other bad things with the help of netconsole. Extremely useful thing. One problem: throttling on the CPU due to local overheating (continuous load) is fixed as MCE. A backup happens - and the admin comes a terrible message about the MCE, which in practice means “slightly overheated” and definitely does not require attention to itself at 3 am.

The ridiculousness of the problem is that Linux fixes MCE after the throttling is over. That is the 'normal' mode, but instead it turns into MCE. It looks like this:
 CPU0: Core temperature above threshold clock throttled (total events = 40997)
 CPU4: Core temperature above threshold clock throttled (total events = 40997)
 CPU4: Core temperature / speed normal
 CPU0: Core temperature / speed normal
 mce: [Hardware Error]: Machine check events logged

In this case, we definitely want to respond to normal MCE. What to do?

Within logstash, message handling is supposed to be stateless. You see the message - you react. Introduce, for the sake of one type of message, a more complex system — overkill.
')
It would seem that there is a filter (not to be confused with output) elasticsearch, which allows you to make requests. Unfortunately, he does not know how to do ifs, that is, remove_tag and add_tag ​​will work regardless of whether the search was successful or not.

It's sad.


The second problem: how should the query look like in elasticsearch? We need a time interval relative to the current one, and we need to filter on the current host (that is, the host to which the MCE arrived).

Let's start to decide in turn.

First: the request. query cannot contain a normal search query, but must contain a query string, which is inside the lucene query . Let's start with a fixed query:

"Type: netconsole AND host: compute109 AND message: temperature". (type is set in input for receiving messages from the kernel via netconsole).

Hooray, there are results, but for a large interval, which is fraught with false positives. What about @ timestamp? Help for query language is quite modest, and there is no math or special variables. On the other hand, ranges are supported using the "[from TO to]" entry.

Here I had an intuition attack and I wrote @ timestamp:[now-2h TO now] in the request, although there is not a word about 'now' in the documentation. And they understood me. It is better to check such requests in kibana in the field “query”. A quick check revealed that now-1m also rolls in quality time.

So the query has become more pleasant: type:netconsole AND host:compute109 AND message:temperature and @timestamp:[now-1m to now] ...

It would seem that everything will be fine. But - did not work. Although it should have been. And here I send WTF rays to the creators of lucene, logstash and elasticsearch. Because and and AND are six different letters.

Correct (and only!) type:netconsole AND host:compute109 AND message:temperature AND @timestamp:[now-1m to now] . Otherwise, the request begins to search for messages (message) containing any words temperature, and, @ timestamp ).

The question remains: where does the host come from? Here logstash helps - it can substitute variables into text strings from the fields of the message being processed.

The query query in the logstash config starts to look like this:

query => "(type:netconsole) AND (host:%{[host]}) AND (@timestamp:[now-1m TO now]) AND (message:temperature)"

And there would be happiness if it were not for the absence of if'a inside the elasticsearch filter. It turns out a perverted construction: we want to use elasticsearch for if, but we cannot, and where we can use if, we cannot make inquiries "into the past."

After several dozen attempts to “climb up the tree”, we got the following construction: We, after finding the MCE, make a request to elastic, and fill in the field (in the message about MCE) “mce_temperature” with the message content of the found message about 'temperature' on the host from the message about MCE, in the last minute. That is, we add to the message about MCE the previous one, about temperature. If we find, of course.

The rest is simple: we have MCE and we have the contents of the mce_temperature field with the “previous” message. Outside of the elasticsearch block, we write:

         if [temperature_mce] = ~ / Core temperature.speed normal / {
                 noop {
                         remove_tag => ["notify"]
                         add_tag ​​=> ["supressed"]
                 }
           }


(Using the notify tag, we send a message to shinken, and supress is just for a person, to see why the message was not sent).

Catching cockroaches


Everything was good, except that at the moment when I wrote this article (assuming that the problem is solved), I get happy with such nagios and speak in human voice: ** PROBLEM: ALERT ... kernel: [8871838.807783] mce: [Hardware Error ]: Machine check events logged

We look:
 [Tue Feb 24 15:51:40 2015] CPU0: Core temperature / speed normal
 [Tue Feb 24 15:52:54 2015] mce: [Hardware Error]: Machine check events logged

Oh, thanks! And I have an interval [now-1m TO now], that is 60 seconds. And the events had an interval of 74 seconds. So in the final version, the viewing interval had to be increased to 3 minutes.

The second “fun” problem, which came a bit later, is problems with reverse DNS. It turned out that the DNS does not keep up with monitoring at some points and does not resolve all IPs. We usually resolve IP to the name and write it in the host field (dns filter). And it turns out that the 'temperature' recorded with IP in the host field (could not be resolved), and the MCE came with the host name. Or vice versa. In any case, nasty bzz-bzzzz from the phone at night without sufficient reason.

The solution was also simple: we will resolve the name, and ip save to another field. And we do a search by IP.

Conclusion


Final config (snippet):

   if [message] = ~ / Machine check events / {
      elasticsearch {
         hosts => ["localhost"]
         query => "(type: netconsole) AND (source_ip:% {[source_ip]}) AND (@timestamp: [now-3m TO now]) AND (message: temperature)"
         fields => ["message", "temperature_mce"]
      }
         if [temperature_mce] = ~ / Core temperature.speed normal / {
                 noop {
                         remove_tag => ["notify"]
                         add_tag ​​=> ["supressed"]
                 }
           }
    }

If you want to disable throttling MCE completely, then replace / Core temperature.speed normal / with / Core temperature /

I'm not sure that the proposed solution is best practice, but it also works minimally intrusively into the configuration. This approach allows you to solve a whole class of problems associated with multi-line messages and allows you to make retrospective situational decisions without diluting the bigdata for nothing.

Source: https://habr.com/ru/post/251349/


All Articles