Collect and analyze logs with Lumberjack + Logstash + Elasticsearch + RabbitMQ

Good day.

Logs are often and absolutely not deservedly deprived of the attention of developers. And when programmers need to parse log files, sometimes from several dozens of servers at the same time, this task falls on the system administrators and takes them a lot of time and effort.

Therefore, when problems occur and it is necessary to find an error in the logs of some service, it all depends on how strong the admin is and how programmers know what to look for. And also on whether the logrotate has worked and whether the old data has not slammed ...
')
Logstash helps a lot in such situations. He is actively developing the last year, has already done a lot of noise, and on the habr, for example, here and here , there are enough articles on how to install it and make it work on one server. In this article I will touch upon Logstash as a processing server, because, as an agent, it didn’t suit us in some respects.

Why? Maxifier is a SaaS product with customers in the USA, Brazil, several European countries and Japan, so we have about a hundred servers scattered around the world. For operational work, we need to have convenient access to the logs of our applications and a quick search for errors in them in case of problems on third-party services / api, the appearance of incorrect data, etc. By the way, The Guardian (some of our clients) use a similar logging system .

After several instances of building Rsync logs from multiple servers, we thought about an alternative, less long and laborious. And recently we have developed our logging system for various applications. I will share my own experience and description of how it works.

Our target applications from which we collect logs:
1. Java daemons (Jetty / Native)
2. Demons on Node.js
3. Dynamic Java applications (for example, cluster nodes)

Critical for programmers are errors and exceptions with multi-line spectra flying with such errors.
Therefore, the main requirements were:

1. Good support for multi-line messages.
2. Ability to handle different message formats.
3. Easy configuration of data sources.

Without going deep into the configuration part, I will tell you about the component parts and why they are present in the scheme.
In general, our log collection scheme looks like this:

The diagram shows that we do not keep logstash, which would monitor application files on application servers. We considered that it was inefficient to use a daemon, which consumes 128 MB of memory and a significant part of the processor's power during logging, and forcibly limited resources would look like a crutch. Therefore, we began to look for other, less demanding solutions to the problem of transporting logs.

Agents

Rsyslog is a ready-made demon and can see everything, but it’s always in the details.

Benefits:
1. Fast.
2. Able to track files.
3. Able to ~~dig and not dig~~ encrypt and encrypt.
4. Already have in the system.

Disadvantages:
1. Does not know how to send multiline messages (All attempts on the server to receive a multiline message were completely in vain).
2. Cannot track files by mask.
3. Badly handles file truncation.

As already mentioned, the support for multi-line messages was important for us, so we continued searching, but temporarily hooked up the Rsyslog log collection, in order to organize a search for them later.

Lumberjack (now known as logstash-forwarder) is an experimental daemon for sending logs.

Benefits:
1. Fast.
2. Minimalist (Unix-way).
3. Able to use masks.
4. Consumes a minimum of resources (written in Go).

Disadvantages:
1. Experimental.
2. Independently does not support multiline messages.

Logstash (agent only)

Benefits:
1. The same software for collecting logs everywhere.
2. Enough productive.
3. Able all that we need.

Disadvantages:
1. Java on all servers.
2. High memory requirements can potentially affect the operation of other services.
3. Difficult to configure.

Agent selection

We decided to test Lumberjack, and if the result is not satisfactory, roll back to the option with Logstash on each server.
Since Logstash and Lumberjack were developed by the same people , we also received nice buns: for example, additional fields - with them we were able to recognize the type of log application without parsing file names.

We tested this demon, and the only bug found in it is that when the place ends, it cannot write the state file and sends repeated messages from the previous place, or it may not notice that the file has been cleaned. We usually do not allow this, and everything comes back to normal when a place appears.

In general, we liked Lumberjack as an agent. The configuration is extremely simple, it can search by masks and perfectly tracks a large number of files. Moreover, there is a good module for Puppet, with which we set up our servers without any problems. In parallel with the choice of agent, we tested logstash - then it was still in version 1.1.12.

Logstash

First of all, we tried to process multiline messages with it, first from rsyslog, then from lumberjack.
At first we tried the big regular expressions that Google gives out on GitHub Gist, but ultimately we came to the conclusion that the best option - everything that does not fall under our timestamp, should be considered part of the previous message. Moreover, Lumberjack did not modify the messages from the service.

As a result, there is such a simple pattern:

filter { multiline { pattern => "(^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})" negate => true what => "previous" } }

We set up one Logstash instance to process incoming logs, gave it several threads, and started connecting servers. Everything seemed to be fine, but they began to notice that from time to time the server dropped connections, ran into memory, and occasionally fell. Even at 16 Gb in Xmx did not give the desired effect.

We decided to distribute the load on two types of instances, and in general did not lose.

The first type of instances (1) only listens for tcp connections and saves raw files (it is convenient to debug filters on them), sticks together multi-line messages and sends them to the bus. It copes well with its task, resulting in the loading of 100% of two cores at peak times (our peak is 2500-3000 messages per second).

The second type of instances (2) takes messages from the bus, splits them into fields, puts them in elasticsearch and, just in case, into files. And, as it turned out, it was the second type instances that were the most problematic - then there would be no more memory, or elasticsearch would quiver. But thanks to the bus, the data was not lost, and in the instances of the first type (1), the logs continued to be collected.

Unloading an instance of the first type led to the fact that the monitoring server as a whole consumes a minimum of resources and at the same time is able to withstand a sufficiently large number of messages per unit of time during the peaks. Filtering instances (of the second type) can be limited in resources and you can not really care about their performance.

This bundle worked for us for about a month until we met a very strange behavior. Lumberjack agents on the servers began to feverishly reconnect to Logstash and immediately fell off with the message that they could not reach the listening Logstash and even i / o timeout, while listening to Logstash at a cursory glance at him seemed alive and even sent messages to the bus .
Zabbix reported that PermGen was being pumped to the process, we increased the memory and restarted, in just 10 minutes the situation repeated.

We started to figure out which server generates messages that lead to a stupor for Logstash. After a brief search (by the way, it is easy to search for the one who sent the most messages - Lumberjack reports this in the Syslog), a server with a size of ~ 20 Gb was detected on one of the servers, and it included several lines of 2-3Gb in size. It is not hard to guess what was happening: Logstash slowly formed this line in its memory until it rested on it and then tried to do something about it. For such cases, we added a separator to filter such messages and asked the developers not to do that again.

There were problems with too detailed logs, for example, Noyan.js logyan bunyan , at the output he had up to 30-40 (!) Fields, naturally in the interface it looked awful. Here we judged simply:
1) removed unnecessary fields;
2) altered the priority of messages in a human-readable form.

Elasticsearch

Did not cause major problems for the period of use. The 3-gig instance of Elasticsearch normally processes indexes and 5-6 GB of archived data per day. The only thing that sometimes happens is OutOfMemory, when trying to stuff a big message or ask a too complex request. But nobody canceled the JMX monitoring, and Zabbix normally notifies that the GC somehow works for a long time and it would be nice to look at the application.

Interface

The Kibana web interface in its original form turned out to be, to put it mildly, difficult even for programmers, and even more so for people far from IT (our analysts also sometimes look into the logs). We took the original dashboard as a basis, set it up to fit our needs and replaced it with the default dashboard in Kibana. Now it looks like this:

In particularly difficult cases, you have to additionally adjust filters and multi-search, but this rarely happens and does not cause major problems.

Short summary

1.Lumberjack - great for the role of a logging agent.
2. Logstash - normally parses the stacktrace and absolute different logs and there are no problems with flexibility, but you need to adapt to your requirements.
3. Kibana - again, you need to adapt to the consumer. Sometimes even the simple renaming of fields simplifies the work.

Of course, there are cases that this scheme does not cover, for example, when you just need to see the specific launch of the application and what happened - the web interface is probably not a helper here, although this is, of course, the taste and color.

Result

The described system of assembling logs made it possible to remove part of the load from admins and significantly reduced the waste of their time and nerves. Yes, and programmers are so much more convenient - now that an error occurs, we simply give developers a link to the logs in the web interface instead of an uncomfortable heap of log files. Now the system processes more than 1 million messages per day from more than 300 application instances, without creating a significant load. This is not the limit of our needs, most likely, the figure will increase at least twice, and there is confidence that the system will be able to accept such a load.

Update

Logstash listener

 input { lumberjack { charset => "CP1252" port => 9999 ssl_certificate => "/etc/ssl/changeme.crt" ssl_key => "/etc/ssl/changeme.key" } } filter { multiline { pattern => "(^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})|(^\d+\))|(^{\")" negate => true what => "previous" } } output { file { path => "/logs/raw/%{host}/%{+YYYY-MM-dd}.log" } rabbitmq { durable => true exchange => "logstash-exchange" exchange_type => "direct" host => "127.0.0.1" key => "logstash-routing-key" password => "changeme" persistent => true user => "logstash" vhost => "logstash" } }

Logstash filter

 input { rabbitmq { auto_delete => false charset => "CP1252" durable => true exchange => "logstash-exchange" exclusive => false host => "127.0.0.1" key => "logstash-routing-key" password => "changeme" prefetch_count => 4 queue => "logstash-indexer" type => "logstash-indexer-input" user => "logstash" vhost => "logstash" } } filter { #ugly hack date { match => ["@timestamp", "ISO8601"] } mutate { gsub => ['host', '.example.com', ''] } #just in case 2 types of applications if [type] == "custom" { grok { match => ["message", "(?m)(?<date>\d{4}-\d{2}-\d{2}) (?<time>\d{2}:\d{2}(?::\d{2}(?:\.\d+)?)?) \[%{LOGLEVEL:level}\s?\] - (?<mxf_raw_message>.*)"] } grok { match => ["message", "(?m)(?<mxf_datetime>\d{4}-\d{2}-\d{2} \d{2}:\d{2}(?::\d{2}(?:\.\d+)?)?) "] } mutate { replace => ["mxf_datetime", "%{date}T%{time}Z" ] } date { match => ["mxf_datetime", "ISO8601"] target => "mxf_timestamp" } mutate { remove_field => ["mxf_datetime"] } grep { match => [ "message", "\|" ] add_tag => "have_id" drop => false } if "have_id" in [tags] { grok { match => ["message", "\[@%{HOSTNAME:mxf_host} %{GREEDYDATA:mxf_thread} (?<mxf_class>[az]{1}[.]{1}[az]{1}[.]{1}[a-zA-Z]{1}.*):%{NUMBER:mxf_classline}\|(?<mxf_id>.*):%{NUMBER:mxf_linenum}\]$"] } #ugly hack for correct mapping mutate { convert => [ "mxf_linenum", "integer" ] } } else { grok { match => ["message", "\[@%{HOSTNAME:mxf_host} %{GREEDYDATA:mxf_thread} (?<mxf_class>[az]{1}[.]{1}[az]{1}[.]{1}[a-zA-Z]{1}.*):%{NUMBER:mxf_classline}\]$"] } } grok { match => ["message", "(?m).*\] - (?<mxf_message>.*)\[@.*"] } grok { match => ["file", "\/opt\/apps\/%{GREEDYDATA:mxf_version}\/logs\/%{GREEDYDATA:mxf_account}\/%{WORD:mxf_logfilename}.log$"] } mutate { gsub => ["mxf_aiid","\|",""] } } if [type] == "nodejs" { json { source => "message" } mutate { rename => ["msg","mxf_message"] } mutate { rename => ["args.0.code","mxf_response_code"] } mutate { rename => ["data.service","mxf_service"] } mutate { rename => ["data.method","mxf_method"] } mutate { rename => ["t","mxf_request_time"] } mutate { rename => ["pid","mxf_pid"] } mutate { rename => ["err.name","mxf_errname"] } mutate { rename => ["err.stack","mxf_stacktrace"]} mutate { rename => ["err.message", "mxf_errmsg"] } mutate { remove_field => ["0", "1", "2", "args", "http", "hostname", "msg", "v", "data", "err","t", "pid","io"] } if [level] == 60 { mutate { replace => ["level","FATAL"] } } if [level] == 50 { mutate { replace => ["level","ERROR"] } } if [level] == 40 { mutate { replace => ["level","WARN"] } } if [level] == 30 { mutate { replace => ["level","INFO"] } } if [level] == 20 { mutate { replace => ["level","DEBUG"] } } if [level] == 10 { mutate { replace => ["level","TRACE"] } } } } output { elasticsearch_http { host => "127.0.0.1" } }

Kibana config

Source: https://habr.com/ru/post/216201/

All Articles