⬆️ ⬇️

How did we choose Graylog2





Before each relatively large production sooner or later there is a question about the centralized collection and viewing of logs. Currently, there is a huge selection of open-source, paid, online and on-premises solutions. I will try to outline the selection process in our particular case.



This review article, the essence of which tells about the main features of Graylog2, why we chose it and how we exploit it.

')

A little about our infrastructure (you can read a little more here ): we use 7 data centers around the world, about 500 production servers and a couple more hundreds of various applications from which I would like to collect logs. All this is running under both Linux and Windows, and heterogeneous services are spinning on top. All services have their own log format, while there is also Java, which has a kind of StackTrace.



Our requirements and what we wanted to get in the end



From the programmers and all interested, the requirements for logs were simple:





Here, in general, nothing complicated, everything is quite normal. But in our case, it was necessary to satisfy the following service requirements:





This set of requirements was quite critical so that the new service could fully fit into the existing infrastructure, taking into account the peculiarities of our automation, distribution of rights and would not turn out to be a white crow, who would have to spend too much time. The service should be convenient for both end users and meet their requirements.



At this stage, we realized that we were left with only some commercial solutions and Graylog2 from opensource.



How to count the number of logs and load



In short, no way. Therefore, here I will point out the main approaches and nuances that helped us in this matter.



In the beginning, we took and looked at the number of logs on the server focus group, measured the dynamics of changes in 2 weeks. We multiplied the resulting number by the number of servers. As a result, the average number of logs was about 1TB per day. It was necessary to store these logs from 2 weeks to 3 months.



At this stage, when calculating commercial solutions and own infrastructure, it was decided to use Graylog2. Deciding that the best way to calculate the real load is to get some of the traffic to the test server, we deployed one Graylog2 node and sent traffic from a certain focus group there.



For about a week, we saw a load of 10–20k messages per second and, in general, we were already prepared to use these numbers as we deployed the combat cluster. But at some point on the servers something broke, the number of logs increased almost 10 times and on one server we saw a surge to 100k messages per second. At the same time, the StackTrace part from Java applications did not fit into the allowed log size. At this point, we came to the understanding that the logs are needed just for convenient work in such critical situations, and all previous calculations were conducted only in normal conditions.



Main conclusions:





A small description of the functionality Graylog2



The main reasons why we chose him:





This functionality has covered almost all of our needs and greatly simplified life. Just a couple of months from a test service with a dubious future, it turned into a rather important business-critical unit and has been doing a great job for six months already.



Of course, the implementation of such solutions is not the most transparent process. We are faced with a fairly large number of nuances and pitfalls, both during the initial setup, and during further operation. In the next article, I’ll tell you exactly how we configured and tuned Graylog2 and its components such as MongoDB and Elasticsearch.

Source: https://habr.com/ru/post/340168/



All Articles