Before each relatively large production sooner or later there is a question about the centralized collection and viewing of logs. Currently, there is a huge selection of open-source, paid, online and on-premises solutions. I will try to outline the selection process in our particular case.
This review article, the essence of which tells about the main features of Graylog2, why we chose it and how we exploit it.
')
A little about our infrastructure (you can read a little more
here ): we use 7 data centers around the world, about 500 production servers and a couple more hundreds of various applications from which I would like to collect logs. All this is running under both Linux and Windows, and heterogeneous services are spinning on top. All services have their own log format, while there is also Java, which has a kind of StackTrace.
Our requirements and what we wanted to get in the end
From the programmers and all interested, the requirements for logs were simple:
- the sending log agent should not heavily load the system;
- the ability to add custom fields at an arbitrary point in time, on arbitrary servers;
- search, sort and so on;
- the ability to send POST logs with requests or something similar (for sending logs, for example, from mobile devices).
Here, in general, nothing complicated, everything is quite normal. But in our case, it was necessary to satisfy the following service requirements:
- openLDAP authentication (in our case, this is FreeIPA );
- convenient delineation of rights;
- convenient configuration of clients (preferably from one place);
- possibility of automated installation of agents on all used systems;
- the possibility of convenient monitoring of both the health of services and the necessary metrics;
- availability of community and documentation, or commercial support;
- simple scaling.
This set of requirements was quite critical so that the new service could fully fit into the existing infrastructure, taking into account the peculiarities of our automation, distribution of rights and would not turn out to be a white crow, who would have to spend too much time. The service should be convenient for both end users and meet their requirements.
At this stage, we realized that we were left with only some commercial solutions and Graylog2 from opensource.
How to count the number of logs and load
In short, no way. Therefore, here I will point out the main approaches and nuances that helped us in this matter.
In the beginning, we took and looked at the number of logs on the server focus group, measured the dynamics of changes in 2 weeks. We multiplied the resulting number by the number of servers. As a result, the average number of logs was about 1TB per day. It was necessary to store these logs from 2 weeks to 3 months.
At this stage, when calculating commercial solutions and own infrastructure, it was decided to use Graylog2. Deciding that the best way to calculate the real load is to get some of the traffic to the test server, we deployed one Graylog2 node and sent traffic from a certain focus group there.
For about a week, we saw a load of 10–20k messages per second and, in general, we were already prepared to use these numbers as we deployed the combat cluster. But at some point on the servers something broke, the number of logs increased almost 10 times and on one server we saw a surge to 100k messages per second. At the same time, the StackTrace part from Java applications did not fit into the allowed log size. At this point, we came to the understanding that the logs are needed just for convenient work in such critical situations, and all previous calculations were conducted only in normal conditions.
Main conclusions:
- counting logs in normal conditions does not give a picture of what is happening in the event of accidents. A collection of logs is needed precisely for the prompt resolution of these situations;
- different services and languages write messages in their own way and such situations need to be considered in advance;
- cluster performance should allow processing many times more messages from the normal load.
A small description of the functionality Graylog2
The main reasons why we chose him:
- Well proven earlier product. With good documentation and illumination of the main problems.
- Ability to configure agents via the web interface through Collectors.
- Simple and functional integration with OpenLDAP, including group synchronization.
- Convenient horizontal scaling.
- A huge selection of different variations of inputs for logging.
- The presence of plug-ins and extensions.
- Pretty simple and convenient query and sample language.
- Availability of dashboards and the possibility of notifications.
This functionality has covered almost all of our needs and greatly simplified life. Just a couple of months from a test service with a dubious future, it turned into a rather important business-critical unit and has been doing a great job for six months already.
Of course, the implementation of such solutions is not the most transparent process. We are faced with a fairly large number of nuances and pitfalls, both during the initial setup, and during further operation. In the next article, I’ll tell you exactly how we configured and tuned Graylog2 and its components such as MongoDB and Elasticsearch.