Prometheus - practical use

One of the most important tasks in developing applications with microservice architecture is the task of monitoring. Tracking the status of services and servers allows not only to respond in time to malfunctions, but also to analyze their work. Availability of such information is difficult to overestimate, because it provides additional opportunities to improve the performance and quality of your software.

Fortunately, there are many solutions to the problem of monitoring, both paid and free. I want to share the experience of practical use of the open source monitoring system Prometheus .

Meet Prometheus

I seriously encountered microservice architecture quite recently, starting work with the existing infrastructure of services. Zabbix was used to monitor the status of servers and collect metrics. Zabbix configuration was done by administrators and all changes / additions of graphs had to go through them. This approach allows us to divide the area of responsibility, especially in cases where different development teams use the same monitoring system. But at the same time, it leads to delays or inaccuracies when adding new graphs, because the administrator may not pay attention to the details that are obvious to the developer who has set the task or will simply be busy with other tasks. The problem becomes especially acute when further development requires fast access to historical data on service metrics. As a means of solving this problem, I decided to look for a simple and easy to set up monitoring system that would “fit” well on the already existing infrastructure and simplify the work on analyzing metrics.

The minimal configuration of the Prometheus monitoring system consists of the Prometheus server and the application being monitored; you only need to specify at which address you need to request metrics. The collection of metrics is performed according to the HTTP pull model, but there is also a push gateway component for tracking short-lived services. When using a pull model, your applications are not aware of the monitoring system, which means that you can run several Prometheus servers for monitoring, thereby insuring against possible data loss. To prepare applications for further monitoring, it is proposed to use client libraries that implement the necessary tools for creating and displaying metrics for various languages. It is recommended to use them, but at the same time you can apply your implementation if it meets the specifications of exposition formats .
')
This approach fully met my requirements, since the collection of metrics for Zabbix was carried out according to a similar scheme. To preserve compatibility with Zabbix and with the library of metrics already in use, an adapter was written that converts the existing format to a suitable one for Prometheus. This solution allowed us to experiment with Prometheus without pain, without “affecting” monitoring in Zabbix and preserving the stability of the existing services.

Using

The advantage of using Prometheus came to light right away. It is difficult to overestimate the ability to receive the dynamics of changes in the metrics at any time, compare them with others, transform, view them in text format or as a graph without leaving the main page of the web interface.

For filtering and converting metrics, a very convenient query language is used . Unfortunately, there is no possibility to save the generated request directly through the web interface. To do this, you need to create a console , which is a html-page, with the possibility of using Go-templates , and already in it to build graphs, tables, summaries, etc.

Creating a generic console is a good solution. For example, you need to monitor several HTTP servers. Each of them has a metric, for example, “http_requests_total”, showing the number of received HTTP requests. Create a console that displays a list of such services as links to the console with more detailed information:

{{ range query "sum(http_requests_total) by (instance)" | sortByLabel "instance" }} <a href="/consoles/job-overview.html?instance={{ .Labels.instance }}">{{ .Labels.job }}</a> {{ end }}

As a result, we get a list of references to the “job-overview.html” console, to which the “instance” parameter is passed. Now this parameter can be used as a filter for plotting graphs:

  new PromConsole.Graph({ node: document.querySelector("#successGraph"), expr: [ "sum(http_requests_success{instance='{{ .Params.instance }}' })" ], name: http_requests_success, })

As a source of additional examples, you can use the standard set of consoles.

Performance

According to the developers, a single Prometheus server can easily serve millions of time-series. This is enough to collect data from thousands of servers with an interval of 10 seconds. In cases where this is not enough, the possibility of scaling is provided.

In practice, the declared performance is confirmed. When monitoring 800 services, about 80 metrics each, Prometheus uses about 6% of one core and 3 GB of RAM, and metrics collected over 15 days occupy 17 GB of disk space. It is not necessary to allocate a separate server for Prometheus, with such a low resource consumption it can be installed alongside other services without inconvenience.

Prometheus stores collected metrics in RAM and when it reaches the size limit, or after a certain time interval, saves them to disk. In cases where you have to collect a large amount of data (more than 100k time series, for example), the load on the disk may increase. Prometheus documentation provides helpful optimization tips for such cases.

There are cases when you need to create a schedule that requires heavy calculations, exciting long time periods or having a high visit. In order to facilitate the work of Prometheus and not calculate such data on each request, there is the possibility of preliminary calculation . This option will be especially useful when creating dashboards.

Customization

Frankly, the web-interface seemed a little "damp" to me. When working with graphs, metrics, the section of observable services (/ targets), the need arose for additional functionality. There were no problems with this, and, soon, I added the ability to quickly search for tags of metrics, as well as changed the layout of the console section. When the / targets section increased to 800 services, we had to hide the endpoints on the page, combining them into groups (otherwise they took up too much space). When one of the endpoints stops working normally, an error icon is added to the group. By expanding the sheet, you can find the problem endpoint and find out detailed information about the error:

The simplicity of Prometheus web-interfaces opens wide possibilities for creating themes, modifying standard interfaces or adding new sections. As a confirmation of what has been said, I suggest that you familiarize yourself with the graphical configuration generator .

Additional software

The Prometheus development team creates a project that is as open as possible for integration, which can be used in conjunction with other technologies, and the community tries its best to help. On the Prometheus website you can find a list of supported exporters - packages that remove metrics from other services and technologies. We use only a few: node_exporter , postgres_exporter and grok_exporter . For them, built a generalized console, graphics in which are built relative to the service being viewed. All newly added or discovered ( service discovery ) services automatically become available for viewing in previously created consoles. If you have a whole “zoo” of technologies, you will have to install a lot of exporters to monitor them, but there is a logical rationale for this approach.

The security situation may seem strange. Initially, the Prometheus service and the installed exporters are “open to the world”. Anyone who knows the correct address and port will be able to get data on your metrics. The position of the developers here is that - Prometheus is a monitoring system, not security. It will not be difficult for users to use third-party 3rd-party products for this purpose, which have been successfully implementing such tasks for a long time, allowing Prometheus developers to focus on monitoring tasks. In my case, Nginx is successfully used as a reverse-proxy with http basic auth, the configuration of which took very little time.

For those who want to deprive themselves of sleep and rest, the developers provide the opportunity to use AlertManager . It is extremely easy to configure and allows you to work with the already mentioned query language. AlertManager can integrate with HipChat, Slack, PagerDuty, and, if necessary, integration with services that are not yet supported, I recommend reading the integration article for audio alerts .

As a practical example, I cite a rule from my current AlertManager configuration:

 ALERT nodeexporter_available_disk_space IF (100 - node_filesystem_free{job=~".*nodeexporter"} / node_filesystem_size{job=~".*nodeexporter"} * 100) > 70 ANNOTATIONS {description="Used disk space: {{ $value }}%. Device: {{ $labels.device }}.", summary="Running out of disk space at instance {{ $labels.instance }}"}

This rule will work in case of filling the disk space by more than 70% on any of the servers where node_exporter is running, after which the corresponding notification will be sent to the post office.

Conclusion

I recommend you to get to know Prometheus closer. For me, it has become an easy, simple and understandable monitoring system, with which it is simply pleasant to work, expand and modify to fit your needs. If someone has any questions about the practical application - I will be glad to answer in the comments.

useful links

Source: https://habr.com/ru/post/308610/

All Articles