📜 ⬆️ ⬇️

Monitoring Docker Hosts, Containers, and Container Services

I was looking for a self-hosted open source monitoring solution that could provide metrics storage, visualization and alerting for physical servers, virtual machines, containers, and services operating inside containers. After testing Elastic Beats, Graphite and Prometheus, I settled on Prometheus. First of all, I was attracted by the support of multidimensional metrics and an easy-to-learn query language. The ability to use the same language for graphics and notifications greatly simplifies the task of monitoring. Prometheus tests both black and white, which means that you can test the infrastructure, as well as monitor the internal state of your applications.



Why the choice fell on Prometheus



The Prometheus Ecosystem is huge. This means that you can find metric exporters for a variety of systems, ranging from database, MQ, HTTP servers, to systems associated with hardware, such as IoT or IPMI. White box testing also has excellent coverage. There are Prometheus client libraries for Go, Java, Python, Ruby, .NET, PHP, and other programming languages.


Getting started with Prometheus and the docker


If you want to try out the Prometheus stack, look at the dockprom repository on GitHub. You can use dockprom as a starting point for a monitoring solution. This will allow you to manage the whole stack with one command: Prometheus, Grafana, cAdvisor, NodeExporter and AlertManager.



Installation


Copy the dockprom repository to the docker host, go to the dockprom directory and run compose up:


$ git clone https://github.com/stefanprodan/dockprom $ cd dockprom $ docker-compose up -d 

Containers:



If Gafana supports authentication, the Prometheus and AlertManager services do not have this function. With basic authentication for Prometheus and AlertManager, you can remove the port mapping from the docker-compose file and use NGINX as a reverse proxy server.


Install Grafana


Go to http://<host-ip>:3000 and log in using the admin username and changeme password. You can change the password using the Grafana UI or by changing the user.config file.


From the Grafana menu, select the “Data Sources” item and click on “Add Data Source”. To add Prometheus containers as a data source, use the following values:



Now you can import control panel templates from the Grafana directory. From the Grafana menu, select “Control Panel” and click “Import.”


Docker control panel
Docker control panel


The docker’s control panel displays key metrics for monitoring your server’s resource usage.



Docker Container Control Panel
Docker Container Control Panel


The Docker Container Dashboard displays key metrics for monitoring used containers.



The panel does not contain containers that are part of the monitoring stack.


The control panel of the monitoring services
The control panel of the monitoring services


The monitoring services control panel displays key metrics for monitoring the containers that make up the monitoring stack.



You can control the use of Prometheus memory by attaching chunks of local storage. You can change the maximum snippet value in docker-compose.yml. I set the storage.local.memory-chunks value to 100,000. If you monitor 10 containers, Prometheus will use about 2 GB of RAM.


Definition of Notifications


I installed three notification configuration files:



You can change the notification rules and reload them using an HTTP POST request:


 curl -X POST http://<host-ip>:9090/-/reload 

Monitoring service notifications


If one of the target objects (node-exporter and cAdvisor) does not respond for more than 30 seconds, enable the notification:


 ALERT monitor_service_down IF up == 0 FOR 30s LABELS { severity = "critical" } ANNOTATIONS { summary = "Monitor service non-operational", description = "{{ $labels.instance }} service is down.", } 

Docker Host Notification


If the docker host's CPU is under high load for more than 30 seconds, enable notification:


 ALERT high_cpu_load IF node_load1 > 1.5 FOR 30s LABELS { severity = "warning" } ANNOTATIONS { summary = "Server under high load", description = "Docker host is under high load, the avg load 1m is at {{ $value}}. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}.", } 

Change the load threshold according to the number of CPU cores.


If the dock’s host memory is full, enable notification:


 ALERT high_memory_load IF (sum(node_memory_MemTotal) - sum(node_memory_MemFree + node_memory_Buffers + node_memory_Cached) ) / sum(node_memory_MemTotal) * 100 > 85 FOR 30s LABELS { severity = "warning" } ANNOTATIONS { summary = "Server memory is almost full", description = "Docker host memory usage is {{ humanize $value}}%. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}.", } 

If the docker’s host repository is full, enable the notification:


 ALERT hight_storage_load IF (node_filesystem_size{fstype="aufs"} - node_filesystem_free{fstype="aufs"}) / node_filesystem_size{fstype="aufs"} * 100 > 85 FOR 30s LABELS { severity = "warning" } ANNOTATIONS { summary = "Server storage is almost full", description = "Docker host storage usage is {{ humanize $value}}%. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}.", } 

Docker Container Notifications


If the container does not respond within 30 seconds, turn on the notification


 ALERT jenkins_down IF absent(container_memory_usage_bytes{name="jenkins"}) FOR 30s LABELS { severity = "critical" } ANNOTATIONS { summary= "Jenkins down", description= "Jenkins container is down for more than 30 seconds." } 

If the container uses more than 10% of the CPU cores for more than 30 seconds, enable the notification:


  ALERT jenkins_high_cpu IF sum(rate(container_cpu_usage_seconds_total{name="jenkins"}[1m])) / count(node_cpu{mode="system"}) * 100 > 10 FOR 30s LABELS { severity = "warning" } ANNOTATIONS { summary= "Jenkins high CPU usage", description= "Jenkins CPU usage is {{ humanize $value}}%." } 

If the container uses more than 1.2 GB of RAM for 30 seconds, enable the notification:


 ALERT jenkins_high_memory IF sum(container_memory_usage_bytes{name="jenkins"}) > 1200000000 FOR 30s LABELS { severity = "warning" } ANNOTATIONS { summary = "Jenkins high memory usage", description = "Jenkins memory consumption is at {{ humanize $value}}.", } 

Configure Notifications


The AlertManager service is responsible for sending Prometheus server notifications. AlertManager can send notifications via email, Pushover, Slack, HipChat and other systems using the webhook interface.


Here you can view or disable notifications: http://<host-ip>:9093 .


Receive notifications can be configured in the alertmanager / config.yml file.


To receive notifications via Slack, you need to configure the integration by selecting "Outgoing network bindings" on the application page.


Copy the Slack Webhook URL into the api_url field and define the Slack channel.


 route: receiver: 'slack' receivers: - name: 'slack' slack_configs: - send_resolved: true text: "{{ .CommonAnnotations.description }}" username: 'Prometheus' channel: '#<channel>' api_url: 'https://hooks.slack.com/services/<webhook-id>' 

Extension of the monitoring system


To cover more than one docker host, you can expand the Grafana Dockprom control panel. To control a larger number of hosts, you must place an exporter node and a cAdvisor container on each host and specify a Prometheus server to read.


You need to activate the Prometheus stack through the data center / zone and use the integration feature to combine all the metrics into a specific copy of the Prometheus program, which will be a general overview of the infrastructure. Thus, if a zone or a copy of the Prometheus program involved in zone merging fails, the monitoring system will be available in the remaining zones.


You can also make Prometheus more resilient by running two identical servers in each zone. If several servers send notifications to Alertmanager, this will not lead to the appearance of redundant data, since Alertmanager performs deduplication.


')

Source: https://habr.com/ru/post/314212/


All Articles