Monitoring Docker Hosts, Containers, and Container Services

I was looking for a self-hosted open source monitoring solution that could provide metrics storage, visualization and alerting for physical servers, virtual machines, containers, and services operating inside containers. After testing Elastic Beats, Graphite and Prometheus, I settled on Prometheus. First of all, I was attracted by the support of multidimensional metrics and an easy-to-learn query language. The ability to use the same language for graphics and notifications greatly simplifies the task of monitoring. Prometheus tests both black and white, which means that you can test the infrastructure, as well as monitor the internal state of your applications.

Why the choice fell on Prometheus

The entire stack can be expanded using containers.
It is designed for distributed systems and infrastructures.
Scalable data collection, independent of distributed storage.
Flexible service discovery (built-in support for Kubernetes, Consul, EC2, Azure).
Target exporter for services such as HAProxy, MySQL, PostgreSQL, Memcached, Redis, etc.

The Prometheus Ecosystem is huge. This means that you can find metric exporters for a variety of systems, ranging from database, MQ, HTTP servers, to systems associated with hardware, such as IoT or IPMI. White box testing also has excellent coverage. There are Prometheus client libraries for Go, Java, Python, Ruby, .NET, PHP, and other programming languages.

Getting started with Prometheus and the docker

If you want to try out the Prometheus stack, look at the dockprom repository on GitHub. You can use dockprom as a starting point for a monitoring solution. This will allow you to manage the whole stack with one command: Prometheus, Grafana, cAdvisor, NodeExporter and AlertManager.

Installation

Copy the dockprom repository to the docker host, go to the dockprom directory and run compose up:

$ git clone https://github.com/stefanprodan/dockprom $ cd dockprom $ docker-compose up -d

Containers:

Prometheus (metric database) http://<host-ip>:9090
AlertManager (notification management) http://<host-ip>:9093
Grafana (visualization of metrics) http://<host-ip>:3000
NodeExporter (host metrics collector);
cAdvisor (container metric collector).

If Gafana supports authentication, the Prometheus and AlertManager services do not have this function. With basic authentication for Prometheus and AlertManager, you can remove the port mapping from the docker-compose file and use NGINX as a reverse proxy server.

Install Grafana

Go to http://<host-ip>:3000 and log in using the admin username and changeme password. You can change the password using the Grafana UI or by changing the user.config file.

From the Grafana menu, select the “Data Sources” item and click on “Add Data Source”. To add Prometheus containers as a data source, use the following values:

Name: Prometheus
Type: Prometheus
Url: http://prometheus:9090
Access: proxy

Now you can import control panel templates from the Grafana directory. From the Grafana menu, select “Control Panel” and click “Import.”

Docker control panel

The docker’s control panel displays key metrics for monitoring your server’s resource usage.

Server uptime, percentage of idle CPU, number of CPU cores, available memory, swap, and storage.
Schedule of the average system load, the schedule of completed and blocked IO-processes, the schedule of interruptions.
Graph of CPU usage in guest, idle, iowait, irq, nice, softirq, steal, system, user modes.
Schedule of memory usage by allocation (used, free, buffers, cached).
Schedule of using IO (read Bps, read Bps and IO time).
Schedule of network usage by devices (incoming Bps, outgoing Bps).
Use swap and activity charts.

Docker Container Control Panel

The Docker Container Dashboard displays key metrics for monitoring used containers.

Total CPU Container Load, Memory Usage and Storage.
Schedule of used containers, system load schedule, IO usage schedule.
Schedule of use of the CPU container.
Schedule memory usage container.
Schedule of cached memory.
Schedule of incoming use of the network of containers.
Schedule of outgoing use of the network of containers.

The panel does not contain containers that are part of the monitoring stack.

The control panel of the monitoring services

The monitoring services control panel displays key metrics for monitoring the containers that make up the monitoring stack.

The running time of the Prometheus container, the total memory usage of the monitoring stack, fragments and memory series of the local Prometheus storage.
Schedule of use of the CPU container.
Schedule memory usage container.
Plots of saved Prometheus fragments and urgency of preservation.
Plots of operations of Prometheus fragments and the duration of the establishment of control points.
Graphs of the percentage of Prometheus templates used, target readings and reading duration.
Prometheus HTTP request graph.
Prometheus notification schedule.

You can control the use of Prometheus memory by attaching chunks of local storage. You can change the maximum snippet value in docker-compose.yml. I set the storage.local.memory-chunks value to 100,000. If you monitor 10 containers, Prometheus will use about 2 GB of RAM.

Definition of Notifications

I installed three notification configuration files:

Notifications of targets.ru monitoring services;
Docker host notifications hosts.rules;
Notifications containers docker containers.rules.

You can change the notification rules and reload them using an HTTP POST request:

 curl -X POST http://<host-ip>:9090/-/reload

Monitoring service notifications

If one of the target objects (node-exporter and cAdvisor) does not respond for more than 30 seconds, enable the notification:

 ALERT monitor_service_down IF up == 0 FOR 30s LABELS { severity = "critical" } ANNOTATIONS { summary = "Monitor service non-operational", description = "{{ $labels.instance }} service is down.", }

Docker Host Notification

If the docker host's CPU is under high load for more than 30 seconds, enable notification:

 ALERT high_cpu_load IF node_load1 > 1.5 FOR 30s LABELS { severity = "warning" } ANNOTATIONS { summary = "Server under high load", description = "Docker host is under high load, the avg load 1m is at {{ $value}}. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}.", }

Change the load threshold according to the number of CPU cores.

If the dock’s host memory is full, enable notification:

 ALERT high_memory_load IF (sum(node_memory_MemTotal) - sum(node_memory_MemFree + node_memory_Buffers + node_memory_Cached) ) / sum(node_memory_MemTotal) * 100 > 85 FOR 30s LABELS { severity = "warning" } ANNOTATIONS { summary = "Server memory is almost full", description = "Docker host memory usage is {{ humanize $value}}%. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}.", }

If the docker’s host repository is full, enable the notification:

 ALERT hight_storage_load IF (node_filesystem_size{fstype="aufs"} - node_filesystem_free{fstype="aufs"}) / node_filesystem_size{fstype="aufs"} * 100 > 85 FOR 30s LABELS { severity = "warning" } ANNOTATIONS { summary = "Server storage is almost full", description = "Docker host storage usage is {{ humanize $value}}%. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}.", }

Docker Container Notifications

If the container does not respond within 30 seconds, turn on the notification

 ALERT jenkins_down IF absent(container_memory_usage_bytes{name="jenkins"}) FOR 30s LABELS { severity = "critical" } ANNOTATIONS { summary= "Jenkins down", description= "Jenkins container is down for more than 30 seconds." }

If the container uses more than 10% of the CPU cores for more than 30 seconds, enable the notification:

  ALERT jenkins_high_cpu IF sum(rate(container_cpu_usage_seconds_total{name="jenkins"}[1m])) / count(node_cpu{mode="system"}) * 100 > 10 FOR 30s LABELS { severity = "warning" } ANNOTATIONS { summary= "Jenkins high CPU usage", description= "Jenkins CPU usage is {{ humanize $value}}%." }

If the container uses more than 1.2 GB of RAM for 30 seconds, enable the notification:

 ALERT jenkins_high_memory IF sum(container_memory_usage_bytes{name="jenkins"}) > 1200000000 FOR 30s LABELS { severity = "warning" } ANNOTATIONS { summary = "Jenkins high memory usage", description = "Jenkins memory consumption is at {{ humanize $value}}.", }

Configure Notifications

The AlertManager service is responsible for sending Prometheus server notifications. AlertManager can send notifications via email, Pushover, Slack, HipChat and other systems using the webhook interface.

Here you can view or disable notifications: http://<host-ip>:9093 .

Receive notifications can be configured in the alertmanager / config.yml file.

To receive notifications via Slack, you need to configure the integration by selecting "Outgoing network bindings" on the application page.

Copy the Slack Webhook URL into the api_url field and define the Slack channel.

 route: receiver: 'slack' receivers: - name: 'slack' slack_configs: - send_resolved: true text: "{{ .CommonAnnotations.description }}" username: 'Prometheus' channel: '#<channel>' api_url: 'https://hooks.slack.com/services/<webhook-id>'

Extension of the monitoring system

To cover more than one docker host, you can expand the Grafana Dockprom control panel. To control a larger number of hosts, you must place an exporter node and a cAdvisor container on each host and specify a Prometheus server to read.

You need to activate the Prometheus stack through the data center / zone and use the integration feature to combine all the metrics into a specific copy of the Prometheus program, which will be a general overview of the infrastructure. Thus, if a zone or a copy of the Prometheus program involved in zone merging fails, the monitoring system will be available in the remaining zones.

You can also make Prometheus more resilient by running two identical servers in each zone. If several servers send notifications to Alertmanager, this will not lead to the appearance of redundant data, since Alertmanager performs deduplication.

Source: https://habr.com/ru/post/314212/

All Articles