📜 ⬆️ ⬇️

Monitoring of distributed and cloud infrastructure

In the last article I reviewed the various types of monitoring of simple web projects and websites, when the site does not require a level of reliability of 99.99%, when the response time may be hours or days. In general, when everything is simple. In this article, I will reveal the mechanisms for monitoring the cloud infrastructure, when a simple signal is not available / available is not enough to understand what the problems are and how to solve them promptly. Or when solving a problem may require a large number of actions, which can only be partially automated.

Typically, the level of reliability of the infrastructure of the project allows you to leave the reaction time to the problems encountered the same - hours or even days. But at the same time there are a number of places, decisions on which should be made in (semi) automatic mode in order to exclude the human factor and minimize system downtime. The triggers of such solutions will be discussed below. I want to immediately note that almost all the described monitoring technologies are used in the new cloud-based social intranet service - Bitrix24.


Not enough


When building a fault-tolerant distributed infrastructure, in addition to providing several levels of reliability, usually several layers of system monitoring are included. Among them are:

Built-in monitoring


Munin provides a large amount of information about the status of the desired server. The most frequently checked points include:

Some pictures of real system monitoring (download and database)




')
For example, the script checks for free space in the InnoDB MySQL table
  ## Tunable parameters with defaults
 MYSQL = "$ {mysql: - / usr / bin / mysql}"
 MYSQLOPTS = "$ {mysqlopts: --- user = munin --password = munin --host = localhost}"

 WARNING = $ {warning: -2147483648} # 2GB
 CRITICAL = $ {critical: -1073741824} # 1GB

 ## No user serviceable parts below
 if ["$ 1" = "config"];  then
     echo 'graph_title MySQL InnoDB free tablespace'
     echo 'graph_args --base 1024'
     echo 'graph_vlabel Bytes'
     echo 'graph_category mysql'
     echo 'graph_info Amount of InnoDB tablespace bytes in the table'
     echo 'free.label Bytes free'
     echo 'free.type GAUGE'
     echo 'free.min 0'
     echo 'free.warning' $ WARNING:
     echo 'free.critical' $ CRITICAL:
     exit 0
 fi

 # Get freespace from mysql
 freespace = $ ($ MYSQL $ MYSQLOPTS --batch --skip-column-names --execute \
     "SELECT table_comment FROM tables WHERE TABLE_SCHEMA = 'munin_innodb'" \
     information_schema);
 retval = $?

 # Sanity checks
 if ((retval> 0));  then
     echo "Error: mysql command returned status $ retval" 1> & 2
     exit -1
 fi
 if [-z "$ freespace"];  then
     echo "Error: mysql command returned no output" 1> & 2
     exit -1
 fi

 # Return freespace
 echo $ freespace |  awk '/ InnoDB free: / {print "free.value", $ 3 * 1024}' 

A large list of popular plugins can be found in the GIT repository.

Internal monitoring


Nagios as a monitoring solution is definitely good. But you need to be prepared for the fact that in addition to it, you will have to use your own scripts and (or) Pinba (or a similar solution for your programming language). Pinba handles UDP packets and collects information about the execution time of scripts, the amount of memory and error codes. In principle, this is enough to create a complete picture of what is happening and to ensure the required level of service reliability in automatic mode.

At the level of internal monitoring, it is already possible to make decisions on the allocation of additional capacity (if this is possible automatically, then it is enough to simply monitor the average level of processor utilization on the application servers or the database, if this is done manually, then you can send letters or jabber messages) or turning them off. Also, in the event of an abnormal number of errors (usually this happens when a hardware failure or an error occurs in a new version of the web service, you can always install it through additional checks) you can send emergency SMS notifications or make a phone call.

It is also very convenient to configure automatic addition (or deletion) of tests when increasing test points (for example, servers or user sites) with predetermined patterns: for example, checking the main page, distributing PHP runtime, distributing memory usage for PHP, number of nginx and PHP errors.

Intermediate monitoring


Monitoring at the cloud infrastructure level does not offer such a large number of providers, and it is rather informational: real decisions are made either on the basis of internal data or the external state of the system. At the intermediate level, you can only collect statistics or confirm the internal state of the infrastructure.

For Amazon ( CloudWatch ), the following check options are available here:

Already, based on the results of monitoring an intermediate (at the level of balancers), it is possible to make an informed decision about the allocation or closure of machines (instances) in a cluster. This is exactly what is being done in Bitrix24: as soon as the load status of the application servers becomes too large (more than 60%), new instances begin to run. Conversely, when the load is reduced to less than 40%, the instances close.

External monitoring


Here the choice of solutions is very large. If you want to track the status of servers around the world, the best solution is Pingdom . For Russian realities, PingAdmin , Monitorus or WEBO Pulsar (the latter two have a network of servers in Russia). It is especially convenient to set up check from several points (for example, Moscow + St. Petersburg) and pull the remote notification script if the service is not available within 1-2 minutes. If at the same time there are any problems inside, then you can immediately switch to plan “B” (turn off inactive servers, pour notifications, etc.).

Additional advantages of external monitoring include checking the real-time response on the server side (or real network delays). By this parameter, you can also configure notifications. As an additional option when using a CDN: you can track the total load time of the service pages and disable or enable CDN for different regions.

PS article review, more about the monitoring architecture of large projects. I will tell about concrete applied things in the following articles.

Source: https://habr.com/ru/post/140091/


All Articles