Foreword
Our main project is the optimization of advertising in social networks and mobile applications. Each banner display is the result of interaction of a rather large number of services located on different servers, sometimes in different data centers. Naturally, there is the task of monitoring the connection between servers and services. About the form in which this task stands, which solutions are suitable, which ones are not suitable - this is further discussed.
Network structure: frontends, backends, services
Like many other web applications, ours consists of frontends (FE), backends (BE), and services. The frontend accepts the connection from the client and sends it to the backend, which calls on services for help and responds to the client. In our case, FE is almost always - nginx, BE - tomcat. This architecture is fairly standard, non-standards begin when you need to decide how to route requests between frontends, backends, and services.
')
In our case, the routing is controlled by the CMDB configuration database. This database reflects both the physical network device (data centers, physical servers, ...) and the application logic device (which services and interfaces are located in which containers on which physical servers, splitting FE and BE into clusters, which servers can serve traffic , and which - no, etc.)
CMDB information can be used in two ways. The first - conditionally “offline”, is that, according to the crown, every few minutes, the script selects the necessary information from the CMDB and builds up-to-date configs, for example, for nginx, including the necessary upstream ones. How the second method of accessing CMDB works is in the next section.
CMDB, short names, search service by name
The second way to use CMDB can be designated as online, because it reflects changes in the CMDB not periodically, but immediately (taking into account short-term caches on top of the CMDB). This method is mainly used by backends to find the services they need, to connect services between is based on the use of DNS. The easiest way to understand how this is done from the image with explanations:

- The back-end be5 located on host58 needs to connect to the database, the name of the database service is db1 (short DNS name)
- From the name db1, from the data in resolv.conf is built the name db1.be5.host58.example.com. Resolver sends a request to PowerDNS
- The pDNS plugin from db1.be5.host58 concludes that the be5 container located on host58 is looking for the db1 service
- CMDB data is used to issue a list of IP working containers that provide db1 service closest to host58
- The service on be5 selects an arbitrary address from the received list and calls the db1 service.
Summarizing the intermediate result: front-ends, backends, services find each other using a single source of information CMDB. There are two methods for obtaining addresses: through the construction of configs and through queries to the DNS.
The physical structure of the network, communication problems

We rent servers from hosting providers. The structure in different data centers is the same - a set of physical hosts running containers with frontends, backends or services. A request from any component can go to any other host and even to another data center, if for some reason the necessary service is only in another data center. In total, we are located in six data centers, in each data center there are about a hundred or two servers, on each server up to four or five containers. In total, we get about a thousand hosts and about five thousand containers.
Since the entire network structure is not in our hands, we cannot monitor links, switches, channel congestion, etc. For the same reason (the network is not in our hands) we use ipsec for communication between servers. Ipsec sometimes gives surprises in the form of loss of communication between two hosts (or only between containers). Similar surprises are presented by the provider infrastructure. In any case, all that we can detect is the fact that one host (or its containers) is unavailable from some set of hosts or containers. Moreover, the fact of unavailability is not always manifested at the application level: services are duplicated, the request can be satisfied from an available server. Also, we cannot monitor such problems using pings from the central server - it simply will not notice them.
At a time when the number of our servers was limited to dozens, we did not particularly bother, we checked connectivity by pings sent from each host to everything else and tracking losses. As the number of hosts approaches one thousand, this simple method stops working:
- ping sending time becomes too long, or:
- network consumption for monitoring becomes noticeable (a standard ping sent from one server to a thousand servers will take about a megabit of bandwidth within a second, if the whole thousand servers do this then this is already too much)
- CPU consumption on the host for processing and analysis of "who sees whom or does not see" becomes large
- connectivity between containers is not monitored — we don’t see ipsec problems or root routing problems leading to problems at the container level
In order to solve all these problems with one blow, we need to monitor not “everything”, but only the connection between the components that appeal to each other. So the first task that arises here is how to find out which components the component is accessing. Only developers know this exactly, but they cannot always provide up-to-date information, therefore, it makes no sense to ask them about it, or ask to keep some register, no - in reality it does not work
Here we come to the aid of knowing how services find each other: those components that do not use rezolving, but calculate partners through a direct appeal to the CMDB, we will ask to save the result of this calculation not only in our configs, but also in some public place in the agreed format. And for services that use CMDB through rezolving - we will parse the PowerDNS logs and find out that PowerDNS responded to the request. In this way, we will be able to collect a complete list of which IP specific components can access.
Monitoring, tests
We collect a complete list of current connections between containers in one place, for the task of monitoring the link, this one place plays the role of a “server”. All information is presented on the server as a dictionary:
container | list of links |
---|
be5.host58 | 192.168.5.1 192.168.5.2 10.1.2.3 10.2.3.4 |
be8.host1000 | 10.0.0.1 10.0.100.5 |
The server makes this information available through the REST interface.
The second half of the monitoring is located in each container - this part is called the “agent”. The agent does not know about the container in which it is located; nothing but the name stores any current state. Every two minutes, the agent contacts the server for a list of his connections and the server responds with a string from the table above. Usually the length of the list of IP addresses does not exceed ten. The agent sends pings to these addresses. The result, containing a list of problems (if any), is reported back to the server via POST. Since all operations performed by the server for agents are very simple, it can easily serve a large number of them - for this, it just needs to have time to accept the GET, POST request and perform a search in the dictionary or add data to the dictionary several dozen times per second.
In addition, since the agent still runs every couple of minutes - it makes sense to collect some more current performance indicators of the component and transfer them to the server. We monitor a lot of things through the munin, but the munin gives a fairly large delay in the presentation of data, and here we can get something straight in the “real time” mode. Therefore, the agent also collects indicators such as LA in the component, the rate of requests to nginx, if there is one, and the number of nginx errors broken down by code.

Monitoring, presentation and analysis of results
As mentioned above, agents send the results of connectivity tests to the server (along with some other data, the list of which may be different for different agents). Here is some information we have about:
name | disconnects | DC | role | load | nginxErrRate | connectivity | nginxAccessRate |
be1.host123 | 0 | Tx | be | 0.21 | 15.55 | | 31.38 |
be1.host122 | 0 | Tx | be | 0.11 | | 18.28 | | 34.61 |
fo1.host161 | 2 | VA | fo | 0.02 | 0.11 | 10.1.1.4,10.1.2.4 | 14.1 |
fo1.host160 | 0 | VA | fo | 0.0 | 0.00 | | 0.0 |
fo1.host162 | 2 | VA | fo | 0.01 | 0.18 | 10.2.1.4,10.2.4.3 | 17.56 |
Some of the information in this table was sent by agents (disconnects, load, nginx ...), part (DC, role) - filled in from the CMDB. Since the data in this table is very short-lived and the number of rows in it is on the order of thousands, it makes sense to arrange it as part of the sqlite database of type ": memory:".
The main reason for which all this garden was fostered - monitoring communication problems. Therefore, in the first place, we are only interested in the disconnects and connectivity columns — the number of dangling connections and the corresponding list of containers, but in general we can get various interesting facts from such a table using simple SQL queries. “SELECT name FROM data ORDER by disconnects DESC” - gives us a list of servers with communication problems that we must solve first. The results of this query, if they are not empty, are sent to Nagios as an alert with an attached list of problem servers. The duty engineer, after receiving information about the problem, takes further measures - removes traffic from problem servers, creates a ticket in the system of the hosting provider, etc.
We sometimes have a situation in which the overall level of errors in the system begins to grow. This can occur for various reasons, related both to our operational problems and to external causes. An abstract example - routing between data centers has changed, as a result, cross-data center requests no longer fit into the time limits, which leads to an increase of 504 errors in nginx. The query “SELECT SUM (nginxAccessRate), SUM (nginxErrorRate), dataCenter FROM data group by ORDER BY SUM (nginxAccessRate)” gives us a general picture of errors broken down by data centers. This overall picture can then be detailed using queries that work only with 504 errors, etc.
Another example - let's say we want to find out if there is a correlation between server load and the number of errors that occur on it. The query "SELECT load, nginxErrorRate FROM data WHERE role =" be1 "ORDER by nginxErrorRate DESC" allows us to plot or analytically detect such a dependency.
The enumeration of such examples can be continued. What is important here is that all this information is quite relevant, in contrast to the munin graphs, the delay in data presentation is about two minutes, the methods for analyzing the data are incomparably more and it is easier to execute them.
This is not a solution replacing bundles such as statsd + graphite, if only because there is no history management, the ability to quickly get different cuts of system behavior here is just a bonus to analyzing connectivity between services.
Conclusion
The system described today monitors the connection for about a thousand containers. No performance issues have yet been identified. On the other hand, the sensitivity to short-term interruptions of communication, channel overload, or increased time to ping was found to be too high. With this we are fighting the sensitivity setting of the Nagios alerts.