Like the others: Monitoring & Tracing Tools in Odnoklassniki

The monitoring of large high-load systems is reminiscent of the work of an air traffic controller: you need to continuously monitor a variety of indicators and prevent all problems "live". Fortunately, unlike aviation, the errors are still not so fatal, which is probably why the monitoring team has fewer gray hairs.

Sergey Sharapov, a system analyst at Mail.ru, helped us to look “on the other side” of analytics and monitoring. He has extensive experience in Odnoklassniki, starting with setting up server and network equipment, right up to building business processes for HR.

')
Sergey saw with his own eyes both successful episodes from the life of the Odnoklassniki backend and feilas, so we decided to ask him about the structure of the Odnoklassniki monitoring service, the team’s work pattern, performance evaluation methods and the most memorable events from practice.

- Sergey, tell us about the size of the monitoring team, its structure and points of interaction with the developers. Who is responsible for what in this scheme?

Sergey Sharapov : The monitoring team consists of eight people: five people work on day shifts, of which we have three (from 7:00 to 16:00, from 10:00 to 19:00 and from 14:00 to 23:00), and three people work the night shift. In the afternoon, when there is increased user activity on the portal and many experiments are being launched, two people work in a shift. At night and on weekends one person works in a shift. The day team is engaged in a deeper analysis and investigation of the anomalies that have occurred. A system administrator is given to help the monitoring team, they are on duty for 24 hours and are only involved when something is really required of them. Therefore, even while on duty, admins can do their tasks. While on duty, the day monitoring team directly interacts with the developers and engineers in the data center. The night team, in general, contacts only with the system administrator on duty, it wakes the admin only if there is downtime or the probability of its occurrence. All night incidents are usually dealt with in the morning.

- What are the KPIs of the monitoring team, how is its effectiveness evaluated? Who forces change?

Sergey Sharapov : After the change, the system administrator should assess the quality of the monitoring team's duty on three points: the speed of detection and reporting of the problem, the completeness of the investigation and the escalation of the problem, whether all those interested were involved. There is also a cross-check of incidents on the quality of their design and investigation. Each incident after closing is verified by colleagues. Now we are creating a system that will aggregate and analyze all this information in order to understand in time where the team has problems.

- How does the alert system work from a technical and human point of view?

Sergey Sharapov : All on-line monitoring is in one system - SmartMonitoring (all the necessary information is on one screen), which shows problems with business metrics and problems in the operation of applications. In the same place there are notifications on new autoincidents, which work for us in conjunction with Jira + Zabbix. Zabbix detects the problem and automatically creates an incident in Jira. All communication of the monitoring team with admins, developers and engineers takes place in our TamTam messenger. For each more or less serious incident, a separate chat is created in which its solution takes place. When creating incidents, automatic notifications come to the main chat, where all the employees are, write about all experiments and work that can affect something, in addition, all this is duplicated in a separate monitoring chat, where there are only technical specialists. Autoincidents do not fall into these chats, because these incidents do not affect users, and if something serious happens, a general incident is created to which auto incidents are linked. The most important thing is that the chat is “read”, there is no spam in it, and all messages carry meaning.

- Tell us about the most interesting cases of the influence of the human factor, did anyone delete the database?

Sergey Sharapov : Of course, such cases are encountered ... We are constantly inventing something to minimize the human factor. The most serious incident occurred on April 4, 2013. Much has been told about him, and there is a separate article on Habré . We call this incident 404 among ourselves. Eight years ago I distinguished myself, "killed" equipment by several tens of thousands of dollars ... I was just starting to work at Odnoklassniki, and I was trained to update the firmware on our Promise. My mentor threw a new firmware on Skype, showed on my computer how to update. But wait for the array to go out of the reboot, for a long time, and what can happen there ?! I repeated it on 15 devices. It turned out that the firmware was not from this equipment, but there is no possibility to roll back. But the story ended safely for us. Since the equipment was just preparing for production, the vendor met us and replaced all the “dead” devices for free.

Another time, one of the developers deleted 4 TB of data with our statistics. The reason was an error in the command to delete - the '$' at the beginning of the directory name was not escaped, which led to the removal of the parent. But this story ended well, was a backup.

- Judging by the information in the network, you have a lot of self-written solutions. We suspect that this is due to the fact that Odnoklassniki appeared when there was nothing special, and not because all modern solutions do not suit you. Do you analyze the market? Which of the popular could replace your own work?

Sergey Sharapov : We constantly monitor all new solutions. We attend many conferences. Even if there is some good solution for our amount of equipment, then either it is very expensive, or, most likely, it will have to be finished for a very long time. We know well how the systems created by us work and work. We can easily administer and develop them, flexibly changing them for our needs. We are a big company, and we want to create not only the main product, but also related products. But to say that we do not use open-source solutions at all is wrong. Immediately comes to mind our database for storing statistics, which is made on the basis of Druid, about which there will be a report on HighLoad in November. But so that it works as we need, we spent a lot of effort.

If you want to learn more technical details from the practice of Odnoklassniki, come to hear Sergey's report “ SmartMonitoring - Monitoring Business Logic in Odnoklassniki ” at the October DevOops 2017 Piter conference. Of course, he will not be there alone. Surely you will be interested in other reports, including:

Ensemble of salty puppeteer chefs: compare Ansible, SaltStack, Chef and Puppet (Andrey Filatov, Epam Systems)
Expanding k8s (Nikolai Ryzhikov, Health Samurai)
DevOps in scale: Greek tragedy in three acts (Baruch Sadogursky, JFrog and Leonid Igolnik, CA Technologies)
Troubleshooting & debugging production applications in Kubernetes (aka The Failing Demo Talk) (Ray Tsang, Google and Baruh Sadogursky, JFrog)

Source: https://habr.com/ru/post/339786/

All Articles

Like the others: Monitoring & Tracing Tools in Odnoklassniki

More articles: