Hello! We continue the series of Backend United mitaps. The fourth meeting is called Okroshka, and it will be dedicated to incidents. Together with colleagues from Tutu.Ru, Ozon and Avito, we’ll talk about working with incidents, tools for improving incident response and the value of technical debt.
The meeting will be held on August 10, beginning at 12:00. Register yourself and invite colleagues. Under the cutscene - abstracts, links to registration and video broadcasting of the mitap.
We all want our users to be happy, and the services to work and quickly repaired after breakdowns. The more developers and teams, the more different services, the more different control tools that this all works. And more of the possible actions that have to be taken to diagnose and recover.
I will tell you how simple technical solutions helped us make our lives easier during incidents. How, using the chat features, almost without magic, we gave the teams a customizable system that makes diagnostics important for them closer, alerts from different systems are more useful, and their routing is easier.
And as a bonus I’ll tell you how you can measure the treasured “nine” of the availability of your service, and what happened with us.
Has it happened in your practice that a failure, which until recently seemed insignificant, led to the fact that the entire food was groomed? Or did you fix a problem that wasn’t particularly serious?
How to understand the actual effect and recognize a time bomb? How to manage the flow of bugs and crashes and isolate significant ones? In the report I’ll talk about how the practice is organized in Avito and what research and automation we use in our work.
Sometimes situations arise in which everything breaks down, all the graphs are red, and everything is on fire. It seems that with a detailed analysis everything becomes clear ... but no. It is not easy to catch the root cause of the problem, especially when you do not have a complete picture of what is happening in the monolith, services, microservices, databases, in the heads of developers, etc.
I will tell you how we collected all the secret knowledge, failure scenarios of various systems and services and transferred all this to code for the purpose of automated detection and initial analysis of significant incidents.
High development rates lead to an acceleration of the rate of accumulation of technical debt. More and more often we have to make concessions to the stability and quality of the developed solutions in favor of new functionality and new product attributes. Without proper control of technical debt volumes, the situation with the stability of the system and, as a result, with the technical stability of the business may deteriorate. I’ll talk about what we do to control everything that breaks down and is quickly repaired, how we help teams not to forget about these promises, and we provide the business with complete and understandable information about what happened, how it was repaired and what we’ll do to it did not happen again.
12:30 - 13:15 - Simple tools to improve incident response: Tutu experience . Andrey Borzov (Tutu.ru)
13:20 - 14:00 - Work with Production Explosions: detection, loss estimation, incident management . Dmitry Khimion (Avito)
14:00 - 14:45 - Lunch
14:45 - 15:30 - AutoLSR - automated data collection for significant incidents . Vladimir Kolobaev (Avito)
15:40 - 16:20 - We broke it now, but we will fix it later. Tech debt and its value . Boris Kaiser (Ozon)
16:30 - Afterparty at ONE MORE PUB
Mitap will begin on August 10 at 12:00. Participation in the event is free, but you need to register . Please indicate your last name and first name as in your passport, and do not forget to take it (or a driver’s license) with you, otherwise you will not be allowed into the office.
Address: Avito office, Lesnaya 7 .
Watch the live broadcast of the mitap on the AvitoTech YouTube channel .
See you!
Source: https://habr.com/ru/post/461739/
All Articles