Citymobil - a guide for startups to increase stability against the background of growth. Part 2. What are the types of accidents?

This is the second article from the series about how we in Citymobil increased the stability of the service (you can read the first one here ). In this article I will delve into the specifics of the analysis of accidents. But before that, I will highlight one point that I should have thought about in advance and covered in the first article, but did not think about it. And about which I learned from the feedback of readers. The second article gives me a chance to eliminate this annoying flaw.

0. Prologue

One of the readers was asked a very fair question: “What is difficult in the taxi service backend?” The question is good. I asked him myself last summer before I started working at Citymobil. I then thought, "think a taxi, an application with three buttons." What is difficult in it? But it turned out that this is a very high-tech service and the most complex product. In order to at least roughly understand what it is about and what kind of big technology is in fact, I’ll tell you about several areas of Citymobil's food business:

Pricing. The pricing team deals with pricing at each point and at each point in time. The price is determined by the prediction of the balance of supply and demand based on statistics and other data. All this makes a large, complex and constantly evolving service based on machine learning.
Pricing. The implementation of various payment methods, the logic of surcharges after the end of the trip, the retention of funds on bank cards, billing, interaction with partners and drivers.
Distribution of orders. On which machine to distribute the order of the client? For example, the distribution option to the nearest one is not the best in terms of increasing the number of trips. A more correct option is to match customers and cars in such a way as to maximize the number of trips, given the likelihood of cancellation by this particular client under these conditions (because it takes a long time to wait) and cancellation or sabotage of the order by this driver (because it takes too long to drive or too low check).
Geo. Everything related to the search and sadzhest addresses, landing points, adjusting the time of filing (our partners supplying maps and traffic jams do not always provide accurate information on ETA, taking into account traffic jams), improving the accuracy of forward and backward geocoding, improving the accuracy of the machine feed. There is a lot of work with data, a lot of analytics, a lot of machine-based services.
Antifraud. The difference in the price of a trip for a passenger and for a driver (for example, on short trips) creates an economic incentive for froders who are trying to steal our money. The fight against fraud is somewhat similar to the fight against spam in the mail service - completeness and accuracy are important. It is necessary to block the maximum number of froders (completeness), but good users should not be taken for froders (accuracy).
Motivation drivers. The driver motivation team develops everything related to enhancing the usability of our platform by drivers and driver loyalty through various types of motivations. For example, make X trips and get extra Y rubles for it. Or buy a shift for Z rubles and ride without a fee.
Backend driver application. A list of orders, a demand card (a hint to where the driver should go to maximize his earnings), a roll-out of status changes, a communication system with drivers and much more.
The backend of the client application (this is probably the most obvious part, and that is usually understood as a taxi backend): placing orders, sending statuses about changing the order status, ensuring the movement of machines on the map on the order and on the feed, backend tips and etc.

This is all the tip of the iceberg. Functionality is much more. Behind a simple user-friendly interface, there is a huge underwater part of the iceberg.
')
And now we return to accidents. For six months of conducting the accident history, we have compiled the following categorization:

bad release, 500th errors;
bad release, non-optimal code, load on the base;
unsuccessful manual intervention in the system;
Easter egg;
external causes;
bad release, broken functionality.

Below I will write down what conclusions we did on the most common types of accidents.

1. Bad release, 500th errors

Almost all of our backend is written in PHP, an interpreted language with weak typing. It happens, you roll out the code, but it falls due to an error in the name of the class or function. And this is just one of the examples when the 500th error appears. It may also appear in the case of a logical error in the code; unrelated the wrong branch; accidentally deleted the folder with the code; left in the code temporary artifacts needed for testing; did not change the structure of the tables according to the new code; did not restart or stop the necessary cron-scripts.

We struggled with this problem consistently in several stages. Lost trips due to a bad release are obviously proportional to the time it was in operation. That is, it is necessary to do everything possible so that a bad release is in operation as little time as possible. Any change in the development process that reduces the average time a bad release is in operation by at least 1 second is positive for the business and should be implemented.

A bad release, or even any accident in production, passes through two states, which we call the “passive stage” and “active stage”. The passive stage is when we are not yet aware of the accident. The active stage is when we already know. The accident begins in the passive stage, and over time, when we learn about it, the accident goes into the active stage - we begin to deal with it: first we diagnose, and then we fix it.

To reduce the duration of any accident in production, it is necessary to reduce the average duration of both the passive and active stages. The same applies to a bad release, because it is in itself a kind of accident.

We began to analyze our current crash repair process. Bad releases that we encountered at the time of the analysis started resulted in an idle time (full or partial) for an average of 20-25 minutes. The passive stage usually took 15 minutes, the active - 10 minutes. During the passive stage, complaints from users who were processed by the contact center began, and after some threshold the contact center complained to the general chat rooms in Slack. Sometimes one of the employees complained when he could not order a taxi. An employee complaint was a signal to us about a serious problem. After the transition of a bad release to the active stage, we began to diagnose the problem, analyzed the latest releases, various graphs and logs in order to establish the cause of the accident. After finding out the reason, we rolled back the code if the bad release was pumped last, or did a new rollback with the revert of the bad release commit.

That is the process of dealing with bad releases, we had to improve.

1.1. Reduction of the passive stage

First of all, we noticed that if a bad release is accompanied by 500 errors, we can understand without complaint that a problem has occurred. Fortunately, all the 500th errors were recorded in New Relic (this is one of the monitoring systems we use), and it remained only to fasten SMS and IVR notifications about exceeding a certain frequency of five hundred meters (the threshold was constantly reduced over time).

This led to the fact that the active stage of the accident “Bad release, 500th mistakes” began almost immediately after the release. The process in the event of an accident began to look like this:

The programmer deploys the code.
The release leads to an accident (massive 500-ki).
SMS comes.
Programmers and admins begin to understand (sometimes not immediately, but after 2-3 minutes: SMS can be delayed, the sound on the phone can be turned off, and the immediate action culture after SMS cannot appear in one day).
The active stage of the accident begins, which lasts the same 10 minutes as before.

Thus, the passive stage was reduced from 15 minutes to 3.

1.2. Further reduction of the passive stage

Despite the reduction of the passive stage to 3 minutes, even such a short passive stage strained us more than the active one, because during the active stage we already do something to solve the problem, and during the passive service does not work in whole or in part, but “ men do not know. "

To further reduce the passive stage, we decided to donate three minutes of developer time after each release. The idea was very simple: you roll out the code and look at New Relic, Sentry, Kibana for three minutes, are there any 500th errors. As soon as you see a problem there, then a priori you assume that it is related to your code and you begin to understand.

We chose three minutes on the basis of statistics: sometimes problems appeared on charts with a delay of 1-2 minutes, but there were not more than three minutes.

This rule was entered into the do's & dont's. At first, it was not always executed, but gradually the developers got used to the rule as to basic hygiene: brushing your teeth in the morning is also a waste of time, but it is necessary to do so.

As a result, the passive stage was reduced to 1 minute (the charts still sometimes were late). As a pleasant surprise, it also reduced the active stage. After all, the developer meets the problem in good shape and is ready to immediately roll back your code. Although this does not always help, because the problem could arise due to someone else’s parallel-rolling code. But, nevertheless, the active stage on average was reduced to 5 minutes.

1.3. Further reduction of the active stage

More or less satisfied with one minute of the passive stage, we began to think about further reducing the active stage. First of all, we paid attention to the history of problems (it is a cornerstone in the building of our stability!) And found that in many cases we don’t roll back right away because we don’t understand which version to roll back to, because there are many parallel releases. To solve this problem, we introduced the following rule (and recorded it in do's & dont's): before release you write to the chat in Slack, what are you trying to do, and in case of an accident you write to the chat “accident, don’t roll!”. In addition, we started automatically reporting via SMS about the facts of the release, in order to notify those who do not enter the chat.

This simple rule dramatically reduced the number of releases already in the course of accidents and reduced the active stage - from 5 minutes to 3.

1.4. An even shorter active stage

Despite the fact that we warned about all releases and accidents in the chat, sometimes race conditions arose - one wrote about the release, and the other at that moment was already rolling out; or the accident started, they wrote about it in the chat, and someone just rolled out a new code. These circumstances lengthen the diagnosis. To solve this problem, we have implemented an automatic ban on parallel releases. The idea is very simple: after each release, the CI / CD system prohibits everyone to roll out over the next 5 minutes, except the author of the last release (so that he can roll back or roll a hotfix if necessary) and several very experienced developers (for emergency). In addition, the CI / CD system prohibits rolling out during an accident (that is, from the moment of receiving the notification of the beginning of the accident until the moment of receipt of the notification of its completion).

Thus, the process has become such: the developer rolls out, he keeps track of the graphics for three minutes, and after that two more minutes no one can roll out anything. If there is a problem, the developer rolls back the release. This rule dramatically simplified diagnosis, and the total duration of the active and passive stages was reduced from 3 + 1 = 4 minutes to 1 + 1 = 2 minutes.

But two minutes of an accident is a lot. Therefore, we continued to optimize the process.

1.5. Automatic crash detection and rollback

We have long thought how to reduce the duration of the accident due to bad releases. They even tried to force themselves to look at tail -f error_log | grep 500 tail -f error_log | grep 500 . But in the end, they still settled on a cardinal automatic solution.

In short, this auto-roll. We started a separate web server, to which the load balancer was loaded 10 times less than other web servers. Each release was automatically deployed by the CI / CD system to this separate server (we called it preprod, although, despite the name, there was a real load from real users). And then the automation performed tail -f error_log | grep 500 tail -f error_log | grep 500 . If within one minute there was not a single 500th error, then CI / CD deployed a new code in production. If errors appeared, the system immediately rolled everything away. At the same time, at the balancer level, all requests completed with 500 errors on preprod were duplicated to one of the production servers.

This measure reduced the impact of “five hundred” releases to zero. In this case, in case of bugs in automatics, we did not cancel the rule of three minutes to follow the schedules. That's all about bad releases and 500th mistakes. Moving on to the next type of accidents.

2. Bad release, non-optimal code, load on the base

I will begin immediately with a specific example of an accident of this type. Rolled out the optimization: added USE INDEX to the SQL query, while testing it accelerated short queries, as in production, but long queries slowed down. Slowing down long requests was noticed only in production. As a result, the flow of long queries put the entire master database for one hour. We thoroughly figured out how USE INDEX works, described it in the do's & dont's file and warned developers against misuse. We also analyzed the query and realized that it returns mainly historical data, which means it can be run on a separate replica for historical queries. Even if this replica will fall under load, the business will not stop.

After this incident, we still ran into similar problems, and at some point decided to approach the issue systematically. Prosherstili all code frequent comb and brought to the replica all requests that can be put there without compromising the quality of service. At the same time, we divided the replicas themselves according to the levels of criticality so that the fall of any of them would not stop the service. As a result, we have come to an architecture in which there are the following bases:

master database (for write operations and for queries that are supercritical to data freshness);
production replica (for short queries that are slightly less critical to the data freshness);
a replica for calculating price ratios, so-called surge pricing. This remark can lag by 30-60 seconds - this is not critical, the coefficients change less often, and if this remark falls, the service will not stop, just the prices will not quite match the balance of supply and demand;
a replica for the admin user and contact center (if it falls, the main business will not rise, but the support will not work and we will not be able to temporarily view and change the settings);
many replicas for analytics;
MPP-base for heavy analytics with full cuts based on historical data.

This architecture gave us more room to grow and reduced the number of crashes by an order of magnitude due to non-optimal SQL queries. But she is still far from perfect. There are plans to do sharding so that updates and deletes can be scaled, as well as short queries that are supercritical to the freshness of the data. MySQL's margin of safety is not infinite. Soon we will need heavy artillery in the form of Tarantool. About this will be necessarily in the following articles!

In the course of proceedings with non-optimal code and requests, we understood the following: it is better to eliminate any non-optimality before the release, not after. This reduces the risk of an accident and reduces the time spent by developers on optimization. Because if the code is already rolled out and there are new releases on top of it, then optimizing is much more difficult. As a result, we have introduced a mandatory code check for optimality. It is conducted by the most experienced developers, in fact, our special forces.

In addition, we began to collect at Do's & Dont's the best ways to optimize the code that work in our realities, they are listed below. Please do not take these practices as absolute truth and do not try to blindly repeat them in yourself. Each method makes sense only for a specific situation and a particular business. They are given here just for example, so that the specifics are clear:

If the SQL query does not depend on the current user (for example, a demand card for drivers with the indication of minimum travel rates and polygon coefficients), then this query should be done by cron with a certain frequency (in our case, once per minute is sufficient). Write the result to the cache (Memcached or Redis), which is already used in production-code.
If the SQL query operates on data whose lag is not critical for a business, then its result should be put in the cache with some TTL (for example, 30 seconds). And then in subsequent requests read from the cache.
If in the context of processing a request on the web (in our case, in the context of the work of implementing a specific server method in PHP) you want to make a SQL query, then you need to make sure that this data has not “arrived” with any other SQL query (and whether they will come further along the code). The same applies to cache accesses: it can also be overwhelmed with requests if desired, therefore, if the data has already “arrived” from the cache, then it is not necessary to go to the cache as if to your home and take it out of it, which is already taken away.
If, in the context of processing a request on the web, you want to call any function, then you need to make sure that not a single extra SQL query or cache access is made to its gut. If the call of such a function is inevitable, then you need to make sure that it cannot be modified or its logic is broken so as not to make unnecessary requests to the databases / caches.
If you still need to go to SQL, you need to make sure that the queries that already exist in the code cannot add the required fields above or below the code.

3. Unsuccessful manual intervention in the system

Examples of such accidents: unsuccessful ALTER (which overloaded the database or provoked the replica lag) or failed DROP (ran into a bug in MySQL, blocked the database during the drop of the fresh table); heavy request for a master made by mistake; did work on the server under load, although they thought that it was taken out of work.

To minimize falls for these reasons, it is necessary, unfortunately, to understand the nature of the accident every time. General rules we have not yet felt. Again, let's try on examples. Say, at some point the surge factors stopped working (they multiply the price of a trip at a place and time of increased demand). The reason was that on the replica of the base from which the data for calculating the coefficients were taken, there was a Python script that ate all the memory, and the replica went down. The script was launched a long time ago, it worked on a replica just for convenience. The problem was solved by restarting the script. The conclusions were as follows: do not run third-party scripts on the machine (recorded in do's & dont's, otherwise it is a blank shot!), Monitor the end of memory on a machine with a replica and alert via SMS if the memory ends soon.

It is very important to always draw conclusions and not to slip into a comfortable situation "they saw the problem, fixed it and forgot it." High-quality service can only be built if conclusions are drawn. In addition, SMS-alerts are very important - they set the quality of service at a higher level than it was, do not allow it to fall, and allow you to further increase reliability. As a climber from each stable state pulls itself up and fixed in another stable state, but at a higher height.

Monitoring and alertings with invisible but hard iron hooks crash into the rock of obscurity and never let us fall below the level of stability we set, which we constantly raise only upwards.

4. Easter egg

What we call the “Easter egg” is a time bomb that exists a long time ago, but which we didn’t find. Outside of this article, this term means an undocumented feature, made specifically. In our case, this is not a feature at all, but rather a bug, but which works as a time bomb and which is a side effect of good intentions.

For example: overflow 32 bit auto_increment ; nonoptimality in the code / configuration, which “shot out” due to the load; backward replica (usually either due to a suboptimal request for a replica that was triggered by a new usage pattern, or a higher load, or due to a suboptimal UPDATE on the master that was caused by a new loading pattern and loaded replica).

Another popular type of Easter eggs is a non-optimal code, and more specifically, a non-optimal SQL query. Previously, the table was smaller and the load was less - the query worked well. And with the increase in the table, linear in time and increase in load, linear in time, the consumption of resources of the DBMS grew quadratically. Usually this leads to a sharp negative effect: the type was all “ok”, and then - bang.

More rare scenarios - a combination of bug and easter eggs. The release with a bug led to an increase in the size of the table or an increase in the number of records in a table of a certain type, and the Easter egg that already has led to an excessive load on the database due to slower queries to this expanded table.

Although we also had Easter eggs that were not related to the load. For example, 32-bit auto increment : after two and a little billions of records into the table, inserts cease to be performed. So the auto increment field in the modern world should be made 64-bit. We learned this lesson well.

How to deal with "Easter eggs"? The answer sounds simple: a) look for old “eggs”, and b) not allow new ones to appear. We try to fulfill both points. The search for old “eggs” in our country is associated with constant code optimization. We have identified two of the most experienced developers for almost-fulltime optimization. They find in slow.log requests that consume the most database resources, optimize these requests and the code around them. We reduce the likelihood of the emergence of new eggs through checking for the optimality of the code of each commit by the aforementioned Sensei developers. Their task is to point out errors affecting performance; suggest how to do better, and transfer knowledge to other developers.

At some point after the next Easter egg we found, we realized that the search for slow queries is good, but it would be worthwhile to additionally look for queries that look like slow but work quickly. These are just the next candidates to put everything in the event of the explosive growth of the next table.

5. External causes

These are reasons that we believe are poorly controlled by us. For example:

Trotling by Google Maps. It can be circumvented by monitoring the use of this service, observing a certain level of load on it, planning the growth of the load in advance and purchasing the expansion of the service.
The fall of the network in the data center. You can get around by placing a copy of the service in the backup data center.
Accident payment service. You can bypass the reservation of payment services.
Wrong traffic blocking by the DDoS protection service. You can get around by turning off the default DDoS protection service and turning it on only in case of a DDoS attack.

Since the removal of an external cause is a long and expensive exercise (by definition), we just started collecting statistics on accidents due to external causes and waiting for the accumulation of critical mass. Recipe, how to determine the critical mass, no. It works just intuition. For example, if we were 5 times in full downtime because of problems, for example, the service to combat DDoS, then with each subsequent fall, there will be a sharper and sharper question about alternatives.

On the other hand, if it is possible to somehow make everything work with an inaccessible external service, then we will definitely do it. And this helps us to post-mortem-analysis of each fall. There should always be a conclusion. So, you always want-not-want, but you come up with a workaround.

6. Bad release, broken functionality.

This is the most unpleasant type of accident. The only type of accident that is not visible for any symptoms other than user / business complaints. Therefore, such an accident, especially if it is not large, can exist for a long time in production unnoticed.

All other types of accidents are to some extent similar to the “bad release, 500th mistakes”. Just the trigger will not be a release, but a load, a manual operation or a problem on the external service side.

To describe the method of dealing with this type of accidents, it suffices to recall the bearded anecdote:

Mathematics and physics was offered the same problem: boil the kettle. Hand tools: stove, kettle, water faucet with water, matches. Both alternately pour water into the kettle, turn on the gas, light it and put the kettle on the fire. Then the task was simplified: a teapot filled with water and a stove with burning gas were proposed. The goal is the same - boil water. The physicist puts the kettle on the fire. The mathematician pours water out of the kettle, turns off the gas and says: “The task has been reduced to the previous one”. anekdotov.net

This type of accident should be reduced to “bad release, 500th error” by all means. Ideally, if the bugs in the code were saved to the log as an error. Well, or at least left traces in the database. By these traces you can understand that a bug has occurred, and immediately alert. How to contribute to this? We began to analyze every major bug and propose solutions for what kind of monitoring / SMS alerting can be done so that this bug immediately manifests itself in the same way as the 500th error.

6.1. Example

There were massive complaints: orders that were paid through Apple Pay are not closing. Began to understand the problem repeated. Found the reason: they made a revision in the expire date format for bank cards when interacting with acquiring, as a result of which they began to transfer it specifically for payments through Apple Pay, not in the format in which it was expected by the payment processing service (in fact, we treat one another cripple), so all payments through Apple Pay began to deviate. Quickly fixed, rolled out, the problem disappeared. But "lived" with a problem of 45 minutes.

In the wake of this problem, we monitored the number of unsuccessful payments through Apple Pay, and also made an SMS / IVR alert with some non-zero threshold (because unsuccessful payments are the norm in terms of service, for example, the customer has no money on the card or the card is blocked) . From now on, when the threshold is exceeded, we will instantly know about the problem. If the new release introduces ANY problem into the processing of Apple Pay, which will cause the service to become inoperable, even partial, we will instantly find out about it from monitoring and roll back the release within three minutes (the above describes how the manual rollback process works). It was 45 minutes of partial downtime, it was 3 minutes. Profit

6.2. Other examples

Rolled out the optimization of the list of orders offered to drivers. A bug has crept into the code. As a result, drivers in some cases did not see the list of orders (it was empty). We found out about the bug by accident - one of the employees looked into the driver's application. Quickly rolled back. As a conclusion from the accident, they made a graph of the average number of orders in the list of drivers, according to the data from the database, looked back at the chart for a month, saw a failure there and made an SMS alert on the SQL query that forms this chart while reducing the average number of orders in list below the threshold selected based on the historical minimum for the month.

Changed the logic of distribution to cashback users for trips. Including distributed to the wrong group of users. They fixed the problem, built a schedule of distributed cashbacks, saw a sharp increase there, also saw that there was never such growth, they made an SMS alert.

With the release, the order closing functionality was broken (the order was closed forever, the payment on the cards did not work, the drivers demanded cash payment from the customers). The problem was 1.5 hours (total passive and active stages). About the problem learned from the contact center for complaints. Made a correction, made monitoring and alert for the time of closing orders with thresholds found on the study of historical charts.

As you can see, the approach to this type of accidents is always the same:

Roll out the release.
We learn about the problem.
We repair it.
We determine by what tracks (in the database, logs, Kibane) you can find out the signs of a problem.
Build a graph of these signs.
We wind it into the past and look at the surges / falls.
We select the correct threshold for the alert.
When the problem arises again, we immediately find out about it through an alert.

What is pleasant in this way: a single class and an alert immediately closes a huge class of problems (examples of classes of problems: non-closing orders, extra bonuses, non-payment through Apple Pay, etc.).

Over time, we made building alerts and monitoring for every major bug as part of the development culture. So that this culture is not lost, we have formalized it a little bit. For each accident, they began to demand that they themselves create a report. The report is a completed form with answers to the following questions: root cause, method of elimination, impact on business, conclusions. All items are required. Therefore you do not want it, but you will write the conclusions. This process change, of course, recorded this do's & dont's.

7. Kotan

, , -, . - ( , ) «». «». :-)

«» :

. — , . , ( ), ( ) , . ( , ).

. , . , : — , — . , « 500- 1 %» — . « 500- 1 %, - , - , - » — . , . ( ). , : , «», , , , . — . ( , ). .

. . , ( , , ), , : , , , .

. , , ( , ).

8. ?

— . . : , . , , . , , , .. — , — ! , . , , ? , , .. , , .

. . ( , ), , : , , , . , , . . . -, , . , , , — : .

9.

, .

?	?
.	.
( ) post-mortem.	.
do's & dont's.	, , .
, 5 .	.
, .	.
.	.
	.
.	.
.	.
SMS/IVR- .	.
( ) .	.
.	- .
( — slow.log).	- « ».
.	.
.	.
.	, , .
«» — .	, .
.	.

, ! , , , , !

Source: https://habr.com/ru/post/445704/

All Articles