
Hi, Habr!
A week ago,
an article appeared in which I began a conversation on how to prepare an e-commerce project for the explosive growth of traffic and the other charms of large-scale campaigns.
We have dealt with key technical details, now we will pay attention to administrative issues and streamlining support processes during peak loads:
- what makes the site unstable and why the cloud is not a panacea;
- which business parameters need to be monitored in order to detect the problem before it causes significant losses;
- how without chaos to route the incident from the event to the solution and localize the failure.
And much more - I ask everyone under the cat!
In my experience, the biggest headache in preparing for large-scale actions is a strong administrative pressure. The business, which is so calm, suddenly has a desire for everyone to be on the lookout, to blow dust particles off the site and so on, “God forbid what happens, let us fine”. Let's try to satisfy this overall sound desire. We will talk about this on the example of Black Friday, since this is the most striking example of a sharp increase in the load on the site.
And we begin with the fundamental question: what exactly is the cause of the unstable operation of our site?
')
What makes the site unstable

It is time to do what you have long and regularly postponed. To understand what factors make the site less stable, raise and analyze the problem history. Just do not say that you do not have it.
Your top will have plus or minus the following reasons:
- Release accidents.
- Admins got tired - they repaired one thing, and another broke. Unfortunately, such linings are often hidden and do not fall into the story.
- I screwed up the business - they started the action crookedly, deleted something, etc.
- Broke down partner services.
- "Sad" software. Most often this happens because of paragraphs. 1 and 2.
- Physical damage.
- Other problems.
Of course, all situations are different, and your “rating” may turn out a little different. But all the same, the problems associated
with changes on the site and the
human factor will be in the lead
, as well as the fruits of their joint love - releases or attempts to optimize something.
To eradicate these problems so that on the first attempt to make the necessary changes and not to break down what works normally, is a task about which many copies are broken. And we have very little time, only about four months. Fortunately, you can handle it locally. To do this, follow a couple of simple rules:
1. Works - do not touch.
Complete all planned work as early as possible - in a couple of weeks, in a month. How early it is to get involved with improvements, your incident history will tell. It shows how long the main tail of the problems lasts. After that, do not touch the site and infrastructure of the product until it is overloaded.
2. If you still had to climb into the product for urgent repair - test.
Regularly, tirelessly, even the smallest and minor changes. First, in a test environment, including under load, and only then transfer to the prod. And again test and recheck the key parameters of the site. It is better to work at night, when the load is minimal, because you should have time to save the situation if something goes wrong. Good testing is a whole science, but even just
reasonable testing is still better than its absence. The main thing is not to hope for a chance.
Freezing changes at the time of high load is the only reliable means.
What to do with partner services, we have already discussed in the last article. In short - for any problems ruthlessly disconnect. Most often, problems arise at once with many users of the service, and contacting technical support is an ineffective measure. Your letters will not help them to fix faster, at such hours IT-service department is hot without them.
However, if you do not report the problem and do not receive the incident number with the time of its establishment, you most likely will not be able to charge a service fee for violating the SLA.
Little about reliability

In preparation, you need to change all the failing hardware and cluster services. Read more about this in
one of my previous articles.
I would like to draw your attention to the following popular fallacy: it seems to many that transferring a site from their servers to the cloud immediately gives +100 reliability. Unfortunately, only +20.
To increase the resiliency of a virtual server, a commercial cloud simply automates and speeds up the “replacement” of fallen hardware to a matter of seconds by automatically raising the virtual machine on one of the live servers. Keywords - "accelerates" and "fallen iron." The virtual machine will still be restarted. VMware Fault Tolerance and analogs that allow you to escape from a reboot, as a rule, are not used in commercial virtualization due to resource-intensiveness and reduced performance of protected virtual machines. Hence the conclusion: a commercial cloud is not a panacea for fault tolerance, its main advantages are flexibility and scalability.
See in the story how many downtime you had for replacing or repairing physical equipment. After moving to the cloud, their number will decrease, and - yes, you will live a little easier. Do not have to run to the warehouse or store for a new server. But now virtualization tricks will be added to iron accidents.
It may happen that the machine has become unavailable, but the physical host responds anyway. The cloud will not see this problem. Or exactly the opposite: the host is not responding, but everything is fine with virtual machines. In this case, virtualization will raise them elsewhere. It will take some time to start up, and you will again get idle, out of the blue. And under load it can be fatal. Therefore, even in the cloud you need to remember about reservations. By the way, to warn the virtualization provider about which machines are reserving each other is a great idea. Otherwise it may happen that all your machines will end up on the same physical server and die at the same time.
- When performing load tests, it makes sense to schedule failover testing under load.
This is when you “drop” a node in a cluster right at the time of the load test and see what happens. With
properly configured clusters and
correctly allocated resources, this should not adversely affect the test results and cause a heap of errors.
It seems that with all the typical "little drums" we are done. Before reading further, I recommend that you refresh the technical details described
in the previous article . After all, if the site is technically not able to withstand the load, the speed of the reaction will not save you.
Now we will think how to prepare for the unusual or sudden. We can’t prevent them by definition, so it remains to roll up our sleeves and learn how to repair them as quickly as possible.
Stages of incident elimination

Consider what constitutes the time to eliminate the accident:
- Failure detection rate - delay monitoring, receiving letters from the user, etc.
- Reaction time to the detected incident - someone should notice the report and deal with it.
- Time to confirm the presence of the incident - was there a boy?
- Time to analyze the incident and find ways to eliminate.
- Time to eliminate the incident and the problem. It is not always possible to fix everything from the first time, and this stage may have several iterations.
Typically, the detection and elimination of failures involved in support. If the team is large, each of these steps can be performed by different people. And time, as you know, is money. In our case, literally. Black Friday has a fixed duration, and competitors are not asleep - customers can spend everything on them. Accordingly, it is critically important that each employee knows his area of ​​responsibility and incidents are resolved by the “pipeline”.
Let's consider each stage separately, define problem points and consider ways to optimize them promptly.
All the following tips, hints and recommendations - this is not a recipe for "beautiful life", but specific things that you will manage to implement in the next 3-4 months left before Black Friday.
Detect the accident
In the most unsuccessful scenario, the client informs you about the failures. That is, the problem is so serious that he
spent his time reporting . In this case, only a very loyal customer will write or call, and a simple user will leave, with a shrug.
In addition, often the client does not have direct access to the IT department. Therefore, he either writes to info@business.ru, or calls the girls from the call center. When the information crawls to IT, it will take a lot of time.
Suppose we have a lot of loyal customers, and each of them considers it his duty to write to the TP about the problems. While the incident is classified as massive, while they are escalating and deciding, hours will pass. In this case, single treatment may be lost, and info@business.ru mail is sometimes not raked for weeks.
Therefore, it will be very useful to start self-tracking of key business parameters. At a minimum - the number of users on the site, the number of purchases made and their ratio. These data will allow you to quickly respond if something went wrong, and significantly reduce the time to identify (and solve) a specific problem in the work of the site.
No users? We must see where they could go. There are users on the site, but no sales? This is a signal about the problem, and rather late. Automated scripting testing will help discover that
something has happened
somewhere . Usually, autotests are driven by builds or releases, but they are just great for monitoring. With their help, you can see the breakdown or slowdown of some important business process through the eyes of the user.
Of course, if you don’t have any scenario testing, for the few months left before Black Friday, you won’t cover the whole production with tests. Yes, and they can give a serious load. But with tests of a dozen basic processes, it is quite possible to have time.
It is also very useful to track the average response time of servers. If it grows, you can expect sales problems. Such data should be automatically tracked by the monitoring system.
As you can see, on proper monitoring, you can reduce the time it takes to detect a problem
from hours and days to a
few minutes, and sometimes you can see if there is a problem before it gets to its full height.
Incident Response Time

We did a great job and, thanks to monitoring, we immediately found a failure. Now you need to start an incident, assign a priority, route and assign a person responsible for further processing.
Two things are important here:
- Get notified of the problem as soon as possible;
- Be prepared to promptly process the notification.
Many IT professionals are not used to react quickly to emails, even if they have a client on their smartphone. So important notifications should not be sent by email.
Use SMS for alerts on accidents. Even better, implement a caller bot for the most critical cases. I personally have not seen practical implementations of such bots, but if resources allow, why not? As a last resort, use WhatsApp / Viber / Jabber. Alas, the Telegram on the territory of the Russian Federation for many understandable reasons can not be a reliable channel for emergency notification.
Automatic escalation of the incident can also be useful in the absence of confirmation. That is, the monitoring will notify the next one in the queue if the main recipient of the notification does not respond. This system will insure you
if something (or someone) goes wrong.
Now let's talk about how to ensure prompt response to failure messages. First, someone must be prepared to be responsible for handling alerts. Alerts to the whole team are useful, but only for keeping people informed.
Collective responsibility is unreliable when speed is required.
If at the time of the shares do not set on duty on a clear schedule, you may be faced with the fact that during force majeure someone will sleep and someone will not have access from home. Someone will be on the road at all. And in fact, there is no one to tackle the problem in the next hour. Of course, you can put round the clock operational duty, but there is a nuance here. You will not force good specialists to work constantly in shifts, which means that when they are needed, you will still have to look for them and wake them up. And those who still work in shifts, fall out of the general context of the life of the team. This has the most fatal effect on their effectiveness in planned tasks.
It will be saved by the fact that in most projects it is necessary to promptly respond to messages, understand what has happened, and urgently need to be repaired
about 18 hours a day. Usually, the period from 6–8 am to 1–2 am of the next night accounts for up to 90% of traffic and sales.
To avoid overlaps, it is sufficient to shift the work schedule for attendants to formats such as:
- 6: 00-15: 00 and 17: 00-02: 00 - watch "from home";
- 15: 00-17: 00 - cover those who are in the office;
- 02: 00-06: 00 - little traffic. However, do not assign a very hard sleeping responsible.
Do not forget the weekend. This issue can be resolved in the same way.
If your daily activity of users is distributed differently, select a similar schedule, in which the site will not be left unattended during prime time.
To be on duty is to be responsible for handling monitoring events, calls from previous lines (customer support) and monitoring the system as a whole. But while everything is quiet, the duty officer is engaged in his main work.
Be sure to start duty a few days before the load. First, it will once again make sure that everyone has all the access. Secondly, a change in the working mode is stressful, many will need to “tune in”. And it would be better if the period of addiction does not coincide with the main heat.
Great, alerts come, and it’s exactly the people who need to respond to them. But the reaction time on duty is greatly affected by the presence of unnecessary and unprocessed alerts, as well as notifications, which in principle do not imply any action.
It is very important not to leave unprocessed alerts. If many similar events occur on a regular basis, investigate the cause and fix it. In the monitoring system there should be no active alarms at all.
According to experience, if something cannot be fixed quickly or it does not require repair, but it still blinks, it is better to suppress the alert and create a task for study. Constantly blinking alarm will sooner or later become familiar and stop attracting attention. The trouble is that in the event of a real problem, people can confuse a light bulb and ignore a really important event.
It is also extremely important to correctly configure and prioritize events in the monitoring system. The system should notify you exactly what needs to be fixed. About specific failures or risk of their occurrence. You will not repair 100% CPU Usage? You will eliminate high delays on the WEB server, because CPU Usage is information for debug, not a problem. If the processor is 100% loaded on Black Friday with a target load, response speed and taking into account stocks, this means that you have correctly calculated everything.
The utilization of system resources must be controlled, but this is a slightly different task, which is important for planning resources and identifying areas of accident impact.We have set up the events, now it is important to correctly prioritize what we will fix in the first place. To do this, let's look at the differences between the Critical and Warning alert levels. I will give a little exaggerated, but understandable examples.
Critical - this is when you go to the grandmother on the subway, get an alert and go to the nearest station. Get out the laptop, sit on the bench and start working - there was a stop in sales or there were heavy losses. That is, Critical is something that has a direct, moreover significant impact on users.
Warning - this is when you do not leave work until you fix it. Throw everything and run to the rescue for the sake of Warning is not necessary. You can smoke / finish and make a decision. For example, there was a clear risk of critical problems like a fallen server from a HA pair, errors and the like fell down in the logs. If you do not clog and conscientiously repair such events, (as well as get to the bottom of the causes and work to prevent them) there will be very few of them.
Another thing that is often forgotten. Do not throw on duty only admins. Be sure to attract developers by forming work pairs for each shift. This is useful to us in the following stages.If the project is functionally complex, it makes sense to send on duty consultants, analysts, testers and all others who may be useful. Make them available at least by call. The specialist will have to confirm the problem (or vice versa) and help with functional localization - when you have to raise a person for repair, it will save you time. I will discuss this issue in more detail in the next section.
And the last important point. Each employee on duty must thoroughly know the contacts and areas of responsibility of all his colleagues in the state of emergency. If he cannot solve the problem on his own and starts searching for available rescuers in a panic, chaos will occur, due to which you will lose a lot of time.
Compliance with these simple rules will help to avoid problems due to missed alerts and ensures that when an emergency comes (read both as “Black Friday” and “emergency”), people will be able to solve problems quickly.
We confirm the presence of the incident
The next step after receiving the notification is to understand what exactly went wrong and whether there is a problem in principle: it is not always easy to determine who is right, the user or the system. The fact is that the same alert can be interpreted differently depending on the angle of view.
For example, a typical admin who received information about bugs in a search engine (products disappeared) will go to check the search server and read the logs. He will spend a lot of time and make sure that the search is working. Then he will climb even deeper in trying to understand what is broken. As a result, it turns out that the “missing” products were deliberately hidden and there was no problem, just the user was not up to date.
Or, the admin will fall into a stupor, and then close the ticket for the lack of crime. Well, what, other products are great looking! But in fact, someone accidentally deleted goods from the landing page from the database, and the entire advertising campaign turned into a “demotivational” one.
In the first case, the admin spent time localizing a non-existent problem due to incomplete information. In the second, the “angle of view” is “to blame” The admin will look for a
technical problem, while the analyst will quickly detect the
logical and restore the goods.
The solution here is only one thing - if you receive an automatic notification, you should clearly know what it means and how to check it. It is desirable in the form of written instructions. If we are talking about messages from users, first of all they should be dealt with not so much a technical, as a functional specialist with a technical background. It is he who will take on yet another annoying problem - the confused messages which are well known to you a la “everything is hanging on me”, “your website is not working” and “I click, but it does not want”.
Before understanding further, it is necessary to understand what exactly happened at the person, and to be convinced that the problem is "real". To do this, in technical support, where the user reports a problem, must be polite and experienced specialists. Their task is to extract as much information as possible and understand what,
in the opinion of the visitor , is not how it works. Based on this information, you can determine: this is a technical problem with the site, or,
let's say , the interface was not intuitive enough.
Localize failure
Great, we got the alert. Make sure there is a problem. Next you need to understand its technical essence and outline its zone of influence. We have to see what exactly is not working, why and how to fix it. At this stage, our main enemy is the same as before: the lack of information.
Good monitoring and logging helps to fill it. First, the key parameters of the system, which we talked about in the first paragraph - sales, visitors, page generation speed, technical errors in server responses, should be displayed in the form of graphs on a large screen (the more, the better) in the service room support
All important data should always be in front of your customer support. During a state of emergency or any other action, this will allow them to quickly respond to changes in indicators and prevent a problem.To localize the failing component, you will need a site map with data on the interaction of components and their relationships. To quickly detect problem points, you need to track the data for each interaction flow in dynamics.
For example, an application accesses a database. This means that for each database server both from the server and from the client side, we should see the following:
- Number of requests per second;
- The number of replies;
- Response time;
- The volume of transmitted responses;
- Technical errors of this interaction (authorization, connections, etc.).
After the problem component is localized, you can go to the logs and see what is wrong with it, poor thing. Great to speed up the process will help centralized log collector. For example, on
ELK .
Also, as I wrote
in the last article , significant time savings are achieved due to the convenience of searching through cluster logs and the ability to track request processing throughout the chain.
We eliminate the failure

At this stage, we are finally repairing what has broken down, and understanding how to speed up this process.
Obviously, our best assistant is an instruction for troubleshooting. Unfortunately, we will have it only if we have already encountered this situation earlier. Well, and did not forget to write down the working decision. If there is no instruction, you will have to go through trial and error.
When you have to repair something new, you need to weigh the safety of work and the need for early intervention. Checking the corrections made in the test environment, on the one hand, reduces the risks, and on the other hand, delays the solution of the problem.
I try to be guided by the following rule: if I am absolutely sure that it will not be any worse, or it is impossible to reproduce the problem in the test environment, you can try to repair it immediately. But such a method is justified only if 3 factors coincide at once:
- Everything lies;
- The drug will not affect the valuable data;
- There are backups.
In other cases, it is worthwhile to reproduce the problem in the test and double-check everything before transferring to the production. Avoiding iterations on re-correction will help high-quality work in the previous stages (awareness of the problem and its localization). As a rule, repairing it from the first time does not work if we repair something that is not broken, or something is not taken into account.
And here we again come to the aid load testing. We emulate the work of the product and begin to specifically break it. This is necessary to understand how it works, what kind of problems affect it. In addition, this is a great way to learn how to repair an application, and at the same time write repair instructions.
After that, it will be possible to conduct tactical exercises to localize and eliminate problems in the testing area. For example, when one of the leading specialists breaks something slyly, maybe not even in one place, and sends someone to figure it out and fix it on their own. For a while. Very useful practice. He teaches him to work in a stressful situation, and he learns the system, and hones his skills, and new instructions give birth to the sea.
In conclusion of our small methodical educational program, I want to draw your attention to the importance of current instructions, formal schedules and other paperwork unloved by many. Yes, it eats up the lion's share of time and energy. But the time spent will return to you a hundredfold, when the thunder clap, and you will “fix it all” without unnecessary nerves.
Operation is an SLA. And SLA is about keeping timings in general, and separately, at each stage. To control the implementation of SLA and these very timings, you need to know the time limits for each stage. Otherwise, until you go beyond the framework, you will not understand that you are already late somewhere. And without fixing the algorithms of work and specific actions at each stage, one can neither estimate nor guarantee the duration of these stages.
Creativity is very interesting, but completely unpredictable. Engage them for the soul, and test and implement the most successful solutions, but not during the preparation for Black Friday or another promotion. Business will thank you for it.
So far this is all that I would like to tell about this topic. I would be glad if my advice, being transferred to the realities of your business, will allow us to survive a high load calmly and comfortably.
If you want advice on how to act exactly in your situation, I invite you to my
seminar “Black Friday. Secrets of survival. In the question-answer format, we will talk about preparing the site for traffic growth and discuss both technical and organizational subtleties of this process.
The seminar will be held on August 16 in Moscow. Since the event will be quite intimate (maximum 25 people), preliminary registration is required. And all the others I am waiting for discussion in the comments. :)