Hi, Habr!
In 2017, during Black Friday, the load increased by almost one and a half times, and our servers were at the limit. Over the year, the number of customers has grown significantly, and it became clear that without thorough preliminary preparation, the platform may simply not withstand the loads of 2018.
We set the goal to be the most ambitious possible: we wanted to be fully prepared for any, even the most powerful, spikes in activity and began to withdraw new capacities in advance throughout the year.
')
Our CTO Andrei Chizh (
chizh_andrey ) tells how we were preparing for Black Friday 2018, what measures we took to avoid falls, and, of course, about the results of such thorough preparation.

Today I want to talk about preparations for Black Friday 2018. Why now, when most of the large sales are over? We began to prepare about a year before large-scale actions, and by trial and error we found the optimal solution. We recommend that you take care of the hot seasons in advance and prevent fakapy that can emerge at the most inopportune moment.
The material will be useful to anyone who wants to squeeze the maximum profit from such actions, because The technical side of the question is not inferior here marketing.
Features of traffic at big sales
Contrary to popular belief, black Friday is not one day in a year, but almost a whole week: the first discount offers are received 7-8 days before the sale. Website traffic begins to grow smoothly throughout the week, reaches its peak on Friday and falls quite sharply on Saturday to regular store numbers.

This is important to consider: online stores are especially sensitive to any “slowdowns” in the system. In addition, our email newsletter direction also felt a significant increase in the number of shipments.
It is strategically important for us to go through Black Friday without a fall, because The most important functionality of the work of sites and mailing lists of stores depends on the work of the platform, namely:
- Tracking and issuing product recommendations,
- Issuance of related materials (for example, design images of blocks of recommendations, such as arrows, logos, icons and other visual elements),
- Delivery of product images of the right size (for this purpose, we have “ImageResizer” - a subsystem that downloads the image from the store server, compresses it to the right size and, through the caching servers, produces images of the right size for each product in each recommendation block).
In fact, during the Black Friday 2019 period, the service load increased by 40%, i.e. The number of events that the Retail Rocket system tracks and processes on online shopping sites has increased from 5 to 8 thousand requests per second. Due to the fact that we were preparing for more serious loads, we experienced this surge easily.

General training
Black Friday is a hot season for all retail and for ecommerce in particular. The number of users and their activity at this time is growing at times, so we, as always, thoroughly prepared for this tense pore. Add here the fact that many online stores are connected to us, not only in Russia, but also in Europe, where the hype is much higher, and we get the level of passions worse than the Brazilian series. What needs to be done to be fully prepared for increased loads?
Work with servers
To begin with, it was necessary to find out exactly what we lack to increase the capacity of servers. Already in August, we started ordering new servers specifically for Black Friday - we added 10 additional machines. By November, they were already completely in battle.
At the same time, part of the build of machines was reinstalled for use as Application servers. We immediately prepared them for using different functions: for issuing recommendations, and for ImageResizer, so that, depending on the type of load, each of them could be used for one of these roles. In the normal mode, the Application and ImageResizer servers have clearly marked functions: the first are engaged in issuing recommendations, the second - delivering images for letters and recommendation blocks on the website of online stores. In preparation for Black Friday, it was decided to make all the dual-purpose servers in order to balance the traffic between them depending on the type of download.
Then we added two large servers for Kafka (Apache Kafka) and got a cluster of 5 powerful machines. Unfortunately, everything did not go as smoothly as we would like: in the process of data synchronization, two new machines occupied the entire width of the network channel, and it was necessary to urgently figure out how to carry out the adding process quickly and safely for the entire infrastructure. To solve this issue, our administrators had to valiantly donate the weekend.
Work with data
In addition to servers, we decided to optimize files to ease the load and a big step for us was the translation of static files. All static files that were previously hosted on servers were increased to S3 + Cloudfront. We have long wanted to do this, because the server load was close to the limit values, and now there was an excellent reason.
A week before Black Friday, they increased the caching time for images up to 3 days, so that if ImageResizer falls, previously cached images were obtained from cdn. It also reduced the load on our servers, because the longer the image is stored, the less often we need to spend resources on resizing.
Last but not least, 5 days before Black Friday a moratorium on deploying any new functionality was announced, as well as on any work with the infrastructure - all attention is focused on coping with the increased loads.
Difficult Situation Response Plans
No matter how good the preparation is, fakaps are always possible. And we have developed 3 plans for responding to possible critical situations:
- load reduction
- disable some services
- full service shutdown.
Plan A: reduced load. It should have been enabled if, due to a surge in load, our servers will go beyond the allowed response timings. In this case, we prepared mechanisms for gradually reducing the load by switching part of the traffic to Amazon servers, which simply would give “200 OK” to all requests and give an empty response. We understood that this is a degradation of the quality of service, but the choice between the fact that the service does not work at all or does not show recommendations for approximately 10% of the traffic is obvious.
Plan B: Disable services. Implied partial degradation of the service. For example, reducing the speed of calculating personal recommendations for the sake of unloading some databases and communication channels. In regular mode, recommendations are calculated in real-time mode, creating for each visitor their own version of an online store, but under conditions of increased loads, the reduction in speed allows other core services to continue working.
Plan C: in case of Armageddon. If a complete system failure occurs, we have prepared a plan that will allow us to safely disconnect us from customers. Buyers of stores simply stop seeing recommendations, the performance of an online store will not suffer in any way. For this, we would have to reset our integration file in order for new users to stop interacting with the service. That is, we would disable our main tracking code, the service would stop collecting data and calculate recommendations, and the user would simply see the page without blocks of recommendations. For all those who have already received the integration file, we have provided the option of switching the DNS record to Amazon and the 200 OK plug.
Results
We coped with the entire load, even without the need to use additional build machines. And thanks to advance preparation, we did not need any of the developed response plans. But all the work done is an invaluable experience that will help us cope with the most unexpected and huge influxes of traffic.
As in 2017, the load on the service increased by 40%, and the number of users in online stores on Black Friday increased by 60%. All the difficulties and mistakes occurred during the preparatory period, which saved us and our clients from unforeseen situations.
How do you experience Black Friday? How do you prepare for critical loads?