How Go saved our “Black Friday”

Earlier, we already talked about the fact that as the load grows, we gradually abandoned the use of Python in the back end of critical services in production, replacing it with Go. And today, I, Denis Girko, the team leader of the Madmin development team, want to share the details: how and why this happened on the example of one of the most important services for our business - price calculation taking into account discounts on coupons.

The mechanics of working with coupons are probably presented by anyone who has made purchases at online stores at least once. On a special page or directly in the basket you enter the coupon number, and the prices are recalculated in accordance with the promised discount. The calculation depends on what kind of coupon it provides - as a percentage, in the form of a fixed amount or using some other mathematics (for example, loyalty program points, store promotions, types of goods, etc., are additionally taken into account). Naturally, the order is issued with new prices.

Business is delighted with all these mechanisms of working with prices, but we want to talk about the service from a slightly different point of view.
')

How it works

For the calculation of prices, taking into account all these difficulties on the back end, we now have a separate service. However, it was not always independent. The service appeared a year or two after the start of the online store, and by 2016 it was part of a large monolith in Python, which included a variety of components for marketing activity (Madmin). In an independent “block”, he stood out later, as he moved towards the microservice architecture.

As is usually the case with monoliths, Madmin was modified and partially corresponded by a large number of developers. There were integrated third-party libraries, which simplified development, but often not in the best way affected performance. However, at that time we didn’t really care about the resistance to heavy loads during sales, because the service did an excellent job with the task. But 2016 has changed everything.

In the USA, “Black Friday” has been known since the 60s of the last century. In Russia, it began to be launched in the 2010s, while the action had to actually be created from scratch - the market was not quite ready for it. However, the efforts of the organizers were not in vain, and with each passing year the user traffic to our website increased during sales. Therefore, our collision with a load that was beyond the power of that version of the pricing service was only a matter of time.

"Black Friday" 2016. And we overslept her

Since the idea of sales has earned its full potential, “black Friday” differs from any other day of the year in that by about midnight the weekly audience of the site arrives at the store. This is a difficult period for all services. Even in those of them that function smoothly throughout the year, problems sometimes come out.

Now we are preparing for each new “Black Friday”, imitating the expected load, but in 2016 we still acted differently. Testing Madmin before an important day, we tested the resistance to stress using scenarios of user behavior on ordinary days. As it turned out, this test does not quite reflect the real situation, since on “Black Friday” comes a lot of people with the same coupon. As a result, the pricing service with this discount, unable to cope with the threefold (as compared to ordinary days) load, at the hottest peak of the sale, blocked our ability to serve customers for two hours.

Service "lay down" an hour before midnight. It all started with a disconnection of the connection to the database (at that time, MySQL), after which not all running copies of the pricing service were able to connect back. And those that are still connected, could not stand the incoming load and stopped responding to requests, stuck on the base locks.

By coincidence, a junior remained on duty, who was on the way home from the office at the time of the fall of the service. He was able to connect to the problem only by arriving at the site and calling in "heavy artillery" - the reserve duty officer. Together, they normalized the situation, but only after two hours.

As the proceedings began to open details about how far the service was not optimal. For example, it turned out that for the calculation of one coupon 28 requests were made to the database (not surprisingly, everything worked with a 100% CPU load). The above-mentioned users with the same coupon of “Black Friday” did not simplify the situation, especially then for all the coupons we had a counter of applications - so each use increased the load by referring to this counter.

2016 gave us a lot of food for thought - mainly about how to adjust our work with coupons and tests so that this situation does not happen again. And in numbers, that Friday is best described by this picture:

Results Black Friday 2016

“Black Friday” 2017. We were preparing seriously, but ...

Having received a good lesson, we prepared for the next “Black Friday” in advance, seriously rebuilding and optimizing the service. For example, we finally created two types of coupons: limit and unlimited - to avoid blocking on simultaneous access to the database, we removed the entry into the database from the scenario of using the popular coupon. In parallel, 1–2 months before “Black Friday”, we switched from MySQL to PostgreSQL in service, which, together with code optimization, reduced the number of calls to the database from 28 to 4–5. These improvements made it possible for the service to reach SLA requirements — answer 3 seconds at 95 percentile at 600 RPS.

Having no idea how much our improvements have accelerated the work of the old version of the production service, at that time two versions of the Python code were being prepared for Black Friday — a highly optimized existing version and a completely new code written from scratch. In production, they rolled out the second one, which they had been testing before and on nights and nights. However, as it turned out already “in battle”, they were slightly untested.

On the day of "emergency" with the advent of the main stream of customers, the load on the service began to grow exponentially. Some requests were processed up to two minutes. Due to the long processing of some requests, the load on other workers increased.

Our main task was to handle such valuable business traffic. But it became obvious that “throwing with iron” does not solve the problem and from minute to minute the number of busy workers will reach 100%. Not knowing exactly what we were facing, we decided to activate harakiri in uWSGI and just nail down long requests (which are processed for more than 6 seconds) in order to free up resources for normal ones. And it really helped to resist - the workers began to be freed literally a couple of minutes before they were completely exhausted.

A little later, we figured out the situation ... It turned out that these were requests with very large baskets - from 40 to 100 products - and with a specific coupon that has restrictions on the range. This situation was poorly handled by the new code. It found incorrect work with an array, which turned into infinite recursion. It is curious that we then tested the case with large baskets, but not in combination with a clever coupon. As a solution, we simply switched to a different version of the code. True, it happened about three hours before the end of “Black Friday”. From this point on, all baskets began to be processed correctly. And although we had fulfilled the sales plan at that time, we were able to avoid global problems due to the burden that was five times larger than a typical day.

Black Friday 2018

By 2018, for high-load services serving the site, we gradually began to implement Go. Given the history of the previous "Black Fridays", the discount calculation service was one of the first candidates for processing.

Of course, we could save the already “tried-and-tested” version of Python, and before the new Black Friday we could turn off the heavy libraries and discarding the suboptimal code. However, by that time Golang had already taken root and looked more promising.

We switched to the new service this summer, so before the next sale we managed to test it well, including on the increasing load profile.

During testing, it turned out that the weak point in terms of high loads remains our base. Too long transactions caused us to select the entire pool of connections, and requests were queued. So we had to alter the logic of the application a little, reducing the use of the database to a minimum (referring to it only when there is no way without it) and caching directories from the database and data on coupons that are popular on Black Friday.

However, this year we made a mistake with forecasts of a load in a big way: we were preparing for 6-8 times growth in peaks and achieved good work of services specifically for such a volume of requests (they added caches, disconnected the experimental functions, simplified some things, deployed additional Kubernetes nodes and even the database servers for the replicas, which in the end were not required). In fact, the surge of user interest was less, so everything went as usual. The service response time did not exceed 50 ms at the 95 percentile.

For us, one of the most important characteristics is how the application is scaled due to the lack of resources of one copy. Go spends hardware resources more efficiently, so with the same load, fewer copies are required to run (ultimately servicing more requests for the same hardware resources). This year, at the very peak of sales, 16 copies of the application worked, which processed an average of 300 requests per second with peaks of up to 400 requests per second, which is approximately twice as high as the normal load. I note that last year, the Python service required 102 instances.

It would seem that the service on Go from the first approach closed all our needs. But Golang is not a "universal solution to all problems." It was not without some features. For example, we had to limit the number of threads that the service can start on the multiprocessor node Kubernetes so that when scaling it does not interfere with the “neighboring” applications on the production (by default, Go has no limits on how many processors it will take). To do this, we set GOMAXPROCS in all applications on Go. We will be happy to comment on how useful this was - in our team this was only one of the hypotheses about how to deal with the degradation of the “neighbors”.

Another “setting” is the number of connections that are held as Keep-Alive. The default http and database clients in Go keep only two connections by default, so if there are a lot of concurrent requests and you need to save on the TCP connection setup traffic, it makes sense to increase this value by specifying MaxIdleConnsPerHost and SetMaxIdleConns, respectively.

However, even with these manual "twists" Golang provided us with a large stock of performance for future sales.

Source: https://habr.com/ru/post/434208/

All Articles

How Go saved our “Black Friday”

How it works

"Black Friday" 2016. And we overslept her

“Black Friday” 2017. We were preparing seriously, but ...

Black Friday 2018

More articles: