📜 ⬆️ ⬇️

As we "Miss Russia" in the hands of endured

On April 15, the Miss Russia 2017 contest was held. After the site was completely reworked, the page loading speed began to fit within one second, even at times of peak loads. Our partners from Byndyusoft , represented by Alexander Bindyu ( @alexanderbyndyu ), the architect of the entire system, told how they did it, shared the details of the platform transfer to the cloud, and also told why they had to change the entire internal infrastructure of the project.



Information about the company: Byndyusoft is a company that implements projects on the .NET platform for various subject areas around the world.

About competition


The national contest "Miss Russia" is the country's largest project in the field of beauty. Currently, the competition is supported by the Ministry of Culture of the Russian Federation. This year the contest celebrated its 25th anniversary.
')
The qualifying rounds of "Miss Russia" are held in 85 regions of the Russian Federation, each year more than 75,000 girls take part in castings.

The contest "Miss Russia" has unique rights to represent our country at the largest international beauty contests "Miss World" and "Miss Universe".



The peculiarity of the contest site is that 97% of the traffic and 100% of the voting take place within two weeks in April. Compared with the April load the rest of the year passes without loads.



What is the problem



Over the past few years, the Miss Russia site has been constantly refurbished and reworked, updated each year to a new competition. By the 25th anniversary, “Mission Russia” arrived with a design from Studio Lebedev, who lived on the CMS “Splash” (correction, 10.05.2017 - Cubique CMS, not Plesk) .

During the contest, users constantly complained that the pages open slowly, give a 500-error and voting does not work. There were attempts to “overclock” the site, but they ended unsuccessfully.

The site was hosted on a dedicated server. Adding iron to the server has ceased to be beneficial. When the customer conducted load testing, the site dropped 150 requests per second.


Load tests of the first version

First try

At first, customers tried to transfer the site from a dedicated server to the cloud - they copied the virtual machine and launched it on Azure. Despite the increased power and hope for the "clouds", the speed dropped , the site went down at 90 requests per second.


The first version on Azure

The competition site could not be scaled horizontally. You add 5 times more iron, and productivity increases by 30%; you increase two times more - it increases by another 2%. A typical problem of monolithic systems.


Old Version Architecture

Of course, the old Miss Russia could be tricky. For example, add a CDN or raise a few virtual locks through the balancer. But each such move would have rested against the problems of the old code and the CMS, and still would not bring the desired flexibility in scaling.
It became clear that it needs to be redone completely under the new architecture and cloud infrastructure.

Second try

The customer came to us with a task: you need to transfer “Miss Russia” to the cloud and achieve sufficient speed. We looked under the hood and realized that with the current architecture it would not be possible to achieve business goals. We decided to redo everything from scratch.

Layout

First we turned the site. He began to weigh less and open faster. Previously, the site could meet some tiny avatar, which actually loaded highres and weighed 3 MB. As a result, achieved such results for the new site:

  1. 2..13 times less requests per page.
  2. 5..16 times less traffic.
  3. 8 times less time for full load.

Analyzed the Metric and it turned out that 60% of visitors go through mobile devices. The site has been redesigned so that everything becomes adaptive and responsive.

It was



It became



Architecture


Instead of a monolithic backend, they implemented a distributed microservice architecture so that in order to increase capacity, it was not necessary to flood the site with server capacity, but it was enough to add load on the necessary service at the right time.

For the new architecture, we took as the basis the ideas that would lead to the achievement of business goals:

  1. Split the app into (micro) responsibility.
  2. Each part will perfectly fulfill its role.
  3. Each part will take care of scaling itself.
  4. Total automation.

As a result, came to this architecture:



New architecture



Previously, all requests from the site fell into the monolith block, which was responsible for processing votes and generating content. When overloading one module, others also began to slow down. Now each part works and scales independently.

The new result of load testing inspired optimism:

  1. Loaded through a network with a bandwidth of 1 Gbit / s.
  2. After ~ 5450 RPS we see the first problems with server responses.
  3. Response time did not exceed 1000 ms




Technology


Azure provides a choice in technology and solutions. For example, which CDN to take? Akamai or Verizon? We set up experiments and selected the most appropriate tools, finding a couple of critical problems along the way.

.NET Core and Kestrel

New competition site written in .NET Core. We have been working with him in production on other projects for half a year - we see no problems.

The only unpleasant problem arose with Kestrel, which, under load, began to respond with code 502.3. At the same time, the application fell and did not come to life until restart.

The problem was in the version of Kestrel 1.1.0. Description in Issue323 and Issue311 . We were lucky that two weeks before the competition, Microsoft.AspNetCore.Server.Kestrel version 1.1.1 came out and the problem was gone.

CDN

We chose between Akamai and Verizone. Chose Akamai, because they have a cache server in Russia, which is important for the competition audience.

CDN generally used the standard approach:

  1. Pictures are cached for 7 days, HTML is updated 1 time per hour.
  2. JavaScript and CSS of new versions automatically fall into the CDN, each version is cached separately.
  3. Compression enabled.
  4. You can manually reset the cache.

If you want to cache HTML, please note that Akamai CDN only supports 3rd level domains. For caching we had to redirect from missrussia.ru to www.missrussia.ru .

Webapp

We deployed the main site and API in separate WebApps. When changing the load, we had a choice according to the scaling method:


During the contest, both WebApps worked at the S3 tariff, after the competition at the S1 tariff, in order not to burn money when there is no load.

Service bus

We chose a queue between Service Bus and Storage Queues. Here is what we needed:

  1. Small messaging with short processing time.
  2. No need for transactions and support for the order of processing messages.
  3. Client availability under .NET Core.

Chose Service Bus with .NET Standard client library for Azure Service Bus .

If the queue is slow and drops when sending a message, then check that you:


WebJob and vote checking

When solving the problem with voting, it was necessary to take into account that the influx of voters did not affect the speed of response of the main site. In addition, it was important to strengthen the algorithms for eliminating bots, because in the past year “cheating” votes was a problem. That is, the voting system should work faster and at the same time make a more complex analysis.

We smashed in time the definition of voice quality and voice increment. When a person votes, they always answer: “Thank you, your vote has been counted!”. At this point, a message is generated about the voice, the message is sent to the queue, where it waits for its turn to parse in the voice processing service. The processed voices fall into the database, and then, a few hours later, to the site. The vote count is added at once to hundreds and thousands of units.

This approach allowed us to remove the feedback for those who tried to pick up the parameters of the API call for cheating votes. Now it was not clear to them how the system responded to a POST request formed manually.

In terms of horizontal scaling, the solution is also excellent. The Service Bus queue is scaled horizontally, and in order to speed up the heavy analysis of the voice, it is enough to lift several dozen voice processing services. In Azure, you can raise several WebJobs either manually with a couple of mouse clicks or automatically.

There is a technical nuance, why we chose WebJob for the voice processing service, rather than the Service Fabric.

To work via Service Fabric under .NET Core, you need to install the SDK from a special Ubuntu repository. This creates problems both with the deployment and during the development. And WebJob is ready to work with .NET Core without any extra movement.

Pure PaaS

The whole project was made by our two developers in four weeks. It turned out so quickly in part due to the fact that the entire infrastructure clicked on the mouse in the Microsoft Azure web interface is pure PaaS. We have not created or configured any virtual machine.

Scaling vertical and horizontal was also done with the mouse.

Microservices

Although the project is small and there are only three micro-responsibilities, but we adhered to the basic ideas of microservice architecture. In the project, we have identified three microservices: a voice processor (service), a voice receiver (API), and a content creator (Web).

Microservices are completely independent. If any of them turns off, the rest will continue to work. If any of them experience a load and begin to slow down, others will not know about it.

If, after processing the vote, we would like to send an SMS to the girl with a greeting that they voted for her, then another microservice connected to the Service Bus would appear on the diagram. This microservice would consume the events that were formed for it at the moment the voice processing was completed. Thus, the architecture is expandable almost infinitely.

The only thing that distinguishes the new architecture of the Miss Russia contest site from the canonical microservice is a common database for all microservices. In this case, we specifically made this simplification in order to save time and money. The base is small, there is not much data and they are divided in the database so that they almost do not overlap. If ever the logic in the project becomes complicated, which is unlikely, then we will give each microservice in the repository.

Result


Site speed

The site runs fast and smooth, even on mobile. All content is cached in the CDN, which coped with 5500 requests per second. Caching in the browser, in the CDN and in the web application allowed 99.7% of users to open the page of the Miss Russia contest in less than one second.



Load flexibility

Due to the flexibility of allocating capacity in Azure, the cost of a new infrastructure during voting (two weeks a year) is equal to the cost of a dedicated server for a previous version of the site. But after the vote, we removed unnecessary power in a couple of clicks and the cost was 3 times cheaper.

On large projects, we usually create an automated system, which in the case of load itself adds instances of services, when there is no load, it reduces their number. Peaks are not only in connection with well-known events (Black Friday, March 8), but also per diem (at night, at rush day), weekly (at the weekend, no one, on weekly peaks), random (someone mentioned the site on a popular forum) therefore automation is necessary.

Voting

Voting fulfilled 100% of requests, no user received a 500-error. Of the 750K votes, the algorithm eliminated approximately 500K as bots, and the remaining votes were credited to the girls.

The contest organizers received transparent reporting on the voting process: who and how many votes were received, who tried to wind up the results.

findings

Literate architecture allowed:

- Giving more resources to the loaded parts, less to the unloaded.
- Remove the influence of parts of the system on each other.
- For each part, resources are allocated only when it is required.

Additional materials on architecture:



The author of the material


Alexander Bindju ( @alexanderbyndyu ) - Owner of Byndyusoft, expert in Agile and Lean, IT-architect.

The project was completed with the connection of expert support CSP-provider InfoboxCloud , which promptly answered all questions about Azure.

Source: https://habr.com/ru/post/327546/


All Articles