Highly loaded systems: solving basic problems

Hi, Habr!

Today I want to talk about some solutions to the problems that arise during the use of high-load systems. Everything that will be discussed in this material is tested on my own experience: I am a Social Games Server Team Lead in the Plarium company that develops social, mobile, browser games.

First, some statistics. Plarium has been developing games since 2009. At the moment, our projects are launched in all the most popular social networks (Vkontakte, My World, Odnoklassniki, Facebook), several games are integrated into large game portals: games.mail.ru, Kabam. Separately, there is a browser and mobile (iOS) version of the “Rules of War” strategy. The bases have more than 80 million users (5 games, localization in 7 languages, 3 million unique players per day), as a result, all of our servers receive an average of about 6,500 requests per second and 561 million requests per day.
')
As a hardware platform, the combat servers mainly use two server CPUs with 4 cores (x2 HT), 32-64 GB RAM, 1-2 TB HDD. Servers run on Windows Server 2008 R2. Content is distributed through a CDN with a bandwidth of up to 5 Gbps.
Development is conducted under the .NET Framework 4.5 in the C # programming language.

Given the specifics of our work, it is necessary not only to ensure the stable functioning of the systems, but also to ensure that they can withstand a large load jump. Alas, many popular approaches and technologies do not withstand the test of high load and quickly become a bottleneck in the system. Therefore, after analyzing many problems, we found the best for us (I think) ways to solve them. I will tell you which technologies we have chosen, which ones we have refused, and I will explain why.

NoSQL vs. Relational

In this battle, pure NoSQL proved to be a weak fighter: the solutions that existed at that time did not support the imputed data consistency and did not have sufficient resistance to falls, which made itself felt in the process. Although in the end the choice fell on relational DBMS, which allowed the use of transactionality in the necessary places, in general, NoSQL is used as the main approach. In particular, tables often have a very simple key-value structure, where data is presented in the form of JSON, which is stored in a packed form in a BLOB column. As a result, the scheme remains simple and stable, while the structure of the data field can easily expand and change. Oddly enough, this gives a very good result - in our decision we have combined the advantages of both worlds.

ORM vs. ADO.NET

Given the fact that pure ADO.NET has a minimal overhead, and all requests are created manually, are familiar and warm the soul, it sends any ORM to deep knockout. And all because the object-relational mapping has in our case a number of drawbacks, such as poor performance and low query control (or lack thereof). When using many solutions, ORM has to struggle for a long time and often with the library and lose the main thing - speed. And if it comes to the tricky flag to correctly handle the client library timeouts or something similar, then attempts to present the installation of such a flag using the ORM are completely frustrated.

Distributed transactions vs. Own Eventual Consistency

The main task of transactions is to ensure data consistency after the operation is completed. Changes are either successfully saved, or completely rolled back if something went wrong. And if in the case of a single database, we were happy to use this, without a doubt, an important mechanism, then distributed transactions with high loads proved to be not the best side. The result of use is an increased waiting time, complication of logic (here we also recall the need to update the caches in the memory of application instances, the possibility of dead locks, poor resistance to physical disruptions).
As a result, we have developed our own version of the mechanism for providing an Eventual Consistency, built on message queues. As a result, we obtained: scalability, fault tolerance, suitable time for the onset of consistency, the absence of deadlocks.

SOAP, WCF, etc. vs. Json over http

When using ready-made solutions in the style of SOAP (standard .NET web services, WCF, Web API, etc.), there was insufficient flexibility, difficulties in setting up and supporting various client technologies, and an extra infrastructure intermediary appeared. To work with the data, we chose to send JSON over HTTP, not only because of its maximum simplicity, but also because using such a protocol it was very easy to diagnose and fix problems. Also this simple combination most widely covers client technology.

MVC.NET, Spring.NET vs. Naked ASP.NET

Based on experience, I can say that MVC.NET, Spring.NET and similar frameworks create unnecessary intermediate structures that prevent squeezing the maximum performance. Our solution is built on the most basic features provided by ASP.NET. In fact, the entry point is a few regular handlers. We do not use any standard module, in the application there is no active ASP.NET session. Everything is clear and simple.

Little about bikes

If none of the existing methods of solving the problem is suitable, you have to become a little inventor and re-search for answers to questions. And even if you reinvent the wheel sometimes, if this bike is crucial for a project, it's worth it.
Json serialization

Slightly more than a third of the CPU time we use is spent on serializing / deserializing large amounts of data in JSON format, so the question of the effectiveness of this task is very important in the context of the performance of the system as a whole.
Initially, we used Newtonsoft JSON.NET in our work, but at some point we came to the conclusion that its speed is not enough, and we can implement the necessary functionality for us in a faster way, without having to support too many deserialization options and “great” features like validation JSON schemes, deserialization in JObject, etc.

Therefore, we independently wrote a serialization taking into account the specifics of our data. At the same time, the resulting solution on tests turned out to be 10 times faster than JSON.Net and 3 times faster than fastJSON.

Compatibility with existing data serialized with Newtonsoft was crucial for us. To ensure compatibility, before including our serialization in production, we tested on several large databases: read the data in JSON format, deserialized using our library, serialized again, and checked the original and final JSON for equality.

Memory

Because of our approach to organizing data, we received a negative effect in the form of a large object heap. For comparison, its average size was about 8 gigabytes versus 400–500 megabytes in second-generation facilities. As a result, this problem was solved by dividing large data blocks into smaller blocks using a pool of previously allocated blocks. Thanks to this scheme, the heap of large objects has significantly decreased, garbage collection has become less frequent and easier. Users are satisfied, and this is important.

Working with memory, we use several caches of different sizes with different policies of obsolescence and updates, while the design of some caches is extremely simple, without all the excesses. As a result, the performance indicator of all caches is at least 90–95%.

After testing Memcached, we decided to postpone it for the future, because so far it is not necessary. Overall, the results were not bad, but the overall gain in performance did not exceed the cost of additional serialization / de-serialization when putting the data into the cache.

Additional tools

• Profiler
Since the familiar profilers significantly slow down the speed of the application when connecting to it, actually making it impossible to profile sufficiently loaded applications, we use our system of performance counters:

In this test example, it is clear that we wrap the main operations in counters with names. Statistics are accumulated in memory and collected from servers along with other useful information. Counter hierarchies are supported for analyzing call chains. As a result, you can get a similar report:

Among the positive aspects:

- counters are always on;
- minimum costs (less than 0.5% of the resource used by the CPU);
- simple and flexible approach to specifying the sections to be profiled;
- automatic generation of counters for entry points (network requests, methods);
- the ability to view and aggregate on the principle of parent — child;
- it is possible to evaluate not only real-time data, but also to save the measurement values of counters over time with the possibility of further viewing and analysis.

• Logging
This is often the only way to diagnose errors. We use two formats in our work: human readable and JSON, while writing everything that can be written while there is enough disk space. We collect logs from servers and use for analysis. Everything is done on the basis of log4net, so nothing extra is used, the solutions are as simple as possible.

• Administration
In addition to the rich graphical web interface of our admin panel, we developed a web console to which you can add commands directly to the game server, without making changes to the admin panel. Also, using the console, you can very easily and quickly add a new command to diagnose, to obtain technical data online, or to adjust the system without rebooting.

• Deployment
With the increase in the number of servers it became impossible to pour anything manually. Therefore, in just one week of work by one programmer, we developed a simple system for automated server update. The scripts were written in C #, which allowed quite flexible modification and maintenance of the deployment logic. As a result, we got a very reliable and simple tool that in critical situations allows you to update all production servers (about 50) in a few minutes.

findings

In order to achieve speed with high server load, it is necessary to use a simpler and thinner technology stack, all tools must be familiar and predictable. Structures should be both simple, sufficient to solve current problems and have a margin of safety. Optimal use of horizontal scaling, produce cache control performance. Logging and monitoring the status of the system - must have for the life support of any serious project And the deployment system will greatly simplify life, help save time and nerves.

Source: https://habr.com/ru/post/217151/

All Articles

Highly loaded systems: solving basic problems

More articles: