In the analytical part of the Yandex.Money backend, Microsoft IIS is being actively used, and some knowledge has already accumulated about its use in a high-loaded environment that I want to share.
Our analytics runs on the Microsoft stack ( SQL Server and SSIS, SSAS, SSRS), one of the best BI solutions on the market. Once our BI is based on the services of a single vendor, then it is logical to use Microsoft’s IIS solution for hosting web applications.
In the article I will tell you about the features of working with applications running on IIS that are typical for a high-loaded environment.
To begin, we synchronize with you our ideas about "Hayload". As an illustration of a highly loaded system, the article mainly uses our system of fraud protection (antifraud). It processes all incoming Yandex.Money operations - and this is not one hundred operations per second.
It is important to exclude logging, calls to other services and everything else that is not required for a response from processing a user request. Secondary actions can be performed in parallel processes outside the main thread. Especially critical is the observance of this rule, when the service has a certain SLA response.
Anything that is initialized “for a while” should not do this when the first request comes from the user. One of our most important anti-fraud services, anti-fraud, once required loading the entire set of rules at the start of each new request handler.
Imagine a simple network firewall that, when a new packet arrives, passes its contents through a long set of rules. Only in the case of an antifraud, the number of rules and conditions in them have reached such volumes that it is impossible to load all this every time you start the next processing flow without a significant decrease in the responsiveness of the service. At peak antifraud loads, the processor load even reached 100% over several minutes.
In addition, long initialization is fraught with cascading performance degradation. When the first request arrives, the anti-fraud creates a separate handler for it and loads the entire list of security rules there, and the user admires the offer to wait a bit.
The old process of responding to user requests with a lack of resources ready handlers.
Now imagine that, until the loading of the rules for the previous handler is complete, a transfer request arrives from other users. A similar operation is launched for each of them, and there can be quite a few such duplicate initializations. All this led to periodic interface delays of 30 seconds or more.
Therefore, the object with the rules made the only one and hung up for launch along with the main service. Now the handlers are started in a couple of milliseconds and do not affect the overall performance of the server (bursts within 10%). Initially, there were thoughts to make their launch proactive, but given the quick start of each handler, this would not have a noticeable effect on performance.
On this, of course, you should not stop - you need to introduce optimization at all levels of the system. By the way, when processing the rules, we revised the basic mechanism and divided the rules into online and offline logic and servers, but this story deserves a separate story in the future.
When it comes to working with a server farm, you need to either teach the application to bind a user session to a specific node, or correctly handle logically related requests that come to different nodes.
Since binding a session to a single node is wrong in terms of users' fault tolerance and happiness, it is better to immediately build a synchronization system so that any cluster node knows about the results of the queries of its “colleagues”. This can be done in different ways, for example:
For example, we want to know when making a payment when the source account was opened. To do this, you will need to either find the account opening operation itself, or save some of its data separately for later use. In our case, option 1 is used.
The application must always be running, and the launch itself must be done in advance, before the appearance of real requests. In addition, the service should already be “warmed up” with test requests in order to immediately process incoming calls with minimal delays.
This is to some extent hampered by the default IIS settings, due to which unloaded processes periodically stop. Even if they are automatically launched at the first relevant request from the outside, the application will need time for initial initialization, processing of related requests and so on.
Previously, web application restarts occurred every 30 minutes on one of the nodes. Therefore, we wrote a Powershell script that took the load off the service on the balancer, restarted the application and ran several dozen requests. After that, the script waited for initialization and, when the application started responding quickly enough, returned the “combat” traffic to the server. However, the algorithm is still relevant, despite the reduction of planned restarts.
Therefore, before starting the process or restarting the web application, it is necessary to transfer the load to another server, wait for the newly launched one to warm up and then include it in incoming processing.
IIS has a limited set of service streams for working with disks, a network, etc. These streams are used to process most user requests, and at some point the resources of service streams may not be enough.
IIS has two key stream settings that can limit performance:
Queue Length - adjusts the queue length for incoming requests.
In addition, there is the parameter Database connection pooling , which is configured in the client DBMS on the side of the web application. Here you need to carefully select values, otherwise there will be a lot of unused connections to the database, each of which consumes memory.
In addition to the service ones, there are worker threads (WorkerThreads) - just those that are used by the server application to start new handlers of the same anti-fraud. It is desirable to issue these flows exactly as much as is required for work. Parameters can be roughly calculated by the following algorithm:
If it fails to calculate, it is better to conduct an experiment in a test environment, because simply setting the maximum values ​​is not the best idea. You should not waste server resources and increase the time of IIS initialization.
Of course, this algorithm can only be considered as a starting point for evaluating your application. For example, possible load peaks in your environment are not taken into account. For a full assessment, you will need to consider many other parameters that are clearly beyond the scope of this article.
Speaking of resources, the stable operation of 64-bit applications can be hampered by auxiliary libraries compiled under 32-bits — right up to the crash of the main application. When debugging 64-bit applications, you should not forget that the IIS emulator from Microsoft Visual Studio 2012 (and earlier) will not work, since it is 32-bit and you need to use an instance of IIS for debugging in 64-bit mode problems.
From the point of view of exploitation, for Yandex.Money antifraud, a 1 + N cluster is needed, in which you can make changes only once, regardless of the number of nodes. Therefore, experiments in choosing the appropriate “rollout” logic led to the following algorithm:
Speaking of monitoring. Of course, the approach for each application is different, but we use IIS logging as one of the tools. Once a day, a script is launched that scans the logs with the help of Microsoft LogParser and generates a report with an average time, number and ratio of response status (500, 200, etc.). I do not post the code itself, because there is too much added solely under the realities of our system. But the idea from the article LogParser, PowerShell and IIS files was used as a basis.
It is important to pre-unify the layout of applications on disks. That is, to locate the site in the same folders on each server - this will simplify further automation. To run an IIS farm, you will need Shared Config to store a single IIS configuration for all nodes, which is better placed on the local drives of each server with synchronization scripts.
When placing Shared Config on a network folder, "special effects" are possible - for example, IIS will not be able to read the configuration and will quench all applications.
A common data storage (in our case, anti-fraud rules) may also be required. For this, it is convenient to use DFS synchronization: the application works with a local folder on the node, and the data of this folder is synchronized over the other machines by means of DFS, which also handles conflicts itself. For greater fault tolerance and ease of upgrading, you can use the Many-to-Many synchronization scheme.
The key element of any distributed and high-loaded system is the balancer of incoming requests. For example, requests from users to frontend systems. Such a balancer can be of two types:
Using Microsoft's web server is convenient for those who use the stack of applications and the development of the same manufacturer. If your application does not need large data structures, then memory costs on the hosts will be minimal. In addition, most of the necessary options and features are provided out of the box, which is clearly simpler than building analogs on Opensource.
If the application is written on the same .Net - why launch it on a “foreign” platform with a fight?
Source: https://habr.com/ru/post/339976/
All Articles