Battle shooting at night, or Why load the prod - not scary

"And if you do not shoot, then I will spoil"

More recently, it was thought the service should just work. We painted, wrapped up, wrote scripts - everything seems to be ok, you can roll on the prod.

But competitors are not asleep, so the race begins, not only for new features, but also for speed. Any application lag or long server response (not to mention the pop-up 500th errors) spoil the impression of the service and force the user to go somewhere else. Surely, everyone was faced with situations where instead of buying a ticket for a plane, train or concert, “Internal server error” was displayed on the screen, and you in a rage wanted to split the monitor.

I am Viktor Bodrov, I work in Yandex.Money in the performance research team and I want to talk about how it is useful to study performance right on production.

Every minute of downtime is disastrous for companies, especially for those involved in financial transactions, such as transfers and payments, payment for goods in stores. Each hung payment is not only monetary losses, but also reputational losses. In such conditions, high uptime is needed, which requires keeping abreast of the payment system and constantly monitoring all of its indicators, paying particular attention to performance.

Why investigate it?

First, knowing the performance of your own service helps you prepare for the launch of new features, various promotions and sales. Knowledge of reliable indicators will help at the right time to meet the influx of new users not with a small gate with a turnstile, but with wide open front doors and open arms.

Secondly, a good service should know the limit of its performance at any time, so measurements should be regular. If you do this often enough, keeping the data up to date, then do not miss the degradation in your service and will be able to quickly restore the required performance.

Thirdly, relying on actual data on productivity, it is easier for business to plan the development of a service and choose growth directions.

For those who first puzzled such a problem, the question arises: where and how to measure performance? Often for such experiments use the stand. In some companies there is a special stand for performance research. If you have one, great! If it is 1 in 1 corresponds to the product, then everything is fine with you. But most often it is very expensive to maintain a stand that fully corresponds to the product. How to be? It turns out that the only place that fully corresponds to the product is the food itself?

For many, the proposal to measure and check something directly on the prode sounds heretical. Do not be alarmed if everything is done carefully, nothing bad will happen. For example, we calculate in advance the possible risks and determine that our system may suffer from the experiment. At the same time, we plan how to reduce the danger and constantly monitor the state of the system.

Studies on the prod do not cancel the use of the stand. It can be used to check releases and special experiments to study the performance of microservices, including them individually or in various combinations.

The main and indisputable advantage of the results obtained on the sale is that they are the most honest of all the options and are most similar to the actual processing of user traffic. No matter how close the stand to the production of the characteristics - to catch him, like Achilles turtle, he can not. When researching products you will use the same databases as real users, the same network, the same environment. There is no need to build anything, everything is already built up to you and is functioning properly.

The data of such experiments will be of interest to all engineers, regardless of their role: developers, testers, admins. Guaranteed performance indicators will also interest the business - for potential customers they will be a convincing advertisement of the service.

Selection of scenario and load feed

For the safe and proper organization of such experiments, you must perform several mandatory steps.

Scenario selection

The first step, the most important, is to select the studied scenarios. These can be either single requests (for example, a balance check) or scenarios with complex logic, where each next request depends on the results of the previous one (payment for goods in the store, transfer from wallet to wallet). We regularly take a list of all business processes that exist in the system. We have more than 400 such processes. Based on the goals of the business, we will coordinate the priorities of the scenarios.

What scenarios to include in the priority group?

Those that are expected to surge in user activity in the near future.
Those for which there are constant tight restrictions (for various reasons) that do not allow to fall below the SLA.

Thus, it is possible to form some pool of priority scenarios for regular checks on the sale. In our company, we are shooting at them at least once a quarter.

A set of techniques and tools is selected depending on the logic of the script. In our case, priority scripts have extensive logic. For example, when making payments with a card, various checks are carried out, depending on which script goes along its branch, therefore we use JMeter to implement them. It is convenient for such complex scenarios, where each next request depends on the result of the previous one. If you want to shoot single requests, I recommend using a high-performance Phantom .

For the study of user scenarios may need special users, on behalf of which requests will be made. If you use a single user or a small number of them, then you can encounter data caching that distorts the results. The more different users, the more accurate the research data.

Load feeding circuit

In the second step, we select the input intensity feed scheme.

For example, before the sale, we determine what the main ways users will pay. For tuning and tracking of bottlenecks, we conduct firing on certain types of payments. As a rule, user activity during various actions is biased in favor of one or another scenario. By checking it, you will get a clear picture of its behavior under load.

But the business may be interested in the overall performance picture. For such a case, you can combine the most popular business scenarios, in proportion to their use by real users, and include a cumulative load. It is worth considering that in this case there can be difficulties with a quantitative assessment. Instead of one specific performance number, you will get a numerical series, which, in turn, may vary depending on the proportions of a particular scenario in the overall flow.

The flow rate can also be different. I will focus on the two most common profiles. This is a test with a linear (or stepwise) growing intensity flow and stability test, which verifies the long-term operation of the service under a consistently high intensity flow. The second option requires a long time of research, which is not always possible in a combat environment, and besides, the level of intensity supplied should already be known for it.

X axis - time, Y - load intensity (requests per second)

It is good when there is a certain SLA, on the basis of which you can conduct checks, monitor performance, response time and monitor the behavior of components. More often there is a situation when the level of performance is unknown and it is required to determine it. To do this, use the first option - measured with increasing intensity supplied. We include the input stream, we increase it linearly or in steps, we look at the behavior of the service. Linearly applied load can more accurately track the point of saturation and the point of degradation. About this more was described in our article . But stepwise supplied intensity combines in itself, including small stability tests, especially if the steps are long in time. It is not recommended to immediately apply a large flow of load to the input, it is better to “warm up” the service, gradually increasing the input flow.

You can also conduct a series of two experiments. First measure the saturation point with linear load and stop. You should not continue to feed the stream further to the disorder, it is still food, and not a stand. The second experiment is to look at the behavior of the service under the step load, selecting a few steps near the saturation point. And then we’ll continue to carry out the stability test, as far as time permits, choosing a load for it 15-20% below the saturation point (or frustration, if you suddenly had it before saturation). Higher climb is dangerous.

Timing

Next, you should determine the time of the experiment. One of the most important conditions for measuring the performance of the product is to ensure security for all real users. It is extremely rare to have situations where it is possible to stop the service for a while for prevention and calmly fire on it. As a rule, online services are sharpened to work in 24/7 mode, so you need to fit into the mode of using the service.

It is logical that the higher the real user activity, the greater the risks that the shooting can lead to downtime and financial losses. On the other hand, the smaller the user traffic, the smaller the measurement error. Therefore, in order to minimize the influence of experiments, it is recommended that they be carried out in periods of reduced user activity.

As practice shows, the minimum activity of our users falls on the period from two to seven in the morning. Of course, each service has its own characteristics and its own audience, so we determine the time for firing, monitoring the behavior of users. It is not always possible to organize experiments in the selected optimal period. For firing on the prode, especially in the initial stage of their connection, increased control is required. This will cause difficulties, because your colleagues are also people and can not always go to work at night. This situation will require a compromise. You need to choose a time that would be suitable for all and at the same time satisfy the condition of low user activity.

Difficulties when working with contractors

If the service is tied not only to internal calculations, but also to interact with third-party services (contractors), you need to choose how you will conduct your shooting: with the counterparty or using a service stub instead. Naturally, if you plan to shoot and counterparty servers, you must first agree on everything. This will greatly complicate the preparation for the shooting, but it will add advantages to the veracity of the results obtained as a result.

And vice versa: if you replace the counterparty's service with a plug, then the preparation for the shooting will be much simpler, but the honesty of the results will decrease. Here it should be noted that the stub should imitate the behavior of the counterparty as much as possible, and not just give 200 OK.

The contractors themselves are different. Some easily go to joint inspections, others drive each step through many instances. Determining the time of the experiments can also cause controversy. For example, some government offices agree to work only from 9 to 18 hours.

Check access and coordination with the Security Council, financial departments, admins

In this part, we will discuss access and coordination with all responsible persons - the security service, financial departments and system administrators.

It is necessary to check the availability of the necessary access . Ensure that nothing will prevent research and, if necessary, order access from network administrators, both from your own and from counterparties, if you work with them. Network admins will help with the adjustment of balancing. We once had a balancer switched from round-robin to ip-hash. As a result, all our requests fell on the same front, chosen by a new balancing algorithm.

After receiving the accesses, you need to debug and test the script on the minimum unit stream.

The next steps are approvals . First of all, contact the security service so that your experiment will not be cut off during takeoff because of “suspicious activity”. To assess all possible risks of the Security Council, a detailed shooting plan will be needed - who, what and in what quantities participates in it.

Next you need to agree on a shooting plan with the financial and commercial departments. If the service is related to financial activities, then coordination with the financial department and accounting will be required. Any additional financial activity may affect the financial statements or even cause failures in the formation of various summaries of transactions. This should be avoided, so you should warn colleagues, having developed the optimal configuration of experiments.

If you have a statistics department that accumulates information about the work of the service, then you need to coordinate the shooting with them. The fact is that the load flow will cause an additional wave of statistical data. Agree on whether they will take tests into account in their reports or not. If not, then decide how real user data will be separated from test data.

When planning, you also need to agree on the date and time of the tests with the commercial department, whether they have any advertising and promotions scheduled for or near this time. Do not forget to inform the team leaders about all planned and unplanned activities on the sale. Naturally, you need to warn and agree with the admins, as during the shooting may require their participation. In addition, it is admins who know about all the actions on the sale. Perhaps just at your chosen time scheduled data center switching, server replacement or other work.

Shooting and analysis of results

Finally, we discuss firing with monitoring. Determine where to look during the experiments, under what conditions to stop, to which sensors respond? This must be done "on the shore", before the start of shooting.

There are several reasons for stopping.

1) At the signal from the monitoring. In this case, it doesn’t matter whether the functionality involved in the experiment breaks down or an abnormal situation occurs at the other end of the service. It is necessary to stop the tests and understand the reasons, because the smooth operation of the service is one of the main priorities.
2) With the growth of network or HTTP errors. This is an emergency situation requiring intervention.
3) If saturation is reached, performance will no longer increase, but response time will increase. No need to wait for a breakdown and put prod. The result for the analysis is already there, you can safely stop the experiment.

After the experiment, it can be understood that the logs and results are not enough and it is necessary to repeat the experiment with debug-logging enabled. This will make the log and write to the disk heavier, but now you know the level of the required load, which means that instead of a long test, you can get by with reduced ones.

Results analysis

At the end, it remains to analyze the results and provide the obtained data to the interested parties. You can start doing this already during the experiments. We use Zabbix and Elastic bundles with Grafana and Kibana for analysis. We monitor the temporal characteristics of all external and internal calls involved in our experiment, monitor connector pools, queues, and monitor the database. For online tracking of metrics from the traffic generator, the Yandex Lunapark service (there is an open analog - overload.yandex.net).

The presentation of results will differ depending on who needs them. For development, admins and testers need a detailed report with accurate metrics, graphs, logs, spectra. For business, the result and development forecasts are important. In this case, the concreteness and accentuation of figures are better and more visual. To do this, you can use the principle of traffic lights. The red zone is bad, you need to urgently optimize. Yellow - we have noticed the degradation of indicators, you should pay attention to this. Green - everything is OK, moving on. Understandable and accessible type of research results will help to remove questions about the importance and usefulness of performance measurement.

Successful research, and remember the safety of users!

Source: https://habr.com/ru/post/445402/

All Articles