📜 ⬆️ ⬇️

[updated] How the load testing of the processing cost us € 157,000 and why nobody was fired


Once our team decided to experiment with the shooting of a combat payment system with real money. This would allow us to understand how much we can stand together with our partners. I want to tell about this ambiguous experience.


Two factors prompted the study: our own processing of cards and the upcoming large sale of one of the most popular online retailers in Russia.


The idea looked quite budget - about 125,000 p. (1 p. per operation), but who knew how everything will turn out. The special feature is that all information about the experiment has been closed for a long time and published for the first time in open source.


As it turned out, we are not alone - not one of our partner banks six months ago knew their real “ceiling” and solved problems as they were received. In fact, they simply added new capacity right during sales, although not always successfully.


Future victims


The system of card payments (Mastercard in our case) interconnects many external organizations, each of which has its own zoo systems. Almost all of them get under the load of card payments, so we will immediately look at all the participants.


Below is a simplified diagram of the interaction of several organizations to ensure that you can pay for some goods with a card through the Yandex.Money service:



All key participants in the experiment on the same scheme. Yandex.Tank is a great Yandex tool for load testing and comfortable performance analysis.


Yandex.Money accepts the user’s plastic card information (in the case of a dough, an already completed form template is transferred), and then the following happens:


  1. The completed form with the card data is transmitted to the frontend servers, which process it and transfer the data to the backend, and later to the core responsible for working with accounts.


  2. The payment service blocks the transaction amount on the user's account and in parallel sends the transaction information to the acquiring bank, which then transfers the information to Mastercard and NSPK.


  3. After a day, the NSPK sends the blocking file to the acquiring bank and Yandex.Money, after which the list of blockings is parsed in our microservices and the money is written off from the user's account.

When large-scale online sales of such payments can be hundreds per second, so get all the links in the payment chain. So that no one had any complaints, we agreed in advance with a couple of friendly banks that on certain dates there would be “live firing” and that their employees would be fully armed in the field.


Looking ahead, after the first experiment of 17,000 operations, it turned out that banks use the horizontal scaling model of payment gateways, adding new capacities as needed. In normal life, they have time to respond to a gradual increase in load, but in stress mode this was not enough.

There is one subtle point. The fact is that all banks and MasterCard earn money on money transfer fees. To make such a commission zero for tests was almost impossible. According to our estimates, the cumulative figure should have been 0.7% for a separate MCC payment category code.


By the way about the commission

The value of the commission was the minimum possible agreed figure at the time of the experiment from the point of view of all the parties involved.


Remember this moment - at the end of the article you will have something to compare with.


And do not shoot us processing


When testing the performance of payment services, inevitably, a number of nuances and limitations arise that forced us to go with the experiment to production:


  1. The performance of key microservices (for example, related to encryption of card data) is limited by the license.


  2. The location of microservices is also limited by the license. That is, the server with the purchased license cannot even be transferred to the test environment.


  3. None of our partner banks have ever conducted anything like this and have not investigated their capacity, so there was no one to ask for wise advice. It helped out close cooperation with Mastercard and a grain of knowledge from the personal experience of individuals.

As a preparatory action, we also turned off the additional security of the 3DSecure payment (SMS codes), removed limits on the number and amount of transactions at each stage of payment, and also agreed with the security services of the participants so that they would not sound the alarm.

The more important was the project for Yandex.Money, since in addition to practical benefits, it was possible to do something useful for the entire Russian banking industry - the results and methods of research can be useful not only for us. In addition, I personally have always been curious about how our processing will behave, if it is “shot”.


Damage radius


As you understand, no one would even agree to the team a potential stopping of the service. Even for a few seconds.


Therefore, the shooting was to be made careful and predictable:



MCC code (Merchant Category Code) - is a 4-digit number assigned by VISA, Mastercard and other payment systems for classifying the activities of a point of sale in a payment operation using bank cards. For the owner of such an outlet, it is important to get the most profitable MCC and pay a lower commission to the payment system.

All these measures ensured a good level of resiliency of the payment service for users, even at peak loads.



The general scheme of the experiment: multithreaded replenishment of wallets with bank cards of other wallets.


For the experiment, we agreed on the level of intensity applied to 20 RpS (Requests per Second, the number of requests per second). To achieve greater performance, you can add new modules, but only one was present in the experimental design.


Yandex.Money collaborates with several partner banks, but only one of them agreed to a large-scale experiment with real money.



The graph shows the increase in performance on the processing module with an increase in input intensity, with a failure in the center just at the moment of a decline in the performance of our processing.


In addition, the tests caused some problems in our backend and the system core, creating a lot of locks on card accounts. Blocking is a normal situation for any card transaction, since before a real write-off, money is simply blocked on the user's account and the record of this falls into the general Mastercard file.


Each acquirer bank receives such a file from Mastercard every time at the same time as agreed and further parses it. So, after experiments, the file with the usual size of 20 MB increased 5 times and began to weigh 104 MB. It took more resources to work out such a list, that is, we had to rewrite the individual microservice modules that processed the lock file.


Well, we have slightly optimized queries to the database, reducing the load on the processor, and released more cards to reduce the number of locks per card.


We continue shelling


The second wave of the experiment took place more smoothly and calmly, despite the more than double the number of processed payments.



The schedule is more uniform, which indicates the success of the measures taken. RpS is 20, since this is the maximum value agreed for the experiment.


After the end of the flow of 24,778 transactions, the volume of locks for each card continued to grow, which led to delays in making payments: before each write-off, the processing had to re-read the entire list of locks of a particular card. The solution is to increase the number of cards from 50 to 10,050, which allowed for each to reduce the list of locks from 200 to 1 with a similar number of operations.


The next wave of tests brought 50,000 operations , and the write-offs were loaded into the processing database in 40 minutes, after which they had to be processed. The lock file continued to grow ominously with each experiment. But the key processing database runs on Oracle with a limit of 4 GB per file. To the limit is still far, but it became uncomfortable.


In a separate experiment, we evaluated the write processing performance. During the day, we conducted tests with an intensity of 15 locks per second and a subsequent stream of write-offs. The file with write-offs came to us at 18:00 the next day with a delay of 1.5 hours, and our processing processed all 1,135,000 records in 2 hours and 10 minutes. For contrast, the usual analysis of the average list of locks takes ten minutes.


Problems also arose with the performance of the antifraud machine and the front-end request balancer. The point was that the balancer did not check the logical availability of the service at the site, limited only by its presence on the network.


In addition, massive shelling everywhere led to an increase in the size of the logs, which additionally tested our system for collecting logs on EHK (Elasticsearch / Heka / Kibana), which was recently told , for durability.


The culmination was an experiment for two days with a total number of operations of 1,400,000 , on the second day of which two things were happening at the same time:



Two of these operations have loaded the processing at full within the framework of the existing licensing restrictions of 20 RpS.


Battle shooting ended in two days at the level of 3 157 800 operations. However, how to celebrate success and admire the numbers we were not given.


Hello from Mastercard


We were billed at € 157,890 as a commission for the operations performed, which did not fit in the agreed 0.7% from 125,000 r.



When ordering a terminal for test operations, we indicated an incorrect MCC code, which is why an incredible commission was obtained.


And here is the reason for the outrage - we chose the wrong MCC code for the acquiring terminal, through which all the test firing went. Therefore, for the operation on 1 p. paid 4 p. commissions It was not possible to learn about this problem during the experiment, as Mastercard billed a week later.


The misunderstanding cost us 2 months of hard work to manually resolve the issue with Mastercard. In fact, we have logged the entire experiment, which was used to search for and change operations in Mastercard. More confirmation that without detailed logs anywhere.


Despite the prescription of the experiment and minor financial losses from its conduct, the experience is definitely positive. Moreover, such combat firing has become regular, and the data obtained will significantly increase the performance of card operations.


Of course, all the characters and numbers in the article were creatively rethought for security reasons, but we fantasized not too vigorously and preserved the ratios of the experimental data.


')

Source: https://habr.com/ru/post/329926/


All Articles