The story of the notorious technical error "Dodo Pizza", partner Yandex.Cash, the system architect of the company Andrei Morevsky told us - I immediately give the microphone to the author.
I’m going to Sapsan to open the first Dodo pizzeria in St. Petersburg, when suddenly I get an alert about the multiple cancellations of paid orders. And not just multiple ones - our system managed to roll back the allegedly paid orders for 8 million rubles in an hour!
Now this story is only a smile, but that morning it was not funny at all. Therefore, I want to share some of the technical details of the incident and the conclusions made, and at the same time tell a little about the Dodo Pizza order processing system.
That morning, we immediately contacted the Yandex.Cash team, and learned that such situations usually arise with two or three transactions and are resolved individually. It seemed to be difficult, because for each such operation, the payment system team has to contact a certain bank and exchange requests. It was especially insulting to realize that we returned the money we did not receive - these were test orders.
Our server canceled payments for 7.84 million rubles. For a network with an annual turnover of almost 3 billion, this is serious money. In addition, it is more than 10% of investments attracted for the last round. Agree, too serious a price for one mistake.
The same day, the network’s founder, Fyodor Ovchinnikov, reported on the incident on social networks, and the story quickly spread through news sites.
While we were doing everything possible to get the money back, the readers wondered how this happened at all and excitedly chose options for the reprisals against "programmers." Why, I personally received about a dozen alarming messages from my acquaintances: the guys wondered if I was all good and offered vacancies "just in case."
Over the weekend, everyone connected to the problem, right down to the top management of Yandex.Cash. The payment system team was able to agree with partner banks on operations to cancel the erroneous return. It was a really big job: I had to manually sort through over ten thousand transactions.
The money was eventually returned to us, although we lost 150 thousand rubles for transfers at bank commissions, another 40 thousand were spent on SMS notifications to customers.
Today, this system serves around the clock 183 pizzerias in nine countries. In five years of development, we have gone from a primitive order-taking block to a full-fledged cloud-based ERP system that manages orders, work in the kitchen, planning schedules, stocks, finances — almost all aspects of our business.
We test Dodo IS on several environments: there are “sandboxes” for demonstration to product managers, there are integration circuits. Before laying out the production, the final version is tested by analysts and QA in a stable environment. For verification, we use real data that we regularly copy from the “combat” base. Of course, all data when crossing the border production - ** environment ** depersonalized .
On the stable environment, we try to fully reproduce the combat conditions - including the cancellation of payments not attached to orders. In reality, such payments may occur due to errors in the payment process or due to an incorrectly completed order cancellation procedure by the user. To check how cancellation will occur in the test environment, a special task is started according to the schedule, which cleans up the tails.
The day before the incident, two unsuccessful coincidences occurred at once, in the best traditions of Murphy's laws:
due to an error in the configuration, it turned out that the background task does not look at the imitation of the payment service, but at the real connection to Yandex.Kass;
Therefore, the cleanup task started in good faith, went through all the transactions and found those that need to be canceled. And ten thousand requests for cancellation came to Yandex.Money.
Not a single meeting with the company’s management raised the question of the punishment of the guilty, even after some time.
“Of course, we will draw the most serious conclusions from this critical error. We will not punish people - we will simply do everything so that this will not happen again. ”
Post on Fyodor Ovchinnikov's Facebook and VKontakte page immediately after the incident.
Fear of punishment sooner or later paralyzes the work of any company. I am sure that many of you have met companies where they write a lot of documents and letters in order to be as far as possible from the “area of ​​destruction”. Where no manager is ready to take anything that is bold, but even a trifling decision without 20 approvals. I believe that such companies are not capable of creation and development, they can only “finish” the resources created by their bolder predecessors and pioneers for years.
Our right to make a mistake does not mean the right to work carelessly and hack-work, it is above all trust and confidence that no employee can make a mistake out of malicious intent.
“If there is a possibility that some kind of trouble can happen, then it will definitely happen.”
Murphy's Law.
Trust does magical things with people - we didn’t have a single employee who would not take this incident to heart, did not live it as our own pain, would not offer help.
Providing rapid growth of Dodo IS, we often preferred the speed of development to everything else. Sometimes this happened at the expense of system logic, architecture and infrastructure.
As a result, the system turned out to be monolithic and strongly connected. The code for processing payments and interacting with acquirers was located directly on the client site, along with UI and controllers. So, any changes in the controllers could directly or indirectly affect the payment logic. Moreover, the location of the payment logic in the general repository led to the already-you-know-what incident. Loss of money was only a side effect of work in other parts of the system, in fact not related to the processing of payments.
For the last six months, we have been reengineering the system and shifting our monolith to the rails of SOA (Service-Oriented Architecture). Today everyone in the company - from programmers to managers - understands that technical debt must be returned.
As part of the transfer of the system to SOA, we allocate a separate payment processing service - a payment gateway. This service encapsulates all payment logic, including interactions with acquirers. In fact, we develop our own payment aggregator for our own needs. The payment gateway will become a single point of online payments for the client site (dodopizza.ru) and our other online sales channels.
We decided to certify a PCI DSS Self-Assessment payment gateway. The idea may seem controversial (we do not keep PAN card numbers), but the PCI DSS standard is not a bureaucratic formality, but a checklist consisting of correct practices and tips for working with sensitive data and written in “blood”.
Each payment gateway should have the architecture described by the UML diagram. This is the component model of our gateway:
But what is inside IBackService, IPlugin and other interfaces:
But how many diagrams do not draw, and you still have to explain with words :) What does the gateway consist of, and what role do its components play?
There is such a site dodopizza.ru, where most of the orders are issued. Now the site redirects the user to the payment page, depending on the chosen method - for example, on Yandex.Money - and processes the responses from payment systems. If necessary, the site backend calls the acquirer's backend. But in the new architecture, the site will not know anything about the payment page, nor about the acquiring. He will simply redirect the user to the payment gateway, who will decide himself where to send it further and how to interact with the acquirer.
The gateway is a RESTful service that accepts requests for returns and payment orders, for which it provides two APIs:
The Back API is intended only for calls from Dodo IS and is available only in the DMZ.
The source code of the payment gateway is located in a special repository, closed from all developers. For any changes to the source, the developer needs to make a separate application. The service itself is deployed on isolated environments with increased security requirements.
The payment gateway contains the basic logic of payment processing, and the specific logic of integration with specific acquirers is located in plug-ins. Thus, the work on connecting a new acquirer or changing the list of available ones is done on a point-to-point basis, with minimal risk of hooking on the excess.
The payment gateway stores payment information in its own database, closed to the outside world and to other parts of Dodo IS. Access to them is impossible even for the gateway itself. The database has its own API for managing entities, which is open only inside the payment loop.
In order to more clearly see the role of each component and present where and how data flows, I suggest looking at the data flow diagram:
If, after viewing the chart, you still do not understand how the payment is proceeding in the new architecture, look at the detailed example under the spoiler.
Payment start script
N | Step | Example (Yandex.Cassa) |
one | The client is on the Client site and goes to pay for the order. | - |
2 | The client site requests the Payment Gateway for non-cash payment methods available for a particular pizzeria, calling the GetPaymentTypes method. | - |
3 | The client site displays the client payment methods. The client chooses the method of payment. | The client chooses payment through Yandex.Cash. |
four | The client site sends a payment creation request to the Payment Gateway by calling the CreatePayment method. The selected payment method, pizzeria identifier, order identifier, amount to be paid, URLs of the payment status notification, successful return and unsuccessful return are transmitted. | - |
five | The payment gateway creates a payment in the Draft status. | - |
6 | The payment gateway validates the payment and assigns it the status Accepted or Rejected. | - |
7 | The payment gateway returns payment to the Client Site. | - |
eight | If the payment is rejected (Rejected), the Client site shows the client errors and the script ends. | - |
9 | The client site determines the type of embedding of the Payment Gateway. The type of embedding is indicated for each payment method. if the embedding type is “via redirect”, the Client site redirects the client to the PaymentPage Payment Gateway payment page, passing the payment identifier. if the frame-in type of embedding, the Client's site displays the frame in which the PaymentPage Payment Gateway payment page is displayed to the client, passing the payment identifier. | The embedding type for Yandex.Cash is “via redirect”. Therefore, the Client site redirects the client to the PaymentPage PaymentPage payment page, passing the payment ID. |
ten | The payment gateway assigns the status Started to the payment if the payment is in the Accepted status. Otherwise, proceed to the scenario of unsuccessful completion of payment. | - |
eleven | The payment gateway displays a payment page with a wait animation. | - |
12 | The payment gateway validates payment by ID. If validation is not passed, proceed to the scenario of unsuccessful completion of payment. | - |
13 | The payment gateway determines the plug-in by the payment method that will conduct the payment through the acquirer. If the plugin is not found, go to the scenario of unsuccessful completion of payment. | A plugin for Yandex.Money is selected. |
14 | The payment gateway calls the plugin's StartPayment method, transferring the payment. | - |
15 | The plugin performs its specific actions, calls the acquirer system and returns the result to the gateway. | The plugin returns to the Payment Gateway the result of the "redirect" and the URL of the payment page in Yandex.Kassa. |
sixteen | The payment gateway processes the result of the plugin: if the result is an “error”, the Payment Gateway proceeds to the scenario of unsuccessful completion of payment. if the result is “payment made”, the Payment Gateway returns the response received from the plug-in and proceeds to the scenario of successful completion of payment. if the result is “waiting”, the Payment Gateway returns the response received from the plugin. if the result is a “redirect”, the Payment Gateway redirects to the URL obtained from the plugin and proceeds to the scenario of waiting for payment. | The payment gateway redirects the client to the URL of the payment page in the Yandex.Money service and proceeds to the scenario of waiting for payment. |
Payment Waiting Scenario
N | Step | Example (Yandex.Cassa) |
one | The payment gateway listens for acquirer requests through universal endpoint acquiring. The same endpoint handles client redirects initiated by the acquirer. | Yandex.Cassa sends HTTPS POST to pay.dodopizza.com/acquiring/yamoney/checkOrder or Yandex.Cassa sends HTTPS POST to pay.dodopizza.com/acquiring/yamoney/checkAviso or Yandex.Kassa redirects the client to the address pay.dodopizza.com/acquiring/yamoney/success |
2 | Upon receiving the request, the Payment Gateway extracts the plug-in name from the request parameters and creates the appropriate plug-in. | Payment gateway named yamoney finds a plugin for Yandex.Cash |
3 | The payment gateway authorizes the request by calling the plugin AuthorizeAcquiringRequest method | Plugin verifies the authenticity of the request. |
four | The payment gateway sends a plug-in request by calling the ProcessAcquiringRequest method. The plugin performs its specific actions and returns the result to the gateway. | According to the request parameters, the plugin selects the appropriate handler. CheckOrder: The plugin returns the result of the “wait” and response for sending to Yandex.Kassa to the gateway. CheckAviso: The plugin returns to the gateway the result “payment made” and the response to send Yandex.Kassa success: The plugin returns the result of the "redirect" and the URL of a successful return to the gateway. |
five | The payment gateway processes the result of the plugin: if the result is "error", the Payment Gateway proceeds to the scenario of unsuccessful completion of payment. if the result is “payment made”, the Payment Gateway returns the response received from the plug-in and proceeds to the successful payment completion scenario. if the result is “waiting”, the Payment Gateway returns the response received from the plugin. if the result is "redirect", the Payment Gateway redirects to the URL obtained from the plugin and proceeds to the scenario of waiting for payment. | - |
At this point, I was asked to stop copying the internal documentation in Habr, so I’ll get round. I would be glad if the article pushes you to some thoughts about your own architecture or tells you new solutions. And it will be quite cool if you find something unnoticed by us and tell us what we are doing wrong. But do not be too harsh - perhaps many of the nuances are not forgotten, but simply remained outside the scope of the article, then I will explain them in the comments.
Source: https://habr.com/ru/post/325762/
All Articles