📜 ⬆️ ⬇️

Change payment service using A / B test through HAProxy

At Avito, we are closely following the development of other classifications around the world. And of course, we are interested in the best practices of working with such a complex system as billing. Today I am publishing a translation of the post of my colleague in the Naspers group (Avito is a member of it), M. Rafai Alem, a software engineer from Dubizzle . This is the leading ad site in the UAE, part of the OLX Group - a network of the largest online markets in 45 countries with more than 1.9 billion visits, 37 billion page views and 54 million ads monthly. The topic will interest all those involved in the creation and development of their own payment service.




Imagine that you need to rewrite an existing web service to upgrade to a new payment service provider for various practical reasons. Your first thought will probably be to completely replace the old gateway with a new one and start it. But this is a bit naive, especially if you work with payment gateways that have service level agreements, agreements with servicing banks, scripts for risk detection and fraudulent activities, etc. These factors make the transition process more risky in terms of operations, revenues, customer retention and, ultimately, business success. In this post, we will look at the approach we used to reduce the risks of changing the payment gateway, and why it is so important.



Dubizzle checkout page


It all starts with the old payment service ...


Our old payment system is written in Python 2 and is closely connected with the old payment gateway. When we first tackled the problem, we thought that integrating the new payment gateway into existing flows, URLs and the Python Bottle would be easy. When we started working on the first prototype, we realized that we were creating spaghetti code, since the API flows of these two payment gateways were completely different. The API of the old payment gateway was different: Redis and Gevent were largely used to optimize the processing of users and payments, there was absolutely no need to repeat this in the API of the new gateway.



Payment service in the old payment gateway


A / B tests are good as long as they are simple.


A / B tests are very important for decision making in product development. In Dubizzle, such tests are usually more focused on what the user faces: the transitions between pages, the components on the pages and their location, the features. However, they complicate the situation if you want to test basic systems that are highly dependent on each other.


Working on the prototype, we realized that the A / B test we wanted to perform should not be performed using tools like Optimizely, even if we could somehow integrate the new gateway into the same custom mappings and streams. And that's why.



if cookie == 'OLD': use old payment gateway API else: use new payment gateway API 


If the user is in control group A (a web service that interacts with the old payment gateway), this web service must ensure that the payment process of this user continues in the same payment gateway in which he started. So, the web service should not initiate transactions in the old payment gateway and try to complete them in the new one. This problem can be solved using sticky sessions , which are supported by Optimizely and HAProxy.


The story of two payment services


When we began to understand the problem we encountered, we decided to develop a new payment service integrated with the new gateway and compare their performance using the A / B test (50/50). The A / B test had to meet at least the following requirements:




Payment service integrated with new payment gateway


A / B test with HAProxy


Excluding Optimizely, we decided to use the HAProxy test. It is a powerful load balancer of the fourth and seventh levels of the OSI network model with a wide range of functions. One of these functions is the ability to bind a request to a backend using cookies at the seventh level.
We set up HAProxy for three backends:



Below is an example of the HAProxy backend:


 backend main balance roundrobin cookie SERVERID insert indirect nocache maxlife 2h option forwardfor option http-server-close option http-pretend-keepalive timeout queue 5000 timeout connect 5000 timeout server 50000 server old-service-old-pg old-service.eu-west-1.elasticbeanstalk.com:80 weight 128 cookie old_v1 check server new-service-new-pg new-service.eu-west-1.elasticbeanstalk.com:80 weight 128 cookie new_v1 check backend old_service option forwardfor option http-server-close option http-pretend-keepalive timeout queue 5000 timeout connect 5000 timeout server 50000 server old-service-old-pg-static old-service.eu-west-1.elasticbeanstalk.com:80 check backend new_service option forwardfor option http-server-close option http-pretend-keepalive timeout queue 5000 timeout connect 5000 timeout server 50000 server new-service-new-pg-static new-service.eu-west-1.elasticbeanstalk.com:80 check 

Perhaps you thought: why do we need static backends? This will become clear when we get to the frontend HAProxy, so let's first consider the main backend.


We chose Weighted Round Robin (WRR) to balance the load on old-service-old-pg and new-service-new-pg. This makes sense in the A / B test, where you just need to divide the traffic between groups A and B, given that a request falling into group A should not fall into group B throughout the session. We achieved this with the cookie directive HAProxy. Assuming that any user who initiates a transaction completes it within 2 hours, we set up HAProxy so that the cookie is deleted after 2 hours and a new one is created based on the WRR results.


This gave us two very important results:



There is a small vulnerability in our scheme that you probably haven’t noticed yet. Consider the following scheme:



Payment process through HAProxy


The user starts the transaction in the old payment service at 45 minutes after receiving the cookie. Then it goes to the 3-D Secure Bank page at 1:00 and 59 minutes and is redirected back to our URL of the successful payment page after 2 hours. Since HAProxy is set to maxlife cookie for 2 hours, HAProxy cancels the session cookie and tries to insert a new one after redirection to the URL of the successful payment page. If we are unlucky, WRR can link a new session to a new payment service that does not know how to handle redirection to the successful payment page initiated by the old service.


In our case, we decided to ignore this problem, because we knew from experience that the users who initiated the transaction usually complete it within two hours. But imagine what problem we would face by setting the maxlife parameter to, for example, 1 minute?


Let's go back to the conversation about why we used the old_service and new_service , as well as main . HAProxy typically configure a proxy frontend that handles all access control lists (ACLs). In our case, it looked like this:


 frontend all bind *:80 timeout client 50000 default_backend main acl is_old_webhook path_reg ^\/webhook-old.* acl is_refund path_reg ^\/refund-endpoint.* acl is_new_webhook path_reg ^\/webhook-new.* use_backend old_service if is_old_webhook use_backend new_service if is_refund use_backend new_service if is_new_webhook 

These web pages and endpoints, which can be seen in the configuration, are the specific rules needed to process requests originating from external gateways. Since the old and new services are not able to interact with each other's payment gateways, using WRR for load balancing will result in almost all requests giving an error 404 or 400. In addition, since these requests come from payment gateways, there is no need for persistence because they do not contain scripts that cover different rules (all requests are processed instantly upon receipt of code 200).


The best place to solve this problem is the balancer itself, so we configured ACLs to direct the flow of requests to the appropriate application servers through the old_service and new_service static new_service . In other words, there is a sort of conversation between the payment gateway and HAProxy, which should redirect the request to the appropriate application server.


Payment Gateway: Hi, I’m an internal request from an old payment gateway, and I’m reporting that I’ve received a successful payment.
HAProxy: Hi, I know you. Go through the old_service door to access the target application.
Target app: Hi, I know how to be with you. Let's activate this user's order. For this, I will give you code 200.

Query flow monitoring


Now that all the problems have been solved, it is worth wondering how to test the implementation of such a complex A / B test.


HAProxy keeps very detailed logs for debugging complex systems. Consider a few examples from the A / B test.


 Feb 22 09:16:39 ip-1-xxx haproxy[9046]: 1.xac:2459 [22/Feb/2017:09:16:25.623] all main/new-service-new-pg 13783/0/0/101/13884 200 8241 - - --NI 22/22/0/1/0 0/0 "GET /step1/1725770 HTTP/1.1" Feb 22 09:18:23 ip-1-xxx haproxy[9046]: 1.xac:2629 [22/Feb/2017:09:18:12.291] all main/new-service-new-pg 7223/0/1/3881/11105 200 327 - - --VU 19/19/0/1/0 0/0 "POST /step2/1725770 HTTP/1.1" Feb 22 09:19:30 ip-1-xxy haproxy[9402]: 1.xbz:59238 [22/Feb/2017:09:19:26.761] all main/new-service-new-pg 3714/0/1/45/3760 200 4711 - - --VN 34/34/0/1/0 0/0 "GET /success-redirect/1725770 HTTP/1.1" Feb 22 09:16:51 ip-1-xxx haproxy[9046]: 1.xac:2459 [22/Feb/2017:09:16:49.146] all new_service/new-service-new-pg-static 2047/0/0/5/2052 200 226 - - ---- 29/29/0/1/0 0/0 "POST /webhook-new HTTP/1.1" 

Note the NI, VU, and VN flags in each request for a single order identifier. These checkboxes allow you to understand how the client, server and HAProxy processed cookies. This is the most important information that you should pay attention to when testing and debugging.
We quote the HAProxy documentation :


-: Cookie binding is disabled. This is the case when we do not do an A / B test.
NI: The cookie was not set by the client and was generated in response. This usually occurs when the first request from each user and allows you to count the number of real users. WRR will define the user group.
VU: Cookie provided by the client; The last date of the visit is outdated, so an updated cookie was given in response. This is also possible if there is no date at all or it is specified, but the maxidle parameter is not set and the cookie can be set indefinitely.
VN: Cookie provided by the client. Occurs in most responses if the client already has a cookie. So HAProxy finds the current group for the user and sends it to the corresponding backend.

Please note that when HAProxy selects the new-service-new-pg server and sets a cookie, all subsequent requests from this user are sent to the new-service-new-pg via the main backend.


If the request matches one of the ACLs, HAProxy sends the request without setting a cookie. It also ensures that the A / B test results are not distorted by HTTP calls originating from bots.


A / B Test Architecture



A / B Test Architecture


DNS and boundary node - common to all our microservices. All the magic of the A / B test begins after the traffic passes through the upstream load balancer (also HAProxy).


Conclusion


Although for us such a solution was the best possible solution, there are many other ways of more complex balancing, not only using Cookie bindings. Shopify described the solution with Nginx and OpenResty very well in his blog .


From translator


Thanks to the author, who kindly agreed to the translation and publication here. You can contact him directly on Twitter: mrafayaleem . And if you want to develop similar services, check out our vacancies on My Circle: we in Avito need developers of highly loaded systems, including billing developers.


Have you encountered such problems? Let's discuss in the comments.


')

Source: https://habr.com/ru/post/336508/


All Articles