
With an increase in the number of components in a software system, the number of people participating in its development usually increases. As a result, in order to maintain the pace of development and ease of maintenance, approaches to organizing the API should be the subject of special attention.
If you want to learn more about how the Wargaming Platform team copes with the complexity of the system from more than a hundred web services interacting with each other, then welcome to Cat.
Hello! My name is Valentin and I am an engineer on “Platform” at Wargaming. For those who do not know what a platform is and what it does, I will leave here a
link to the recent publication of one of my colleagues -
max_posedon')
At the moment I have been working in the company for more than five years and partially found the period of active growth of World of Tanks. To uncover the issues raised in this article, I need to start with a brief insight into the history of the Wargaming Platform.
A bit of history
The growth of the popularity of “tanks” turned out to be avalanche-like, and as is usually the case in such cases, the infrastructure around the game began to grow rapidly. As a result, the game very quickly became overgrown with various web services, and at the time of my joining the team, their account was already running for dozens (now, by the way, more than 100 platform components work and benefit the company).
As time went on, new games came out, and it was no longer easy to understand the intricacies of integration between web services. The situation only worsened when teams from other offices of Wargaming joined the development of the platform. Development has become distributed, with all the consequences in the form of distance, time zones and language barrier. And the service has become even more. Finding someone who understood well how the platform as a whole works was not so easy. Information often had to be collected in parts from different sources.
The interfaces of various web services could be very different from each other in a stylistic design, which made the process of integration with the platform even more difficult. And direct inter-component dependencies reduced development flexibility by complicating the decomposition of functionality within the platform. Worse, the games — platform clients — knew our topology well, since they had to integrate directly with each platform service. This gave them the opportunity, using horizontal links, to lobby for the implementation of certain improvements directly in the component with which they are integrated. This led to the appearance of duplicate functionality in various components of the platform, as well as to the impossibility of extending the existing functionality to other games. It became obvious that to continue to build a platform around each specific game is a dead-end branch of development. We needed technical and organizational changes, as a result of which we would be able to control the growing complexity of a fast-growing system and make the entire functionality of the platform usable by any game.
At this point, I want to finish the historical excursion and, finally, to talk about one of our technical solutions, which helps to keep under control the complexity caused by the ever-growing number of services. In addition, it reduces the cost of developing new functionality and greatly simplifies integration with the platform.
Meet the Contract API
Inside the platform, we call it the Contract API. At its core, this is an integration framework, represented by a set of documentation and client libraries for each technology from our stack (Erlang / Elixir, Java / Scala, Python). It is developed, first of all, in order to simplify the integration of platform components with each other. Secondly, to help us solve a number of the following problems:
- stylistic differences of software interfaces
- the presence of direct inter-component dependencies
- keeping documentation up to date
- introspection and debugging of end-to-end functionality
So, first things first.
Stylistic differences in software interfaces
In my opinion, this problem arose as a result of a combination of several factors:
- No strict standard on how the API should look. The code of recommendations often has no proper effect, the API still turns out to be different. Especially if the development is conducted by teams from different offices of the company. Each team has its own habits and practices. In aggregate, such APIs often do not look like parts of a whole.
- The lack of a single directory with the names and formats of business-specific entities. As a rule, it is impossible to take an entity from the result of the work of one API and transfer it to the API of another service. This requires transformation.
- The lack of a mandatory centralized review system for the API. There are always tight deadlines and there is no time for collecting uprudes and, moreover, making changes to the API, which in fact often turns out to be already half tested.
The first thing we did when designing the Contract API was to state that from now on the API belongs to the platform, and not to a single component. This led to the fact that the development of new functionality begins with a pull-request to the centralized API repository. At the moment we use the GIT repository as a repository. For convenience, we have divided the entire API into separate business functions, formalized the structure of this function and called its Contract.
Since then, each new business function in our contract API should be described in a special format and pass through a pull request with a mandatory review. There is no other way to publish a new API in the Contract API. In the same repository, we identified a directory of business-specific entities and suggested that contract developers reuse them instead of describing these entities on their own.
So we got the conceptually holistic API of the platform, which looked like a single product, despite the fact that it was actually implemented on a variety of platform components using different technological stacks.
The presence of direct inter-component dependencies
This problem of ours manifested itself in that each platform component was required to know who specifically serves the necessary functionality.
And the matter was not even in the difficulty of keeping this reference book up to date, but in the fact that direct dependencies significantly complicated the migration of business functionality from one component of the platform to another. The problem was especially acute when we began decomposing our monoliths into smaller components. It turned out that convincing the client to replace the working integration with any functionality with the same from a business point of view, but another from a technical point of view, is a rather not trivial managerial task. The client simply does not see the point, since everything is fine with him. As a result, bad-smelling layers of backward compatibility were written, which only complicated the support of the platform and had a bad effect on the quality of service. And since we are already going to standardize the platform API, it was necessary to solve this problem along the way.
We faced a choice of several options. Of these, we carefully considered:
- Implementation of service discovery protocols on each of the components.
- The use of a mediator , which would redirect client requests to the correct component of the platform.
- Using a message broker as a messaging bus.
As a result of some thought and experimentation, the choice fell on the message broker, despite the fact that it was seen as a potential single point of failure and increased the overhead of operating the platform. An important role in the selection was played by the fact that the platform at that time already had expertise in working with RabbitMQ. And the broker itself was well scaled and had built-in mechanisms for ensuring fault tolerance. As a bonus, we were able to implement an
event-driven architecture (
event-driven architecture or
EDA ) “under the hood”. What subsequently opened up to us more opportunities for interservice interaction, compared with the interaction of the type “point-to-point”.
So, topologically, the platform began to turn from a graph with random connectivity into a star. And the platform components inverted their dependencies and got the opportunity to interact with each other exclusively through contracts registered in the centralized repository, without having to know who specifically implements this or that contract. In other words, all components within the platform were able to interact with each other using a single point of integration, which significantly simplified the life of the developers.
Keeping documentation up to date
Problems associated with the lack of documentation or the loss of its relevance are almost always encountered. And the higher the pace of development, the more often it appears. After the fact, it is difficult to assemble all API specifications for more than a hundred services in a distributed and multinational team in a single place and format.
Developing the Contract API, we set a goal to solve this problem as well. And we did it. A strictly defined contract description format allowed for the construction of a process in accordance with which, immediately after the appearance of a new contract, the automatic assembly of documentation is launched. This gives us confidence that our API documentation is always up to date. This process is fully automated and does not require any effort on the part of development or management.
Introspection and debugging end-to-end functionality
As we crushed our monoliths into smaller components, quite naturally difficulties began to arise with debugging through-functionality. If the maintenance of the business function was distributed across several platform components, then for localization and debugging of the problem, it was necessary to look for representatives from each of the components. That at times it was difficult to achieve, given the 11-hour time difference with some of our colleagues.
With the advent of the Contract API, and in particular thanks to the underlying message broker, we were able to receive copies of the messages involved in the execution of the business function, without side effects on the interaction participants. To do this, it is not even necessary to know which of the platform components is responsible for processing a particular contract. And after the problem is localized, we can get the identifier of the broken component from the metadata of the problem message.
What else have we developed on top of the Contract API
In addition to its main purpose and solving the above problems, the Contract API allowed us to implement a number of useful services.
Gateway to access platform functionality
Standardization of API in the form of contracts allowed us to develop a single point of access to platform functionality via HTTP. Moreover, when new functionality (contracts) appears, we do not need to modify this access point in any way. It is compatible in advance with all future contracts. This allows you to work with the platform as a single product using the usual HTTP interface.
Bulk operations service
Any contract can be launched as part of a mass operation, with the ability to track its status and then receive a report on the results of this operation. This service, just like the previous one, is compatible with all future contracts in advance.
Single processing platform errors
Protocol Contract API standardizes including errors. This allowed us to implement an error interceptor that analyzes their severity and notifies the monitoring system of potential problems on the platform components. And in the future will be able to make a decision about opening a bug on the platform component. The error interceptor catches them directly from the message broker and does not know anything about the appointment of a contract or error, acting only on the basis of meta-information. This allows him, as well as all the services described in this section, to be compatible with all future contracts in advance.
Automatic user interface generation
Strictly formalized contracts allow you to automatically build components of the user interface. We developed a service that allows you to generate an administrative interface based on a collection of contracts, and then embed this interface in any of our platform tools. Thus, those admins that we previously wrote with our hands can now be generated (albeit only partially) in automatic mode.
Logging of platform interactions
This component is currently not yet implemented and is at the stage of elaboration. But in perspective, it will allow “on the fly” to turn on and off logging of any business function in the platform, extracting this information directly from the message broker, without any side effects that adversely affect the interacting components.
The main purpose of the Contract API
But still, the main purpose of the Contract API is to reduce the costs of integrating the platform components.
The developers are abstracted from the transport level by the libraries that we developed for each of our technological stacks. This gives us some room for maneuver in case you have to change the message broker or even switch to point-to-point interaction. The external interface of the library will remain unchanged.
The library under the hood generates a message according to certain rules and sends it to the broker, after which, waiting for a response message, returns the result to the developer. From the outside, it looks like a normal synchronous (or asynchronous, implementation-dependent) request. I will give a few examples as a demonstration.
An example of calling a contract using Python
from platform_client import Client client = Client(contracts_path=CONTRACTS_PATH, url=AMQP_URL, app_id='client') client.call("ban-management.create-ban.v1", { "wgid": 1234567890, "reason": "Fraudulent activity", "title": "ru.wot", "component": "game", "bantype": "access_denied", "author_id": "v_nikonovich", "expires_at": "2038-01-19 03:14:07Z" }) { u'ban_id': 31415926, u'wgid': 1234567890, u'title': u'ru.wot', u'component': u'game', u'reason': u'Fraudulent activity', u'bantype': u'access_denied', u'status': u"active", u'started_at': u"2019-02-15T15:15:15Z", u'expires_at': u"2038-01-19 03:14:07Z" }
The same contract call, but using Elixir
:platform_client.call("ban-management.create-ban.v1", %{ "wgid" => 1234567890, "reason" => "Fraudulent activity", "title" => "ru.wot", "component" => "game", "bantype" => "access_denied", "author_id" => "v_nikonovich", "expires_at" => "2038-01-19 03:14:07Z" }) {:ok, %{ "ban_id" => 31415926, "wgid" => 1234567890, "title" => "ru.wot", "conponent" => "game", "reason" => "Fraudulent activity", "bantype" => "access_denied", "status" => "active", "started_at" => "2019-02-15T15:15:15Z", "expires_at" => "2038-01-19 03:14:07Z" }}
At the place of the contract “ban-management.create-ban.v1” there can be any other platform functionality, for example: “account-management.rename-account.v1” or “notification-center.create-sms-notification.v1”. And all of it will be available through this single point of integration with the platform.
The overview will be incomplete if not to demonstrate the Contract API from the point of view of the server developer. Consider a situation in which a developer needs to implement a handler for the same ban-management.create-ban.v1 contract.
from platform_server import BlockingServer, handler class CustomServer(BlockingServer): @handler('ban-management.create-ban.v1') def handle_create_ban(self, params, context): response = do_some_usefull_job(params) return response d = CustomServer(app_id="server", amqp_url=AMQP_URL, contracts_path=CONTRACTS_PATH) d.serve()
This code will be enough to start servicing the specified contract. The server library will unpack and check the parameters of the request for correctness, and then call the contract handler with the request parameters already ready for processing. Thus, the server developer is protected by the library, which, in case of receiving incorrect request parameters, will send a validation error to the client and register the fact of the problem.
Due to the fact that the Contract API is implemented on the basis of events, we are able to go beyond the Query / Response scenario and implement a wider range of interservice interactions.
For example:
- make a request and forget (without waiting for an answer)
- make requests to several contracts simultaneously (even without using the event loop)
- make a request and get answers at once from several handlers (if provided for by the integration script)
- register the response handler (it works if the contract handler has reported completion, accepts the result of the contract handler, that is, his answer)
And this is not a complete list of scenarios that can be expressed through an event-based interaction model. This is a list of those that we are currently using.
Instead of conclusion
We have been using the Contract API for several years. Therefore, it is not possible to talk about all the scenarios for its use in a single review article. For the same reason, I did not overload the article with technical details. She already turned out quite voluminous. Ask questions, and I will try to answer them right in the comments. If any topic is particularly interesting, it will be possible to reveal it in more detail in a separate article.