The architecture of billing a new generation: the transformation with the transition to Tarantool
Why such a corporation as MegaFon, Tarantool in billing? From the outside, it seems that a vendor usually comes in, brings some large box, sticks the plug into the outlet - that's the billing! Once it was, but now it is archaic, and such dinosaurs have become extinct or are dying out. Initially, billing is a billing system - counting or calculator. In modern telecom, this is a system for automating the entire life cycle of interaction with a subscriber from entering into a contract to termination , including real-time billing, payment acceptance, and much more. Billing in telecom companies is like a fighting robot - big, powerful and armed with weapons.
And here is Tarantool? Oleg Ivlev and Andrey Knyazev will tell about it. Oleg is the main architect of MegaFon with extensive experience in foreign companies, Andrew is the director of business systems. From deciphering their report at Tarantool Conference 2018, you will learn why R & D is needed in corporations, what Tarantool is, how the deadlock of vertical scaling and globalization became prerequisites for the emergence of this database in the company, about technological challenges, architecture transformation, and what MegaFon techno technology is like Netflix, Google and Amazon.
')
Unified Billing project
The project, which will be discussed, is called “Unified Billing”. It was in him that Tarantool showed its best qualities.
The growth of Hi-End equipment performance did not keep pace with the growth of the subscriber base and the increase in the number of services; further growth in the number of subscribers and services was expected due to M2M, IoT, and affiliate features led to a deterioration in time-to-market. The company decided to create a unified business system with a unique, world-class modular architecture, instead of 8 current different billing systems.
MegaFon is eight companies in one . In 2009, the reorganization was completed: branches throughout Russia merged into a single company OJSC MegaFon (now PJSC). Thus, the company has 8 billing systems with its own "custom" solutions, branch features and a different organizational structure, IT and marketing.
Everything was good, until I had to launch one common federal product. There appeared a lot of difficulties: someone billing with a rounding up, someone less, and someone - according to the arithmetic average. Thousands of such moments.
Despite the fact that the version of the billing system is one, one supplier, the settings diverged so that the glue is long. We tried to reduce their number, and came across a second problem, which is familiar to many corporations.
Vertical scaling . Even the coolest at the time iron did not meet the needs. We used equipment from Hewlett-Packard, Superdome Hi-End line, but it didn’t pull even the need of two branches. I wanted a horizontal scaling without large operating costs and capital investments.
Waiting for growth in the number of subscribers and services . Consultants have long brought to the telecom world stories about IoT and M2M: there will come times when there will be a sim card on each phone and iron, and two in the fridge. Today we have one number of subscribers, and in the near future they will be much more.
Technological challenges
These four reasons moved us to major changes. There was a choice between upgrading the system and designing from scratch. They thought for a long time, made serious decisions, played tenders. As a result, we decided to design from the very beginning, and took up interesting challenges - technological challenges.
Scalability
If earlier it was, let's say, 8 billing for 15 million subscribers , but now it should have turned out 100 million subscribers and more - the load is much higher.
We have become comparable in scale with major online players like Mail.ru or Netflix.
But further movement to increase the load and the subscriber base has set us serious tasks.
The geography of our vast country
Between Kaliningrad and Vladivostok there are 7500 km and 10 time zones . The speed of light is finite and at such distances the delays are already significant. 150 ms on the coolest modern optical channels is a bit too much for real-time-billing, especially such as is now in telecom in Russia. In addition, you need to update in one working day, and with different time zones - this is a problem.
We do not just provide services for a monthly fee, we have complex tariffs, packages, various modifiers. We need not just to allow or prohibit the subscriber to talk, but to give him a certain quota - to cheat calls and actions in real time so that he does not notice.
fault tolerance
This is the reverse side of centralization.
If we collect all subscribers in one system, then any emergency events and disasters are bad for business. Therefore, the system is designed to eliminate the impact of accidents on the entire subscriber base.
This is a consequence of the rejection of vertical scaling. When we went into horizontal scaling, we increased the number of servers from hundreds to thousands. They need to manage and build interchangeability, automatically back up the IT infrastructure and restore the distributed system.
Such interesting challenges faced us. We designed the system, and at that moment we tried to find global best practices, to check how we are trending, how we are following advanced technologies.
World experience
Surprisingly, we did not find a single reference in world telecom.
Europe has disappeared by the number of subscribers and the scale, the USA - by the plane of its tariffs. We looked at something in China, but found something in India and got specialists from Vodafone India.
To analyze the architecture, assembled the Dream Team led by IBM - architects from different areas. These people could adequately assess what we are doing and bring some knowledge to our architecture.
Scale
Several numbers for illustration.
We design a system for 80 million subscribers with a reserve of a billion . So we remove future thresholds. This is not because we are going to seize China, but because of the pressure of IoT and M2M.
300 million documents are processed in real time . Although we have 80 million subscribers, we also work with potential customers, and with those who have left us, if you need to collect receivables. Therefore, real volumes are much larger.
2 billion transactions daily change the balance - it is payments, charges, calls and other events. 200 TB of data is changing actively , 8 PB of data are changing a little slower, and this is not an archive, but live data in a single billing. The scale of data centers - 5 thousand servers on 14 sites .
Technological stack
When we planned the architecture and undertook to assemble the system, we imported the most interesting and advanced technologies. The result was a technological stack, familiar to any Internet player and corporations that make high-load systems.
The stack is similar to stacks of other major players: Netflix, Twitter, Viber. It consists of 6 components, but we want to reduce and unify it.
Flexibility is good, but in a large corporation without unification in any way.
We are not going to change the same Oracle to Tarantool. In the realities of large companies, this is a utopia, or a crusade for 5-10 years with an incomprehensible outcome. But Cassandra and Couchbase can be completely replaced by Tarantool, and we are striving for this.
Why Tarantool?
There are 4 simple criteria why we chose this database.
Speed We conducted load tests on MegaFon industrial systems. Tarantool won - it showed the best performance.
It cannot be said that other systems do not meet the needs of MegaFon. Current memory-solutions are so productive that this stock of the company is more than enough. But we are interested in dealing with the leader, and not with the one who lags behind, including the load test.
Tarantool covers the needs of the company even in the long term.
TCO cost . Couchbase support on MegaFon volumes costs space money, with Tarantool the situation is much nicer, and in terms of functionality they are close.
Another nice feature that has had a little impact on our choice is that Tarantool works better than other databases with memory. It shows maximum efficiency .
Reliability MegaFon is invested in reliability, probably, like no other. Therefore, when we looked at Tarantool, we realized that we need to make it so that it meets our requirements.
We invested our time and finances, and together with Mail.ru we created an enterprise version, which is already used in several other companies.
Tarantool-enterprise fully satisfied us for security, reliability, logging.
Partnership
The most important thing for me is direct contact with the developer . This is exactly what the guys from Tarantool bribed.
If you come to the player, especially who works with the anchor client, and say that you need the database to be able to do it, this and that, usually he answers:
- Well, put the requirements under the bottom of that pile - someday, probably, we will get to them.
Many have a roadmap for the next 2-3 years, and it’s almost impossible to build in there, and the Tarantool developers are captivating with openness, and not only with MegaFon, and adapt their system to the customer. This is cool and we really like it.
Where we applied Tarantool
We have Tarantool used in several elements. The first is in the pilot we made on the system of the address catalog. At one time, I wanted it to be a system that is similar to Yandex.Maps and Google Maps, but it turned out a little differently.
For example, the address directory in the sales interface. On Oracle, finding the right address takes 12-13 seconds. - uncomfortable numbers. When we switch to Tarantool, replace Oracle with another database in the console, and perform the same search, we get acceleration 200 times! The city pops up after the third letter. Now we are adapting the interface to make it happen after the first one. However, the response speed is completely different - already milliseconds instead of seconds.
The second application is a trendy topic called two-speed IT . All because the consultants from each iron say that corporations should go there.
Here there is a layer of infrastructure, above it domains, for example, a billing system like a telecom, corporate systems, corporate reporting. This is the core that does not need to touch. That is, of course, possible, but paranoid providing quality, because it brings the corporation money.
Next comes the microservice layer - that differentiates the operator or another player. Microservices can be quickly created on the basis of some caches, raising data from different domains there. Here is a field for experiments - if something did not work out, close one microservice, open another. This provides a truly enhanced time-to-market and increases the reliability and speed of the company.
Microservices is perhaps the main role of Tarantool in MegaFon.
Where we plan to apply Tarantool
If we compare our successful billing project with the transformation programs in Deutsche Telekom, Svyazkom, Vodafone India, it is surprisingly dynamic and creative. In the process of implementing this project, MegaFon and its structure were not only transformed, but also Tarantool-enterprise appeared at Mail.ru, and our vendor Nexign (formerly Peter-Service) had a BSS Box (boxed billing solution).
This is, in a sense, a historical project for the Russian market. It can be compared with what is described in the book “The Mythical Man-Month” by Frederick Brooks. Then, in the 60s, IBM attracted 5,000 people to develop the OS / 360 operating system for mainframes IBM. We have less - 1,800, but ours are in vests, and taking into account the use of open-source and new approaches, we work more productively.
Below are the billing domains or, to put it more broadly, the business systems. People from the enterprise are well aware of CRM. Other systems should already be available to everyone: Open API, API Gateway.
Open API
Let's look again at the numbers and how the Open API works now. Its load is 10,000 transactions per second . Since we plan to actively develop the microservice layer and build the MegaFon public API, we expect more growth in this part in the future. 100,000 transactions for sure .
I don’t know if SSO can be compared with Mail.ru - guys, like, 1,000 0000 transactions per second. We are extremely interested in their solution and we plan to learn from their experience - for example, to make a functional SSO reserve using Tarantool. Now the developers of Mail.ru are doing this with us.
CRM
CRM - these are the very 80 million subscribers that we want to bring to a billion, because there are already 300 million documents that include a three-year history. We are really looking forward to new services, and here the growth point is connected services . This is a ball that will grow, because there will be more and more services. Accordingly, we will need a story, we do not want to stumble on this.
The billing itself in the part of billing, work with customers receivables has been transformed into a separate domain . To extend the performance, applied architectural template domain architecture .
The system is divided into domains, the load is distributed and fault tolerance is provided. Additionally conducted work with a distributed architecture.
All the rest are enterprise level solutions. In the call storage - 2 billion per day , 60 billion per month. Sometimes you have to recount them in a month, and better quickly. Financial monitoring is the very same 300 million that is constantly growing and growing: subscribers often run between operators, increasing this part.
The most telecom component from mobile communications is online billing . These are the systems that allow you to call or not to call, make a decision in real time. Here, the load is 30,000 transactions per second, but with the growth of data transfer, we plan 250,000 transactions , and therefore we are greatly interested in Tarantool.
The previous picture is the domain where we are going to use Tarantool. CRM itself, of course, is wider and we are going to apply it in the core itself.
Our estimated performance figure of 100 million subscribers confuses me as an architect - what if 101 million? To redo everything again? To prevent this, we use caches, at the same time increasing accessibility.
In general, there are two approaches to using Tarantool. The first is to build all caches at the level of microservices . As far as I understand, VimpelCom follows this path, creating a client cache.
We are less dependent on vendors, we are changing the core of the BSS, so we have a single card file of customers already out of the box. But we want to embroider it. Therefore, we use a slightly different approach - we make caches inside systems .
So less rassinhrona - one system is responsible for the cache, and for the main master source.
The method fits well with the Tarantool approach with the transactional skeleton, when only the parts that relate to updates, that is, data changes, are updated. Everything else can be stored somewhere else. No huge data lake, unmanaged global cache. Caches are designed for the system, either for products, or for customers, or to make life easier for service. When a subscriber is disturbed by the quality, I want to serve him in a quality manner.
RTO and RPO
There are two terms in IT - RTO and RPO .
Recovery time objective is the service recovery time after a failure. RTO = 0 means that even if something falls, the service continues to work.
Rrecovery point objective is data recovery time, how much data we can lose over a period of time. RPO = 0 means that we are not losing data.
Tarantool task
Let's try to solve the problem for Tarantool.
Given : everyone understands a basket of applications, for example, in Amazon or anywhere else. The basket is required to work 24 hours 7 days a week, or 99.99% of the time. Orders that come to us must maintain order, because we cannot randomly turn the connection on or off to the subscriber — everything must be strictly sequential. The previous subscription affects the next, so the data is important - nothing should be lost.
The decision . You can try to solve in the forehead and ask the developers of the database, but the problem is not mathematically solved. We can recall theorems, conservation laws, quantum physics, but why - it cannot be solved at the database level.
The good old architectural approach works here - you need to know the subject area well and at its expense resolve this rebus.
Our solution: create a distributed register of applications for Tarantool - a geo-distributed cluster . In the diagram, these are three different data processing centers - two to the Urals, one after the Urals, and we distribute all requests to these centers.
Netflix, which is now considered one of the leaders in IT, until 2012 had only one data center. On the eve of the Catholic Christmas on December 24, this data center lay down. Users of Canada and the United States were left without their favorite movies, they were very upset and wrote about it in social networks. Netflix now has three data centers on the west-east coast and one in western Europe.
We initially build a geo-distributed solution - fault tolerance is important to us.
So, we have a cluster, but what about RPO = 0 and RTO = 0? The solution is simple, which depends on the subject.
What is important in applications? Two parts: sketching a basket BEFORE making a purchase decision, and AFTER . The part of DL in telecom is usually called order capturing or order negotiation . In telecom, this can be much more complicated than in an online store, because there you have to serve the customer, offer 5 options, and this all happens for a while, but the basket is full. At this point, failure is possible, but it's not scary, because it happens interactively under the supervision of a person.
If the Moscow data center suddenly fails, then switching automatically to another data center, we will continue to work. Theoretically, one product in a basket can be lost, but you can see it, complete the basket again and continue to work. In this case, RTO = 0.
At the same time there is a second option: when we clicked “submit”, we want the data not to be lost. From this point on, automation starts working - this is already RPO = 0. Applying these two different patterns in one case can be just a geo-distributed cluster with one switchable master, in the other case some quorum record. Patterns may vary, but we solve the problem.
Further, having a distributed register of applications, we can also scale it all up - have many dispatchers and executors who access this registry.
Cassandra and Tarantool together
There is another case - the showcase of balances . Here is just an interesting case of the joint use of Cassandra and Tarantool.
We use Cassandra, because 2 billion calls a day is not the limit, and there will be more. Marketers love to color traffic by source, there are more and more details on social networks, for example. This all increases the story.
Cassandra allows you to scale horizontally to any volume.
We feel comfortable with Cassandra, but she has one problem - she is not good at reading. Everything is OK on the record, 30 000 per second is not a problem - the problem is reading .
Therefore, a topic with a cache appeared, and at the same time we solved the following problem: there is an old traditional case when the equipment from the switch from online billing comes in the files that we load into Cassandra. We dealt with the problem of reliably downloading these files, even using the advice of the IBM manager file transfer — there are solutions that manage the transfer of files efficiently using the UDP protocol, for example, and not TCP. This is good, but all the same, the minutes, and while we are not loading this all up, the operator in the call center cannot answer the client what happened to his balance - we must wait.
To avoid this, we use a parallel functional reserve . When we send an event through Kafka to Tarantool, recalculating units in real time, for example, today, we get a balance cache that can give out balances at any speed, for example, 100 thousand transactions per second and those 2 seconds.
The goal is that after making a call after 2 seconds in your account there is not only a changed balance, but also information about why it has changed.
Conclusion
These were examples of using Tarantool. We really liked the openness of Mail.ru, their willingness to consider different cases.
To consultants from BCG or McKinsey, Accenture or IBM, it is already difficult to surprise us with something new - much of what they offer, we are already doing, or done, or planning to do. I think that Tarantool in our technological stack will take a worthy place and will replace many of the already existing technologies. We are in the active phase of development of this project.
The report of Oleg and Andrey is one of the best at the Tarantool Conference last year, and already on June 17, Oleg Ivlev will speak at T + Conference 2019 with the report “Why Tarantool in Enterprise” . Alexander Deulin will also give a talk on MegaFon with the report “Tarantool Caches and Oracle Replication” . We learn what has changed, what plans have been implemented. Join - the conference is free, you only need to register . All reports were accepted and the conference program was formed: new cases, new experience using Tarantool, architecture, enterprise, tutorials and microservices.