The eternal question of technical duty

This is one of the coolest project reliefs. The picture shows a graph of the total time spent by the CPU on processing all user requests. At the end you can see the transition to PHP 7.0. since version 5.6. This is 2016, switching in the afternoon from November 24th.

From the point of view of calculations, Tutu.ru is first of all an opportunity to buy a ticket from point A to point B. For this, we grind a huge number of schedules, collect caches from a variety of airline systems and periodically make incredibly long join requests to the database. In general, we are written in PHP and until recently were completely on it (if the language is properly prepared, then you can even build real-time systems on it). Recently, performance-critical areas have begun to react to Go.

We constantly have technical debt . And this happens faster than we would like. The good news is: don't close it all. The bad: as the supported functionality grows, techdol also grows proportionally.
')
In general, technical debt is a payment for an error in making a decision. So you predicted something wrong, like an architect, that is, you made a forecasting error or made a decision in the context of insufficient information. At some point, you realize that you need to change something in the code (often at the level of architecture). Then you can immediately change, and you can wait. If you wait, the interest has run up on the technical debt. Therefore, good practice is to restructure it from time to time. Well, or plead bankrupt and write the whole block again.

How it all began: monolith and common functions

The Tutu.ru project started in 2003 as a regular Runet website of those times. That is, it was a bunch of files instead of a database, a PHP page on the front of HTML + JS. There were a couple of excellent hacks of my colleague Yuri, but he would better tell some of this himself. I joined the project in 2006, first as an external consultant who could help with advice and code, and then, in 2009, I moved to the position of technical director. First of all, it was necessary to restore order in the direction of airline tickets: it was the most loaded and the most complex in architecture.

In 2006, I recall, there was a train schedule and there was an opportunity to buy a train ticket. We decided to do the section of airline tickets as a separate project, that is, it was all united only at the front. All three projects (train schedules, railway and air) were finally written in their own way. At that time, the code seemed normal to us, but somewhat unfinished. Non-perfectionist. Then he grew old, was covered with crutches and on the railway direction turned into a pumpkin by 2010.

In the railway, we did not have time to return the technical debt. It was unrealistic to refactor: there were problems in architecture. We decided to demolish and redo everything anew, but it was also difficult on a live project. As a result, only the old URLs were left at the front, and then block by block was rewritten. As a basis, they took the approaches used the year before when developing the aviation sector.

Copied to PHP. Then it was clear that this is not the only way, but for us there were no reasonable alternatives. They chose it because they already had experience and groundwork, it was clear that this is a good language in the hands of senior developers. Of the alternatives, there were insanely productive C and C ++, but any rebuilding or implementation of changes then resembled a nightmare. Okay, not reminded. Were a nightmare.

MS and the whole .NET were not even considered from the point of view of a high-load project. Then there were no options other than Linux-based. Java is a good option, but it is demanding of resources from memory, it never forgives junior errors and then did not give the opportunity to release releases quickly - well, or we did not know this. We don’t even consider Python as a backend, only for data manipulation tasks. JS - purely under the front. Ruby on Rails-developers then (and now) was not found. Go was not. There was still Perl, but experts rated it as unpromising for web development, so they also refused it. PHP remained.

The next holivar story is PostgreSQL vs. MySQL. Somewhere one thing is better, somewhere another. In general, then it was good practice to choose what worked better, so we chose MySQL and its fork.

The development approach was monolithic, while there were simply no other approaches, but with an orthogonal library structure. These are the beginnings of a modern API-centric-approach, when each library has a facade facing out, for which you can pull right inside the code from other parts of the project. Libraries were written in "layers", when each level has a certain format at the entrance and gives further to the code a certain format, and unit tests run between them. That is something like test-driven development, but pixelated and scary.

All this was placed on several servers, which allowed to scale under load. But at the same time, the code base of different projects quite intersected at the system level. In fact, this meant that changes in the railway project could have affected our airline. And touched often. For example, in the railway it was necessary to expand the work with payments - this is a refinement of the general library. And the air works with it, therefore, joint testing is necessary. We screened dependencies with tests, and it was more or less normal. Even for 2009, the method was quite advanced. But still, the load could from one resource add another. There was an intersection in the databases, which led to unpleasant effects in the form of brakes throughout the site with local problems in one product. The railway killed the airline several times on the disk due to heavy queries to the database.

We scaled by adding instances and balancing between them. Monolith as is.

Tire era

Then we went along a rather marginal path. On the one hand, we began to allocate services (today this approach is called microservice, but we did not know the word “micro”), but for interaction we began to use the bus for data transfer, and not REST or gRPC, as they do now. We chose AMQP as the protocol, and RabbitMQ as the message broker. By that time, we had quite famously mastered running demons for PHP (yes, there is a quite working fork () implementation and everything else for working with processes), since for a long time in the monolith, such a thing as Gearman was used to parallelize requests to reservation systems. .

They made a broker on top of a rabbit, and it turned out that all of this does not really live under load. Any network losses, retransmitts, delays. For example, a cluster of several brokers "out of the box" behaves somewhat differently than what the developer has stated (never was this, and here again). In general, we learned a lot. But in the end we got the required SLA services. For example, the most loaded service via RPS has at 400 rps, the 99th percentile round-trip from client to client including bus and service processing of about 35 ms. Now we see a total of about 18 krps on the bus.

Then came the direction of the buses. We immediately wrote it without a monolith on the service architecture. Since everything was written from scratch, it came out very well, quickly and conveniently, although it was necessary to constantly refine the tools for the new approach. Yes, all this is spinning in virtual machines, inside which the demons in PHP communicate via bus. Demons were launched inside the Docker containers, but there were no solutions for orchestration like Openshift or Kubernetes. At 2014, they were just starting to talk about it, but we didn’t consider this approach.

If you compare how many tickets for buses are sold in comparison with tickets for a plane or train, you get a drop in the ocean. And in trains and airplanes, moving to a new architecture was hard, because there were working functionality, real workload, and always the choice arises between doing something new or spending money on paying technical debt.

Moving to services is a good thing, but it’s a long one, but we must deal with the load and reliability now. Therefore, in parallel, they began to take point measures to improve the life of the monolith. We divided backends into product types, i.e., we were able to more flexibly manage the routing of requests depending on their type: airfare apart from the railway, etc. You could predict the load, scale it independently. When they knew that in the railways, for example, was the peak of New Year sales, they added several virtual machine instances. It began then exactly 45 days before the last working day of the year, and on November 14-15 we had a double load. Now FPK and other carriers have made a lot of tickets with the start of sales for 60, 90 and even 120 days, and this peak has spread. But on the last working day of April there will always be a load on the trains before the May ones, and there are still peaks. But about the seasonality of tickets and the migration route of the demob, my colleagues from the railway will tell you better, and I will continue about the architecture.

Somewhere in 2014, a large database began to pull in a lot of small ones. This was important because it grew dangerously, and the fall was critical. We began to allocate separate small bases (on 5–10 tables) for a specific functionality, so that failures would less affect other services, and so that all this could be scaled more easily. It is worth noting that for load distribution and scaling we used to read replicas. Recovery of replicas for a large database after a replication failure could take hours, and all this time it was necessary to “fly on a word of honor and on one wing”. Memories of such periods still cause an unpleasant chill somewhere between the ears. Now we have about 200 instances of different bases, and administering such a number of installations with our hands is a laborious and unreliable business. Therefore, we use Github Orchestrator, which automates the work with replicas and proxySql for load distribution and protection against failures of a specific database.

Like now

In general, we gradually began to allocate asynchronous tasks and separate their runs in the event handler, so that one does not interfere with the other.

When PHP 7 came out, we saw a lot of progress in tests and a decrease in resource consumption in tests. Moving to it took place with a small hemorrhoids, the entire project from the beginning of the tests to the complete translation of the entire production took a little more than six months, but after that the consumption of resources fell almost twice. The graph of CPU load time is at the top of the post.

The monolith has been preserved until now and, in my estimation, makes up approximately 40% of the code base. It is worth saying that the task of replacing the entire monolith with services is not explicitly stated. We move pragmatically: everything new is done on microservices, if it is necessary to modify the old functionality in the monolith, then we try to translate it into a service architecture, if only the refinement is not very small. At the same time, the monolith is covered with tests so that we can deploy twice a week with a sufficient level of quality. The features are covered in different ways, unit tests are fairly complete, UI tests and Acceptance tests cover almost all the portal functionality (we have about 15,000 test cases), API tests are more or less complete. We hardly do load testing. More precisely, our staging is similar to the prod in structure, but not in power, and is lined with the same monitoring. We generate load if we see that the last run on the old release differs in timing, we see how critical it is. If the new release and the old one are about the same, then we release it in the prod. In any case, all the features go under the switch, so that you can turn off at any second if something goes wrong.

Heavy features are always testing under 1% of users. Then we go to 2%, 5%, 10% and so we reach all users. That is, we can always see an atypical load up to a killing surge of a surge and disable it in advance.

Where needed, we took (and will take) 4-5 months for a reengineering project, when the team focuses on a specific task. This is a good way to chop a Gordian knot when local refactoring no longer helps. So we did a few years ago with air: we redesigned the architecture, we did it - we immediately got an instant acceleration in development, we were able to launch many new features. In two months after reengineering, the order of customers increased by an order of magnitude. They began to more accurately manage prices, connect partners, everything became faster. Joy. I must say, now it’s time to do the same thing again, but such is fate: the ways of building applications are changing, new solutions, approaches, tools are emerging. To stay in business, you need to grow.

The main task of reengineering for us is to accelerate the development further. If nothing new is needed, then reengineering is not needed. No need to invent a new one: it makes no sense to invest in modernization. And so while maintaining the modern stack and architecture, people enter the work faster, the new connects faster, the system behaves more predictably, the developers are more interested in working on the project. Now there is a task to finish the monolith, not throwing it out completely, so that each product can upload updates without depending on others. That is, get a definite CI / CD in the monolith.

To date, we use to exchange information between services not only a rabbit, but also REST and gRPC. We write a part of microservices on Golang: computational speed and work with memory are excellent there. There was a call for the introduction of nodeJS support, but in the end they left the node only for server rendering, and the business logic was left for PHP and Go. In principle, the chosen approach allows us to develop services in virtually any language, but we decided to limit the zoo so as not to increase the complexity of the system.

Now we are going to microservices that will work in Docker containers under the OpenShift orchestration. The challenge for a year and a half - 90% of the entire twist inside the platform. Why? So it is faster to deploy, to check versions faster, less sales difference from devel-environment. The developer can think more about the feature that he implements, not how to deploy the environment, how to configure it, where to run, that is, more useful. Again, there are operational issues: there are many microservices, they need to be automated by management. Manually - very high costs, the risk of errors in manual control, and the platform gives a normal scaling.

Every year we have an increase in load - by 30–40%: more and more people master the tricks with the Internet, stop going to physical cash registers, we are adding new products and features to existing ones. Now about 1 million users come to the portal per day. Of course, not all users generate the same load. Something does not require computational resources at all, and, for example, searches are a rather resource-intensive component. There, one and only tick “plus or minus three days” in aviation increases the load by 49 times (when searching back and forth, the matrix 7 by 7 is obtained). Everything else in comparison with the search for a ticket inside the railway systems and air travel is quite simple. The easiest in resources is adventures and search for tours (there’s not the easiest cache in terms of architecture, but still there are fewer tours than combinations of tickets), then the train schedule (it is easily cached using standard tools), and then everything else .

Of course, technical debt is still accumulating. From all sides. The main thing is to understand in time where you can manage to refactor, and everything will be fine, where you don’t have to touch anything (it happens: we live with Legacy if there are no changes planned), but somewhere you just need to rush and reengineer, because without this will not be. Of course, we make mistakes, but in general, Tutu.ru has been in existence for 16 years, and I like the dynamics of the project.

Source: https://habr.com/ru/post/456430/

All Articles

The eternal question of technical duty

How it all began: monolith and common functions

Tire era

Like now

More articles: