"There are pluses for both admins and developers": Oleg Anastasyev about the Odnoklassniki cloud

Public clouds have long been no surprise, and about them is hardly required to explain something. Private is another story: everyone knows about their existence, but not everyone has ever dealt with them, so the experience of others is more interesting.

Therefore, the transition of Odnoklassniki from the approach “each server is engaged in its task” to the cloud approach “a single pool of resources is distributed as needed” has caused us questions. Migrating such a large project with an 11-year history to the new system is not easy - what exactly prompted you to go for these labor costs? How is the use of the cloud in Odnoklassniki different from the use of public services? Why are standard solutions like Mesos and Kubernetes not suitable for the project, and their own one-cloud was made?
')
These and other questions were answered by the leading Odnoklassniki developer Oleg Anastasyev , already familiar to Habr's readers.

- We remember that you have a “Big Rule” rule: any technical action must answer the question “Why?”. So, why did you need a cloud?

- One of the reasons is as follows. Operating directly with the glands is for the most part the prerogative of the system administrators, or system reliability engineers, call them what you want. We want to move developers closer to production. To do this, you need to give them a way to communicate with production in such a way as to save people from unnecessary bureaucracy, while not allowing anything to fall. And this is one of the ways to solve the problem.

Another reason is the ability to provide a more dense server load by increasing the utilization of data centers.

“Usually, the clouds and server load are said in this context:“ if a startup has a thousand users today, and tomorrow is a million, you can quickly scale with the cloud and use the required amount of resources without overpaying for unused ”.

But Odnoklassniki is no longer a startup, your audience cannot suddenly double, its growth is slower and more predictable. Why, then, without a cloud, do not provide a dense server load, allocating them for each task as much as is required?

- There are several considerations. First, yes, the audience is more or less stable, but this only applies to the number of users: the load is growing much faster. Now, due to the fact that video consumption (in particular, live broadcasts) is increasing very quickly, in terms of workload, we are growing twice a year - despite the fact that we have a fairly large load.

Secondly, although Odnoklassniki as a whole has existed for a long time, quite a lot of different new products are launched in them: the same broadcasts appeared relatively recently, many new products for administrators of groups, private announcements, music; Finally, the messenger TamTam. It is difficult to estimate in advance about such products how successful they will be, how quickly they will need to be scaled. And if each server is engaged in its task, it is not obvious how many servers are required to allocate. This is the same startup problem.

And thirdly, the load on a particular service can be extremely uneven, peak moments can be very different from normal. For example, when high-profile football broadcasts occur, the corresponding load increases tenfold, and resources need to be allocated for this. And some of these peak moments are not predicted in advance. This is a big task and a big problem.

- With the pluses it became clearer. And did I have to settle for the sake of them? Say, the reason for the separation of tasks between servers you previously called isolation failures. Has resiliency suffered from the fact that now there is no such separation?

- Naturally, magic does not exist, and you have to make some compromises. And if before you had a whole dedicated server for your tasks, and now on this server is a sinful sin, then there is a risk that the tasks cannot be completely isolated from each other. But, as far as I can judge now, the overall resiliency of the system in our case has only become higher. Precisely this will be able to show real, not simulated, large-scale accidents, so far they have not been with the cloud yet.

- Why is the need for a cloud approach felt right now? Has something changed at all in the industry, or specifically in Odnoklassniki?

- I think three factors came together here. New more or less open source infrastructure management systems began to emerge in the world, and from such well-known components we use Docker in our new system. When planning the expansion of the infrastructure, we realized that the place in the existing data centers would soon end, and we need to either build a new one, or we could consolidate into the existing ones. And we have freed up resources to finally solve this problem.

- In the case of the Odnoklassniki-scale project, the difficulty is not only to make a new system, but also to switch to it. How much laborious labor migration is for you? For example, compared to the situation when you are changing something on a website for users?

- Changes on the site is technically much easier. You pull the switch in the centralized configuration system, and users go to the new system. And migrating to a new cloud is a separate large process in which efforts must go and go from everyone, throughout the company.

It is clear that it is economically impossible to just put another one as big with a cloud next to our data center, and to transfer everything there, you must use the already working equipment. Therefore, introducing more and more infrastructure into the cloud is a complex task that is being performed gradually, it cannot be “just taken and done”.

On the other hand, it also requires support from the developers, they have to change something in the code, make containers for their applications (we take into account that we have almost 700 different microservices) ... In general, this is a big process. The first version of the cloud was written and launched far from one month ago, but the migration of systems to the cloud is still going on.

- And how easy is it to achieve this “support from the developers” - do you have to deal with “we need to cut the features here, do not want to be distracted by infrastructure issues, we’ll rather go to the cloud later”?

- I have a clear conviction that if people themselves do not want to use the new system, if it does not give them any advantages in comparison with the old one, then you will not drive them there with a stick. And the new cloud has advantages for both admins and developers.

Developers have an alternative. Or they are launching something in the cloud right now, because there is the required computational power there, or if they don’t want to migrate to the cloud for some reason, they need iron servers. That is, someone has to go to the data center, stick it into the rack, configure and so on, and so it should be done not in one data center, but in at least three (according to the fault tolerance requirements, the service should be distributed to three data centers). center), and it takes time.

And, accordingly, there is a choice: either spend a little time transferring your product to the cloud and launch it now, or wait for the physical servers to be configured, set up, and so on. The choice is obvious. And those teams that have the greatest need for resources, migrate with all possible speed.

- Many large companies around the world make their own clouds. And why are they not ready to simply use the conditional Amazon AWS - because theirs is cheaper, because they are worried about security, or because they want a large level of control and access?

“There are those who go to Amazon — so do Netflix, the pioneers of cloud computing.” But they are launching their own system of automatic control of infrastructure on top of AWS, they have already reached the point that it is more convenient for them to have their add-on over AWS.

But in any case, it depends on what the company had originally and what the current circumstances were. Does the company have its own iron servers? Are there any specialists who can develop infrastructure? All companies choose their path, a lot depends on the availability of specialists, on what they started, and even on the geographical location. For a large Russian company focused on the Russian market, hosting on Amazon will be just silly, because it's there, and we are here.

- And for those Odnoklassniki employees who will use your cloud, how will this differ from using the same AWS?

- Amazon is a slightly different system. It is older: although they are trying to keep up with the times, to do things like AWS Lambda, to use modern software, there are a lot of things to do. And in our country it looks different, because there is no such thing.

In addition, AWS is a universal product that is suitable for hosting a variety of things. Like any universal product, it is not very convenient in the context of a specific task, there are many different things that we do not need and only confuse.

And since we make a specific product for us, we work with it easier, the management is clearer, and we can naturally put it on how we usually do things. And so not only with hosting, but also with other components of the cloud, where they also moved away from universal solutions to their own solutions - for example, one-cloud.

- Specifically, in the fall you are going to talk about one-cloud in detail at our Joker and DevOops conferences, but in the meantime then tell us briefly: what is it all about?

- This is the operating system of the data center level - that is, a large distributed system that allows you to automatically manage the infrastructure using task descriptions (“task” is a broad concept, it can include thousands of replicas distributed to servers in several data centers). In addition, it can automatically maintain the performance of the task during various failures. For example, in the simplest case, if the server crashes, the system automatically restores the required number of replicas on other machines.

- And didn’t they use popular open source solutions like Kubernetes, either, because of their universality?

- Yes. We considered both Kubernetes and Mesos. It was clear that the first and second in its usual form do not fall on what we want. We could hold out the one and the other to what we need. But we appreciated the work that would have to be done, and the work to write our own, and it turned out that the amount of work was about the same. As a result, we decided to write our own, which will turn out to be more convenient and understandable, more reliable, more sharpened for us, for our requirements and infrastructure.

- Since you will talk about one-cloud at both Java-conferences and DevOps-conferences, I would like to know: will these reports differ?

- Yes. On DevOops, a report will be more about processes: how to build DevOps processes when you have clouds or your own data center management system, and how not to lose fault tolerance. And about ensuring the isolation of tasks on one machine. Since fault tolerance is very important for such a large distributed system, most of the report will be about how it is provided.

On Joker, it is planned to do less about the processes of ensuring fault tolerance, and more about how the cloud is arranged internally, what fault tolerance on this side provides. In addition, there is an idea to tell more about the operation of Java-applications in containers. Because our cloud is probably the only one in which one hundred percent of the tasks launched are written in the Java language, and there is an experience that is useful for those who write in Java.

- In the case of Joker, your preliminary report plan consists of the points “sodom”, “gomorr” and “gut”. There expect hardcore?

- Honestly, I find it difficult to judge, because, like one of the authors of one-cloud, I have a distorted perception. In general, it seems to me that everything is very simple, but other people do not always agree with this. I myself am wondering what happens in the end.

Both conferences, DevOops and Joker , will be held in autumn in St. Petersburg: DevOops on October 20, and Joker on November 3-4. On the sites of both you can see what else will be there, in addition to the report of Oleg - and on the sites of both you can already buy tickets.

Source: https://habr.com/ru/post/335234/

All Articles

"There are pluses for both admins and developers": Oleg Anastasyev about the Odnoklassniki cloud

More articles: