Resource allocation in large high performance clusters. Lecture in Yandex
Most complex data tasks require a considerable amount of resources. Therefore, almost every data center in the world is not one, but many customers - even if they all act under a common brand. Companies need power for a variety of services and goals, and in the process of achieving any one of them they have to deal with a whole set of subtasks. How can a data center deal with the flow of people who want to analyze or calculate something? Incoming orders for calculations need to be performed in a certain order, trying not to deprive anyone of resources. This lecture is about the basic methods of distributing real tasks on a large cluster. The method described by Ignat Kolesnichenko is used to service almost all Yandex services.
Ignat is the head of one of the groups in our distributed computing technology service. He graduated from the Moscow State University and the School of Data Analysis, in Yandex since 2009.
')
Under the cut - a detailed transcript of the lecture and slides. Good evening everyone. My name is Ignat Kolesnichenko, I work in Yandex, in the service of distributed technologies and computing. And today I will try to tell you about one task that I met in very real life. To begin with, let's talk a little more about what my work consists of, what we do. And then - a little more detail about the details, and move on to the task as a result, we will understand what it is.
What kind of product are we doing? We make a product for developers and analysts of Yandex. The challenge is that we build large clusters and software on top of them. These large clusters and software on top of them allow you to store huge amounts of data, and not only store them, but also process them. By huge volumes, I mean petabytes, dozens of petabytes. Petabytes are a thousand times more than terabytes, and tens of thousands of petabytes are more than tens of thousands of times. The task is to build such a system, in which ordinary developers and analysts of Yandex could come in, run their fairly simple code there, and it would work quickly and distributed across the entire cluster, would get the result. Then they would build some kind of their schedule, understand that the proportion of Yandex is growing or falling, and would have already made some conclusions, improved their software. What is a cluster in our case? In our case, the cluster is very simplistic as follows. These are many, many servers called compute nodes. The server is generally the same as your laptop, only much more powerful, and it costs not somewhere, but in a data center in a shelf. Typical server features are not like regular laptops with 4-8 cores. They have 30 cores, 128 GB of memory, in general, many resources that can be used to run tasks.
In addition, in order to manage these computational nodes in order to run something there, in order for something to work, to store data, we need some system. And an important part of the system are two things - the meta information server, which will know where the data is in this cluster and what is happening on it, and the scheduler, who will decide where to run what tasks. Today we will mostly talk about the planner. Let's take a closer look at what he can be. The scheduler in fact is a server or maybe several servers. And from the point of view of the interface - how users work with it and how it communicates with the outside world - it has two types of communication. One view is the start of calculations, when a developer comes in and tells the server to start the program written by him in some form. Then the scheduler says: “Yes, I accepted your program, launched it. Now here you can see what happens to it when it is executed or not executed. ” In addition, it communicates with our computational nodes, with servers. What does the compute node tell it? Computing nodes say: “Look, I have so many free resources, I have half CPU not occupied, a lot of free memory, all this is not used. Give me some task, ”to which the planner replies,“ Oh! Keep the new tasks. ” And so each node with this one and only scheduler communicates. In more detail, it looks like this. The scheduler should have some strategy of how it decides which tasks to run, when they come to it, and on which nodes. The user comes and runs his calculations. The scheduler remembers: “Oh, I have such a calculation” - and stores them somewhere in its internal data structure. In addition, he has some strategy of his own. And when the computational node comes to him, he informs him about the tasks and resources that she is running, our strategy must answer what exactly the node should do with its tasks or what new tasks should be launched. For example, she might say: “Please, start me two tasks of my first calculation”. He can say: “Run one task of the second calculation, and also cut off one task of the third calculation, because the third calculation does too much. Stop him doing it. ”
Now a little more detail about the calculations and problems, about the fact that all of this is. The answer depends on the type of computation, but in the simple case we can say that the user comes with some written code, be it a Python or C ++ binary, says what resources he wants to have on the cluster and somehow describes it . The strategy somehow remembers and the data that you want to process - and they lie on different nodes - distributes into pieces. And already on these pieces, the strategy starts the calculation. We will assume that every computation - also called operations in another way - consists simply of some set of tasks. A little further we will see what is meant by this. I want to tell you about what we called strategy on the previous slide: what it is, what they are, what properties they should have, how they should work.
What should the strategy do? Its important function is to decide which task to run. She has a lot of different calculations that users ordered, and she needs to understand which of these calculations, which of these operations should be run. It is also important to understand what users want from the system. Obviously, users run their calculations and want some kind of guarantee — they want the calculation to complete and as soon as possible. We will talk later about what will be meant by “as soon as possible”. And the global question - what properties should a strategy have?
Before we talk about strategy, we still need to remember about resources. This is an important constraint on how we will determine what to run, what resources there are. The two most understandable and basic resources are the RAM available to us on the nodes, and the number of processors. How do we understand if you can run any task on any node? When the user starts the calculation, he should tell us how much it is eating up RAM and how much CPU he is going to use. If he did not report, then we must assume that this is the default value. If the real task of the user suddenly eats up more than he ordered, you just need to kill her and tell the user: “You started an invalid calculation. Don't do that. Go change your limitations. ” But we will assume that the user knows how much his program eats up the CPU, RAM, and based on this we will act.
As for the CPU, it is clear that any user program eats one CPU, because if the application is single-threaded — and most people write single-threaded applications — then this is one CPU. And about memory is already a more complicated question: it’s necessary to understand how many different data structures the user’s program will allocate, how much it will eat of this memory. And the user's task is to understand how much his program consumes. There are less popular resources, such as network utilization. It may happen that the user program goes somewhere on the network and downloads something to itself. There are user programs that actively torment disks. If you start to constantly read from random places from your hard disk to a typewriter, then the hard disk will quickly end and you will no longer be responsible for any reasonable time. So it is also important to consider. If you run a lot of tasks, all of which want a disk on one machine, then all of them will work very slowly, and the user obviously doesn’t want it.
The user tells us how many different resources each calculation wants. And about the nodes, we also know how many resources they have. You can go to some Sysprog and find out how much memory there is, how many free cores there are and how the resources on our computing server are being used. Then we start talking about strategy. And the first strategy, absolutely simple, about which I will tell you, is the FIFO strategy. Let me draw how it will be arranged.
Our planner will line up our operations. FIFO stands for first input first output, simply denoting the concept of a queue. Let's say we had users, they somehow started operations and our scheduler has some queue of operations. After that, all that our strategy has to solve when our node comes with some of its own resources is the tasks of which of these operations to launch. Let's now introduce some mundane numbers — knowledge about our nodes, our operations — and consider a concrete example of how the FIFO strategy works. Then it will be clear how it works.
Let us have 32 CPUs and 63 GB of memory on our node. Let the first operation consist of 1000 subtasks, and let each subtask eat up 1 CPU and 4 GB of memory. This is the first task.
The second task, let it be completely different - consisting of 500 subtasks, each of which requires, for example, 10 CPUs and 1 GB of memory. And so on.
Users came, started such operations, and our strategy needs to understand which of them to give to this node. Let's say a free node has come to the strategy, and it needs to decide what to do with it.
The FIFO strategy will act in the following simple way. They are not for nothing drawn one after another. This means that they are ordered by time. Who came earlier, launched the operation - she is the first in the queue and got up. The FIFO strategy will first offer the first operation: “First operation, do you have any other task that I can run on the node?”. If there is, then he will tell the node to start the task of the first operation. What does all this lead to?
Let's also, moreover, suppose that we have 100 such nodes. If we have 100 node nodes, how many resources in the cluster do we currently have?
- 3200 CPU and 6400 GB, that is, 6 and a half TB.
- Yes. And the strategy will be the first thing to run everything from the first operation. It is easy to see that at some point in this first operation, she will start everything, and all the resources will not be exhausted yet. That is, at some point we will come to the conclusion that a thousand of a thousand tasks have already been launched here, but resources have not yet been exhausted, there is something else on the nodes. At this point, the strategy will go like this: “Yeah, I still have free resources. We need to start something else. ” She will go to the next operation, and start running her tasks. You can even understand how many tasks of the second operation she will be able to run at best.
Let's estimate. Having started the first operation, we will spend 1000 CPU and 4000 GB of memory. So, we will have 2200 CPU and 2400 GB of memory. Further to these remaining resources, it will launch the second operation. Here the main suffering resource will be the CPU, that is, it will be missed, because it wants little memory, and the CPU - a lot. Therefore, apparently, we will be able to run 220 tasks of the second operation. And at this stage, the launch of tasks will end until some of the tasks of our operations begin to end. As soon as the tasks of the first or second operation begin to end, the scheduler will take this into account. That is, when a node comes to it, it reports, not only what free resources it currently has, but also what is the status of those tasks that were already running on it. She will report about some tasks that they ended. Planner like this: “Aha, they are over! You can go there and plan something else. ” He will go and try to look at the second operation in order to plan something, on the third operation, in order to plan something.
About 220 tasks of the second operation here there is some deception. Do all see what this deception is? Why can not always be able to run 220 tasks of the second operation?
- In a sense, should get less?
- It may in some case be less. I hope that in principle we will not be able to, because this is contrary to our limitations, but for some reason it may turn out to be less.
“Because the memory goes somewhere else.”
- We have fair restrictions, really we do nothing more and do not spend. But the problem is that the tasks of the second operation want 10 CPUs, and it may happen that we occupied 25 CPUs with the first task on one node, and it has 7 free, and 7 free ones are obviously not enough, and the scheduler did not has the right to take and run at least one task of the second operation, because there are not enough resources for it. That is, there are free resources, but these free resources are not enough. This is a problem of granularity, which we probably will not talk about today, but we need to understand what it is. Generally speaking, a good planner should take this into account. If he understands that somewhere because of the granularity he cannot start something, it means that he should try to force something out of the first operation, for example. It is clear that the first operation is more convenient for him, it is easier to run on other nodes due to running at least one task of the second operation.
Let's go further. I want to understand what we have requirements for a strategy. Generally speaking, they can write a large list. Today I will consider the three most basic and important ones. Let's write in more detail what they mean. Honesty is not an easy requirement. What is meant by it? FIFO strategy has this problem. Imagine that you have a lot of users, they all came, started operations, and someone was lucky who launched the first one. And someone was very unlucky: he launched, and the first operation turned out to be very long, tedious, and, in fact, perhaps not needed by anyone, and the person mistakenly launched it, for example. Then all the rest will stand and wait until this first operation is completed, or the user cancels it, or something happens to it. It’s clear that the cluster user probably doesn’t like it so much that your neighbor has come, before you got in line, and he does something for a long time. And that's all, and you can’t do anything, but you urgently need to read a report, you have a job because of this, and you want to be guaranteed something.
What will this mean? Suppose we have three users: A, B and C. These users could somehow agree on what share of the cluster belongs to whom. For example, they could just by their importance or from some other considerations agree that User A was entitled to 20% of the cluster, User B was entitled to 30% of the cluster, User C was 50% of the cluster. And I want that we could somehow communicate such information to our planner so that he could take this into account in his strategy and distribute the resource so that 20% of the cluster belongs to user A, 30% belongs to user B and 50% belongs to user C.
When we talk about it, the naive question arises: if they so divided resources, why not free themselves three separate clusters, and not try to live on one? There is a reason for this. The reason is this: I want them to be more profitable to unite than to live separately.
Why can this be? Imagine that user A's tasks for our cluster eat 1 CPU and 4 GB of memory. User B has 10 CPUs and 1 GB of memory. I argue that then it is more profitable for them to unite than to live separately.
Why is that? Imagine that User A has its own 20 machines, while User B has 30 machines. They run all their resources on all their machines. I draw two columns: the first column is the CPU, the second is the memory. I want to understand in each of these columns how much they will be filled in terms of total resources for the entire cluster. At the same time, I remind you that we had 32 CPUs and 64 GB of memory on each machine. And, let's say, these operations have a lot of their tasks that they run, that is, they can eat all the resources of the cluster.
This user A can see, for example, that he will eat the whole memory, and the CPU - by half. We have a typewriter 64 GB of memory and 32 CPUs. Then how much can we run on 64 GB of memory? 16 such tasks. 16 tasks - they, of course, half our CPU will not eat.
Okay, how is the second user doing? The opposite way. He wants a lot of CPU and little memory. He, apparently, will eat all the CPU and how much memory there is.