Once a week, Yandex disables one of its data centers. We call it teachings. What it is? How did it come about? Why do we do this? Is it a diversion? How dangerous is it? I regularly have to answer these questions both inside and outside the company. Today I decided to clarify all these issues at once.

Now we have several of our own data centers, in which there are several tens of thousands of servers and network equipment. Teaching is a simulation of a real-life situation in which we lose either the entire data center or part of it.
')
To begin with, I propose to turn to history and try to understand how we came to such a decision. Everyone is accustomed to the fact that our services always work, without lunch breaks and prevention. Serious failures occur so rarely that each of them becomes a noticeable event.
When the servers were big and the data centers were small ...
In 2000, Yandex rented four racks at MTU-Intel, where about 40 of its own Yandex servers were located. These dozens of servers became the basis of the first independent data center of Yandex, which was located in the company's first office - in Moscow, on Vavilova Street, in the computing center of the Russian Academy of Sciences. In those years, all the servers and the entire network infrastructure were in the same data center. We have been lucky for several years and no emergency has happened. Everything changed at 19 o'clock on November 12, 2004. At that moment, a couple of minutes before the start of the second round of the Yandex Cup game in the building of the Computing Center of the Russian Academy of Sciences (CC RAS) in Moscow, Vavilova Street, where the company is located, the electricity suddenly went off.

This was due to a malfunction in the electricity supplier’s equipment. At that time, we already had 2 data centers, but all the network equipment ensuring the connectivity of our servers with the outside world was in the disconnected data center. That evening, Yandex services
did not work for several hours .
It was the first major accident, then there were others:
- the fluff got stuck in the air conditioners, and they began to heat, and not to cool the data centers, the server had to be turned off;
- disabling data centers for power for various and quite incredible reasons, starting from the fact that the landlord forgot to pay the electricity bill in time, ending with the cat climbing into the transformer box and arranging a short circuit;
- there were floods in data centers;
- Of course, it was not without our beloved character - an excavator operator who was cleverly and aptly digging in the places where our fiber optic cable lay.
Obviously, in such conditions, we quickly realized that we can only rely on our strength and be able to live in the N-1 data center. We started working on it in 2004, and then many solutions that seem obvious and simple now were new to us.
Infrastructure development
Started with the infrastructure. After the events of November 12, our first data center was equipped with a diesel generator set, and the second received external connectivity not only with the first data center, but also with the outside world. Thus, in theory, it became possible to somehow serve users in an environment where one of the two data centers did not work. It was clear that before this could be done in practice, one must also invest in redundancy in order not to die under load, a lot of things to be corrected in the architecture of the projects and to continue to develop the infrastructure.
Interesting fact: since then, all our data centers have been equipped with DGS, which have rescued us more than once. As you know, our main operating systems are Linux and FreeBSD, and when the data center is working at the DSU, we are
joking that we are working on
Solar .

It was also decided that the backbone network uniting the DC should not have single points of failure. In particular, no communication channels between data centers should pass the same paths. Therefore, the first network topology providing redundancy was a ring. Then the channels were already interconnected by three of our data centers plus two traffic exchange points with connection to MSK-IX - in ICI and on M9. Over time, the backbone network turned into a double ring, and now all data centers in the Moscow region are essentially connected in
full mesh . Thus, a single cable break does not affect the availability of services.
It took a lot of effort to rework the code of our applications to work in the new environment. For example, we taught programs to add user data to two servers and thought what to do if one of them becomes unavailable: offer the user to upload their data later or still upload them to a working server and then synchronize; patch ipvs kernel module for load balancing. There was a lot of thought about what to do with databases. For example, in MySQL there is always one master server, if it doesn’t work, there can be two scenarios: put the service in read-only mode, or write scripts to quickly create a master from another server in another data center. Now remember all that was done then, will not work, but there was a lot of work.
The preparatory stage took about three years, during which time we invested in duplicating the components where necessary and justified; for some services, they realized that rewriting everything from scratch was a very expensive task and learned how to degrade, cut off less important functionality for the time of the crash.
By the fall of 2007, it was clear that it was time to test the results of the work. How? Of course, turning off the data center is really at a clearly defined time when all who may be interested in the results were in place. Now this is a fairly common practice: most large players with similar data centers, including many hosting providers, conduct such exercises. And then it was not the most obvious and risky step.
For the first time, the shutdown took place on October 25, 2007; we lived 40 minutes without one of our data centers. At first, not everything went smoothly with us, the project teams looked attentively at their services: they invented what to do with the services architecture further, wrote new monitors, ruled bugs. Since then, the questions: “How dangerous is it? Isn't this a diversion? ”- I answer,“ Of course, it’s dangerous, no, not diversion. ”
How are the exercises?
They usually take place once a week in the evenings. Why in the evening? We ourselves regularly argue about it. There are many factors, but two can be identified. First, we can break something during the exercise, in the afternoon it will affect more users and it will be noticeable outside, which is bad, and at night not everything is in place, we can miss something and not notice that it is also bad. Secondly, in the evening we do not have a peak traffic, but not a minimum, you can look at the behavior of services under load. Sometimes the time to start work and the duration may vary. It happens that we combine the exercises with the service work on the network, which are carried out by NOC employees. Since the network should always work, maintenance work in the data center network can be performed only when it does not serve user traffic. In this case, the disconnection of the data center may take several hours. This is quite rare, because many maintenance work, such as updating the software of ToR switches (Top of Rack), is almost completely automated and one network engineer can update several hundred devices by running a couple of scripts. In particular, for this we use
RackTables - an open source product, to the creation of which we ourselves had a hand.

About the time, duration and date decided what next? An important part of preparation is the coordination of turning off the data center. Since October 5, 2007, we have on the internal wiki and, more recently, the internal calendar always has the current exercise schedule for the near future. We try to make the exercise schedule for several months in advance. One day before the hour of X, the organizers send the last warning to several mailing lists and publish an announcement on the internal blog. In the letter you can find a link to the list of what will not work / degrade, and what you should look for carefully.
How is the shutdown? During the exercise, we do not turn off the power of the servers in the tested data center, but model its shutdown using the network. In fact, on the network equipment located on the perimeter of the data center, we turn off the virtual network interfaces, breaking the network connectivity. It is easier to do this in this way, since it is more difficult to turn off all the servers on power and, as a result, more expensive. If you turn off the power supply, then it is necessary that the facility has personnel who can turn everything on after the end of the Exercise, and practice shows that the equipment does not turn on partially, something must be changed. According to our statistics, up to 5% of disks may no longer be recognized by the operating system after shutdown. And, of course, the most important factor is that if something goes wrong, you need to be able to quickly return the data center to work. In our case, we can do it fairly quickly. Therefore, even if the exercises are not going according to plan, any service or part of the functions falls off, for the majority of users everything goes unnoticed.
Debriefing
At the end of the exercise, we have information about what was happening - debriefing. If some services worked in an unplanned way, the teams of these services together try to figure out what caused this and set themselves the tasks.
Each time, errors appear less and less. For example, recently, with a small gap, we first disconnected our oldest, and then the largest DC. On the pages with the results of outages there is not a single ticket with bugs. Nicely! But, of course, not always everything goes so smoothly, since our services are constantly evolving and not everything can be followed, unfortunately. But for this we conduct our test disconnection of data centers.

We still have work to do. Now we have several important tasks. One of them is to ensure the continuous operation of not only our services for users, but also to ensure the continuous operation of our development. Now test and development environments are mostly not duplicated, and during the exercises, developers are forced to drink tea and coffee. Now we are experimenting with OpenStack on migration - not easy, because there are a lot of virtualoks and data, and our inter-center links are wide, but not endless. On the other hand, the rapid deployment of a virtual machine in iaas with the necessary environment and data using scripts.
Why do we work regularly? The load is constantly growing, now we have about 1000 program changes per day, the architecture is changing, the number of servers servicing the service is changing, the living conditions change over the years - old data centers are closing, new ones are appearing. In such a constantly changing environment, we must continue to solve the problem - to work in the N-1 data center.