How to survive the scaling and synchronize all the same between data centers

If not, he drives traffic to Amazon, where he sells bottles of water for $ 26, and we will tell you about the sites in our SaaS constructor

No site can guarantee uninterrupted operation during, for example, a year - this is a given for a number of reasons. So, you need to have a “plan B” - to provide fault tolerance at the level of the data center and create a backup site, which slightly picks up traffic. Synchronize the server all - and Yandex, and Google, and the heroes under the cat.

When we started to make a new site builder, there were only 10 people in the team, instead of such an office, the workplace was an apartment (not for long, but there is something to remember), and the current 700,000 users, ~ 50 million inod and ~ 6TB of data loomed over a distant horizon. This gave us the opportunity to experiment with server synchronization for more than a year - and at the same time elicit from colleagues from other projects, how and what is theirs.

After the nth experiment and visiting the nth event, our programmer Maxim came and said: “It seems that we are not the only crazy ones” . In short, we have made our entire synchronization system. And now tell you why.
')

Prologue: uKit through the eyes of a sysadmin

One of the main entities in our system is the “site” - a document in the database that contains links to its constituent “widgets”, which are somehow arranged and filled with something by the user. Having placed and filled in the widgets to your taste, the user decides that it is time to click the “publish” button - and his site should get into production, so to speak.

Since we chose MongoDB as the main database for the project, we didn’t worry too much about fault tolerance: there is asynchronous replication out of the box, everything is fine with that.

We also needed to be able to quickly switch between two data centers (and suddenly a meteorite would fall into one of them) - this means that all sites should always be up to date on both servers. I wanted it as easy as with Monga. But any simplicity is fraught with complexity - and vice versa. Especially the opposite!

Here is the path we went in search of:

Inventing your “highload”

What problems may arise with bare Rsync

Add asynchrony? And, no, we will not add

At the crossroads: Amazon or your Lunapark?

A turning point: or how older comrades do

“Comb Vershansky”: how to solve the problem of parallelism with asynchrony

Do not think about nanoseconds down: what to consider when testing a samopisnaya system

Happy End: Heroic Rescue of Member Servers

Inventing your “highload”

The first idea was to copy the current state of the site into a separate collection. In this form, the task was given to the programmer: but after tinkering with it, he suggested simply taking and dropping everything into the files, and putting them on the disk. Maxim has a general tendency to take care of the system resources.

Thinking about it, we gradually began to penetrate this idea: from the point of view of the “highload”, you probably can’t think anything better and more reliably: you don’t give the file from the disk to Nginx, with its all-conquering sendfile and thread pools. The base, no matter how good it is, anyway, sooner or later it starts demanding sharding, building new indices and other routine maintenance. And we are doing a high-load service, we have to think about such things. In the end, it's even easier.

What problems may arise with bare Rsync

So, we have static files, mainly HTML and images in different formats. How to synchronize not too many files between two servers? Rsync

While we were small enough, we just hung up on cron the synchronization of the entire directory with the sites every minute (with the lock-file). The script worked for a second, we switched between servers several times and did not know grief.

The task of marketers is to make the sites in the system more. And I must say, they cope with the task.

We began to notice that the lock file at peak hours begins to hang for 10 minutes or more - the hierarchy of directories has grown, and Rsync did not have time to go around all of it quickly. Rsync will first go around all the directories on the source, build a tree, get a similar tree from the receiver, compare ... And only after that will it start sending something - which also takes time. As a result, the static on the second server is lagging behind - and the discrepancy does not come from any particular point in time, which adds problems.

Add asynchrony? And, no, we will not add

So we came up with the idea of “distributed file systems” - there are plenty of them in the world: LeoFS, LusterFS, GlusterFS, XtreemFS, WhateverFS. But at that time (maybe it changed), all were either synchronous or the asynchronous mode did not actually work for them. Or worked too badly.

Having rummaged on Habré, we found the Lsyncd project - it works by means of Inotify watches, which are hung on each directory, and the same Rsync, which is triggered by file creation / change / delete events. We decided to try it: it hung on our million directories for about an hour, but in the end it worked and started to sync something. But it turned out that Lsyncd had problems with retrys and remembering a long list of changes:

And the long breaks in communication between the Lsyncd servers are going badly.

At the crossroads: Amazon or your Lunapark?

The search for “grail” led us to a curious crossroads of two options:

Use S3-Like storage - for example, Amazon S3 or one of many analogues. Alternatively, deploy at home. Such an approach, on the one hand, will allow us to stop thinking about where and how our files are stored and start living, and on the other, it promises many new problems.

First, we will have to rewrite the entire code for working with files, and also refactor the places that are based on the fact that the file is written to disk almost instantly and immediately accessible. Secondly, we will lose the ability to go through the Midnight Commander files, which is sometimes necessary when debugging.

Thirdly, if the files with such a scheme are in other people's uncles, they can arrange us any dirty trick - for example, suddenly raise the price twice or drop your service, having mistaken the window. Yes, the files may lie with us in some Swift / Elliptics / Riak - but this is reasonable when there are tens of times more data than we have - if you are Yandex, for example. There it is immediately placed on 5 servers - three with data and two managers, this is the minimum.

In general, more difficult than necessary for our task.

Develop your own, highly specialized solution. So to say, your Lunapark - well, or a bicycle, as you wish. Yes, the code is still rewritten: but you can find a place through which most of it passes and so. And the files in this case are, as they were, in the form of regular files and on our servers.

There is only one drawback - you have to do it yourself and have time before the place runs out. Moment X was to be the eve of the new 2017th year.

A turning point: or how older comrades do

The choice in favor of the samopisny decision took time. At the same time, we were guided not only by our Wishlist and Non-Hobbies, but also by the experience of colleagues in the workshop - they caught them at events and interrogated them carefully.

Dear friend, whose thoughts we did not understand.

The meeting with the guys from VK was perhaps the turning point - on the sidelines of a conference we asked them what they synchronized the files to. They said that they synchronize themselves, because they did not find anything good either.

Later we learned that in Badoo, photos are synchronized in a similar way.

In general, Maxim sat down to write the system, and in the meantime I got out the disk space from the LVM's store and cleaned up the heavy logs.

“Comb Vershansky”: how to solve the problem of parallelism with asynchrony

We took Rabbitmq as the basis for the system (it was already used in the project) and the fs npm module — most of the file operations passed through it. The idea was this - by redefining the fs module, we will force it to record all the performed actions in the queue and only then consider them completed (pull the callback). Next to the queue, a daemon is started, which takes a task from the queue and sends it via http to another daemon that runs on the receiving server. If this file creation - post-ohm this file, if the deletion - the command delete and so on.

But if we take all the tasks in a row and work with them asynchronously, as by default, and does Node.js, then our actions can go out of order. For example, we will try to write a file to a folder that does not yet exist, or something like that. And if you perform the tasks strictly in order, it will be slow.

At the moment of discussing this dilemma, Victor, as it were, happened to be at the blackboard, who, as it were, always happens to be near, if there is any interesting discussion. Vitya began to draw vertical lines, symbolizing the actions of CRUD, - the lines went parallel to each other and were called a, b, c, d ...

With a guitar - this is Vitya. Pasha, who is engaged in such interesting things in the company, also helped to convey his “comb” idea.

Bingo! Our files are stored in the good old a / b / c scheme, that is, the site vasya.ru will be located along the path /v/a/s/vasya.ru. We decided that the receiving party would perform the tasks in order within the same “upper” letter, but beyond its limits - asynchronously. It happened so quickly and reliably. And since the scheme Viti on the board resembled a comb, the principle was named after the programmer who was at the right time in the right place.

Do not think about nanoseconds down: what to consider when testing a samopisnaya system

When there was something to test, we decided to check the synchronicity of directories in two places in the same way we did it in the distant past - namely, by Rsync in dry-run mode.

With this option, in fact, nothing happens, but the screen displays the actions that would occur: you can see all the files that need to be copied, respectively, this is the point of "non-synchronicity."

There was another problem waiting for - our Node.js exposes ctime files to the nearest milliseconds (then clogs with zeros), and Rsync, comparing files for sameness, also takes nanoseconds into account! And considers files different when they are actually the same.

One could compute and compare md5 file hashes - but this is very slow. Searches for a ready-made module capable of changing the creation time of files in UltraHD mode did not lead to anything, and I had to write my own C module for these purposes. So, a return to the sources of synchronization has brought us back to the origins of programming.

Happy End: Heroic Rescue of Two Member Servers

Everything was going well and we began to prepare for replacing the disks on the second server with more spacious ones - and moving to it.

In preparation for the move, they wrote a “cold sync” script, affectionately called coolSync: to transfer files that have been on the disk for a long time and no actions are taken with them (but we don’t know what files exactly). The variant with the usual recursive rsync did not fit here, because this recursion lasts forever, and the list of files built as a result loses relevance, not having time to complete. Our script generated a / b / c paths by itself, walked around them, bypassed the folders only “in depth” and ran rsync for each individual folder on each iteration. In just a week, we were able to download almost all the files.

A couple of weeks we decided why “almost”. When they began to check the synchronization of the individual "letters" between our two servers, lost files were discovered: it turned out, for example, that somewhere the file is moved immediately after being created to another directory, and the sending daemon cannot find it in the right place.

In the end, we managed to win these problems - the new year 2017 was already on the nose, but as in Hollywood blockbusters, we managed by the X hour. Now the system works like this: