Escape from the data center: how we transferred the site to another data center in an emergency order

I want to share with the community our experience of unplanned relocation of a large project. First of all, it will be interesting to those who have more than one dedicated server for a project in one data center. There are no detailed technical solutions, but there is some common sense.

Initial data

The project uses 10 servers that are in the same data center. Servers with nginx, PHP, MySQL. Without a full-fledged mail server and any other services worthy of mention;
The site has more than 3.5 million page views per day;
More than 12 million requests per day to nginx, excluding requests for images. 1/4 of them go to PHP;
More than 250 GB of traffic per day from nginx, excluding images;
Preliminary terms of restoring servers in the data center are more than a few days.

At the time of failure, we had:

One backup server;
Customized version of the site. We sometimes checked the site on a backup;
Database copies using MySQL replication;
Copies of files. The backlog of synchronization, according to monitoring, did not exceed 20 minutes;
Configured SOA DNS zone record , which allows for almost an hour for all users to change the DNS zone settings.

Problems and solution

After the failure, we decided to move in two directions:

restore the site on an existing read-only server, caching everything and “forever”;
Order new servers in another data center to restore full operation of the site.

Thus, while we receive the server, after the site does not lose its position in the search results, and visitors can view the content without the ability to leave comments, log in, etc.

We assumed the failure of the main servers or several servers at once. But they thought that to remain without all 10 servers for a long time is unlikely. Here is what we did not take into account:

The working configs of the backup server were not updated as regularly as on the main server, although they are all stored in the repository. As a result, we got a part of the URL with errors, and some of the pages were not cached;
Cache. Letting visitors to the 1 st server without a cache instead of 10 is “deadly” for him. Especially in the amount of the previous item;
In the process of switching it may turn out that the project itself is not quite ready for the “read only” mode. Pages that are already in the cache, you need to fix it, and after developers fixes, you need to clear the cache. Do not agree to delete the entire cache, created with such difficulty. Every minute more and more users get a new IP, and the number of visitors on the site is growing. Clearing the cache an hour later, after changing DNS, put the server;
We did not find a data center where you can get more than one powerful dedicated server per day. Even with all the backups, you still can not quickly begin restoring the site in full, if you do not take care of this in advance.

All this could be done in advance, and we were forced to act on the fact.

On the first point, logs were analyzed for solving and editing configuration files for “read only” mode were done. Namely:

Gave users a cultural message about the impossibility of logging in;
Stopped comments and other actions to write at the level of nginx, not allowing users to PHP;
Used return 503 in the required places, as well as error_page 503 = /trouble.html ;
Also fixed bugs to cache all other pages.

On the second and third points, the following solution appeared. Make sure that our IP is in priority and each request instead of 504 errors is guaranteed to create a page in the cache and, if necessary, update it. Without worrying about the performance of nginx, as it was necessary to revive the site at least a little, we used Evil If to generate a cache for our IP, bypassing other visitors.
')
In our case, it looked like this:

fastcgi_pass unix:/var/run/php-fpm/fpm-all.sock; if ( $remote_addr = 8.8.8.8 ) { #if ( $remote_addr ~ (8.8.8.8|127.0.0.1) ) { #       fastcgi_pass unix:/var/run/php-fpm/fpm-private.sock; #   #set $nocache 1; #fastcgi_cache_bypass $nocache; }

Instead of 8.8.8.8 substitute your IP.

In addition, we had two PHP-FPM pools : fpm-private - for us, fpm-all - for the rest. To update the cache, use the directive fastcgi_cache_bypass .
To create the cache, we run wget -r -l0 -np - example.com . To update the cache, we uncommented the lines with $ nocache. Thus, we got a working site for ourselves with the ability to generate a cache and update it, if necessary, from our IP. After several hours of work, there were already a few dozen gigabytes of cache.

We almost never touch upon the issue of optimizing server settings; such topics are well described in separate articles. In this case, database queries were read-only, so MySQL optimization was simple - the database was not a bottleneck. Nginx configuration has been reduced to adjusting timeouts and buffers to handle all connections. There was a good load on the network, as there was no more additional server for uploading images. The weak point was php. Even after generating a large cache, some pages were generated for 20 seconds, since there was a queue in PHP-FPM.

Regarding the last point - search for new servers. We failed to resolve this issue promptly. Data centers could only provide servers with maximum core i3-i7, and 8GB of RAM for 1-2 days, and more powerful servers had to wait even longer, plus time to conclude contracts and pay for servers. We found a data center with which we managed to agree on getting 4 powerful servers for a hot standby in 3 days. Now these servers are enough for the site to work, without a pair of heavy modules that are not critical to us. Now we can switch to work in another data center if necessary.

A solution to the problem of backup servers can be a preliminary agreement with the backup data center on the possible provision of additional servers in case of problems with the main server. You can do even wiser and scale the site to more than one data center, which we have now done.

Results

You can’t live without backups, you should check them regularly with your hands and anything. There is no extra care;
The backup server check must be complete. In our case, there was not enough load testing, again because we did not foresee a similar situation;
If you have only one backup server, then configure it immediately so that it works even in a limited mode with real load;
Take time to simulate the most unlikely situation with a project team or architect and administrators - an explosion, a fire, other circumstances in the data center, when you lose access to the servers at once. Debriefing and, as a result, documenting such a case, as well as preparing configuration files with testing their work, will correct most of the likely failures. Then if a UFO arrives behind your servers, you will not say that you did not expect this, and you will be able to recover the project :);
In our case, the site was completely unavailable for just over an hour. A few more hours were interruptions in work. Preparing for such a situation would allow us to spend an hour while DNS records were updated, to generate a cache in order to meet the first visitors, and in the future, do not lie down tightly under load.

In our case, the data center restored work a little more than a day. Given that we worked in read-only mode, we did not have to transfer any data back. After checking the servers, we moved the domain back.

I hope my experience will be useful to you in a difficult moment, when you will lose access to servers indefinitely, and you will have backups, but there will be no backup cluster in another data center. I would be glad if you share in the comments, how would you act in this situation?

Source: https://habr.com/ru/post/143073/

All Articles

Escape from the data center: how we transferred the site to another data center in an emergency order

Initial data

At the time of failure, we had:

Problems and solution

Results

More articles: