How to implement almost instant site switching between sites, when one fell

It happens, sites fall because of a failure of a site of a hoster, channels and so on. I have been working in hosting for 7 years, and I often see such problems.

A couple of years ago, I realized that the backup site service (without modifying their website or service) is very important to customers. Theoretically, everything is simple:
1. Have a copy of all the data in another data center.
2. In case of failure, switch the work to backup DC.
')
In practice, the system experienced 2 complete technical reorganizations (preservation of the main ideas with the change of a significant part of the toolkit), 3 moves to new equipment, 1 move between service providers (moving from a German data center to two Russian ones). On the study of the behavior of different systems in real conditions under the client load took 2 years.

So, even if the hoster uses a cluster solution to host client VDS - the vast majority of existing cluster solutions are designed to work within a single data center, or within one complex system, the failure of which leads to stopping the entire cluster.

Main
1. Replication - from local disks via drbd WITHOUT drbd-proxy
2. Switching between data centers by announcing their own network of IP addresses, without changing them for domains
3. Separate backup (regular copy, not a replica) + monitoring in the third data center
4. HyperVision Based Virtualization (in this case, KVM)
5. Internal L3 network with dynamic routing, IPSEC mGRE OSPF
6. The organization of simple composite components, each of which can be temporarily disabled or replaced.
7. The principle of the lack of "the most reliable system"
8. Fatal point of failure

The reason for such decisions will be described below.

System requirements
1. Launching the client service without modifications (i.e. launching projects that are not designed for clustering)
2. The ability to continue work after a single equipment failure, up to the shutdown of any one data center.
3. Recovery of client services within 15 minutes after a fatal failure, data loss is allowed 1 minute before the failure.
4. The ability to recover data if any two data centers are lost.
5. Maintenance / replacement of own equipment and service providers without downtime of client services.

Data storage
The best choice is the way DRBD works without proxy, the traffic is compressed simultaneously with the encryption using ipsec. In synchronization mode B and local caching of data inside virtual machines, there is no delay of 10-12 ms (ping between Moscow-St. Petersburg data centers) doesn’t affect the speed of work, moreover, this delay is only for writing; quickly.

Four options were explored:
1, 2. Ceph in rbd (block devices) and CephFS (distributed file system) variants
3. XtremeFS
4. DRBD
5. DRBD with DRBD-Proxy

Common to Ceph
Advantages:
Convenient bandwidth scaling. Simple addition of capacity, automatic distribution / redistribution of data across servers.

Disadvantages:
It works within the same data center, with two distant data centers, you need two clusters and replication between them. It does not work well on a small number of disks (data integrity checks suspend current disk operations). Only synchronous replication.
Diagnosing a collapsed cluster is a difficult task. It is necessary or very long to break and repair this system or have a contract for quick support with developers.
When updating the cluster there is a critical moment - restarting the average in the monitor quorum. If something goes wrong at that moment, the cluster will stop and then it will have to be assembled manually.

Ceph fs
It was supposed to be used as a single file system for storing all data in LXC / OpenVZ containers.
Advantages:
The ability to create snapshots of the file system. The size of the file system and individual files may be larger than the size of the local disk. All data is simultaneously available on all physical servers.

Disadvantages:
For each file opening operation, the server should contact the metadata server, check if the file exists, where to find it, etc. Metadata caching is not provided. Sites start to work slower and this is noticeable through the eyes. Local caching is not implemented.
Developers warn that the system is not ready yet and it is better not to store important data in it.

Ceph rdb
Intended use - one block device per container.
Advantages:
Convenient snapshot, image cloning (in the second version format). The image size may be larger than the local disk. Local operations are cached by the host / container operating system. There are no delays for frequently repeated readings of small files.

XtremeFS
Advantages:
Declared support for replication over long distances, incl. work offline, can support partial replicas.

Disadvantages:
When tests proved to be very slow and has not been further investigated. It feels great for distributed storage of an array of data / documents / discs so that, for example, each office has its own copy and is not intended for actively changing files, such as databases or virtual server images.

DRBD
Advantages:
Replicating block devices. Reading from local drives. Ability to work autonomously, without a cluster. Natural data storage - in case of problems with DRBD, you can connect to the device directly and work with it. Several modes of synchronicity.

Disadvantages:
Each image can be synchronized only between two servers (there is a possibility of replication on 3-4 servers, but when switching the master server, difficulties are foreseen with the distribution of metadata between servers + the throughput is multiplied by a factor).
The size of the device cannot exceed the size of the local disk.

DRBD with DRBD-Proxy
Special paid supplement to DRBD for long distance replication

Advantages:
1. Compresses traffic well, 2-10 times relative to work without compression.
2. Large local buffer for accepting write operations and their gradual sending to a remote server without braking operations on the main one (in asynchronous replication mode).
3. Sane, support, quite quick answers at some time (apparently if you get in during working hours)

Disadvantages:
Immediately upon launch, I stumbled upon a bug that has already been fixed, but the fix has not yet been published - a separately compiled binary has been sent with the fix.
In tests, he proved to be extremely unstable - the simplest random write test hung a proxy service at high speed so that it was necessary to restart the entire server, and not just the proxy.
From time to time goes into spinlock and stupidly eats all the processor core

Switching traffic between data centers
Two options were considered:
1. Switching by changing records on DNS servers
2. By announcing your own network of IP addresses from two data centers

Optimally selected announcement of your network through BGP.

Change records on DNS servers
Advantages:
Ease of implementation
By pinging, you can understand to which data center traffic arrives.

Disadvantages:
Some clients are not ready to delegate their domains to foreign DNS servers.
Long switching times - DNS caching is often more aggressive than specified in TTL, and even with a TTL of 5-15 minutes in an hour, someone will still break into the old server. And individual scanners - even after a few days.
It is impossible to save the IP address of client servers when moving between data centers.
In the case of semi-loss of communication with the data center, the dns servers may begin to give different ip-addresses and the switching will occur only partially.

Announcement of your own network of addresses
Advantages:
Fast guaranteed traffic switching between data centers. For tests inside Moscow, the change in BGP announcement diverges in a few seconds. The world may be longer, but still faster and more reliable than the VDS.
It is possible to disable traffic from the half-working data center, the connection with which the system is lost, but which is visible for a part of the Internet.

Disadvantages:
Complicated configuration of internal routing. Switching of individual resources of the system is possible only partial - the traffic will come to the data center that is closer to the client, and leave the data center where the virtual server is running.

Backup in the third data center
The situation of data loss in two data centers is quite real, for example, a software error, according to which the data on the main and standby server were deleted simultaneously. Or a hardware failure on the primary server while the backup is in progress or data is being resynchronized.
For such cases, a server has been installed in the third data center, which is excluded from the overall cluster system. All he can do is monitor and store backup copies.

Virtualization method
Options were considered:
LXC, OpenVZ, KVM, Hyper-V

The choice is made in favor of KVM, because It provides the greatest freedom of action.

Lxc
Advantages:
Easy to install, works on a standard Linux kernel without modifications. Allows basic container insulation.
No performance loss on virtualization.

Disadvantages:
Low level of insulation.
No live migration between servers
Inside the container, only Linux systems can be launched. Even inside Linux systems there are limitations on the functionality of the kernel modules.

Openvz
Advantages:
No performance loss on virtualization
There is a live migration

Disadvantages:
It works on a modified kernel, you need to manually build additional modules and, possibly, have compatibility problems due to non-standard environment for them.
Inside the container only Linux works. Even inside Linux systems there are limitations on the functionality of the kernel modules.

KVM
Advantages:
Works without modifying the system kernel
There is a live migration
You can connect equipment directly to the virtual machine (disks, usb devices, etc.)

Disadvantages:
Performance loss on hardware virtualization

Hyper-v
Advantages:
Good integration with Windows
There is a live migration

Disadvantages:
The necessary features from Linux OS are not supported: caching on SSD, replication of local disks, connection, remote connection to the VDS console for the client.

Select internal network
The task is to provide an internal network, addressable, independent of the external network and the location of the internal server. The ability to quickly switch the flow of traffic to the server when its physical location changes (moving VDS). The possibility of arbitrary routing to each specific server (i.e., the server moves without changing the IP address). The possibility of organizing a fully connected network between data centers is desirable. Traffic protection is desirable.

Initially, the tinc variant and the fully connected L2 network were used. This is the easiest to set up and relatively flexible option. But not always predictable. After a series of routing experiments, I came to the conclusion that routing at the L3 level is exactly what is needed - predictably, manageably, quickly. Inside the internal network, dynamic routing through OSPF works, the route is prescribed for each private IP address. Those. all routers know which one of them has access to each specific server.
In the case of the L2 network, the tables would be about the same, but less transparent, since hidden inside the software, not in the standard routing tables of the kernel.
If necessary (in case of problems with OSPF), as the number of routes grows, this system can easily be replaced with completely static route registration through our own services.

Considered options:
OpenVPN, L2 tinc, GRE, mGRE

The mGRE option is selected. L2 traffic and multicast are currently not needed. If necessary, it is possible to add multicast via software on nodes, there is no need for L2 traffic. The lack of encryption was compensated by setting up IPSEC for traffic between nodes. IPSEC, however, will also compress it.

When setting up in real conditions, an interesting feature emerged - despite the complete disabling of filtering in the data center - its equipment looks inside the GRE protocol and analyzes what is inside there. This removes packets with OSPF traffic and an additional 2ms delay occurs. So IPSEC turned out to be needed not only for abstract encryption, but also for the performance of the system in principle.
The data center specialists asked the equipment supplier why such filtering occurs, but have not received an answer yet (now 1-2 months).

Openvpn
Virtues
Already familiar. It works well. Able to work in L2 / L3 modes.

Disadvantages:
Works on a point-to-point or star connection. To organize a full mesh network, you will need to support a large number of tunnels.

Tinc
Advantages:
Initially able to organize a fully connected L2 network between an arbitrary number of points. Easy to set up. I used it for about 1-2 years in the previous version of the system, before moving servers to Russia.

Disadvantages:
Routing uncertainty if a situation arises when two computers with the same MAC appear in the network (for example, split-brain in a cluster). The delay in determining the change of location of the server when moving about 10-30 seconds.
Driving L2 traffic, but in practice you need L3.

GRE
Advantages:
Works in the core. Just customizable. You can drive L2-traffic.

Disadvantages:
No encryption
Need to support a large number of interfaces

mGRE
Advantages:
Works in the core. Just customizable. A mesh network is created using only one tunnel interface, simple addition of neighbors.

Disadvantages:
No encryption. Does not know how to work with L2-traffic, there is no multicast out of the box.

The use of ready-made cluster solutions, the principle of the absence of "the most reliable supplier / hardware / program"
When using reliable cephfs / cephrbd disk storage, I managed to break them so that the development required by developers. Within several days I received the necessary consultations through the IRC channel and in the course of diagnostics it became clear that it was almost impossible to diagnose and fix such a problem on my own - you need a deep knowledge of the system and a lot of experience with such diagnostics. In addition, if such a system breaks down, the cluster stops working in principle, which is unacceptable even if there is a support contract. In addition, contracts for round-the-clock fast support in any such products are very expensive and this will immediately put an end to the mass decision, since It will not be possible to sell cheaply what you bought for expensive.

Similarly, with any supplier of reliability, the insides of which are closed or not studied to the depth of the natural understanding of every detail. When building a cluster, it was done to ensure that every component of the system in an emergency can be turned off if it stops working as it should and with something to replace, at least temporarily, with the loss of some functionality but while maintaining the work of client services as a whole.

DRBD can be turned off, you can delete and connect to LVM volumes directly. Synchronization between nodes will stop working and live migration will not be possible. For an emergency, as long as there is a breakdown in drbd (probably metadata or configs or software version rollback), this is acceptable. When the problem is tightened, replication via rsync, LVM snapshots, lvmsync, etc. is possible.

KVM systems are never updated at the same time. If a single server fails, all client services will continue to work on the backup while repair work is underway. All client services are removed from the node before starting work.

The third data center with monitoring and backup. If you lose your backups, you can temporarily replace them with LVM snapshots on the main nodes of the system. If monitoring is lost, the hosting system and all client services will continue to work. This will break the auto-quench function of broken resources. At the moment, this is an acceptable compromise. If necessary, this system can also be duplicated.

Internal VPN network with dynamic routing. If this network breaks, it is possible to move all resources to one data center and work without a VPN network.

Public network of IP addresses. At the moment, this is a compromise and is a single point of failure. If for some reason the address block stops working (you forgot to pay, the office that issued the block was closed, selected due to the optimization of the address space). Access to customer resources will be lost. Here an admission is made to the fact that such things usually do not occur unexpectedly and there will be time to prepare for the loss of the unit. If a block of addresses is lost unexpectedly, there is a backup option to take a block of addresses from data centers. The work of client resources can be restored within 1-2 days - basically, this is the time for the data center to reconcile and configure the address block, the work of the cluster itself does not depend on it and only need to update the DNS records of the domains.
In the future, this point of failure can also be eliminated by obtaining a second block of addresses through another company.

Data center
In practice, data centers sometimes turn off electricity and the Internet, despite all the backup systems. Sometimes the most reliable pieces of iron break down and the Internet drops at the same time in several data centers that worked through this piece of hardware, the work of data centers stops the state. authorities during investigations, etc. This is something that has been tested on its own experience in Russian and foreign data centers.
In the final version of the hosting, these problems are also taken into account: the main hosting system is located in two independent data centers located in different regions of the Russian Federation: Moscow and St. Petersburg.Data centers use independent communication channels, are not legally interconnected, and have no special approvals / interactions like common routers.

Fatal point of failure
Such a system allows you to protect the end customer from any single failures in any systems and some multiple failures, except for the existence of the work of the service provider itself. The client inevitably has to take this risk when using someone else's infrastructure, the alternative is to build and maintain its own similar system.

What happened
1. The client comes and says: I do not want to fall.
2. No adaptation of the project in most cases is needed, but you need to stand on our servers. As a rule, they give us access to the hosting, we do everything ourselves by means of support.
3. We give the client a test address to access the server. The client checks that everything works as he is used to, fix minor inaccuracies. Hardware virtualization allows you to accurately copy any project along with the existing operating system, programs and settings, without having to repeat the settings on the new server and risk forgetting something on the old one.
4. Switch without interruption, the longest part - the new IP-addresses. When transferring sites it turns out to make a move with a break of several minutes.
5. A hosting contract is concluded, an invoice is issued for payment. For a small sample project it costs 4500 per month (this includes hosting, support, and duplication).
6. Further, usually nothing falls.

Source: https://habr.com/ru/post/248837/

All Articles

How to implement almost instant site switching between sites, when one fell

More articles: