Simple failover for website (monitoring + dynamic DNS)

In this article I want to show how easy and free you can do a failover scheme for a website (or any other Internet service) on a combination of monitoring okerr and dynamic DNS service. That is, in case of any problems with the main site (starting from a problem with “PHP Error” on the page, and to a lack of space or just a suspiciously small number of orders in the case of an online store), new visitors will be directed to the second (third, and so further) a knowingly working server, or on the “Sorry” page, where they will be politely explained to them that “there is a problem, we already know and are already repairing, we will fix it soon” (and in this case, you will actually know and be able to to fix).

Live with failover or without?

Until some problem happens - there is no particular difference. But when it happens, then without failover, the following often happens: you try to quickly figure out what the problem is, it does not work (backups do not unfold, software for some reason does not work as it follows from the documentation, etc.), but there is no time, server -sites lie, customers call, everything is on the nerves, trying to somehow fix rudely and dirty "on scotch", then somehow it starts up with crutches and lives. You think that at leisure it will be necessary to sort out and redo everything beautifully, but there is nothing more permanent than temporary.

Now, as it happens in a beautiful version with a filer:

Error happens
Error is detected automatically
Alert is sent
Switching to one of the backup servers
Calmly and without panic, the problem is sorted out, corrected and the server is put back into operation.

In this scheme, of course, there may also be its own locks, but still, the scheme is linear, each stage is simple and most importantly - it can be debugged separately, so the chance of failure of this scheme is much lower, and all actions can be automated and executed quickly (unlike from the task to find and fix an unknown epic crap). Your plane landed in a faraway country, you turn on the phone and see in the telegram a notification that the server fell, but everything is fine, the backup server has been activated, you can continue your trip, you do not need to fly back or repair via SSH from the nearest cafe with WiFi . Understand when it is more convenient.
')

The future is here!

Previously, the main problem that made failover often an unacceptable solution was the sum of its costs. Either it was necessary to buy expensive pieces of iron (and invite even more expensive specialists). Or collective farm something difficult on the guides (I even came across an option when two servers are connected additionally with a null modem cable, and a heartbeat is driven through it, so that at the right moment the backup server finds out and takes control). Now there are ways easier and free. If you have a site with cats - there is no excuse for you, if you have not yet implemented a failover for it!

Well, besides, for the failover scheme, you also need a server (and maybe not one), and before that it was a big expense, now you can take a VDS for a penny.

The most reliable site with cats

For a practical illustration of the solution with okerr + dynamic dns, we launched our website with cats cat.okerr.com . We hate cats, so they are almost not there. There are three sites in total, each looks roughly the same (all on the same template), but with different kittens to be easily distinguished, and each writes technical information to see how failover works. The page is updated once every 1 minute, but you can always click reload in the browser.

In the technical information there is a line “status = OK”. Sometimes servers simulate problems and write status = ERR. The main server “falls as if” at 20 minutes of every hour (0:20, 1:20, 2:20, ...). Backup server in 40 minutes. The last server (“sorry” server) always works. At 0 minutes of every hour, the primary and backup servers are “restored”.

If you open the site and leave it in a tab, you will see that it never crashes (although every single server periodically simulates a problem), and in the event of a problem with the server, it simply runs between live servers. The picture, name and address of the server and its role will change. Sometimes you can catch the moment when status = ERR (the problem is already there, but the entire failover scheme has not yet worked), but the next update will show you the page from the working site.

Failover on okerr + dynamic DNS

Let's see how it works under the hood. The task of the file server is to ensure that the address cat.okerr.com always points to the IP address of the working server.
Behind each of the servers that keep our kososite in okerr there is an indicator that checks its status once a minute.

On this screen we see how the cat.okerr.com site is checked from the server alpha.okerr.com. The page should contain status = OK, and as we see at the top, the status of the indicator is now OK. When the server “breaks down”, it will be ERR. (This is just one example of an indicator, Okerr is monitoring, so you can stick any type of indicator, for example, check the free disk space, the number of new orders in the database, and even logical indicators, for example, one error criterion at night and others during the day) .

In the project settings, we created a failover scheme with these indicators:

In the diagram there are three indicators (three servers), different in priorities. The main server for the site is charlie, if it does not work (it will not be “status = OK” or simply unavailable), then bravo and in the latter case alpha. The right side of the page shows the status of the DNS records on different servers.

For those who have noticed that the name cat.he.okerr.com is used: We use the scheme a bit more complicated. Instead of just changing the cat.okerr.com DNS record, we change cat.he.okerr.com (to Dynamic DNS provider Hurricane Electric ), and cat.okerr.com is CNAME (alias), which does not change, always points to cat.he.okerr.com. We just like Hurricane more as a dynamic DNS, and it has the keys to manage a single record (and not the entire zone), which we think is safer. You can also not specify passwords-keys in okerr for managing the entire domain, but only for a subdomain or record.

From fall to rise

Step by step how this scheme works:

There is a problem (simulated) on the server
The okerr sensor once a minute checks the status of each server and reports to the main server of the project in okerr
The corresponding server indicator changes state from OK to ERR
When changing the status of the indicator, the failover is recalculated, it is calculated which address should be set (if necessary, for example, if the main server is working, and at that time the spare server has died, there will be no changes)
This address is reported to the dynamic dns service. At the end of this stage, on the right you will see the status “synced”
Very soon (seconds), the record will reach the DNS servers of your domain (at the kotosite it is ns1-ns5.he.net).
From this point on, some users will already be on the new live server. But not all DNS servers in the world have updated the records, and the old record can be cached somewhere else. You can see how the data on the public DNS servers “dance”, showing something new, then the old value. If you update the failover settings page, Okerr will automatically request new data from the DNS servers.
After the data has stabilized, the old cached record is rotten everywhere — all 100% of the requests go to the new server.

To speed up the 7th stage (often the longest), set the TTL of the dynamic DNS record as low as possible. Usually services allow intervals of 90-120 seconds. This is quite a reasonable compromise.

Additionally

All this can be configured in the evening (if you already have a backup server). Both okerr and dynamic DNS services are free. To get more checks in okerr and a shorter check period, you need to go through training (from the profile page). By passing, the level immediately rises (20 indicators by the hour + 1 fast, 10 minutes). And if there are not enough of them, write to support@okerr.com, most likely it will be possible to increase (so far there has always been an opportunity, never refused, on the contrary, he offered). Initially, I don’t want to promise everything to everyone, I’m not sure that there will be enough power to keep my word. But so far there are few users, so there are no problems with raising the limits.

What can okerr in general - see the presentation on the site. In general, this is monitoring (zabbix from the cloud), and the filer is a nice extra feature. Also from the site, you can go to the demo without registration.

When the status of the indicator changes, a notification is sent to the mail or Telegram. (We watched what was going on and realized that the telegram seems to be the most reliable messenger. Thanks to the RKN for the stress test!) If okerr is properly configured, any notification is either a signal “drop everything, you need to repair!” Or hang up! ”. There should not be any extra alerts from the Okerra (if there are any, you need to configure it in some other way). For example, for our kotosayta server alpha last and never simulates an error. If he lies down - we should know. But the other servers constantly simulate errors, therefore, in order not to receive alerts several times per hour, those indicators have “silent” status.

It makes sense to also make a sorry-server (on any of the cheapest hosting), which will either have your apology page (in case all the main and backup servers lie) or transfer it to the status page on okerr (for example, our cp.okerr. com / status / okerr ) or statuspage.io.

Source: https://habr.com/ru/post/359372/

All Articles