📜 ⬆️ ⬇️

The experience of building a budget fault-tolerant online service 24x7

Problematics



So, we have a commercial online service, and our customers are companies that use 24x7 service. Our task is to make customers happy and our internal problems related to equipment and software failure remain as unnoticed for the customer. The client does not need to know at all that our RAID controller has burned out, and the system administrator lives in Thailand and is not used to getting up early.

What we wanted to get


First, by online service we mean something broader than the “informational” WEB site. For example, the solution described in this article was implemented for an SMS mailing project, where the client may be a bank sending an SMS notification with a payment password using the XML-API of our service.
')
The service includes online billing, the data in the database is constantly updated and must remain safe in all cases. 24x7 availability is very important.

The tasks facing us are:

  1. protect against equipment failure risks
  2. protect against the risks associated with software failure (for example, database crash, "incomprehensible hang" - anything you like)
  3. protect against the risks associated with the disappearance of the Internet and electricity in the data center
  4. after emergencies, the service should be available to customers on the same domain name 2kengu.ru
  5. we want to sleep peacefully - in case of accidents everything is restored by itself in the total downtime required for recovery within 15 minutes
  6. we want the solution to be inexpensive, say, no more than 8,000 rubles / month without purchasing additional equipment.


Standard simple solutions


In this article I will not describe in detail all the pros and cons of such solutions as “good” RAID with an array of “quality” HDDs, cold swap server, cluster, etc. - it will take too long.

I’ll dwell only on a couple of moments:
  1. My example with a burnt RAID is not taken from the ceiling. In my practice, there really was a case when a controller installed on a server from a solid two-letter brand burned down and recorded an error on all 5 HDDs connected to it. Any “piece of hardware” designed to provide security can also “break down” - be it a router, a Cisco, a third-party server controller and anything else - a security link reduces some risks and adds others.
  2. It happens that all efforts on the organization of its fault-tolerant system are nullified by problems arising on the side of the data center. Electricity simply disappears and a smart router is cut off from the world, which must make a decision about putting a backup server into operation and switching IP addresses.


Your humble servant does not claim to "discover America" ​​in the topic being raised. The solution that we have implemented for our project SMS-based campaigns based on several well-known ideas.

However, taking into account many years of experience, including in large IT companies - leaders of the Russian mobile services market, I can state that such (or similar) elegant and effective solution that harmoniously combines various ideas is a great rarity. Often, everything starts up just by chance (perhaps the server bought is expensive - it won't break, and in a year we will buy a new one). In the best case, the server is organized "warm" replacement (with manual winding).

Recipe. All ingenious is simple


  1. We take 2 source servers. These can be VPS / VDS or real servers - it doesn't matter. At one time, the service will work only on one server, i.e. it is not a cluster.
  2. We place servers at various sites. This is a very important point - as practice shows, problems in the data center (failure of the trunk router, power outages and even fire) can occur almost more often than problems with your hardware and software.
  3. We install business software and DBMS on each of the servers, we synchronize the databases. Configure database replication from one server to another. We adjust daily synchronization of software and scripts.
  4. Synchronize server time.


So for each of the servers we can have two modes of operation: master and slave.
In open access is always only master. Slave at this time is engaged only in database synchronization.

The most interesting


The entire solution is built on scripts written in bash. The scripts on both servers are exactly the same.
We will need the 3rd link (except for our two servers): DNS hosting from any trustworthy provider.
We used Yandex hosting in our project.

DNS hosting is used for two things:
  1. storing the current IP-address of the master-server to which the clients accessing the service by the domain name should be redirected
  2. timestamp storage. A timestamp is simply a string (TXT record) containing the conditional variable name known to your scripts and the actual value is a timestamp.


Yandex provides its users with a wonderful thing - an API for working with DNS records. Using this API, our servers can read and write DNS records.

The logic of the server in master mode


According to cron, every 5 minutes the following actions are taken:

  1. The DNS entry is checked to see if the current work IP address is the same as this machine. If it matches, then the server is still the master and continues to function according to the master algorithm. If the IP address has changed, then this verifying server is no longer a master - it must perform the procedure to go to the slave.
  2. The next timestamp is recorded - “heartbeat” - the current server time is recorded in the DNS TXT record.


The procedure for switching from MASTER mode to SLAVE mode


The procedure is performed on the master server when he “understands” that he is no longer a master. This can happen after an accident has occurred and SLAVE realizes that the master server is not available itself becomes the master. After the accident is eliminated (given electricity in the data center), the “old” master rises - our task is to prevent the two servers from working in the combat master mode simultaneously. this can lead to out-of-sync data — for this, as soon as an accident occurs, the slave writes its own IP address to DNS, which, on the one hand, redirects clients to a new IP, and on the other, is a flag for the “old” master, that he is no longer a master .

When switching to a slave, the former master performs the following actions:


  1. stops the WEB-server performing the main business functions
  2. launches another WEB-server, which works as PROXY, redirecting all requests from clients to another - new master-server. Such requests can be made if the service clients have not yet updated their DNS caches and they still apply to the “old” IP address.
  3. copies the full database dump from the new master to itself. This needs to be done on the one hand due to the fact that during the time of the accident (when the old master was unavailable) many updates have already accumulated, and on the other, because the accident itself could lead to database damage.
  4. after copying the dump starts the replication of the database in slave mode
  5. starts the server operation logic in slave mode.


The logic of the server in slave mode


According to cron, every 5 minutes the following action is taken:



The procedure for switching from SLAVE mode to MASTER mode


As already mentioned, this procedure is launched on the slave server when it “understands” that it did not receive the next timestamp from the master via the shared DNS storage.

In this case, you need:
  1. Update DNS A-record - record your IP address to redirect users to a new server. (this is done only by the slave, which switches to master).
  2. Switch the DBMS to work in the master mode and prepare a dump so that it can be downloaded to the new slave when it appears (see switching procedure in the slave).
  3. You need to stop the WEB server, which performs the functions of PROXY by redirecting all requests from clients to the master server.
  4. start the WEB server with the main business logic.


Weak spots


And what if the broken DNS hosting Yandex?


Well, in the first place, we believe that Yandex cares a great deal about the stability of its services, does not spare money (as we) for it and strength. And as practice shows, in fact - the inaccessibility of the DNS servers Yandex is a rarity
And, secondly, if DNS is unavailable for some time, then you simply do not need to do anything - let the master continue to work in master mode. The probability that a Yandex crash will overlap your crash is extremely low.

How to deal with DNS caches that redirect users to the "dead" server.


Firstly, on Yandex, you can configure the cache update time - set them to the minimum values.
Secondly, you can notify your tech-savvy customers about your backup IP, so that they can access your service directly over IP if they are not available by domain name.
In the case when both servers are “alive” users will automatically be proxied to the master.

In custody


Naturally, in addition to the mentioned weak points, one can find more.
But still, this implementation has already proved itself from the best side, more than once having saved technicians from 2 of the bank from emergency work on the emergency lifting of the service.

Total:


Costs: 2 servers (can be rented) - to your taste + one-time work on setting up scripts.
Service failure time in case of an accident: 10 minutes for diagnostics + 10 minutes for switching, totaling 20 minutes - at any time of the day or night - and you calmly wake up in the morning, drink a cup of coffee and learn about everything when sorting letters.

Source: https://habr.com/ru/post/170687/


All Articles