Vipnet failover crypto gateway or how not to implement fault tolerance

For about three years I have been involved in the integration of Infotex products. During this time, I became closely acquainted with most of its products and, in general, I believe that they deservedly became so widespread in Russia. Among their main advantages are the presence of FSB and FSTEC certificates, a wide range of products, including both software and hardware-software solutions, easy and convenient scaling and network administration, good technical support, convenient licensing, easy installation and configuration, and of course price compared to peers. There are, of course, disadvantages, but who has none? However, in my opinion, the most unsuccessful product from the entire line is a fault-tolerant ViPNet Failover cluster, and I will explain further why.

According to the documentation, the hot backup cluster mode is designed to hot-swap the functions of one of the servers with ViPNet software by another server in the event of the first failure. The cluster of hot backup servers consists of two interconnected computers, one of which (active) serves as a server (coordinator) for ViPNet, and the other computer (passive) is in standby mode. In case of failures that are critical for ViPNet software performance on the active server (primarily in the event of network or network equipment failures), the passive server switches to active mode, taking over the load and acting as a coordinator instead of the server that detected the failure. When operating in the hot backup cluster mode, the system for protection against failures also performs the functions of a single mode, that is, ensures that the basic services included in the ViPNet Coordinator Linux software always function.

I note that the Vipnet Coordinator Linux itself is not bad, and the hot backup scheme itself is spoiling everything.

The entire cluster, in terms of other computers on the network, has one IP address on each of its network interfaces. This address has a server that is currently in active mode. A server that is in passive mode has a different IP address that is not used by other computers to communicate with the cluster. Unlike active mode addresses, in passive mode, each of the servers has its own address on each of the interfaces, these addresses for the two servers do not match.
')
The failover.ini config contains 3 types of parameter sections. The [channel] section describes the IP addresses of the reserved interfaces:

[channel] device= eth1 activeip= 192.168.10.50 passiveip= 192.168.10.51 testip= 192.168.10.1 ident= if-1 checkonlyidle= yes

The [network] section describes the backup options:

 [network] checktime= 10 timeout= 2 activeretries= 3 channelretries= 3 synctime= 5 fastdown= yes

In [sendconfig] the address of the network interface of the second server is written, over which the cluster operation will be synchronized:

 [sendconfig] device= eth0

Again, I will give the reservation algorithm from the documentation:

The algorithm for working on the active server is as follows. Every checktime of seconds is used to check the functionality of each of the interfaces listed in the configuration. If the checkonlyidle parameter is set to yes, then incoming and outgoing network traffic passing through the interface is analyzed. If the difference in the number of packets between the beginning and the end of the interval is positive, then the interface is considered to be functioning normally, and the failure counter for this interface is reset. If not a single packet has been sent and received during this interval, an additional checking mechanism is activated, which consists in sending echo requests to the nearest routers. If the checkonlyidle parameter is set to no, then the additional check mechanism is used instead of the main one, that is, packets are sent to the testip addresses every checktime seconds. Then during the timeout of the answers are expected. If there is no response from any testip address on any interface, its fault counter is incremented by one. If, at least on one interface, the failure counter is not zero, then new packets are immediately sent to all testips and a response is expected during the timeout. If in the process of new parcels to the interface, the failure counter of which is not zero, the answer comes, its failure counter is reset. If, after any sending, the failure counters on all interfaces become zero, then a return to the main loop occurs, a new wait during checktime, and so on. If, after some number of new parcels, the failure counter of at least one interface reaches the value of channelretries, then a complete interface failure is detected and the system restarts. Thus, the maximum interface failure time before the crash protection system concludes this is checktime + (timeout * channelretries) .

On a passive server, the algorithm is slightly different. Once in checktime seconds, records are deleted in the system ARP table for all activeip. Then, UDP requests from all interfaces are sent to activeip addresses, as a result of which the system first sends an ARP request and only in the case of a response does it send a UDP request. After the end of the timeout response timeout interval, the presence of an ARP entry is checked for each activeip in the system ARP table, based on the presence of which a conclusion is drawn about the operation of the corresponding interface on the active server. If no response was received from any interface, the failure counter (it is one for all interfaces) is incremented. If at least one interface response was received, the failure counter is reset. If the fault counter reaches the value of activeretries, then it switches to the active mode. The maximum time that passes from the restart of the active server to the discovery of this fact by the passive server is equal to checktime + (timeout * activeretries) .

The total system downtime during a crash may be slightly longer than checktime * 2 + timeout * (channelretries + activeretries) . This is due to the fact that after the start of a failed server reboot, the system puts its interfaces into a non-operational state not immediately, but some time after the other subsystems stop. Therefore, for example, if two interfaces are checked and only one has failed, then the address of the second interface will be available for some time during which the passive server will receive answers from it. Typically, the time from the start of the reboot to the shutdown of the interfaces does not exceed 30 seconds, but it can strongly depend on the speed of the computer and the number of services running on it.

At first glance, everything is correct, as soon as the active server has a problem, it reboots and a passive takes its place. What do we have in practice?

You cannot just take and connect a protected resource (for example, 1C server) to the file through an unmanaged switch, or rather, of course, you can, but in this case, you will need to specify the address of the protected resource as a test IP. As a result, when restarting the 1C server, for example, for scheduled maintenance or updating the software, the file server will also be cut down. This, of course, is not a problem if protection of access to it is the only task, but in most cases it is not.
File server fault tolerance depends on the fault tolerance of each of the test addresses, and becomes smaller with each additional network interface. In my practice, there was a case when some employee of the data center accidentally disconnected the switch serving as a test node from power, as a result, the cluster entered a continuous reboot cycle, because of which the local machines with Vipnet clients could not connect to the domain and their work was paralyzed, until we were allowed into the data center.

It turns out that file server fault tolerance is possible only in a spherical vacuum, where the entire network infrastructure works without failures. Someone may say that this is a necessary measure, but if something happens to one of the servers, for example, the network interface crashes, it will reboot, its operation will be restored and the backup algorithm will prove its usefulness. However, in practice, there was a case when one of the filer interfaces really started up, and it began to reboot according to the scheme described above, and the problem did not go away after a reboot. To get rid of the glitch had to manually turn off the power, so that in this case, fault tolerance is visible only on paper.

All this could be corrected if the algorithm included another condition: the active server should not restart if the passive server is turned off. It seems to be a simple condition that provides real fault tolerance, and was not added by the developers. I can only assume that this is due to the need to make changes to the kernel and re-certification.

In the case of a healthy network environment and server hardware, the uptime of the servers was almost continuous. One of the first filers that I had to install has been working for 3 years and during this time the hot backup scheme was only worked out during software upgrades or because of general technical work in the data center. In general, the real benefits of such a scheme are possible only in the event of a hardware failure in one of the nodes of the cluster, which in my practice has not yet happened.

There was hope for Vipnet Cluster Windows, but Infotex, unfortunately, did not master it, and pity the backup scheme was very promising.

In general, my advice is this, if you don’t need a fault-tolerant crypto gateway for a tick, then it’s better not to bother with the filer, but to use the usual Vipnet Linux coordinator, it’s by itself reliable enough, especially if you don’t touch it;)

Source: https://habr.com/ru/post/188714/

All Articles

Vipnet failover crypto gateway or how not to implement fault tolerance

More articles: