On the implementation of persistent processes in real-time control systems (part 3)

The end of the article.

Go to part 1
Go to part 2

4. System services and operating environments
')
Having implemented a fault-tolerant clustered virtualization environment, we rise to a higher level and deal directly with the operating environment for running our applications inside a virtual machine.

There are no fundamental problems here, the main hypervisors ensure the operation of virtual machines running almost any modern operating systems. Since the most common platform for server tasks is Linux, the easiest way is to focus on the operating systems of this family.

It may seem a natural step to install the same version of Linux inside a virtual machine as on the host system that supports it (for example, SLES or RHEL). This has the advantages that it requires taking into account the features and maintaining the update policy of only one product, and also allows you to use a common license for the physical server and its virtual machines. However, this approach has a significant disadvantage, due to the fact that SLES and RHEL are distributions, which are much more oriented towards the administrator who manages standard applications than the developer, and the environment they support for executing programs obtained in recent versions of development tools. may require significant additional work on the configuration management of system and external packages.

Therefore, from our point of view, it does not make much sense to pursue the unity of the operating environment between the host and the virtual machine, and as a VM OS it is much more convenient to use the Linux distribution you are used to.

Note for the public sector

Good results can show the use of the domestic distribution of the Astra Linux operating system as an operating system VM. This distribution kit is freely distributed in the “civilian” version of the Common Edition and inexpensive in the “military” version of the Special Edition **, is promptly updated by developers, satisfies many of the special requirements of state bodies and fully complies with the import substitution policy. Thus, using Astra Linux on a virtual machine allows you to gain certain competitive advantages in the Russian market, although we cannot, for a number of reasons, recommend this system to work directly on the physical servers of the middle and higher levels.

** Now, it is probably more correct to call the Special Edition simply “protected”, since, as far as it can be understood, anyone can now purchase it for their own needs.

Of course, a VM OS, no less than a physical machine OS, is potentially capable of failures and failures. The task of computing platform redundancy, which is solved at the physical level by clustering, is solved at the virtual level by implementing the system functions on several interconnected virtual machines that control each other's work. The task of performance monitoring, at the physical level solved by strict timers, could be solved on the virtual level in the same way — with a virtual watchdog timer device — but it is much simpler and more functional to carry out issuing commands to the cluster to restart the controlled virtual machine from the monitoring virtual machine (of course, control should be cross). Images of virtual machines are easy to save for creating rollback points and disaster recovery.

5. Computational processes

Finally, we have reached the point for which everything was started - to the very persistent processes in real-time control systems.

So, implementing the measures described above in the article, we managed to ensure the persistence property at the levels of external resources, communication environment with them, hardware and firmware, host operating system, system services and operating environments. Things are easy - to ensure that our processes themselves are running stably in the stable computing environment they provide.

The question of the adequate implementation of the applied logic of the control loop, which is of paramount importance for the stability of control, is beyond the scope of this article. Here we confine ourselves to two issues of ensuring the persistence of processes at the system level - resilience to restarts and resilience to emergency shutdowns.

The numerous resiliency tools described above provide for restoring the availability of the computing environment, but in some cases can lead to restarting of individual computational processes. Under these conditions, the stability of these processes to restarting themselves and their neighbors, with which interaction takes place, is of paramount importance. Such stability can be realized through the absence of the macro state of the computational process. As mentioned in Section 2, it is highly undesirable to establish long-term connections between processes, which can be interrupted at any time by working off an abnormal situation at one of the ends. The exchange of control signals between processes should be reduced to short transactions, for each of which there should be a possibility of failure and repetition or parrying in this case. In the simplest case, such transactions are reduced to sending single packets. In addition, each process must periodically save in non-volatile memory (that is, in our case, on a virtual disk) information sufficient to restore its work from the most practically applicable checkpoint in the case of its own restart.

Particular attention should be paid to the interaction of processes with the DBMS. If a DBMS is used in a project, then it is necessary to realize both a meaningful transactional structure of the data organization itself and the transactional nature of the clients' network connections with the DBMS server. Communication between the client and the server should be able to recover when an abnormal restart of one or the other, which is easiest to achieve, shortening transactions and wrapping each transaction into a separate network connection, initiated, executed and terminated for a short period of time.

Of course, we cannot fully guarantee ourselves against errors in our own application processes. At the level of freezing and blocking processes, the issue is solved by the same means of monitoring the health and restarting of the VM, which we discussed in the previous section. At the level of emergency shutdowns, a lot of developers can save a lot of blood:

while [1]
do
my_executable_module
done

in which the call to the directly executable module that implements the logic of the control program is wrapped.

In conclusion, I would like to note that even the most accurate and error-free implementation of each of the levels considered does not guarantee the developer from the troubles associated with unrecorded interaction between them. Therefore, bringing the fault-tolerant system to the required reliability indicators can take considerable time and requires complete testing of the functions of all levels of the system to work out the failures on each of them.

Source: https://habr.com/ru/post/303974/

All Articles

On the implementation of persistent processes in real-time control systems (part 3)

More articles: