On the implementation of persistent processes in real-time control systems (part 1)

Recently, persistence has become another fashionable term in information technologies. Many articles are published on persistent data, dzavalishin is developing a whole persistent operating system, and we will also share, for a variety of materials, the recent report on persistent processes.

Persistence, in simple terms, means independence from the state of the environment. Thus, in our opinion, it is quite legitimate to talk about the persistence of processes, how their abilities are performed regardless of the state of the environment that created them, including failures at lower levels, which, generally speaking, is one of the most important tasks in the development of automatic control systems real time.

The article classifies the main levels of implementation of the functions of a fault-tolerant control system, examines typical for these levels of failures, and studies specific technical solutions used at each level to ensure persistence.

Depending on the method of implementation of the management system, its hierarchical model can be organized in various ways.
')
For example, as follows:

Computational processes

Specialized redundant equipment

Resource Communication Environment

External resources

so:

Computational processes

Clustered system services and operating environments

Host operating system

Hardware and firmware

Resource Communication Environment

External resources

or, theoretically, even so:

Computational processes

Clustered application server

System services and operating system

Hardware and firmware

Resource Communication Environment

External resources

If you confidently feel like the father of computational architectures and have in abundance (relative to the functional complexity of the task) the workforce and creative potential of programmers and electronics engineers, and even more, God forbid, are clothed with considerable responsibility before the law for the results of applying your system, then The first of the indicated paths is the construction of a redundant hardware-software complex with a specialized architecture. This path has its roots in embedded systems and is a wonderful field for the ascending careers of hardware workers and low-level interface programmers. The author will try to shed more light on this area in one of the following articles (as we developed the transputer), but here we confine ourselves to the remark that, unfortunately, this way is associated with considerable efforts in the manual work of implementing high-level functions, and therefore has little acceptance for systems with significant functional complexity.

Immediately we will designate our thoughts on the use of the application server in high-availability control systems, illustrated by the third scheme. With external attractiveness for those inexperienced in the tasks of automatic control of minds brought up by the development of information systems, such an approach harbors a number of intractable deficiencies. The goal of the main modern application servers is to provide load balancing and increase processing throughput, which is in traditional contradiction with the task of minimizing the response time (latency) required from real-time systems. Also, such application servers have high complexity and are themselves a vulnerable link in terms of fault tolerance. Finally, the interfaces they provide for their applications are often insufficient for automatic control tasks, often requiring interaction with hardware, the use of non-standard network protocols, etc. As a result, although the author knows a number of successful examples of implementing the application server architecture for building information systems, not one industrial implementation in the field of automatic control systems.

Thus, in this article we will focus on the architecture of a cluster of virtual machines, illustrated above by scheme number 2, and consider in more detail its basic levels, moving upwards.

1. External resources

Sometimes novice developers lose sight of the fact that, often, the most vulnerable part of the control loop may be the managed resources themselves or other external objects. This situation is beautifully illustrated by the old joke:

- I am the smartest! - Wikipedia said.
- I'll find anything! - said Google.
- I'm all! - said the Internet! ..
“Well, well,” Electricity said, and ... blinked.

Understanding this anecdote literally, if you did not ensure the input of power to the facility from two independent power lines, or, for example, deliver diesel fuel to a backup diesel generator with an efficiency not worse than its autonomous operation time, then all your progress in the redundancy of server hardware in terms of resiliency, are purely cosmetic.

Less literally interpreting what has been said, you should always check if your magnificent duplicated control circuit does not end with a single executive mechanism or source of a resource, and if so, what to do about it.

The most advanced automatic control systems allow, in case of failure of some mechanism of the controlled system, to try to perform some of its functions with the help of the remaining mechanisms that regularly have other functions. For example, a space rocket terminal control system can compensate for the premature shutdown of a third-stage engine by additional work of the upper stage.

Note

It is not necessary to understand this in the sense that in the terminal control system of the rocket there is a special branch of the code “Work of the upper stage in case of malfunctioning third stage”. In fact, the simple control loop is designed in such a way that the capabilities of different controlled systems overlap each other, and each of them tries to do the maximum possible to achieve the ultimate goal from the situation in which it actually found itself.

2. Environment connection with resources

In addition to the resources themselves, the communication environment between them is of fundamental importance. For us, the most important environments are, first of all, the object power supply system and the data network.

When designing an on-site power supply system for a high availability complex, it is necessary to ensure at least double physically separate wiring of the power supply network, connecting critical equipment to each of the power lines either by duplicating the equipment or by using duplicated power supplies in it with the ability to work from different power supplies. chains. These moments seem obvious, however, in real life, the author has seen the automation object that solves important tasks, which was powered from two independent electrical substations in such a way that the measuring equipment was completely powered by one of them and the computer complex controlling them from the other.

Hot backup of data networks implies a number of problems, to varying degrees of attracting the attention of the general public.

The use of alternative packet transmission routes through backup connections is well supported by conventional intelligent network equipment, except when using non-standard lower-level protocols.

Moving up the stack of protocols, it is necessary to address the issue of using data transfer protocols that are resistant to full or partial failure. Part of this issue is the widely known TCP vs UDP flame.

The advantages of using the TCP protocol in control systems include:
- automatic integrity monitoring;
- arbitrary size of transmitted data.

The advantages of using the UDP protocol in control systems include:
- lack of condition;
- the possibility of half duplex;
- quick return from calls *;
- quick diagnostics of problems at the stack level and returning an error code.

Using TCP in real-time systems requires the developer to become familiar with the settings of the stack, primarily the family of tcp_keepalive parameters. Using UDP requires a clear understanding of the implementation of the ARP protocol (this is related to a clause for footnote * above). The use of both protocols implies a creative knowledge of the receive buffer size settings.

The issue of the lack of state of the UDP protocol becomes important when restarting one of the sides of the connection, including restarting on physically different equipment (backup server).

Separately, it is necessary to address the rarely covered half-duplex question. The implementation of some common network environments is such that, as a result of a physical or logical violation of communication integrity, it is possible that data is transferred from A to B, but cannot be transferred from B to A. TCP cannot function in such conditions. UDP is capable of maintaining one-way communication when one-way is broken (provided that the underlying network equipment works correctly, and excludes the use of ARP when establishing a connection).

In general, according to the author, UDP protocol with the organization of control of message delivery or unconditional re-distribution at the application level is more suitable for transferring short control messages to an IP network for a fault-tolerant system. To transfer large amounts of data, the TCP-coordinated control level is suitable, with short-term connection organization.

Continued: part 2

Source: https://habr.com/ru/post/303162/

All Articles

On the implementation of persistent processes in real-time control systems (part 1)

More articles: