Building fault tolerant (fault tolerant) system

In the development of banking software, this aspect of the system receives the most attention. Often, describing a fault-tolerant system, use the words: Fault Tolerance, Resilience, Reliability, Stability, DR (disaster recovery). This characteristic is the essence of the ability of the system to continue to work correctly when one or more subsystems fall, on which it depends. I will briefly describe what approaches can be applied in this area and give a couple of examples.

Immediately ask me to ask that some things will be specific only to java, but still, by and large, everything described below applies to any platform, so the topic is placed in this topic.

Where to begin?

First of all, you need to be well aware of the architecture of your application, cover the entire stack of your system - from software to hardware.

You can also look in the direction of such a thing as fault tree analysis. This is when you build a diagram of your components in the form of a tree of dependencies, and from the bottom up you begin to put down the cumulative probability of the fall of your components. This chart will clearly show the most vulnerable spots that could potentially cause the greatest damage.
')
The most difficult option is when your system depends on subsystems that are beyond your competence and are not fault tolerat. First of all, we must of course try to convince the owners of these subsystems to provide you with a fail-safe solution, thus making it their problem, not yours. But, unfortunately, this is not always possible to do, so in this case, you first need to understand in detail how these subsystems work, and make fault-tolerant solutions yourself, for example, connecting to two identical duplicated instances of the subsystems at the same time, but more on that below .

Ways of implementation

There are two fundamentally different approaches, although they can be independently or semi-independently combined for your system. One approach is DR (disaster recovery), when a complete copy of your entire system can be raised in another data center. This method helps out in almost any situation, however, it can have a very long period of downtime. I know one system in which it takes around an hour. But here, too, you need to be very careful, not only the configuration of the system, but also the hardware in DR must coincide with production. For example, when we tested DR, in one of the systems with which I worked, then with a heavy load it turned out that one of the components was sharpened to work in a single-threaded environment and its throughput is proportional to the processor frequency, and on the DR we had a mega piece of hardware with a huge number of cores, but with a small clock frequency, so that this component began to shut up. In general, the test was very unsuccessful, and the “cooler” hardware on the current configuration of our system clearly did not suit us.

The second way is to programmatically implement fault tolerance for each of the components and their interaction. There are a lot of different techniques to make your system more stable. Let's look at some of them.

low level fault tolerant services

The principle here is the following: your system should consist of more or less independent systems, each of which should be fault tolerant. Ideally, do not reinvent the wheel, and use ready-made solutions. This is actually the best option when your entire system will rely on the so-called low level fault fault tolerant services.

Single point of failure

Avoid architecture in which your entire system collapses when one of the components falls. This can be achieved either by using the principle of redundancy (see below), or by making the components as independent as possible so that when one of the components falls, only part of the functionality stops working, and the rest of the system continues to work. Of course, the latter solution is not suitable for the main functionality of your system, but if there were problems with problems in any of your system’s auxiliary functions, and it turned off your entire system, then this is probably not very good.

Redundancy

This is when your system has an excessive amount of the components you need. And when one of these redundant components falls, everything should continue to work. With this design approach, two strategies can be distinguished: active-active and active-passive.

Active-Active

This means that you work simultaneously with two identical components at the same time. For example, in the system where the client receives quotation prices, he can subscribe to two different components at once and receive prices from two places at the same time. If one of these components falls, the client will not notice at all that a problem has occurred in the system, which is an undoubted advantage of this approach. As minuses it can be singled out that the amount of traffic and time for its processing doubles, as well as additional server infrastructure, which is constantly in operation and consumes resources. Even this approach may be impossible to implement due to various business constraints. For example, a trading system cannot simultaneously send the same order to two independent connectors to the exchange, as in the case of success on the exchange, two orders appear at once, and this is absolutely unacceptable.

Active-Passive

This strategy represents only one constantly working component, in the case of the fall of which, the second one automatically rises, restores the state and takes over all the work. For example, such functionality is in TIBCO EMS. If you enable this option, the working instance of the EMS is constantly monitored, and in case of failure, the second instance of the EMS is launched, sends all unread messages and starts receiving new ones. But it is very important to understand what will have to pay for it. In one of the projects where we included this option, the maximum throughput of this component fell about twice.

It is also important to understand that if you do not use a ready-made solution, then its acttive-passive implementation will most likely require more effort than an active-active solution, since it requires a reliable way to verify that the active component is functioning, and you need to be able to restore the state to moment of falling or constantly synchronize it. Also, this solution will always have a certain delay in the operation when the component is dropped. But the passive component will most likely consume a lot less resources and you can get a fail-safe solution on a weaker hardware.

Load Balancing

Usually this term comes up when building high-load systems. However, for fault-tolerant systems, this principle can also be applied. The idea is that you have several identical components, between which the entire load is more or less evenly distributed. Unlike the above described active-active strategy, here only one component performs each task. This mechanism is ideal for a stateless component, otherwise for fault tolerance, you will constantly have to synchronize the state. For example, in the case of web servers, do session replication. In this solution, it is very important to have at least N + 1 redundancy, i.e. if for peak loads you need N components that work on the whole coil, then N + 1 components should be present in your system, otherwise, if one of the elements drops, then all other loads will increase and your system will collapse like a domino .

The most common example of load balancing is probably web server load balancing. There are both software solutions (Nginx) and special hardware. In backend systems, load-balancing is often used for balancing to be implemented using a JMS queue, and several identical listeners are hung on it. In this case, the queue ensures that the message gets only in one of the components, and until it processes it, it cannot take the next message. By the way, in the same system, if you take topic as a queue, then you can implement an active-active system.

Defensive programming (Defencive coding)

To make your application as resilient as possible, you need not only to think about it during design, but also in programming. One should always think, and what if something happens and write code that handles all such unforeseen situations. I will give a few examples.

Infinite Loop . Make sure that when an error occurs during the processing of a message, you do not go into an infinite loop that constantly tries to perform this operation, especially if the message comes to you from the outside, and you cannot guarantee that it is correct. I have never seen how the whole system hangs due to this error. What to do in this case? For example, you can send an error message to the monitoring system (see below) and proceed to processing the next task.

Reconnection . Be sure to write the reconnect logic, especially in systems with a firewall .

Degradate gracefully. Throttling . If your application receives messages from outside, be sure to think about what you will do if you receive many more messages than you can handle. For example, you can start re-request or emulate a slow connection, slowing down the response.

State Management . Try to make your system retain its state (checkpoint) from time to time, so that in case of big problems you can always roll back to the last consistent state. For example, in the FIX protocol, common in financial applications, there is the concept of sequence. Each message has its own sequence number, the server remembers the number of the last sent message, and the client remembers the number of the last received message. On the restart, they exchange these numbers, and if there is a gap between them, the server sends all the missed messages.

InfrastructureError . If there is an error in your application from which you cannot recover, for example, OutOfMemoryError in java, you can try to stop the application and start it again. This can be automated using tools such as the Tanuki Java Service Wrapper, about which I have written in detail here .

NullPointerException . However, do not go crazy by checking each input parameter of a method or constructor for null. Better take it as a rule: never pass in a null system between components.

Infrastructure isolation

Sharing the hardware on your system with any other applications is very dangerous. It is also called the student kitchen principle; if someone does something there, then this cannot but affect other people using the same things. The most annoying thing is that it is very difficult to determine the case data if the analysis is not carried out immediately, but after a certain time. There are many examples of this problem. For example, the third application began to actively use the RAM - so your processes began to swap and productively your system is terribly degraded. Or the third system began to actively use the hard disk, and you write the logs synchronously, in this case, your application will slow down a lot again. Overflowing the hard disk with third-party systems and clogging the communication channel are also possible problems, although they can arise in the case when the hardware is used only by your system, it just consists of a large number of processes. For example, a very common case of how applications written in java affect each other is a parallel garbage collector. When it works in one of the components, by default it uses all available CPU resources to its fullest, so if you have an application requiring fast response running on the same machine, this problem is not contrived. Although the number of threads available for garbage collection can always be limited to special flags when starting the JVM.

Monitoring

You can also increase the resiliency of your system through monitoring. Very often, learning about the upcoming problems in advance, you can take certain actions to avoid their occurrence. For example, when you see that the free disk space is over, you can quickly go to the box and clean the logs.

For monitoring, there are many ready-made solutions (Triton, Nagious), both paid and open-source. Standard features are monitoring disk, RAM, processors, and traffic. There are also various plug-ins that allow you to monitor log files and send a message to the monitoring system when errors occur.

Despite the availability of ready-made solutions, banks for some reason develop their own solutions. However, they are already more focused on receiving messages sent programmatically from internal applications. For example, if for some reason a certain back-end system could not process an order, then you can send messages with all details of the order to the described monitoring system so that support can enter the details of the order manually.

Another type of monitoring is the health monitor, when your application sends special heartbeats, and if nothing has been heard from the application for a certain period of time, a fault message pops up in the tracking system.

Problem Management

Of course, you should start thinking about the resiliency of your system as early as possible, even at the design stage. But it's never too late to bring the resilience of your system to the ideal. It is very important to investigate every problem that has happened in your system, find the true cause of errors and make sure that in the future it will no longer cause problems in your system.

In custody

I talked about several approaches to building fault-tolerant systems. Which one should you choose? I would recommend the one that is the most transparent and easier to implement, provided that it still meets the minimum resiliency requirements of your system.

Source: https://habr.com/ru/post/118496/

All Articles