📜 ⬆️ ⬇️

Clarification on the CAP theorem

The article " Misunderstanding of the CAP-theorem " and comments to it show that there really is a misunderstanding. And it is connected not only with the misinterpretation of the term “partitioning”, but also with mental errors at other levels. I'll try to clarify.

Conventions

For a start, let's agree on the level of abstractions.
Within the framework of the theorem, it is meaningless to say that in the “real world” there cannot be a guaranteed connection between nodes of a distributed system. With the same success, it can be said that such a property as “availability” is unattainable in principle, since any system will cease to be available if a sufficient number of nodes fail (for example, if all its nodes fail).

Therefore, let's agree that we are still talking about the model and in it made some assumptions.
We will assume that the probability of failure of a critical number of nodes in the system is negligible - this will give us the opportunity to say that we can guarantee availability all the same.
We will not consider the device nodes and the nature of the connection between them - for the theorem it does not matter.

Technically, we can create a system in which the probability of loss of communication between nodes will be even less than the probability of failure of a critical number of nodes (which we consider negligible). Because of this, we can say that a distributed system with guaranteed communication between nodes (the impossibility of partitioning) is possible .
')
How specifically to achieve this - it does not matter. You may need a million-fold duplication of communication systems or the invention of an ultra-reliable graviton transmitter, which is much more resistant to any impacts than our rough silicon crafts. The fact that it can be done if someone really wants it. Do computers in which 4 processors consider the same only to double-check each other. The same care can be taken to guarantee communications.

Death is nicer

It is important to understand that the death of a system node and the loss of communication with it are not equivalent .
The difference is simple - the dead node can not perform actions that "out of ignorance" will be destructive in relation to the rest of the system. You can always arrange for him to die forever, or by recovering himself first of all, he would consult with the rest of the system so that he would not accidentally break anything.

When you break the connection all the sadder. A node cannot afford to die, because only it is possible to ensure its working capacity (without communication with other nodes it is impossible to understand whether there is someone else alive). He cannot synchronize his work with the others - there is no connection. It remains to act blindly, at your own peril and risk.

Consider what is happening on a specific example.

Suppose there are exactly 2 identical nodes in our system, A and B. Each of them stores a copy of the second data and can independently process requests from the outside. Each processing the request notifies the second of the changes so that the data remain consistent.

Option 1: node A dies.
The system continues to work as if nothing had happened - B continues to process requests. When A is brought to life, he first synchronizes with B and the two will continue to work further. Neither accessibility nor consistency suffer.

Option 2: A and B are alive, but the connection between them is broken.
In addition, each of them continues to receive requests from outside, but cannot notify the second of the changes. For each node, everything looks as if the second node has died and it acts alone. This situation is often called the “split-brain” - the brain is divided into two hemispheres, each of which considers itself the sole master of the situation. The system got sick with schizophrenia.

If at that moment a request to delete a record R was processed on A, and a request to modify the same record was processed on B, then the data became inconsistent. When the connection between A and B is restored, a conflict pops up during synchronization - to delete R or to leave the modified version? Here you can get out with different conflict resolution strategies, but we have already lost the consistency .

An alternative way to solve the problem - A and B, seeing that they have lost contact with each other, stop processing requests. In this case, the consistency is not broken, but the availability will be lost .

Closer to reality

Of course, systems from exactly two nodes are rare. But the same is true if we assume that A and B are not two separate nodes, but two sets of nodes. At the same time, we believe that communication between any two nodes within A and within B remains possible, but nodes from A cannot interact with nodes from B. At the same time, both sets continue to receive and process requests from outside.

In fact, this is a description of a typical situation when the connection between two data centers was interrupted. Split-brain reliably provides schizophrenia to the system even in such a simple case. If there are several splits or in different groups, incomplete data sets are available, everything may turn out to be even worse.

Returning to the theorem

The theorem says that CA is achieved only by loss of resistance to communication failure between nodes . In practice, this means that CA systems are properly used in cases where we consider the probability of loss of communication as negligible.

What ways we achieve this - theorem do not care. No need to call the "real world" to show that the dudes did not catch up with something. The theorem exists within the framework of the model. As it is true now, it will also be true the day after tomorrow, when the graviton transmitter is finally invented.

But even without a graviton transmitter is not so bad. Loss of communication between data centers is not a frequent phenomenon. Within the same data center - even less often. Yes, if a split occurs, conflicts will have to be resolved. Perhaps even with your hands, although a huge number of tasks allows you to resolve many conflicts automatically. But, perhaps, the delights of the CA-system will attract us much more than it will scare away the need to repair something with your hands in the case of an unlikely split. In this case, we with a pure heart will consider the probability of problems to appear negligible, even without relying on the supertechnology of the future.

Well, if in your project you consider the probability of a split to be quite high, you can reformulate the theorem in the following way: when a split occurs, all that remains is to choose A or C.

Lyrical digression

There are techniques to soften the CAP-theorem in different special cases. A person does not necessarily need a system that works perfectly. It is often enough that it works well enough, at least in most cases.

These techniques include attempts to automatically restore integrity, after the terrible thing happened, and attempts to prevent the loss of integrity, while not slipping into complete denial of accessibility.

The simplest example: each node automatically stops processing requests from the outside, if it sees less than half of the other nodes. This ensures that if the system is divided into two unequal parts, one of them will continue to work, and the other will automatically perform hara-kiri so as not to do anything silly.

Another thing is that left with half the power, the system is likely to fall entirely under load. But you can use another trick - the nodes that remain in isolation can continue to work on reading, without creating conflicts. For systems with intensive reading is quite justified move. True, the data given will not always be relevant, but often it is better than nothing. However, when divided into several parts, none of them will be able to process write requests. Although you can come up with something smarter ...

No matter how you fidget, in general, the CAP-theorem cannot be bypassed. But in most projects there is a huge scope for finding a way to get out, reducing the likelihood of problematic scenarios to a minimum.

PS javaspecialist , thanks for the occasion to write this article.

Source: https://habr.com/ru/post/136398/


All Articles