Reachability of the lower limit of the execution time of a commit distributed fault-tolerant transactions

Foreword

Recently I read another article from the series: "we are better than a two-phase commit". Here I will not analyze the content of this article (although I am thinking about giving a detailed analysis). The task of my opus is to propose the most efficient version of a distributed commit in terms of time delays. Of course, such a commit is a high price. However, the goal is to give an assessment and show that a two-phase commit is not a brake, as many believe.

It is also worth noting that there will not be full-scale experiments and fake comparisons. Algorithms and theoretical analysis will be simply given. If desired, you can independently implement and test in practice. Of course, it would be much better if this was described in the current article, but everything depends on free time and motivation. In my opinion, to describe the algorithms is more important than to bring graphics, because graphics on algorithms can draw almost everyone, the opposite is not true.

After such a preface will proceed.

Introduction

Definition RTT - message time back and forth.
Definition Hop - the time of one shipment.

Theorem . Time 1 RTT is equal to the time of two hops.
Proof . It is obvious.

Definition Distributed commit is the process of making atomic changes between at least two distributed system members.

Definition A two-phase commit is a two-phase commit. The first phase is an atomic operation to check the possibility of starting a transaction and blocking the participants of the commit. The second phase is the collection of responses from the participants and the application of the transaction with the release of locks.

Theorem . A two-phase distributed commit cannot be made faster than 1 RTT.
Proof . To conduct a two-phase commit, it is necessary, at a minimum, to send a request from the client to all participants and receive a response on completion. For this you need 2 hop or 1 RTT.

Definition Failsafe commit is a commit that continues to be executed even if one or more commit members fail.

Theorem . A two-phase, fault-tolerant distributed commit for 1 RTT is possible.

To prove this theorem, it suffices to give the method and conditions, when possible. It is clear that this is not always possible, because in the case of competitive access of distributed transactions to the same resource, such transactions should be queued up to this resource. So, they will be executed sequentially. In this case, talking about 1 RTT will be somewhat fun. However, even conventional algorithms, under favorable conditions, give times markedly longer than 1 RTT.

The proof of this theorem will be devoted to the further part of the article.

Biphasic commit

Consider the classic two-phase commit scheme with the coordinator.

The sequence is as follows:

1st hop . The client sends a request to the coordinator.
2nd hop . The transaction coordinator sends a request to the participants for blocking - 1st phase.
3rd hop . Participants successfully take locks and send a response that they are ready for transactions.
4th hop . The coordinator sends a message on the application of operations to all participants - the 2nd phase.
5th hop . Participants report on the success of the application of the coordinator.
6th hop . The coordinator responds to the client.

Total 3 RTT.

Now add fault tolerance. We will assume that the coordinator and the participants are included in the relevant consensus groups. We will also assume favorable conditions, i.e. the group leader does not change and the consensus is completed safely. Let us prove the lemma:

Lemma A leader-based distributed consensus cannot be accomplished faster than 1 RTT.
Proof . To achieve consensus, the request should be sent to the leader. Wherein:

1st hop . The leader sends the request to other participants in the consensus (follower).
2nd hop . Participants send confirmation to the leader.
Without these phases, consensus is impossible.

Lemma Consensus is possible for 1 RTT.
Proof : take the Raft algorithm. In the case of liveliness of the leader and the majority of participants in a consensus, the adoption of an agreed decision on the leader occurs after receiving answers from the participants, i.e. after 1 RTT.

Definition Making an agreed decision is a group consensus agreement .

It should be noted that after this the system guarantees that this agreement will remain in the system, even though the agreement has not reached other participants at this moment. In the event of a leader falling, a failover occurs, during which the new leader must complete these changes. However, this is not the subject of consideration of the lemma, since we consider the potential, i.e. some ideal conditions that can lead to the desired result - the achievement of consensus. Why we do not consider all possible conditions? Yes, because there is a theorem that consensus in an asynchronous system is impossible . Therefore, it is important to understand what is the minimum possible time in the most favorable situations without disturbing the correctness of the algorithm, which is obliged to retain its invariants and in case of violation of these conditions at any stage. These two lemmas give an exhaustive answer, which suggests that the shortest possible agreement time is achievable.

This theorem can be generalized by proving that it is impossible to reach a consensus faster than 1 RTT, throwing out the condition of having a leader. However, this is beyond the scope of this article. The idea of the proof is to consider the dissemination of knowledge about other participants in the system and whether they have the appropriate message: for 1 hop, you can only send data, but don’t find out if they reached and what state the recipient was in.

So, for fault tolerance, take a consensus with 1 RTT and add to our two-phase commit:

1st hop . The client sends a request to the leader of the coordinator.
2nd and 3rd hop . The coordinator leader agrees to start the transaction.
4th hop . The transaction coordinator sends the request to the leaders of the participants for blocking - 1st phase.
5th and 6th hop . Participants successfully take locks with preservation of knowledge in their consensus groups.
7th hop . Leaders of participants send the answer that they are ready to conduct transactions.
8th and 9th hop . The leader of the coordinator will coordinate information about all participants in the system.
10th hop . The leader of the coordinator sends out a message on the application of operations to all the leaders of the participants - phase 2.
11th and 12th hop . Leaders agree on commit and apply changes.
13th hop . Participants report on the success of the application of the leader coordinator.
14th hop . The coordinator responds to the client.

Total 7 RTT. Not bad. Fault tolerance is “only” 4 RTTs. They arise from the fact that the coordinator and the participants 2 times consistently come to their own consensus, for which this time is spent.

In the above scheme, some non-optimalities can be noticed. Let's fix them.

Commit optimization

The first obvious optimization is sending a response to the client immediately after collecting the responses of successful locks from the participants. Since Since these answers are fault-tolerant, the participants will never forget about them, which means that the transaction will be executed sooner or later even in the event of loss of the notes, re-election of the leader, etc. Here, however, there is one slippery moment.

It lies in the fact that in fact the coordinator makes the final decision on whether to commit the final transaction or not. Those. even if all the participants said OK, but some participant blunted due to, for example, a change of leader, the coordinator has every right to roll back the transaction. And if so, then you can remove only 10-13 hops, but not the 8th and 9th. But this is not bad, since we have a decrease of 2 RTT, i.e. 5 RTT instead of 7.

At the same time, 10-13 hops do not disappear anywhere, just the client does not need to wait for them. The coordinator and the participants will finish their affairs in parallel with the client. And the client will receive his confirmation a little earlier. This knowledge will be in the system, just a little later. Here we use the magic of asynchrony, consensus and the impossibility of proving to an outside participant that we cheated and cut the corner a little. Those. if the client suddenly wants to immediately read the data that we have just committed and go immediately to some participant, he will come across a lock (if it was not removed by the 2nd phase), and this request will hang until it is removed . However, within the framework of our theoretical research, this fact is absolutely not important, since we are preparing the perfect conditions. And in the case of imperfect ones, as already mentioned above, we will wait for several eternities (since consensus will require eternity, but we need to hold several of them, and consistently).

The next knight's move is somewhat more complicated and elegant.

Consider the very beginning of the transaction. There, the client sends a request to the coordinator and then he initiates a two-phase commit and sends these requests to the other participants. The idea immediately arises to execute such requests simultaneously, i.e. send a request to the coordinator and participants in parallel. On this way, an insidious trap awaits us.

The fact is that the client is not a fault-tolerant entity, i.e. he may fall. Imagine that he sent a request to the participants, they took the lock and wait, but the request to the coordinator for some reason did not reach and the client fell. Thus, there is no one to start a two-phase commit and there is no one to roll it back in case of conflicts / problems and so on. Participants will permanently block records and no one will help them. Therefore, this optimization is incorrect. Participants have the right to commit only after the decision of the coordinator, who is responsible for the transaction and rolls it back if necessary.

To go further, you need a completely different look at the problem. And for this we begin, oddly enough, with a consensus.

Consensus optimization

It would seem that there can be optimized? After all, we with Raft reach the minimum possible execution time - 1 RTT. In fact, it can be faster - for 0 RTT.

To do this, we recall that in addition to the consensus itself, 1 more RTT is required to send a request from the client to the leader and receive a response. Those. for the removed consensus group, 2 RTT is required in this case, which we see in a two-phase commit in 2 examples: sending and committing on the coordinator, sending and committing on the participants. Total 4 RTTs at once, and another 1 RTT - per second phase commit at the coordinator.

It is clear that a leader-based consensus for a remote client can never be faster than 2 RTTs. In fact, we first need to deliver the message to the leader, and then the leader is obliged to send the group members and receive a response from them. No options.

The only option is to get rid of the weak link - the leader. Indeed, moreover, that all records must pass through it, even if it falls, the group becomes inaccessible for a sufficiently long time. The leader of the consensus is the weakest link, and the restoration of the leader is the most fragile and non-trivial part of the consensus. Therefore, you just need to get rid of it.

Definition Broadcast messages are sending the same message to all group members.

To do this, we take the consensus widely known in narrow circles without a master . The basic idea is to ensure that the broadcasters achieve the same status on the participants. To do this, it is enough to make 2 broadcasts, i.e. just 1 RTT. The first Broadcast to the participants of the system can be made by the client. Response broadcasters from the participants can reach the client. If the client sees the same state (and he sees it in the case of, for example, the absence of concurrent requests), he will be able to understand the analysis of the content of the response broadcasts, that his request will be committed sooner or later. In fact, using such an algorithm, all participants in the consensus, including the client, simultaneously realize that the request was committed. And this will happen after 2 broadcasts, i.e. 1 RTT. Since the client still has to spend 1 RTT to send a message to the group and get an answer, then we have a paradoxical conclusion that the consensus effectively occurred in the 0 RTT.

Analogy

To go further, let us use the most powerful analysis tool - the analogy. Let's go back to the Raft algorithm. What is happening in it? It consists of two phases:

1st phase : the leader sends a request to the participants and waits for a response.
Phase 2 : after the answer, the leader makes an agreed decision individually and sends it to the system participants.

Nothing like? That's right, this is a two-phase commit, with only some reservations:

The Raft algorithm does not need to wait for a response from all participants. In a two-phase commit for a successful transaction, you must wait for a successful response from all participants.
In the Raft algorithm, the participant cannot say neOK. More precisely, theoretically, he can do so (for example, the place is over), but this neOK will be similar to the absence of an answer. In a two-phase commit, everything is stricter: if at least one of the participants said a neOK, then the whole transaction should be quarantined and rolled back. This is the very essence of two-phase: first we ask the consent of everyone, and only after universal unanimous approval we roll changes. The consensus in this sense is more democratic, since requires majority approval.

At the same time, they have in common that there is a dedicated decision-making driver (leader or coordinator), and there are 2 phases - preliminary and final.

Accordingly, all we need is to abandon the coordinator in a two-phase commit, i.e. do exactly the same thing that we did for consensus by giving up the leader.

Let's forget about fault tolerance for a while and see how commit will look in this case.

Self-coordination

Definition A two-phase commit without a coordinator consists of 2 phases:

All participants send to all other participants their decision: OK or neOK.
Each participant after receiving an OK from everyone commits the changes or rolls them back, if at least one responded to a non-QC.

After that, for reliability, each participant can send information to the others that a commit has occurred and you can release locks, but this is not necessary.

Why did we suddenly do not need a coordinator? The fact is that the coordinator followed the transaction process, including whether the nodes were alive. Those. in case of problems with the participants, the coordinator rolled back the transaction. The problem was only in the coordinator itself, since he could not look after himself. Therefore, often a two-phase commit is called blocking .

Definition Self-coordinating transactions - transactions that do not require a dedicated coordinator.

However, adding fault tolerance to the role of the coordinator becomes redundant, since each participant who is a consensus group can stand for itself. Thus, we come to self-coordinating transactions without the need for a dedicated coordinator. An important difference from the usual two-phase commit with the coordinator is that the coordinator can decide at any time to roll back the transaction, even if all participants gave a positive answer. In self-coordinating transactions such non-deterministic behavior is unacceptable, since Each participant makes a decision based on the responses of other participants and this decision should be the same.

Theorem . Self-coordinating transactions give strict consistency (strict consistency = linearizability + serializability).
Proof . Actually, the proof is based on the simple fact that a two-phase commit also provides such a guarantee. Indeed, in a scheme without a coordinator, each participant is himself a coordinator, i.e. there passes a two-phase commit as if it is the most important. So, it preserves all invariants of a two-phase commit. It is easy to be convinced of this, if we recall that each participant broadcasts by the broadcasted answers to all the others. Those. each receives OK responses from everyone else, acting as a coordinator to commit a transaction commit.

We describe the minimum number of hopes under favorable circumstances:
1st hop . The client sends a message to all participants in the transaction.
2nd hop . All participants send a response to the client and each other.

After the 2nd hop, the client has all the necessary information to make a decision about the commit. Thus, only 1 RTT is required for a commit.

Fault tolerance and availability

The attentive reader may ask: what to do in the event of a client falling? After all, if the participants of the system can be made fault tolerant, then we cannot make such requirements to the client, i.e. he may fall at any time. It is clear that after the client sends requests to all the participants in the system, the distributed commit is able to complete without the participation of the client. And what to do if the client managed to send only some of them and safely fell?

In this case, we oblige the client to do the following: the client must send each participant information about all the other participants in our transaction. Thus, each participant knows all the other participants and sends them their result. However, any participant, if he did not receive a request from the client, can choose one of the following behaviors:

Immediately answer that he does not accept the transaction, i.e. sends neoc. In this case, the lock is rolled back. The participant, as always, sends his answer to the other participants.
If the request from another participant contains all the necessary information to complete the transaction commit for this participant, then you can decide on successful blocking of the corresponding records (1st phase) and send OK. To do this, the client must send each participant of the transaction information about all other participants and all the necessary data to make a distributed commit.

In any case, we get that all participants either get OK, or in the absence of the necessary information, someone reports the neOK and the transaction is rolled back. Those. in the event of a client crashing, each participant is able to either finish the job or correctly roll back the client’s actions.

It remains to make the system participants fault tolerant. To do this, put them in the consensus of the group without a dedicated leader. Those. Each participant will not represent a separate node, but a set of nodes in the consensus group.

The commit algorithm will look like this:

The client sends to each node belonging to the group of participants in the transaction, its request.
Upon receiving a request, each node sends to all the other nodes and the client a response regarding the speculative execution of the first phase of the commit as if it were executed at the current step of consensus. In reality, we do not know whether this will actually happen or not, because in the case of competitive requests from other customers, the consensus may reorder the current unapplied actions.
The client receives all requests from all nodes of all participants. If all the nodes during the speculative execution answered OK and the consensus step was the same for each node from the consensus groups, then this means that the speculative implementation of the first phase will actually occur and you can make a decision about the commit.

In fact, the condition of receiving a response from all the nodes of each group is redundant. However, a more detailed consideration of the easing of this requirement is beyond the scope of this article.

findings

Total we get 2 hop or 1 RTT. Taking into account the fact that the message between the client and the server cannot be removed, the effective processing time of the commit on the server side is zero, i.e. as if the server instantly processed a distributed, highly available failover transaction and sent a response to the client.

Thus, we have an important theoretical and applied fact: the lower limit of the execution time of a distributed fault-tolerant commit is achievable.