📜 ⬆️ ⬇️

Failsafe resource allocator over DHT

We have a range of numbers from 0 to N, we need to write two functions int alloc () and free (int). The first one chooses one of the free identifiers from the range [0, N), and the second accordingly “returns” it for reuse (we assume that the number N is small enough that the identifiers could end if they are not returned, but more than the number allocated in each specific time identifier). In this case, at the "lower level" we have only DHT . There are no locks, and, moreover, fault tolerance is required of the algorithms - if any of the cluster nodes “develops” during the execution of the algorithm, the behavior of the system should be predictable. If the task is interesting, and also it is interesting to find out why a fault-tolerant service with such a signature cannot be correctly used, and how it is necessary to fix the signature so that it becomes possible - welcome under the cat.



API


')
In fact, let's start by clarifying the API of our “allocator”. As noted above, the current API is incorrect - it cannot be used in a distributed system, because the calling code will not always be able to understand exactly which state of one or another element of the pool.

A simple example.
1) Suppose I call the allocate () function, it succeeds, but the calling code stops working when the response from allocate () is passed. The problem is that the identifier stood out, but I don’t know anything about it. I physically do not have the opportunity to write some kind of recovery procedure and delete it. On the other hand, the code responsible for allocating the resource has already worked, from his point of view the identifier has been allocated. And he himself will not be able to remove it, because he must wait for the call free. Which will not.
2) The idempotency of the free () function is required. Obviously, the calling code must be written in such a way that in any case, call free (). Obviously, having received a response from allocate, we will transfer it to persistent memory (in RDBMS, key-value storage, or simply print it on the printer and load the paper into the scanner). And then, at a certain point in time, when a node decides to delete a resource, it will call free (). However, if at the moment of free operation the node turns off, our “caller” will obviously try to call it again (exactly how it will be implemented is not important. It is important that it is realizable). Here we have an unpleasant surprise. The fact is that free could still release a resource. And even more, some third-party alloc () could allocate it again. Then we will free the resource once more, but this time it will be another's resource.

Generally speaking, there are several ways to solve these API problems.

The most simple, and quite universal, is that we assign a tag to each resource allocation, which is chosen unique for each pair of alloc / free operations.

That is, the method signature is changed to int alloc (T t) and free (int id, T t). No matter what type of T, it is important that there be some equivalence relation on it, and that we can solve the two problems described above. First, the free method can be made idempotent. Secondly, we can introduce a relatively simple procedure for deleting “phantom” identifiers that were allocated, but for which we did not receive information (see Problem 1). We will not dwell on this further, so as not to inflate the article.

Note that it is almost always possible to practically use a service modified in this way, since in most cases, we allocate “identifiers” for something, and we can always use this “something” as a tag.

Now, a little more, it remains to understand how alloc and free behave in the case of "partial" execution, which, in turn, to understand how we write the calling code, and whether it is possible to write it.

Obviously, in general, the calling code cannot know whether the allocator was executed or not. So the first rule for the calling code is to say

• The calling code must first write somewhere an identifier by which it calls alloc, and only then call it. Due to this, the calling code will be able to implement the recovery procedure, where it will go in the reverse order - it will iterate over all selected pairs (id, t), and call free for all pairs about which there is no mention in its repository.
Similarly, we can’t find out about the free method - it has volunteered or not.
Two more rules come from here.
• The identifier passed by free must be considered as “free” in the system (that is, not to be used anymore) BEFORE calling the free method
• And, nevertheless, the free method must be called (permissible - several times) until at least one of its calls is completed successfully.

Well, let's move on to the implementation.

Implementation



I remind you that at the entrance we have DHT. For people who write in Java, we will assume that we have ConcurrentMap (its interface, by the way, is completely trivial, will be understandable to developers using other languages). At the same time, ConcurrentMap of different hosts "looks" at the same content.

DHT operations themselves are atomic (this is quite supported by a whole range of DHT implementations).

However, there is no atomicity between the challenges in the DHT, and we need to take this part into account and support in our implementation.

So, the implementation

It is obvious that in DHT we will have some kind of map id-> t (well, or vice versa). The problem is how to get to the first free id without going through all the busy nodes?

Such a solution is proposed. Let's build a binary interval tree, such that
• The root of the tree is the interval [0, N)
• Under the non-leafed node X [a, b) there are exactly two nodes [a, c) and [c, b), where c is chosen so that the lengths of these intervals are equal
• And the intervals of the form [id, id + 1) are leaves of the tree.
It is obvious that such a tree has N sheets, each of which can be accessed from the root in ln (N) transitions.

Let's now in each non-leaf node of the tree put the number of reserved identifiers on the interval that this node represents. And in the leafy nodes of the tree we will put the tag, if the corresponding identifiers are busy, and null - if it is free.
It now remains to save this tree in DHT. In addition, it is important to update the information in its nodes in the correct order, because, I remind, our algorithm should work correctly, even if it is interrupted at an arbitrary step.

Suppose we have an Interval structure with two fields - from and to, which defines a half-open interval (includes from, but does not include to). Obviously, such a structure encodes a tree node. It is also worth noting a relatively trivial moment - there are no repeating nodes in the tree, and, moreover, to determine the position of the interval in such a tree, nothing but this interval is needed. In other words, we do not need to link the elements of the tree with "pointers". We can always build an algorithm that, by node X, can determine which nodes are “under” it, and which node is “above” it.
This is a rather important property that simplifies our life.

Put in this structure there are necessary constructors, obvious equals and hashcode, and there are three more useful methods: left (), right () for determining the child nodes and up () for determining the parent node. With designers and hash codes, I think everyone will cope. With left () and right () too. With the up () function, it will be somewhat more difficult, but it is possible to write it as well (we will return to it at the end of the article, so as not to blur the text with unnecessary details).

Now, having a “magical” structure, Interval, which can encode our interval tree, the matter remains small.
In allocate, we need to select some non-empty sheet and write a tag into it, and then go “up” from the selected sheet and update their values. It’s important to keep two things in mind.
• The tag is added to the list by the atomic putIfAbsent operation.
• Values ​​in non-leaf nodes are updated through a loop with read-modify-atomicUpdate. Namely, we read the value from the updated node, read the value in the child nodes, write to the update node the sum of the children through the atomic replace. If atomic replace did not work, repeat the operation.
• Even if we read from DHT that there are loose leaves in the sub-tree of the node, this does not mean that they really are there.
Therefore, our allocate will be written in the form of a "before success" cycle from the failfast procedure of searching and updating a blank sheet. Note that the first modification of the tree we are doing is updating the sheet, and only then updating the parent nodes. This is necessary because allocate may complete at any time.

Free works in approximately the same way - it “releases” the identifier (that is, removes the node from the DHT), if the tag in it matches the requested one, and updates the nodes up the tree. At the same time, the update of nodes coincides with how it goes to allocate. The only thing that is important to note is that free updates the tree even if the leaf node does not contain values ​​(that is, if free has already been called). This is necessary because the previous free could be partially executed, freeing the list, but not updating the higher nodes. Actually, it is for this purpose that we need the API requirement that free can be called several times, and that it needs to be called again if there is any doubt that the last time its call was completed successfully.

Function up



Actually, the promised up () function. It is worth noting that the spelling of left (), right () and up () is greatly simplified if we agree that the top level interval is 2 ^ n long. Then any other interval in the tree will have a length of 2 ^ n, besides, the intervals on the same “level” of the tree will have the same length. Generally speaking, all three sabzhevye functions can be written for a top-level interval of an arbitrary type (but then, in addition to the “current” interval, you will always need to know the top-level interval). But this is inconvenient, and a generalization to 2 ^ n does not harm anything.

Thus, with the length of the intervals, 2 ^ n, left () and right () are written trivially, and for up () you only need to determine whether the current interval is “right” or “left” for its parent interval. This is determined by a simple rule - for the "left" intervals, the initial value is divided without remainder by twice the length of the interval. Accordingly, knowing about the interval, whether it was obtained from the left () or right () function, it is easy to apply the inverse action and get the upper level interval.

Source: https://habr.com/ru/post/177121/


All Articles