
Good day to all!
Today I will tell you about one interesting data structure, about which only a few have heard, and about which very little is written rightly in runet, and in the English-language information, in general, it is also sparse. It was decided to correct the situation and to share with the public in an accessible form this rather exotic data structure.
The van Emde Boas tree is an associative array that allows you to store integers in the range [0; U), where U = 2
k , in other words, numbers consisting of no more than k bits. It would seem, why do we need some other tree, and even allowing you to store only whole numbers, when there are many different balanced binary search trees that allow you to perform insert, delete and other operations for O (log n), where n is the number of elements in the tree ?
')
The main feature of this structure is the execution of all operations during O (log (log (U))) regardless of the number of elements stored in it.
Actually, here is a small list of supported operations:
- Insert (x) - Insert a number in the tree
- Remove (x) - Remove a number
- GetMin (), GetMax () - Finding the minimum and maximum in the tree
- Find (x) - Search for a number in the tree
- FindNext (x) - Find the next number after x contained in the tree
- FindPrevious (x) - Search for the preceding x number
In this case, of course, you can associate any information with each of the numbers, for example, we will store the numbers of some goods in the tree and by the number we will be able to recognize the name of the goods, that is, Find (x) will return not just True / False - whether the product exists or not, and the name of the product with the number x. We will consider an implementation in which just numbers are stored, without additional information. To fasten this feature is not difficult.
So, let's start building our structure, and we will build it recursively.
Let the k-tree store numbers in the interval [0; 2
k ), that is, consisting of k bits. In this case, the number k itself will be a power of two, later it will be clear why we need such a condition. Then it is obvious how to build a 1-tree, it will only store information about whether the numbers 0 or 1 exist in it or not.
Now we build the k-tree. It will store:
- 2 k / 2 pieces (k / 2) -trees
- Auxiliary (k / 2) -tree
- The minimum and maximum number contained in this tree (if it is not empty)
It becomes unclear why we need all this?
Imagine now that we are trying to insert a number x consisting of k bits into a k-tree containing the subtrees children [0..2
k / 2 - 1]. For example, let x = 93. Let's look at its binary representation:

Divide our number into two equal parts high and low, each in (k / 2) bits. We recall that our tree contains 2
k / 2 trees that store numbers from (k / 2) bits. And now, from all these trees, we take the high ones in a row (that is, children [high]) and recursively insert the number low into it.
Thus, we obtain a simple algorithm for inserting and searching in the tree T of the number x, pseudocode:
insert(T, x): if Tk == 1: T.exist[x] = True
Let us estimate asymptotically the operation time of these functions. Let T (k) be the number of operations required to insert / search for a number in a k-tree. To insert / search in a 1-tree, we need O (1) operations, we obtain that T (1) = O (1). If we consider the k-tree (k> 1), we get that cutting the number using bit operations in it occurs in O (1), after which we perform an insert / search operation for the (k / 2) tree, we get such that T (k) = O (1) + T (k / 2). The obvious solution of this equation is T (k) = O (log (k)) = O (log (log (U)). It turns out that we have achieved the required asymptotics for insertion and search.
Unfortunately, such a structure with only insert, search and delete operations, to put it mildly, is meaningless in most cases, given that in this case it was easier to create a regular array A, in whose x-th element it is stored, does the number x exist in our set or not.
It's time to fix the situation! To do this, we implement, say, the operation FindNext (x). Recall that this operation looks for the minimum number q contained in our tree, such that q> = x.
To do this, a little complement our structure. Let me remind you that at the very beginning I said that we would keep in a k-tree not only 2
k / 2 pieces (k / 2) -trees, but at least in it, a maximum and one more auxiliary (k / 2) -tree, let's call it aux (from English
auxiliary - auxiliary). What will it help us so much?
Take a k-tree with its subtrees children [0..2
k / 2 - 1] and the aux auxiliary tree. In aux, we will store all such numbers p such that the children [p] tree is not empty, that is, it contains at least one number. Accordingly, if the p-th tree of children [p] is empty, then the number p is not contained in aux.
Also, we will make a minor modification: for any tree T we will not store the minimum number in its subtrees T.children, we will simply write it in the variable T.min. When we have to insert into the insert query the number x less than T.min, then since T.min is not contained in our subtrees, we insert it into the subtrees, and then assign T.min = x.
Note that now we will not consider separately the case of k = 1, since we have the variables T.min and T.max. And if, say, a 1-tree contained the numbers 0 and 1, then it will simply have T.min = 0, and T.max = 1. If, for example, it contained only the number 1, then it will have T.min = T.max = 1.
Now consider the algorithm for inserting into the tree, taking into account all of the above:
insert(T, x): if T.is_empty(): T.min = T.max = x else: if x < T.min: swap(x, T.min)
The asymptotic behavior of its operation will still be O (log (log (U))), it is evaluated similarly to the previous version of the insert function. It only needs to be noted that if we insert into aux, then the insert following it will work beyond O (1), since this subtree will be empty.
The search function will change a bit:
find(T, x): if T.is_empty(): return False else if T.min == x or T.max == x:
Now consider the function FindNext (x), here is its pseudocode with comments:
find_next(T, x): if T.is_empty() or x > T.max: return None
The asymptotic behavior of her work is obviously O (log (log (U)).
I think the writing of such functions as Remove, FindPrevious and others should not cause much difficulty, since they are all written in the same way, therefore I will omit their pseudocode.
Application
In addition to the obvious use of the tree van Emde Boas, you can come up with many unexpected, for example:
Sorting a sequence of N numbers for O (n * log (log (U)))
Indeed, we insert all our numbers into the tree. Then we take the minimum number in the tree and sequentially perform the operation x = FindNext (x + 1), as a result of which all our numbers will be in the sorted order in the variable x. Note that you can write different implementations of this sort, including those that also remove duplicate elements. The main advantage of this sorting is that, asymptotically, this algorithm overtakes even digital sorting.
Finding the longest increasing subsequence for O (n * log (log (U)))
Some have probably heard of such a task as finding the CWP and its solution in O (n * log (n)). For those who have not heard, can read about it here . The main idea of optimization is that a binary search in an array can be replaced by the FindPrevious operation in the van Emde Boas tree.
Dijkstra's algorithm for O (E * log (log (U)))
For those who are not familiar with the Dijkstra algorithm for finding the shortest path in a weighted graph, I suggest that you read this and this article. The implementation of the Dijkstra algorithm using a heap is known to work for O (E * log (V)), where V is the number of vertices and E is the number of edges. But if instead of a heap, we apply the already known data structure to all of us, then we obtain the asymptotics O (E * log (log (U))), which is good news.
- And many, many tasks, the number of which is limited only by your imagination :)
The minus of all these algorithms is that for too large U the van Emde Boas tree will occupy a larger amount of memory (a rough estimate is O (U)), which, however, can be partially avoided by creating the necessary subtrees "lazily", that is only when we need them.
Instead of conclusion
Here is my implementation of the van Emde Boas tree in C ++. She does not claim perfection, but should do her job. Additions and comments are welcome.
Thank you all for your attention!