
Hi, Habr!
Most developers know what Nested Sets are, their strengths and weaknesses. Today I want to present to the public the implementation of a modification of this technique, which partially solves the shortcomings of the original algorithm, although it has its negative sides.
What are the pros and cons of the original Nested Sets:
+ quick sample of parents and children with a simple request;
+ fast selection of neighboring nodes by a simple request;
+ fast selection of empty nodes with a simple request;
+ instant receiving the number of all children in the node by the formula;
+ possibility of non-recursive tree building in one cycle;
- very slow insertion and update.
What are the reasons for the slow insert / update? Since, the values ​​of the boundaries of nodes go continuously, then any insertion, especially at the beginning of the tree, requires recalculation of the boundaries of all subsequent nodes. If the table is large enough, this process can take tens of seconds or even longer. On animation:
Nested Intervals
The first thought that came to mind when considering a picture with a visual representation of the Nested Sets: “you can not move the border if you imagine the root node as a segment in the field of real numbers”. That is, if you imagine that the borders of the root node have values ​​of 0.0 and 1.0, then you can add nodes to any place without moving others, you just need to take a real number between the target nodes. But, since we know very well that real numbers in the computer have their limits, I quickly refused this idea, in favor of whole numbers.
')
With integers, the essence is the same, only the borders of the root node are made from zero to INT_MAX. Then, when inserting a new node, we simply take any two unused values ​​between the target nodes, and do not touch anything else, everything would be fast and cool if not for one “but” - what to do if there are no more free values? Obviously, you need to move all the values ​​of the boundaries to the nearest free value, "clearing" your space for insertion. Only now the search for such a “hole” is a costly task, so the result in this case can go even slower than in the classic nested sets.

A little googling, I realized that I had not invented a bicycle, and this idea has long been called Nested Itervals. But I haven’t found any implementations, since on most resources this method was considered in a negative way, saying “we treat one, cripple the other”. And this is not unreasonable, but it became very interesting to me, I wanted to test how this algorithm is better, and what is worse, and how much.
Implementation
I started writing Behavior for Yii2, implementing the idea of ​​Nested Intervals. The first difficulty arose when writing insert methods - how to get the boundary of the neighboring node or parent with
insertBefore()
and
insertAfter()
and the boundary of the first / last node with
prependTo()
and
appendTo()
? If in Nested Sets it is obvious - + 1 / -1 to the border of the current node, here without an additional query to the database in any way. But the good thing is that our samples are fast.
It took a lot of time to implement the search for an unallocated space, in case there is not enough space to insert a new node. But as a result, I was able to move the borders not half of the base, but only a small piece to the almost nearest “hole” from the current segment.
The next big question arose when moving existing branches to a new location. If this operation takes place within the framework of a single tree, then everything else can be done quickly, simply by swapping two blocks.
But moving from another tree turned out to be a difficult task, since there can be a lot of nodes being moved, and it would take too long to insert them one by one with the usual algorithm. At first, I generally wanted to abandon this possibility, writing off the implementation complexity. But a little later, I realized the need for the
optimize()
method, which would evenly distribute the boundaries of existing nodes across the range. After that, the idea arose to move the nodes of another tree by optimizing the source tree with the preparation of a “window” for the moved nodes. All this works very slowly (at the moment there is an optimization of this method only for MySQL), but it is worth noting that the operation to move from other trees is rarely required, where more often you need to move within one tree.
But it was quite easy to implement the samples - the main methods for obtaining parents and children are identical to the classic nested sets. Slightly more complicated with samples of the previous / next neighbor - two requests are required: obtaining a parent, and direct search in a range with a limit. But it became really bad with getting empty nodes
getLeaves()
,
getLeaves()
I had to do it with the help of the left join, which will certainly affect the performance.
Insert optimization
Having completed the first series of performance tests, the results were not impressive to say the least. The thing turned out to be that at first I implemented the selection of the boundaries of the inserted node simply by breaking the segment into three parts. Because of this, the gap for new nodes is rapidly reduced. Since the range of 32-bit PHP is [0 - 2147483647] (operations with large numbers are converted to float), then after inserting a total of 19 nodes into the root via
appendTo()
there is no room left for the 20th node (2147483647/3 ^ 19 = ~ 1.9), and you have to perform slow operations to find unoccupied places and move. So the thing, of course, will not work. We need some optimization for the distribution of space. For this, Behavior has as many as 5 options:
- First and main
amountOptimize
- this parameter determines for how many nodes to schedule the partitioning of the segment available for insertion. That is, with a value of 20, we will be able to insert 20 child nodes, and they will uniformly occupy the range of the parent, and then - as it will.
As a value, you can use an array, then the value will be different for each level.
noPrepend
, noAppend
, noInsert
- these parameters help to optimize the reservation of space between borders and the position of the first node. If you only appendTo()
operation (for example, comments), then you can set noPrepend = true
and noInsert = true
, and thus significantly reduce the likelihood of a severe case of clearing the space for the new node. More clearly the effect of these parameters is shown in the pictures.
amountOptimize = 3 | |
---|
amountOptimize = 3 noPrepend = true | |
---|
amountOptimize = 3 noAppend = true | |
---|
amountOptimize = 3 noInsert = true | |
---|
amountOptimize = 3 noInsert = true noPrepend = true | |
---|
amountOptimize = 3 noInsert = true noAppend = true | |
---|
It is worth noting that if you noPrepend = true
, this does not mean that you cannot perform the prependTo()
insertion, it is very likely that many such operations will be performed slowly.
- And the last parameter,
reserveFactor
, determines the size of the gaps between the borders. With one, the size of the gap is equal to the size of the inserted node. If the Behavior usage scenario implies a large number of insertBefore()
and insertAfter()
operations, then this parameter is better to increase.
After all these manipulations, the tests have become much more "tasty."
64 bits
If you use the BIGINT columns to store the
left
and
right
attributes and use 64-bit PHP, you can use the free optimization in the form of a wider range. Just set the range parameter = [0, 9223372036854775807]. This will make more rare cases of lack of space for new nodes.
Performance tests
As a standard for comparison, the most popular Behavior was implemented, which implements Nested Sets from the respected Alexander Kochetov (Creocoder)
creocoder / yii2-nested-sets . It was also interesting to compare the results with the Adjacency List while maintaining the possibility of sorting. I did not find a suitable library for this, so I took it and
wrote it myself (and even with the support of JOINs, but this is not about that now).
The first two tests are rather synthetic, since in reality, hardly anyone will fill the tree, but this procedure was necessary for me primarily to generate a more or less large base for further tests. They simply fill the levels in succession with a fixed number of children.
Test results 1 and 2 DB query time Execution Memory
Test 1. Filling 3 levels of 12 children.
Nested Sets 7696 6,961 ms 26,305 ms 135.8 MB
Itervals default (amount = 10) 6377 2,850 ms 11,920 ms 87.3 MB
Itervals x64 default (amount = 10) 5813 1,992 ms 10,963 ms 78.7 MB
Itervals amount = 24 5813 1,765 ms 10,442 ms 78.7 MB
Itervals amount = 12 noPrepend noInsert 5813 1,750 ms 10,223 ms 78.7 MB
Adjacency List 5811 1,567 ms 9,591 ms 71.3 MB
Test 2. Filling level 6 for 3 children.
Nested Sets 4735 5,701 ms 19,784 ms 82.2 MB
Itervals default (amount = 10) 3644 1,275 ms 5,976 ms 48.9 MB
Itervals amount = 3 noPrepend noInsert 3644 1,271 ms 5,993 ms 48.9 MB
Adjacency List 3642 982 ms 5,812 ms 44.5 MB
Already in the first test, you can see how bad the inappropriate value of the amountOptimize parameter is. In the 32-bit test appeared extra requests? related to the fact that for some nodes the script “cleared the place”. Nevertheless, even in this case, everything was completed much faster. By the way, 64-bits in this test were rescued, and there was not a single “bad” situation.
Tests 3–5 simulate the insertion of 20 nodes at a different place in a large table:
Test Results 3-5 DB query time Execution Memory
Test 3. Insertion at the beginning <4% (20 to 19657 knots)
Nested Sets 100 15,597 ms 16,636 ms 5.0 MB
Itervals 82 21 ms 150 ms 4.7 MB
Adjacency List 100 170 ms 439 ms 4.6 MB
Test 4. Insertion in the middle> 46% <50% (20 to 19657 knots)
Nested Sets 100 8,200 ms 8,985 ms 5.0 MB
Itervals 82 269 ms 593 ms 4.7 MB
Adjacency List 100 163 ms 454 ms 4.7 MB
Test 5. Insertion at the end of> 96% (20 to 19657 knots)
Nested Sets 100 549 ms 911 ms 5.0 MB
Intervals 83 46 ms 187 ms 4.7 MB
Adjacency List 106 159 ms 435 ms 4.7 MB
The test for inserting into the beginning of the base very clearly demonstrates the weak point of the classic Nested Sets - to insert into the beginning of the base, the boundaries of the entire base are required. Hence the devastating result. Only the results of insertion at the end of the database more or less compete with Nested Itervals. And by the way, note that the Adjacency List was slower, this is due to the fact that to ensure sorting, it has to perform
SELECT MAX(sort)
.
The following test removes 20 nodes from the beginning of the tree:
Test results 6 DB query time Execution Memory
Test 6. Removal of <4% from the beginning (20 of 19657 nodes)
Nested Sets 100 16.554 ms 17.678 ms 4.8 MB
Intervals 60 164 ms 250 ms 4.2 MB
Adjacency List parentJoin = 0 childrenJoin = 0 60 169 ms 257 ms 3.8 MB
Adjacency List parentJoin = 3 childrenJoin = 3 60 87 ms 162 ms 3.8 MB
Here, the situation is similar to test 3. There is only a difference in that when in the Itervals, “bad situations” can no longer occur.
Test 7 is very illustrative, as it imitates well the use of Behavior for comments. In a cycle of 1000 nodes is added to a randomly selected node, with a level limitation. Test 8 is similar, but with even more stringent conditions - not only adding, but also any other operation is allowed.
Test results 7 and 8 DB query time Execution Memory
Test 7. appendTo () in a random node (5 levels, 1000 nodes)
Nested Sets 5002 5,989 ms 17,406 ms 80.7 MB
Itervals default (amount = 10) 8497 23.301 ms 41.060 ms 120.7 MB
Itervals x64 default (amount = 10) 7092 11,330 ms 23,618 ms 97.5 MB
Itervals amount = 200.25 noPrepend noInsert 4009 1,431 ms 6,490 ms 50.2 MB
Itervals x64 amount = 250.30 noPrepend noIns 4003 1,421 ms 6,615 ms 50.0 MB
Adjacency List 4001 1,062 ms 5,976 ms 46.1 MB
Test 8. arbitrary operation in a random node (5 levels, 1000 nodes)
Nested Sets 5002 9,383 ms 23,502 ms 80.7 MB
Itervals default (amount = 10) 7733 8,123 ms 24,031 ms 107.2 MB
Itervals x64 default (amount = 10) 5663 3,761 ms 14,084 ms 75.6 MB
Itervals amount = 200.25 reserve = 2 4175 1,548 ms 7,223 ms 52.8 MB
Itervals x64 amount = 250,30 reserve = 2,403 1,541 ms 6,753 ms 50.0 MB
Adjacency List 4395 4,394 ms 12,377 ms 53.4 MB
Here it is striking how important it is when setting up Intervals to correctly configure the insert optimization parameters, since in this example, the default settings gave a very sad result. But with optimizations everything works very quickly, comparable to the Adjacency List. By the way, in test 8, he slowed down a lot, because in order to ensure sorting, he also needed to “clear the place”.
The following two tests for moving nodes in the tree:
Test results 9 and 10 DB query time Execution Memory
Test 9. Moving nodes at the beginning of <4% (20 of 19657 nodes)
Nested Sets 200 24.312 ms 25.479 ms 6.3 MB
Itervals 160 180 ms 573 ms 6.0 MB
Adjacency List 111 107 ms 318 ms 4.6 MB
Test 10. Moving nodes from end to start <4%> 96% (20 out of 19657 nodes)
Nested Sets 200 16,999 ms 17,973 ms 6.3 MB
Itervals 160 16,972 ms 17,854 ms 6.0 MB
Adjacency List 108 86 ms 325 ms 4.6 MB
In principle, in Nested Sets, in Nested Intervals, this operation must be performed in the same time, but in the code for Creocoder this is not optimal, because instead of just swapping a couple of blocks, in its code, the whole moves first base until the end, then move the desired block, and then again the entire base moves back. But in Creocoder you can use unsigned fields for the depth attribute, and in my Behavior it temporarily becomes negative when moving. Results from end to beginning are comparable, but the Adjacency List has a significant advantage.
After writing Behavior, I wanted to find out what benefits Trait will have in place of behaviors. Therefore, I ported it to Trait with static attributes. Further sampling tests were also carried out in the variants with Trait. But the results of the use of impurities, to put it mildly, were not impressed, especially considering how much more ugly the code has become because of them.
A simple selection of all the elements of
Model::find()->all()
:
Test results 11 DB query time Execution Memory
Test 11. Selection of all nodes (19657 pcs.)
Nested Sets 1 40 ms 1.108 ms 215.2 MB
Itervals 1 42 ms 1,247 ms 225.3 MB
Itervals Trait 1 41 ms 1,174 ms 207.4 MB
Adjacency List 1 33 ms 890 ms 179.1 MB
This test was written primarily to compare the memory consumption Behavior vs. Trait. And as you can see there is a difference, but insignificant.
Selection of descendants:
Test results 12 DB query time Execution Memory
Test 12. The choice of children and descendants (for 819 nodes in the middle of the tree from 19657 nodes)
Nested Sets 1641 6.397 ms 7.498 ms 24.9 MB
Itervals Behavior 1641 579 ms 1,657 ms 25.0 MB
Itervals Trait 1641 615 ms 1,590 ms 24.3 MB
Adjacency List parentJoin = 0 childrenJoin = 0 2562 720 ms 1,969 ms 36.9 MB
Adjacency List parentJoin = 3 childrenJoin = 3 2461 704 ms 1,966 ms 35.3 MB
Requests for a selection of children should be the same in the Nested Sets and Nested Intervals, but here another disadvantage of the Creocoder library has revealed - descendants are not optimally selected. Descendants could be selected instantly using an index by one of the attributes left or right
`lft` > :leftValue && `lft` < :rightValue
, instead, the index is used only half
`lft` > :leftValue && `rgt` < :rightValue
. If we analyze EXPLAIN, it becomes clear that the first option is much more preferable. The results of the Adjacency List are inferior, which is not surprising.
Selection of ancestors:
Test results 13 DB query time Execution Memory
Test 13. Picks selection (for 819 nodes in the middle of the tree from 19657 nodes)
Nested Sets 821 3,344 ms 4,069 ms 20.6 MB
Intervals Behavior 821 3,292 ms 4,147 ms 22.0 MB
Itervals Trait 821 3,310 ms 4,080 ms 21.1 MB
Adjacency List parentJoin = 0 childrenJoin = 0 3180 948 ms 2.304 ms 51.2 MB
Adjacency List parentJoin = 3 childrenJoin = 3 1641 486 ms 1,495 ms 30.8 MB
Here, Nested Sets and Nested Intervals behave in the same way, and rather slowly. The point is the absence of normal indices and unfortunate coincidence - both indices from the middle of the table have many elements. As it is not surprising, the Adjacency List worked faster, although at the expense of memory (although this is still a matter of a simple 3-level tree).
Selection of neighbors and empty nodes:
Test results 14 and 15 DB query time Execution Memory
Test 14. Selection of neighboring nodes (for 819 nodes in the middle of the tree from 19657 nodes)
Nested Sets 1641 520 ms 1,424 ms 24.3 MB
Intervals Behavior 1641 19,681 ms 21,326 ms 27.5 MB
Itervals Trait 1641 19,666 ms 21,251 ms 26.5 MB
Adjacency List parentJoin = 0 childrenJoin = 0 1641 535 ms 1,442 ms 23.7 MB
Adjacency List parentJoin = 3 childrenJoin = 3 1641 508 ms 1,421 ms 23.6 MB
Test 15. Selection of empty nodes (for 819 nodes in the middle of the tree from 19657 nodes)
Nested Sets 821 3,215 ms 3,814 ms 18.8 MB
Intervals Behavior 821 10,450 ms 11,166 ms 18.8 MB
Itervals Trait 821 10.425 ms 11.040 ms 18.7 MB
Adjacency List parentJoin = 0 childrenJoin = 0 1833 568 ms 1,743 ms 32.6 MB
Adjacency List parentJoin = 3 childrenJoin = 3 1732 556 ms 1,891 ms 31.3 MB
Here, as expected, we see the weaknesses of the Nested Intervals. As they say, no comment.
findings
Nested Intervals - has the right to life. It has both advantages and disadvantages:
+ quick insertion (assuming a good selection of optimizing parameters);
+ quick node deletion with descendants;
+ the same speed as the Nested Sets in a sample of ancestors and descendants;
+ Also, the possibility of not recursively building a tree in one cycle remains;
- slow receipt of neighboring nodes;
- slow receipt of empty nodes;
- there is no possibility to instantly count the number of children in a node.
Implementation for Yii2
I want to note that the methods are slightly different from those proposed in the behavior of Creocoder. First, I changed the naming of methods to
getParents() getDescendants()
getters, this allows access to the associated nodes, similar to Relations, which allows not to make secondary queries to the database:
$node = Node::findOne(['name' => 'test']); $children = $node->children;
In addition, I did not drag Yii-shnye parameters to save ($ runValidation, $ attributeNames), instead I implemented the methods as an indication of the action (just like
->asArray()->
):
$node = new Node(); $node->makeRoot()->save(false); $node->insertAfter($node2)->save();
Implement Nested Intervals on GitHub .
Implementing the Adjacency List on GitHub .