📜 ⬆️ ⬇️

Solving the problem of sorting trees using a binary Materialized Path

Faced the problem of implementing tree comments on one project, in this post I post my decision.

Task Description



In principle, the task is classical, and at first I solved it with the help of the combination of the Adjacency List and the Materialized Path . In other words, columns are added to the table.

id INT(11) parentid INT(11) mpath VARCHAR(255) 

The mpath stored the full path of the branch in the tree, for example / 1/3/4/5/8/9, where the comment ID is listed through the "/" sign.
')
With this approach, it is very easy to add new comments of almost any nesting level. The sample was made by the query:

 SELECT * FROM messages WHERE postid=$postid ORDER BY mpath ASC 

The problem arose with an increase in the number of comments. For example, there is a tree with the following paths:

 1 1.2 1.2.10 1.2.3 1.2.7 4 4.8 4.8.9 5 5.6 

Here the numbers indicate the ID, and the order is what it would have turned out when using ORDER BY mpath ASC .
Comment 1.2.10 goes to 1.2.3 and others, although it was added later (judging by the ID).

The solution of the problem


The solution of the problem was partially described here: http://habrahabr.ru/post/125729/ . The idea is to use not decimal IDs on the way, but to encode to another number system in order to have less restrictions on the length of the way. At the same time, separators in the form of dots or other characters are not needed, since the field will be used only for sorting.

I used base 95, this is just the number of printable characters in ASCII.

  !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ 


If for each value in the path we use a fixed length, we get the following upper ID threshold:
1 - 95 ^ 1 - 1 = 94
2 - 95 ^ 2 - 1 = 9,024
3 - 95 ^ 3 - 1 = 857 374
4 - 95 ^ 4 - 1 = 81 450 624
5 - 95 ^ 5 - 1 = 7 737 809 374

5 characters is enough to not worry about the maximum number of comments.
The maximum level of nesting, at the same time, for VARCHAR 255/5 = 51

Theoretically, you can take not BASE95, but BASE256 (along with unprintable characters), but keep it all binary in the BLOB, although here I am not sure about the sorting performance (need to check). Then by 3 characters we could encode 16,777,215 variants, and 4 - 4,294,967,295.

How it looks in code


I will give my example of the implementation of this whole theory.

 // $mpath -   materialized path   "/" //    : $db->Execute(" UPDATE messages SET mpath=?, bpath=?, depth=? WHERE id=? ", array( $mpath, bpath($mpath), $depth, $messid )); // . . . //      define('BASE95', ' !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~'); // . . . //  bpath function bpath($mpath, $sep = '/', $max_len = 5) { $bpath = ''; $chunks = explode($sep, trim($mpath, $sep)); $zero = substr(BASE95, 0, 1); foreach($chunks as $chunk) $bpath .= str_pad(dec2base($chunk, 95, BASE95), $max_len, $zero, STR_PAD_LEFT); return $bpath; } 

And further sample:

 SELECT * FROM messages WHERE postid=$postid ORDER BY bpath ASC 

The dec2base function is taken from here .

Such a solution, in my opinion, is very simple. If you also have beautiful solutions in your arsenal, share them in the comments.

Source: https://habr.com/ru/post/226741/


All Articles