Storage of hierarchical structures. Symbiosis "Closure Table" and "Adjacency List"

When we face the task of storing and managing hierarchical data structures, we always have to choose from a rather limited set of patterns. In order to find the most suitable template, it is necessary to analyze the features of each method of storing and processing data and evaluate them taking into account the task and the specifics of the used DBMS.

Suppose there is a task, to provide an opportunity for site users to leave comments on publications. Comments should have a tree structure, users should be able to leave one or more comments to the post, as well as respond to any comments from other users. That is, we need a comment system similar to the one we can see on Habrahabr. For some reason, ready-made solutions are not suitable for us, for example, due to the fact that an additional very complex business logic is supposed to be integrated into the comment system.

Our goal is to develop our own implementation, taking into account the requirements of our application.

What is important to us?

Minimize the number of database queries. In particular, a single request should be enough to retrieve a comment thread.
Get high performance, so queries to the database, especially for reading, should be simple and run quickly, even with a large amount of data.
Store comment text and hierarchical tree structure in different tables.
Have the ability to sort comments retrieved from the database in order to display them in their natural form as a tree-like hierarchical structure or, even better, immediately retrieve a sorted tree from the database.
Control the depth of comments.
Ensure data integrity.
Take into account the specifics of MySQL. As is known, this DBMS does not support recursive queries. Recursion in this DBMS is possible only in stored procedures with nesting restrictions - no more than 255 levels.
The requirements are well-founded and relevant for many projects.

So, consider the known methods of storing and working with trees. There are not so many of them:
')

Details of the implementation of these patterns are perfectly addressed in the presentation of Bill Karwin (Bill Karwin) .

The peculiarity of the “Adjacency List” method is that without the support of recursive DBMS queries, it is impossible to retrieve an arbitrary part of the hierarchy with one simple query. Therefore, in the context of using MySQL, this option does not suit us at all.

The “Path Enumeration” method (or as it is also called the “Materialized Path”) obviously also does not suit us, due to the poor performance of SQL SELECT queries, since the use of the LIKE operator and search by patterns are expected: '1/2/3 /four%'. Storing a certain set as a delimited string is hardly appropriate in the world of relational databases.

Perhaps the most interesting pattern for working with tree structures is the Nested Sets. It may well be used for our task, but its implementation is quite complex and it does not ensure data integrity. Failure to insert a new item in the hierarchy or when moving a subtree from one place to another can create big problems. The need to recalculate and change the values of the part of the left and right indices of the elements of a subtree when adding a new element, no doubt, is a significant drawback of Nested Sets.

The last method “Closure Table”, based on our requirements, could be the best choice, if not one “but” - the lack of a simple way to construct a sorted tree from a flat list of links received by a query.

Let's look at this template for working with tree data structures in more detail.

The connection diagram of the elements of the “Closure Table” tree is as follows:

Suppose we have a comment hierarchy that corresponds to the above link diagram:

Table comments:

Table commentsTree:

You can get a list of all the elements of the required part of the tree by the following query (we will extract the tree starting from `commentsTree`.`idAncestor` = 1):

SELECT `tableData`.`idEntry`, `tableData`.`content`, `tableTree`.`idAncestor`, `tableTree`.`idDescendant`, `tableTree`.`level`, `tableTree`.`idSubject` FROM `comments` AS `tableData` JOIN `commentsTree` AS `tableTree` ON `tableData`.`idEntry` = `tableTree`.`idDescendant` WHERE `tableTree`.`idAncestor` = 1

As a result, we get:

It's simple! But the extracted strings need to be converted to a sorted tree hierarchy. That we need it. Alas, there is not enough data to convert this set into the form we need.

How do we solve this problem?

Option 1. Make the DBMS sort the tree

There is a rather original way, with the help of which you can get a immediately sorted tree hierarchy from the database, with just one query. Well-known developer Bill Carvin (Bill Karwin) proposed a very elegant solution to the problem of extracting a sorted tree:

 SELECT `tableData`.`idEntry`, `tableData`.`content`, `tableTree`.`level`, `tableTree`.`idAncestor`, `tableTree`.`idDescendant`, GROUP_CONCAT(`tableStructure`.`idAncestor`) AS `structure` FROM `comments` AS `tableData` JOIN `commentsTree` AS `tableTree` ON `tableData`.`idEntry` = `tableTree`.`idDescendant` JOIN `commentsTree` AS `tableStructure` ON `tableStructure`.`idDescendant` = `tableTree`.`idDescendant` WHERE `tableTree`.`idAncestor` = 1 GROUP BY `tableData`.`idEntry` ORDER BY `structure`

As a result, we get what we need:

But, as it is not difficult to notice, this method will work correctly only in the case when the length of idAncestor identifiers for all strings that meet the query condition is the same. Pay attention to the following SQL query fragment: “GROUP_CONCAT (tableStructure.idAncestor ORDER BY tableStructure.level DESC) AS structure”. Sorting the rows containing the sets “12,14,28” and “13,119,12” will be incorrect from the point of view of our task. That is, if all identifiers in the requested part of the tree are of the same length, then the sorting is correct, and if this is not the case, then the tree structure will be broken. You can reduce the number of incorrect sorts by specifying the starting point for identifiers from a large integer number, for example, 1000000, specifying AUTO_INCREMENT = 1000000 in the SQL query when creating the table.

In order to completely get rid of this problem, you can specify the ZEROFILL parameter for the idAncestor identifier column, in which case the DBMS will automatically add the UNSIGNED attribute and the identifiers will look something like this: 0000000004, 0000000005, 0000000194.

For some tasks, no doubt, you can use this method. But let's try to avoid using two JOINs when working with two tables, the GROUP_CONCAT () function, and also the ZEROFILL parameter.

Option 2. The symbiosis of the two methods “Closure Table” and “Adjacency List”

Let's look again at the “Closure Table” method. In the flat list of relations of the hierarchical structure, which we receive with just one request, we obviously do not have enough information about the connection of each element with its parent, so that we can build a sorted tree. Therefore, let's add the idNearestAncestor field to the commentsTree table and save the link to the element that is the closest ancestor there.

Thus, we obtained a symbiosis of the two methods “Closure Table” and “Adjacency List”. Now the formation of a properly sorted hierarchical structure is a trivial task. And most importantly, we found a solution that fully meets the requirements.

By the following query we will get the data necessary for building the tree:

 SELECT `tableData`.`idEntry`, `tableData`.`content`, `tableTree`.`idAncestor`, `tableTree`.`idDescendant`, `tableTree`.`idNearestAncestor`, `tableTree`.`level`, `tableTree`.`idSubject` FROM `comments` AS `tableData` JOIN `commentsTree` AS `tableTree` ON `tableData`.`idEntry` = `tableTree`.`idDescendant` WHERE `tableTree`.`idAncestor` = 1

Criticism of "Closure Table"

The “Closure Table” template is often criticized for keeping in the database the links of each element of the tree with all its ancestors, as well as the link of each element to itself. The deeper in the hierarchy is an element, the more records in the table need to be done. Obviously, adding new elements to the end of a deep tree hierarchy will be less effective than inserting elements near the root of a tree. In addition, it is worth noting that for storing trees, the Closure Table method requires more space in the database than any other method.

In order to objectively assess the significance of these comments, they should be considered in the context of the specifics of the real project. For example, a decrease in performance when a new element is added to a two-tailed or thousandth level may be insignificant or uncritical for the operation of the application, or, even more likely, such a situation will never happen at all. As a rule, in real life there is no need for hierarchical structures of large nesting. In addition, in most cases, the principal parameter is the speed of data retrieval, rather than recording. The amount of disk space that the tree structure may require is hardly worth considering, since other users of this resource are much more significant, for example, logging site visitors.

An example of my implementation of a tree of comments based on the “Closure Table” and “Adjacency List” methods on github.

Materials used

Presentation by Bill Karvin (Bill Karwin). http://www.slideshare.net/billkarwin/models-for-hierarchical-data
Discussing how to sort trees on stackoverflow. http://stackoverflow.com/questions/8252323/mysql-closure-table-hierarchical-database-how-to-pull-information-out-in-the-c , http://stackoverflow.com/questions/192220/ what-is-the-most-efficient-elegant-way-to-parse-a-flat-table-into-a-tree
MySQL Documentation

Source: https://habr.com/ru/post/263629/

All Articles

Storage of hierarchical structures. Symbiosis "Closure Table" and "Adjacency List"

What is important to us?

Option 1. Make the DBMS sort the tree

Option 2. The symbiosis of the two methods “Closure Table” and “Adjacency List”

Criticism of "Closure Table"

More articles: