Fastest Indian: Key / Value Trie-Based Container

“It may seem that I do nothing. But actually, at the cellular level, I'm very busy. ”
author unknown

In the 21st century, building programs is increasingly reminiscent of Lego. This approach implies that many “cubes” are invented before us. Actually, their elementary deceptively suggests that the resource of improvements over the years has been almost exhausted, and it remains for us to use what we have. But, strangely enough, by analogy with biology, elementary "cells" sometimes hide the most complex and well-thought-out algorithms, and it is here that all the most interesting battles are concluded. In this sense, the programmers on the versatility of the industry, somewhat reminiscent of doctors. There are their therapists, veterinarians, surgeons, and there are those guys who can spend several months of work on a few lines of code.

“At Google, right now, as I say, in our server park, 1% of all CPUs do the computing inside the hash tables. As I say, more than 8% of all server RAM is occupied by hash tables. And this is just what relates to C ++, I don’t know the Java situation. ”
Matt Kulukundis, CppCon 2017

At one time, an interesting idea came to my mind about how to place two sparse pages in one section of memory while encoding the key values themselves in the container via jumps. This idea seemed quite interesting and fresh, and its verification took only a few dozen lines of code, so in the coming evening I overcame my curiosity about how much memory pages could be compressed in this way. I must say that this was only the beginning and, over time, everything turned into a test of a huge number of hypotheses, measurements, versions. In the current version, it is rather difficult to discern the outlines of that initial “ideal” model. In such a task, as in typical engineering problems, it is very important to find a balance. It is difficult to find that simple universal mechanism, which will work equally well in the mud, in water and in the heat, just like an automatic machine shutter.

“To do simple is sometimes much more difficult than difficult”
Mikhail Kalashnikov

The idea to create a Trie that will run faster than Heshtablits is not new. In 2001, Douglas Baskins described the working principle of Judy Array . The idea seemed so interesting that the company Hewlett Packard identified a whole group of engineers for this project. A well-optimized Trie is more like a small operating system. Here is its own implementation of the memory manager, a whole list of algorithms for compressing and balancing nodes, its own small ecosystem for managing the microworld in the container. In view of the complexity of implementation, there are very few such projects in open access. In the open repositories, the absolute majority of Trie implementations are only a few hundred lines of code ( JudyArray from HP is about 20K, HArray is about 8K LOC note). Those implementations that are, rather are of an academic nature, or are created to solve specific problems for working with text. You will hardly be asked about Trie at the interview, such Key / Value containers are not included in the standard libraries of popular programming languages. And I must say, absolutely nothing. Already at the first examination, it comes to the understanding that a well-optimized Trie can work faster than hash tables, while being even richer in functionality than binary trees.
')
Information retrieval, one of the fundamental tasks of computing technology. Actually, immediately after computers taught something to count and store information, there was a need for an effective search. For all the time it was suggested that only three basic concepts for organizing a quick search. Binary trees are a way to organize a container in which the keys for the search must be strictly sorted. Hashtables - the address of a value is obtained through key processing by a hash function. And Trie - where the key itself encodes the path to the value. In the literature, you can find out more about how these algorithms work. For example, there is a graphic description of how these approaches differ from each other.
But little is said what makes Trie more interesting precisely in terms of constructing a universal Key / Value container.

Hashtable This structure was designed to very quickly find the key, in fact, sacrificing everything else. At the forefront of the hashfunction, the quality of which depends on almost everything. However, you can choose an excellent hash function, which on the test data will give an almost uniform distribution, but still do not control the situation 100%. Perhaps in real work, on new data, your hash function will degenerate into a case close to the worst case. In this case, the super fast search will turn into something like full scan and O (N) instead of O (1). Another problem, the more data, the more collisions. On large volumes of data, collisions grow like a snowball and at some point, they will have to rebuild the entire hash table. This event is called Stop World event and means that you can’t guarantee close to constant latency for the calling code. Also, a large amount of data will entail non-linear memory consumption, many cells inside the hash tables will be empty, and the keys themselves will be stored in the buckets in uncompressed form. There is no effective way to search by range or by key pattern in a container. Even if the keys differ in only one last bit, the probability that they will be close in the structure in the same bucket tends to zero. For a processor, if you work with some kind of natural data set where the keys themselves are similar to each other (URL, FilePath, Words, etc.), this will often mean cache miss, which also does not add speed. But even if you have an ideal hash function, there are only a few keys in the container and there are no collisions at all, when you insert and search the key itself, you scan at least twice. The first time passing through the hashfunction and the second time when they came to the address, re-checking that the found key is really the one that is required. Almost all of these shortcomings are devoid of Trie, but more about it below.

Binary Tree A certain gold standard, especially if we talk about databases, is different variations of binary search. Most often there are two modifications. Red Black Tree - binary search with effective balancing of the tree and B + modification, if you need additional work with the disk. Binary search is devoid of many of the shortcomings of hashtables. Choosing a hash function is not needed, memory consumption is almost linear, inserts and searching in a predictable time, usually O (log (n)). Ability to search by key range and set the data sorting type. But the search speed itself is quite low. In a container with about 1 million keys, the key will be found in approximately 20 seek times. At the same time, if in the ideal hash table without collisions we talked about scanning the key twice, then the key can be scanned dozens of times, at each stage where we need to compare the keys with each other for more-less-equal. In general, the binary tree works really well, but, unfortunately, not on a flat memory model. Every time we need to insert a new key, we insert it somewhere in the middle to keep the sort order. Because of this, quite complex balancing algorithms, because of this - the spaces left in extends, to avoid moving old data with each insertion.

Trie Here we return to our “dark horse” and the first thing to say, Trie always scans the key once. In essence, this means that it is Trie, at least theoretically, that can work faster than hash tables and even more so than binary trees. From here comes another interesting feature - the longer the keys, the greater the difference in speed between the hash tables and Trie in favor of the latter. In addition, this structure works very friendly with processor caches, because, first, similar keys lie side by side in memory, second, most often such keys use shared memory segments and the next time you insert or search, it’s likely that some of the key will already be loaded in L1 / L2 processor caches. If the keys are of a similar nature, such as a URL, the structure uses memory more economically due to prefix compression. Such keys will also be scanned more efficiently by range. A binary tree will read each key from start to finish, while Trie will scan only the “tails” of the keys. Obviously, Trie works better on a flat memory model, because it does not require constant balancing, unlike binary trees, and does not require a complete rebuilding of a tree on large data volumes, unlike hash tables. There are no Stop World events.

What are the disadvantages? The first is that in spite of different methods of compressing knots, like Patricia, this structure loves long jumps very much. If the data is stored on the HDD, where seek time is a very expensive operation, then the hashtable will work much faster. Indeed, despite the fact that she scans the key twice or more times, seek time she will have only one - positioned on the desired bucket, while Trie will have several such seek times, although on average less than in the binary one ( classic) tree. Also, scanning keys of a random nature over a range in a binary tree will be much more efficient because, again, there are a lot of jumps when scanning a subtree. Another disadvantage is the complexity of implementing such a structure and the inability to set custom key sorting, although this is not true for all implementations, for example in HArray you can even set custom key sorting.

JSON In the previous examples we compared the features of the work of different containers by parameters such as speed, memory, the ability to scan across a range of keys. For most applications, replacing one container implementation with another will mean “bit optimization”. In high-load projects, of course there will be significantly more gains. And by high load, I mean not only the servers of large corporations to which there are a lot of client requests. For example, data archiving, graphical rendering, content indexing are all the same tasks where a regular key / value container can work inside a “engine” under a load of millions of requests per second. In such scenarios, the speed is never superfluous and optimization, conditionally, twice will mean that 16 GB of content will not be indexed 4 hours, but only 2. And even so everywhere here we have a choice. We can use the existing implementation of the container key / value and not particularly think about the details of its work. However, there is a whole class of tasks where it is completely inappropriate to use something other than Trie. We are talking about a number of word processing tasks, such as the principle of the Suffix Tree. Also, as an example, Radix Tree found a good application inside the Linux kernel and could hardly be replaced by something else. All these examples are well described in different literature, so I will not dwell on them in more detail. Instead, I will give another interesting example. In general, application architecture is very important to achieve uniformity. So often it happens that a correctly chosen abstraction, a pattern, as in the “template” is suitable for everything else. So JSON is a natural, intuitive format that can be stored inside Trie in the same natural, intuitive way. How to do it? Simply enough, you just need to JSON “cut” into keys, where Key is the path to the attribute and its value, and Value is the document number. Thus, inserting, updating, or deleting an attribute in the middle of a JSON document would not mean overwriting it completely, but rather inserting, updating, or deleting keys inside the container. Searching for any attribute or searching through a range of attribute values would mean just finding keys or scanning a subtree inside a Trie, without deserializing the entire document. All these operations are very fast. Extracting a key from a Key \ Value container usually costs less than a hundred nanoseconds. In addition, Trie naturally compresses JSON with an inverted index. The fact is that if such documents are stored as separate files, the attributes in them will be duplicated. But if they are added to Trie, then all the attributes will be presented to Trie once, regardless of the number of added documents in the container. This approach is somewhat similar to the approach used in column data stores, but this time, it is used for document-oriented databases. In general, this topic deserves a separate article.

“The theory without practice is dead and fruitless, and practice without theory is useless and destructive”
P.L. Chebyshev

Recently, the HArray project, which implements the Trie algorithm with many optimizations, has been posted to the network, to the public domain . The implementation turned out to be quite complicated, but in my opinion, quite effective, as evidenced by benchmarks . This option is not final, there are still many ideas for improvements. In addition, the old version worked one and a half times faster, but after adding new functionality, it slowed down significantly, and this is also worth understanding.

In addition to the basic functionality, implemented fair deletion. Honestly, this is when the keys are not “reset” but are honestly dismantled step by step, gradually releasing the memory. In the scenarios of massive addition and removal of keys, the structure operates in the “waste-free production” mode, parts of the old “dead” keys are used to build new ones. If the number of deletions exceeds the number of additions of new keys, then at some point the structure may give up “extra” pages of memory to the operating system, simultaneously defragmenting its own used pages. It also implemented various tree traversals, searching for a key template, saving to disk, the ability to override the sorting of keys in a container, and other functionality.

Source: https://habr.com/ru/post/348938/

All Articles

Fastest Indian: Key / Value Trie-Based Container

More articles: