Python Prefix Trees

I recently completed the datrie library, which implements the prefix tree (see Wikipedia or Habr ), with a python . I hasten to share it.

In short, we can assume that datrie.Trie is a replacement for the standard dict, which, under certain conditions (keys - strings), takes up less memory, has a comparable speed of obtaining an individual element and supports additional operations (getting all the prefixes of a given string, getting all strings beginning with this string, etc.), which work about as fast as “dictionary” operations.

Works under Python 2.6-3.3, supports unicode, LGPL license.
')

datrie is a cython wrapper over libdatrie . In libdatrie, a version of the prefix tree is implemented, in which the state machine for transitions between nodes is stored in special 2 arrays + suffix compression is implemented (branches that do not branch are stored in another separate array of "tails"). This is not exactly the “state of art” version of trie (HAT-trie and so on should be faster), but the option is quite fast / efficient and with a good ready implementation (and the implementation can in practice kill any algorithm).

The existing variants of trie-shaped structures for python did not suit me. Pure implementation pitons will inevitably consume a lot of memory, they sweep aside immediately. Other implementations of trie-shaped structures for python:

trie.c in biopython. Does not support Unicode (although this is OK), does not work under 3.x (and in order to make it work under 3.x, you need to rewrite the wrapper well, because the library is implemented as a “bare” C-extension);
github.com/buriy/python-chartrie from buriy is quite fast (__getitem__ faster than datrie as a result), takes a lot of memory (to fix this, you need to change the data structure, i.e., the library base), it’s not enough functional, does not support unicode (although this is OK);
www.dalkescientific.com/Python/PyJudy.html - old, complex, it is not clear whether it works (it is written that it supports 2.3 and 2.4)

Maybe there is something else, but, in any case, I didn’t find the option “set - earned - everyone is happy!”, Therefore I think that I’ve been able to cover up the desire to remember C and understand better in Cython and Python C-API trie implementations for python.

Installation (as is usually the case when installing extensions, a compiler is required):

pip install datrie

Creature:

 import string import datrie trie = datrie.Trie(string.ascii_lowercase)

When creating you need to immediately say what keys you can use with this trie (explicitly specify the alphabet or ranges). This is a libdatrie restriction that allows you to effectively store the state machine and support unicode — first, the allowed ranges of unicode characters are specified, then the unicode keys are translated into a more compact internal representation.

It seems to me that it would have been possible in practice to get by with single-byte encodings like cp1251 and have approximately the same functionality and efficiency, but the approach with character ranges also works well, it is done well in libdatrie and all right. I therefore write that “Unicode does not support is OK” - in the case of trie, single-byte encoding may be a convenient option.

Then you can work with trie as with a dictionary:

 >>> trie[u'foo'] = 5 >>> trie[u'foobar'] = 10 >>> trie[u'bar'] = 'bar value' >>> trie.setdefault(u'foobar', 15) 10 >>> u'foo' in trie True >>> trie[u'foo'] 5

The keys must be Unicode :) In Python 3.x it is quite natural, in 2.x in the examples you have to put the letter u, forgive. Values can be any type of object, which is not typical for C-implementations of trie (usually there are integers as values). In fact, “inside” values and truth are integers, but datrie.Trie uses them as indices in an array of “real” values. For those who do not need this feature (for example, they don’t care for the values at all), there is a datrie.BaseTrie in the datrie, which can store only numbers a little faster.

A little bit about speed. All measurements were carried out on a trie.Trie with hundreds of thousands of unique Russian and English words (50/50) and int-values "1", other measurements (on a million urls) can be found here or (better still) by yourself on your data . I carried out speed measurements only in order to have some general idea of the order of magnitudes, and to track the regressions in the library, so take them accordingly; All source code and data are in the repository. In addition, I will not write anywhere about the asymptotic complexity of various operations, since did not explore it. In theory, it should be like a trie from Wikipedia (for example, getting an element or searching by the prefix - O (m), where m is the key length), but the implementation details (both libdatrie and my wrapper) can change everything; if someone builds the corresponding graphs, I will, of course, be grateful.

The operations of receiving an element, checking in, updating an element work on average 2-3 times slower than the standard dictionary (i.e. fast, on my beech it is from 1 to 3 million operations per second for the above-mentioned trie). The exception is the insertion of a new value in the trie, it works much slower (about 50 thousand operations per second for the same trie with Russian and English words). At the same time, trie on such data takes up much less space in RAM: 3-5M (depending on the interpreter) vs 20M + in a regular dictionary (memory measurements were clumsy and I cannot vouch for specific numbers) .

Those. datrie.Trie can be used as a replacement for dict, if there are a large number of not very long lines (for example, words or urls), the data is mainly used in the read-only mode (or “update-only”) and I want to save RAM 2-3 times lower access speed.

So that this pseudo-read-only is not very embarrassing, trie can be saved to a file and loaded from a file:

 >>> trie.save('my.trie') >>> trie2 = datrie.Trie.load('my.trie')

Another trie feature is that some operations that in dict (and any other hash table) would require brute force or large amounts of memory to build additional indexes in the prefix tree work almost as quickly (and some even faster) , just like getting values.

You can check if there are elements in trie whose keys start with a given prefix (it is even faster than a simple in check):

 >>> trie.has_keys_with_prefix(u'fo') True

You can find all the prefixes of this string that are in this trie (this is slower, according to tests - 500-600th operations / sec):

 >>> trie.prefixes(u'foobarbaz') [u'foo', u'foobar'] >>> trie.prefix_items(u'foobarbaz') [(u'foo', 5), (u'foobar', 10)]

You can find all items whose keys start with this line:

 >>> trie.keys(u'fo') [u'foo', u'foobar'] >>> trie.items(u'fo') [(u'foo', 5), (u'foobar', 10)] >>> trie.values(u'foob') [10]

In the last example, most of the time is now spent on the construction of results, rather than on the search, this can be optimized; Now the speed was somewhere around 150-300 thousand values / sec (for example, with a prefix of length 7 and an average of 3 values it is 70 thousand operations / sec).

Datrie.Trie has various methods, help and additional speed measurement results can be found in the README in the repository.

Under pypy, everything started on my debian (10 times slower than under cpython); on a poppy under pypy it didn't start (with a normal python it should work on a poppy, on Linux and under windows). C API - pypy extensions will always be slow, since Cpyext works through a crutch, and cython generates the C extension API. It was possible to write a wrapper on ctypes, but it would be slow under the usual python (and not the fact that fast under pypy) + it is inconvenient to distribute ctypes-extensions. The guys from pypy are now sawing cffi , they promise that it will be fast (both under cpython and under pypy). Plus, it is possible that cython will ever learn how to generate not C extension API, but cffi extensions. Then, perhaps, happiness will come) In the meantime, I don’t know what to do with pypy. Nothing, I guess; somehow everything works under Linux and all right.

In the process of implementation, I crashed with utf_32_le codec in python. There is a bug with the patch ( bugs.python.org/issue15027 ), but the patch is not committed yet. Initially, all operations in datrie worked 10 times slower, but then we managed to do in one place without standard utf_32_le encoding a string and it all worked better. This codec is used in a couple of “hot” places, so I think that if it is accelerated, some operations in the datrie can earn up to 2 times faster.

The iteration on the tree now is also not the most effective, this is due to the peculiarities of the libdatrie interface. But the author of libdatrie is a great person and is going to fix everything, so the prospects are not bad.

As usual, patches, bug reports, ideas, benchmarks, pull requests, etc. are welcome!

github / bitbucket , to whom that is more convenient.

Source: https://habr.com/ru/post/147963/

All Articles

Python Prefix Trees

More articles: