Which map is faster, and is there an alternative to Judy

Shot from Top Gear: USA (Series 2)

In our most heavily loaded services, we in Badoo use the language C and sometimes C ++. Often, these services store hundreds of gigabytes of data in memory and process hundreds of thousands of requests per second. And it is important for us to use not only suitable algorithms and data structures, but also their productive implementations.

Practically from the very beginning, we used Judy as the implementation of associative arrays. It has a C-interface and many advantages. We even use a wrapper for PHP, since in versions of PHP up to 7.0, Judy greatly benefits in terms of the amount of memory consumed compared to the built-in maps.

However, time passes, and many years have passed since the last Judy release - it's time to look at alternatives.

My name is Marco, I am a Badoo system programmer on the Platform team. My colleagues and I did a little research in search of Judy's alternatives, made conclusions and decided to share them with you.

Associative array

If you know what an associative array is , a tree and a hash table, you can safely skip this section. For the rest - a small introduction to the topic.

An associative array is an abstract data type that allows you to get or save a value using a key. It is abstract because it says nothing about how this data type should be implemented. And the truth is, if you fantasize a little, you can come up with dozens of different implementations of different levels of adequacy.

For example, we can take the banal linked list . On the put we will go round it all from beginning to end to make sure that we don’t have such an element yet, and if not, we will add it to the end of the list; on get - just go around the list from beginning to end in search of the desired item. The algorithmic complexity of finding and adding an element to such an associative array is O (n), which is not very good.

From simple but slightly more adequate implementations, we can consider a simple array , the elements of which we will always keep sorted by key. You can search in such an array using binary search , getting O (log (n)), but adding elements to it may require copying large chunks of memory due to the need to free up space for an element in a strictly defined position.

If we look at how associative arrays are implemented in practice, then most likely we will see some tree or hash table .

Hash table

A hash table is a data structure that implements an associative array, the device of which is a two-step process.

At the first stage, our key is run through the hash function . This is a function that accepts a set of bytes at the input (our key), and usually gives some number at the output. We use this number in the second stage to search for a value.

How? There are many options. The first thing that comes to mind is to use this number as an index in a regular array. The array in which our values lie. But this option will not work for at least two reasons:

The range of hash functions is most likely larger than the size of the array that we would like to keep allocated in memory.
Hash functions have collisions: two different keys can give the same number in the answer.

Both of these problems are usually solved by allocating an array of limited size, each element of which is a pointer to another array (bucket). The range of our hash function we then map to the size of the first array.

How? The options, again, a lot. For example, take the remainder of dividing by the size of the array, or just take the required number of lower bits of the number if the size of the array is a power of two (the remainder of the division is a much slower operation for the processor compared to the shift, so usually the arrays size, which is a power of two).

I gave an example of one of the most common ways to implement a hash table, but in general there are many of them. And depending on the chosen method of dealing with collisions, we will get different performance. If we talk about algorithmic complexity for hash tables, then on average we will have O (1), and in worst cases we can get O (n).

Trees

Trees are pretty simple. To implement associative arrays, various binary trees are used, such as, for example, a red-black tree . If the tree is balanced, we get O (log n) per operation.

Some implementations of trees, such as google btree , take into account the modern hierarchy of memory and store keys in batches for more efficient use of processor caches.

Trees allow you to bypass elements in ascending or descending order of keys, which is often a very important condition when choosing an implementation. Hash tables do not know how.

What are we looking at?

The world of associative arrays, of course, is not black and white: each implementation has both advantages and disadvantages.

Different implementations may differ not only in performance, but also in the set of functions provided. For example, some implementations allow you to bypass items in ascending or descending order of the key, while others do not. And this restriction is due to the internal architecture, not the whims of the author.

Yes, and banal performance can vary greatly depending on how you use an associative array: basically write to it or read, read keys in a random order or sequentially.

Gone are the days when Knut's O-notation was enough to compare the two algorithms. The real world is much more complicated. Modern iron, on which our programs work, has a multi-level memory , access to different levels of which is very different, and taking into account this can speed up the algorithm many times or slow it down as much.

An important criterion is the question of memory consumption: what is the overhead of the chosen implementation compared to the theoretical minimum (key size + size of the value multiplied by the number of elements) and is such an overhead allowed in our program?

Pauses, latency and responsiveness also play an important role. Some implementations of associative arrays require a one-time expensive rebuilding of their internal structures when adding an element that is not lucky, while others “smear” these time delays.

Study participants

Judy

Judy (or Judy Array) is a structure invented by Douglas Baskins that implements an ordered associative array. The implementation is written in C and is extremely complex. The author tried to take into account the modern architecture and use the caches of the processor to the maximum.

Judy is a tree, not a hash table. And not just a tree, but the so-called prefix tree . For us, consumers, this means that using Judy, we get key compression and the ability to run through the elements in ascending or descending order.

Judy provides several key / value variations:

Variation	Key	Value
Judy1	uint64_t	bit
JudyL	uint64_t	uint64_t
JudySL	null-terminated string	uint64_t
Judyhs	array of bytes	uint64_t

Judy1 is useful for large sparse bitmaps. JudyL is the basic type that we often use in Badoo C-services. JudySL, as you can see, is convenient to use for strings, and JudyHL is just for some sort of byte set.

std :: unordered_map

unordered_map is implemented in C ++ as a template, so anything can be used as a key and value. In the case of using non-standard types, you will have to define a function to hash these types.

This is a hash table, so that bypassing or descending keys is not available.

google :: sparse_hash_map

As the name implies, sparse hash was developed by Google and claims a minimum overhead of just two bits per value. It is also implemented in C ++ and made in the form of a template (and this, in my opinion, is an undoubted advantage of implementations in C ++).

Among other things, sparse hash is able to save its state to disk and boot from it.

google :: dense_hash_map

In addition to sparse hash, Google has made a variant called dense hash. The speed of it (and especially the speed of get'a) is noticeably higher, but it has to be paid for with a large memory consumption. Also, due to the fact that inside dense_hash_map flat arrays, an expensive rebuilding is periodically required. By the way, Konstantin Osipov spoke about such problems and nuances of hash tables in his report on HighLoad ++ . I recommend to study.

If the amount of data that you store in an associative array is small and its size does not change much, dense hash should suit you best.

Similarly, sparse hash dense hash is a hash table and pattern. That is, I repeat, no passing through the keys in ascending order, but with the possibility of using any types as keys and values.

Also similar to its fellow sparse_hash_map, dense_hash_map can save its state to disk.

spp :: sparse_hash_map

And finally, the last challenger from one of the google sparse hash developers. The author took sparse hash as a basis and rethought it, having achieved a minimum overhead projector of one bit per record.

All the same: hash table and template. Implementation in C ++.

Applicants are presented - it's time to check what they are capable of.

Two more factors

But first, I note that two more things will affect performance and memory consumption.

Hash function

All participants in our study, which are hash tables, use hash functions.

Hash functions have quite a few properties, each of which may be more or less important depending on the area. For example, in cryptography, it is important that the minimal change in the incoming data leads to the largest possible change in the result.

But we are most interested in speed. The performance of hash functions can be very different, and the speed often depends on the amount of data that they are “fed”. When a long string of bytes is fed into the hash function, it may be advisable to use SIMD instructions for more efficient use of the capabilities of modern processors.

In addition to the standard hash functions from C ++, I will look at how xxhash and t1ha do it .

Memory allocation

The second factor that will definitely affect our results is the memory allocator. Both the speed of work and the amount of consumed memory directly depend on it. The standard libc allocator is jack of all trades . It is good for everything, but not perfect for anything.

As an alternative, I will consider jemalloc . Despite the fact that in the test we will not use one of its main advantages - working with multithreading - I expect faster work with this allocator.

results

Below are the results of a random search test among ten million items. The two parameters that interest me the most are test run time and maximum memory consumed (RSS process at peak).

I used a 32-bit integer as the key, and a pointer (64 bits) as the value. These are the sizes of keys and values we most often use.

I did not show the results of other tests that are presented in the repository, since I was most interested in the random get and memory consumption.

The fastest associative array was Google’s dense hash. Next comes spp, and then std :: unordered_map.

However, if you look at the memory, you can see that the dense_hash_map consumes it more than all other implementations. Actually, this is what is said in the documentation. Moreover, if we had competing reading and writing in the test, we would most likely be faced with a stop-the-world rebuild. Thus, in applications where responsiveness and many entries are important, dense hash cannot be used.

I also looked at alternative implementations of hash functions. But as it turned out, the standard std :: hash is faster.

A colleague accidentally noticed that in the case of 32-bit or 64-bit keys, std :: hash is essentially identity. That is, the input and output are the same number, or, in other words, the function does nothing. xxhash and t1ha honestly consider the hash.

In terms of performance, jemalloc is useful. Almost all implementations are getting faster.

However, in memory consumption, jemalloc is not so good.

Source code and hardware

I laid out the source code for benchmarks on github . If you wish, you can repeat my tests or expand them.

I did the testing on my home desktop:

Ubuntu 17.04
Linux 4.10.0
Intel® Core (TM) i7-6700K CPU @ 4.00GHz
32 GiB DDR4 memory
GCC 6.3.0
Compiler flags: -std = c ++ 11 -O3 -ffast-math

Brief conclusions and closing words

Our tests showed a good spp potential as a replacement for Judy. spp was faster on random get'ah, and its overhead memory is not too large.

We have made a simple wrapper over C ++ for C and in the near future we will try spp in one of our high-load services.

Well, as a completely philosophical conclusion, I would say that each of the implementations we have considered has its pros and cons and should be chosen based on the problem being solved. Understanding their internal structure will help you to compare the task and the solution, and this understanding often distinguishes a simple engineer from an advanced one.

Source: https://habr.com/ru/post/328472/

All Articles