📜 ⬆️ ⬇️

Delphi. What does TDictionary


Good day.
Did you know that not all hash tables are equally useful? Now I will tell you the story, how one bad hash table ate up all the performance, and did not frown. And how to fix this hash table sped up the code almost 10 times.
Of course, according to the topic - the article focuses on Delphi, but even if you are not a Delphi developer, you still have to look under the cat, and after reading the article in the source code of the hash tables that you use. And I advise developers to Delphi to completely abandon the standard TDictionary.


1. How the hash table is arranged.

If you know perfectly well how the hash table is arranged, and what linear probing is, you can safely move on to the second part.

Any hash table inside is essentially an array. Elements of such an array are called buckets . And one of the central places in the hash table is the hash function. In general, a hash function is a function that maps one set to another set of fixed size. And the task of any hash function is to give a uniformly distributed result on a fixed set. More about hash functions can be read in Wikipedia, I will not dwell on this.
So, we have prepared an array of, say, 10 buckets. Now we are ready to add values ​​to our array. We take the hash from our key. For example, the hash was a 32 bit integer M. To determine which bucket we will write our value in, we take the remainder of the division: Index: = M mod 10 . And put in the bucket [Index]: = NewKeyValuePair .
Sooner or later we will face the fact that there will already be some value in the bucket [Index]. And we will need to do something about it. Such a case is called collision resolution . There are several techniques for resolving collisions. Here are the main ones:
')
Separate chaining or closed addressing

In the simplest case, this method represents this. Each bucket is a link to the linked list head. When a collision occurs, we simply add another element to the linked list. So that our list does not degenerate into several linear linked lists, they introduce such a thing as load factor . That is, when the number of elements per bucket exceeds a certain number ( load factor ), a new array of buckets is created, and all elements from the old one are shoved by the new ones. The procedure is called rehash.
The disadvantages of this approach are that for each pair <key, value> we create an object to add to the linked list.
This approach can be improved. For example, if instead of buckets you store a pair of <key, value> + link to the head is linked to the list. By setting the load factor to 1 or even 0.75, you can virtually avoid creating linked list items. And buckets that are an array are very friendly placed on the processor's cache. ps At the moment I think this is the best way to resolve conflicts.

Open addressing

For this method, all of our buckets are an array of <key, value>, and absolutely all the elements of the table hash are stored in buckets. One item per bucket. An attempt to insert an element into such a hash table is called probe . The most well-known such sampling methods are: Linear probing, Quadratic probing, Double hashing .
Let's see how Linear probing works.
Here we are in a situation where the bucket is already occupied by another element:

Then we simply add the unit to the bucket, and put the element in the next:

Are you in the same bucket again? Now you have to step over 3 elements:

But it turned out quite badly:

According to the last example, it is clearly seen that in order for this method to be effective - it is necessary that there are many free buckets, and it is also very important that the inserted elements do not cluster in certain areas, which imposes high requirements on the hash function.
An attempt to circumvent an unsuccessful hash function is Quadratic probing and Double hashing. The essence of Quadratic probing is that the next probe will be through 1, 2, 4, 8, etc. items. The essence of Double hashing is that the size of the jump for the next probe is selected using the hash function. Either different from the first, or the same, but the hash is taken from the bucket index into which they just tried to put.
But the most important thing in Open addressing is that even if we fill 10,000 elements without a single conflict, the addition of 10001 can lead to the fact that we iterate over all 10,000 elements already there. And the worst thing is that when we put this 10001st, then to turn to it, we will again have to go through 10,000 previous ones.
There is another unpleasant thing that follows from the above. For Open addressing, the order of completion is important. No matter how great the hashfunction is, everything can spoil the filling order. Let's consider the last case, we are not lucky. All elements were filled without a single collision, but the last element with a collision, which led to a heap of buckets:

And what if we first put this one element:

we have a conflict, we went through only 1 bakt.
And then they would put a neighbor:

would go over 1 baket again.
And the next neighbor:

In sum, the addition would only result in one collision per element. And although the total number of buckets enumerated when added would remain the same, it would generally become more balanced. And now, when accessing any element, we would do 2 checks in the bucket.

2. So what’s wrong with Delphi with Linear probing in TDictionary?


After all, such an “unsuccessful” order with good hashfunction is hard to get, you say? And much wrong.
In order to get all the elements of the TDictionary - we need to go through the array of buckets, and collect the elements in the occupied cells. The main catch here is that the sequence is preserved! Just create another TDictionary, and transfer data to it ... and get a bunch of collisions on the last elements before the next grow.
The simplest code to reproduce the situation:
Hash1 := TDictionary<Integer, Integer>.Create(); Hash2 := TDictionary<Integer, Integer>.Create(); for i := 0 to 1280000 do //  TDictionary    Hash1.Add(i, i); for i in Hash1.Keys do //   -  ,   ! Hash2.Add(i, i); 

In TDictionary, rehash occurs only when free cells in the array of buckets become less than 25% (Grow threshold = 75%). Capacity increase is always 2 times. Here are the filled buckets in the large dictionary:

an attempt to add these elements to a smaller table can be considered as follows. We divide the big table into 2 parts. The first part will be absolutely identical.

Now we need to add elements from the second half (greenbacks).

See how the number of collisions grows when adding the second half? And if rehash is still far away, and the table is big enough, we can get a huge overhead.
Let's count the number of collisions when adding a new item. For this, I copied the source code of TDictionary, and added a couple of callback methods that return collisions. The results brought to the graphics:

I filled the hash table with values, and as I filled, I measured the number of collisions; each new N value displays a new pixel along the X axis. That is, the horizontal axis shows the filling of the table, and the vertical - the number of collisions. To the right of each plot of values:

The black line is the maximum number of collisions in a pixel. Red - arithmetic average of collisions in a pixel.
On the first chart, when filling in 48%, everything is fine. The maximum number of collisions is 169. As soon as we step over 50%, the most horror begins. So at 61% comes the most zhest. The number of collisions on an element can reach 130 thousand! And so up to 75%. After 75% grow occurs, and the percentage of filling is reduced.
Each saw with a bunch of collisions is nothing more than what I showed in the picture above. The “saw” ends with a rehash and a sharp drop in collisions.
By chance, you can be on top of such a saw, and an attempt to work with the last added elements will be accompanied by strong brakes.
How to fix it? Well, the most obvious option is to keep the threshold set to 50%. Those. free cells in the hash table should be at least 50%. Changing this threshold gives the following graphics:

on the same amounts of data. At the cost of additional memory, we "won" our CPU time. The problem here is that the GrowThreshold field is not available outside. You can also try to avoid unsuccessful sequences. Either manually mix / sort the data before placing it in a table, which is quite expensive, or create different tables with different hash functions. Many hash functions (such as Murmur2, BobJenkins hash) allow you to specify Seed / InitialValue. But these values ​​at random are also not recommended. In most cases, a prime number is suitable as a seed, but still it is better to read the manual for a specific hash function.
And finally, my advice - do not use Open addressing as a universal method for any hash table. I think it is better to pay attention to the Separate chaining with bakery <key, value> + pointer to head linked list with load factor about 0.75.

ps I spent several days searching for this reef. The situation was complicated by the fact that large volumes of data were difficult to add + dependence on the% fill TDictionary, so that the brakes mysteriously disappeared from time to time. So thank you for your attention, and I hope you will not stumble about it. ;)

upd 11/24/2016 In Rust, they attacked the same rake: accidentallyquadratic.tumblr.com/post/153545455987/rust-hash-iteration-reinsertion

upd2 01/31/2017 And also in Swift: lemire.me/blog/2017/01/30/maps-and-sets-can-have-quadratic-time-performance

Source: https://habr.com/ru/post/282902/


All Articles