📜 ⬆️ ⬇️

As we query is 100 times faster, or not all hash functions are equally bad

We are developing a database. Once we were approached by a company that faced the following task:

There are some set of objects, and some set of tags. Each object can contain multiple tags. Some tags are very rare, and some are common. A single tag can be associated more than once with one object.
New objects, tags and links between them are continuously added.
The task is to very quickly answer the following questions: “How many objects are there that have an A or B tag, but no C tag” and similar ones. I would like to respond to such requests for tenths of a second, while not stopping data loading.

We received their data from them until today, deployed a test cluster of four machines, and began to think about how to properly distribute the data and how to correctly represent the task as an SQL query in order to get maximum performance. In the end, we decided that the request could be:
')
SELECT COUNT(*) FROM ( SELECT object_id, (MAX(tag == A) OR MAX(tag == B)) AND MIN(tag != C) AS good FROM tags WHERE tag IN (A, B, C) GROUP BY object_id ) WHERE good == 1; 


To make such a request run quickly, we split the data between the cluster servers by object_id, and inside each server sorted it by tags. Thus, the server making the request can send the request without changes to all servers with data, and then simply add their results. On each server with the data, it is enough to find the rows for the A, B, and C tags (and since the data on the tag is sorted, this is a quick operation), then execute the query in one go through these rows. The worst tag has several tens of millions of objects, and it is possible to process several tens of millions of lines in tenths of a second.
It is worth noting that the subquery contains a GROUP BY object_id. GROUP BY in this situation can be done in several ways, for example, if the data after the tag is sorted by object_id, then you can do something similar to merge sort. In this situation, however, we did not sort the data by object_id, and the optimizer reasonably decided that it would be necessary to build a hash table to perform GROUP BY.

We loaded all the data into the cluster, and started the query. Request took 25 seconds.

Something went wrong.
First we removed GROUP BY object_id to localize where the problem occurred. The request ended in the expected 0.3 seconds. So reading data is fast enough, and nothing superfluous is transmitted over the network. The problem is somewhere in the hash table.
We deployed the debug server, and started writing various statistics to the log. After some time, it became clear that in the hash table some values ​​of the hash functions appear much more often than others. Given that the number of chains was greater than the number of records in the table, the longest chain had a length of about 32. This means that something with the hash function is wrong. Opening the implementation, we saw something like the following:

 uint64_t hash(uint64_t value) { return value * MULTIPLIER1; } uint64_t accumulateHash(uint64_t hash, uint64_t value) { hash ^= hash >> SHIFT1; hash += value; hash ^= hash >> SHIFT2; hash *= MULTIPLIER2; return hash; } 


The game with bats seemed very suspicious. Someone obviously thought it was a good idea to make the function more confusing when I wrote this code, but it looks like I overdid it. We commented out both lines with XOR and SHIFT, and restarted the cluster. Request ended in 0.5 seconds, win.

The next morning I decided to find out who originally wrote these two lines. Made git blame and found the unfortunate commit. Interestingly, in this commit, of all the changes, only these two lines were added to the server code, before the accumulateHash function consisted of sum and multiplication. However, in addition to adding these two lines, the author of the commit added a so-called avalanche test — a test that makes sure that any, even the most insignificant, changes to the input numbers lead to completely random changes in the hash function. The test was based on some kind of publication of a PhD, and seemed reasonable. In this case, the test did not pass without XOR and SHIFT, but passed with them. That is, the author of a commit did not just write an intricate function, he understood what he was doing. But why in practice did the function behave so unpredictably?

To deal with this issue, I downloaded the data for one of the tags to a local machine, and started experimenting. Indeed, our current hash function collided: all values ​​had the same high five bits. However, the same function with other SHIFT1 and SHIFT2 has already given good results. Then I generated a random 10 million numbers, and built the hash table again, using our bad hash function. This time there were no anomalous collisions either. That is, at random numbers, the hash function behaves well, the problem is somewhere at the intersection of our hash function and user data. Began to look for patterns in user data. There are regularities: they are all divided into 64, and they all have the same high five bits. I generate a set of random numbers that are multiples of 64 with the same high five bits. No, there are still no conflicts. What else could be the problem? Does the way in which object id's are generated somehow levels our hash function?

Deciding the next day to find out from the client exactly how they generate these IDs, I was almost going home when one of my colleagues asked me: and this bar, which I am doing GROUP BY, is not the same bar by which I am broke between servers?
Yes, it is really the same column. It turns out that because of the bug in the code for splitting into sections and for performing GROUP BY, the same hash function was used. On each of the four physical servers there were 8 sections, totaling 32 sections, the upper bits of the hash function are used to select the section. As a result, for all lines in one section, the upper five bits of the hash value from object_id will be the same. When we commented out the XOR and SHIFT in the hash function, and restarted the cluster, we did not reload the data, thus making it appear that the problem was corrected, since now the data was broken down into a hash function different from that used for GROUP BY, but if we began to download the latest data, then in the end the problem would again manifest itself.

We returned the hash function to the form in which it passes the avalanche test, and changed the hash function for splitting into sections, and one of our colleagues shared a funny story: in a programming competition, he built a hash table in Java on two-dimensional coordinates. The coordinates consisted of two numbers of 32 bits each, and in order for the code to work efficiently, he would write them into one 64-bit number and put them into a hash table. Unfortunately, the task did not pass in time, because in Java the hash function computes the XOR of the higher and lower 32 bits, thereby always producing the same value for coordinates, for which X is equal to Y.

Source: https://habr.com/ru/post/222565/


All Articles