I came across an interesting question in the QA section. The answer to it forced me to write this article as a more complete answer to the question “how to organize a search using a variety of parameters, like in a Yandex market, for example.”
I know that on Habré, and indeed there are many supporters of noSQL solutions (he himself is not without sin), but still I’m a supporter of first thinking, and only then choosing a solution.
So, what we have in "DANO"
- We have 120 checkboxes - option 1/0
- We have 30 "radio" with the choice of "yes / no / no matter"
- We have 2-3 sliders to indicate the price / size range of which the thread
- We have the most important thing: 12 million records in the database.
- We have Select * From tovar Where (wifi = true) and (led = false) and (type = 3) and ... the rest of the parameters ...; with a runtime close to the hysteria of the client.
The hysterics came from the understanding that it was necessary to process more than 100 requests per second, and for this you have to sell a “three rubles” with a view of the Kremlin and buy more iron.
')
So, we start to think about how to save a lot of money and put a part in your pocket in the form of a bonus.
I want to note right away that we are not aiming to get a list of the necessary lines from the database. We need to do prefiltration to speed up the search and filtering process.
Checkboxes
To begin with, all our checkboxes, which have only 2 variants of meaning, we will convert into bits and group them into one “number”. In our case, this number will be 120 bit (128 bit for even counting). This will be our first "index". According to it, we can always filter out those products that meet our conditions. In practice, on average, such goods are not more than 10 thousand for the entire base. The remaining parameters can already be viewed only in these 10 thousand products, and not to “shovel” the entire base.
Ranges
Any range (height, weight, waist, bust volume) can be divided into intervals and numbered. Divide into 256 parts. The figure is taken "from the ceiling." In any case, for the range of $ 100 - $ 1200, we get 2 numbers. If we take $ 1 - $ 100 with a maximum of $ 25,500, then we can write our range as 2 | 12 - in fact, this is a 16-bit number. We will have such numbers according to the number of possible ranges. If there are less than a few, then we’ll stop at this, if there are 100-10000, then everything is more complicated. We can not adequately quickly search for 200-20000 bit key. And we will not, we just calculate it md5, and to speed up the process we will also convert everything into a 128-bit number (some miserable 16 bytes). Combined with a lot of checkboxes, we get an even more accurate sample.
Yes / No / Not Important
Algorithm for finding data values with add. the “not important” parameter is almost similar to the option with checkboxes, except that we write only yes / no in the 1 index (and 0, when choosing “not important”). In the second index, we will put 0 for the “not important” variant and 1 for any other.
Total we get 2 indexes:1 index: 01010000001010101010001
2 index: 01111101110111101110111
Homework will be build through Xor | And | Or request in which we reset all insignificant bits on index 2 and compare with index 1.So, we from 12 million records found 0-100-500 suitable under conditions. These records are already worth checking in their entirety (120 bit checkbox field, range and “Yes / no / no matter”). Agree that a complete check among a couple of hundred lines is much faster than a full search.
...
Profit !!!
The variant with the conversion of a long key into a short one, by, say, encryption, is sufficiently convenient for carrying out data prefiltration. At the same time, if we work with SQL, then the prefilter sample can be stored in MemTable for acceleration and full filtering can be carried out with this data.
A similar algorithm is used by us and in the filter of texts by shingles, when from the checksum we get a 32-bit number, which is faster to compare than 128-bit ones.
A similar technique works in one specialized document storage system, which allows, in microseconds, to find in 2.2Gb text databases links to documents where the search word or phrase is located. But, about this in a little more detail in another article.