📜 ⬆️ ⬇️

How to find a girl in 250 microseconds

In contrast to Europe and America, cautious attitude prevails in dating sites. However, the hope of clicking on the magic button and finding love does not go out in the hearts of many. And we have to justify this hope. Of course, we do not promise to immediately find the ideally suitable “half”, but it is simply obliged to offer dozens, hundreds or, in some cases, thousands of options that meet your exact requirements. What we are doing, and very quickly.

The average database search of 11 million questionnaires with 4 to 30 parameters each takes us an average of 3.5 milliseconds. And at the same time, besides the search for the demon servcher of the “Mamba”, it performs the following, including not quite traditional tasks:

In spite of the fact that our search was developed from the very beginning on its own, from time to time there were thoughts to use something already known, run-in and guaranteed to be effective. Well, and if we think about searching, Sphinx comes to mind first. At some point we decided to check whether he could give us significant advantages, and were somewhat surprised at the results obtained. On a test base of 2 million Sphinx questionnaires showed an average result of 100-120 ms in, depending on the request, while we (only because it was a test) included only a part of the fields in the index. Our search on the same base and on the same equipment spends from 0.2 to 1.1 ms per request.

Sphinx: source - mysql, int sql_attr_uint. .

As further practice showed, half the success of the search lies in creating the right index. In our understanding, the correct index is the one that best fits the search tasks and provides the highest processing speed. After a general analysis of the base, it was decided to divide the index into several parts:

But the main tricks are hidden in the structure of each part of the general index, which are constructed as follows:

What does this mean for search? For example, we have an index for the fields “gender” and “age”, where gender can take the values ​​“M” and “F”, and age - from 18 to 21. Imagine that a user wants to find a girl of 20 years old (i.e. conditions satisfy only one value of each field).
')
Suppose we have such questionnaires (1 block is shown, since the minimum allowable age is 18, then 18 years old is coded as 0000, 19 as 0010, 20 as 0100, and so on)



After processing, the following index is obtained (the first bit is selected in bold, before and after):

image


For some fields, the value is kept non-compact. For example, the present age field takes 64 values. You can store 1 bit for a possible value, that is, the field will take 64 bits, but the search will take only 1 operation, and a search by range - 2 operations. A more traditional option is to store it as a number, then it will take log2 (64) = 6 bits, but a search for a specific age will take 6 operations, and a search over a range of more than 12 (the exact value depends on the length of the comparison condition recorded as DNF).

The search process itself is as follows:

The most interesting is the field verification code. In the results segment, each questionnaire corresponds to 1 bit, which indicates whether the questionnaire is suitable for the query or not.

Processing the request is as follows:



After that, it remains only to go through the bitmap with the result and see which forms have 1.

In our example, the result will be 00010, that is, only the questionnaire number 4 satisfies the query.

The “Mamba” search daemon is invoked from a significant portion of all pages on the site. I think it is worth noting that, along with several other “main” demons, it uses the JSON-RPC protocol and generally creates a “single demonic space”.

The real search statistics is as follows:

Requests to search by profile id
Fig. 1 Requests for search by profile id (schedule from one server)

Requests for search by parameters
Fig. 2 Requests for search by parameters (schedule from one server)

In total, the search runs on two machines, the peak performance of one is 20k of searching for profiles using parameters with a full calculation of the number of profiles found per second (this is the longest operation, the others work much faster). The workload is about 800 rps of search + 1000 rps to get a place + 1500 rps to get personal data.

Finding the place of the questionnaire in the search results
Fig. 3 Displaying the place of the questionnaire in the search results (schedule from one server)

image
Distribution of the total response time (i.e., taking into account network interaction) from all (two) search servers. The average is 2.5 ms, the median is 1.5 ms, 5% of requests take more than 10 ms, and 1% of requests take more than 15 ms.

Source: https://habr.com/ru/post/144084/


All Articles