Using Binary Search to Optimize Data Retrieval Request

Introduction

It is very popular now with optimizing work with various DBMS. At numerous forums, discussions are held about “the best DBMS in the world”, but often all this flows into unreasonable cries that “I have learned the meaning of life and realized that the best data repository is X”.

Yes, of course, now we can observe the active development of NoSQL solutions that allow us to do a lot. But this article is not about them. So it turned out that I changed the job and in the load I got one very interesting project on a bunch of php + MySQL. There are many good decisions in it, but it was written without calculation for a large audience. Over several years of existence, the number of active users began to approach numbers with 7 zeros. Since the project is a kind of social network with game elements, the table with users was not the most “heavy” of all. I inherited tables with tens of millions of user items, personal messages, billing records, etc. The project began to be refactored, split into several servers and achieved significant results. Now everything is stable.

But recently a new task was sent to my email. The bottom line was to collect statistics. After analyzing the requirements, I realized that to execute, it suffices to write a single query that performs 3 INNER JOINs on tables whose dimensions were impressive. Each table contained an average of 40 million records. It turns out that the temporary table would consist of 4 * 4 * 4 * 10 ^ 21 = 64 * 10 ^ 21 records. This is a colossal figure. And to load a DBMS with such a request for collecting statistics is an unaffordable luxury.
')
Further, in fact, I want to present a solution to this abstract problem, which came to my mind when I remembered computer science classes in the first year of university.

(the project uses the MySQL DBMS, but the algorithm does not have any specific features)

What is binary search

I think that many of you have written labs dedicated to finding an element in an array using a binary algorithm. I will try to summarize its essence.

Suppose we have a sorted array of n elements:

First array element = 1
Last array element = n

We need to find the index of the element with the value f .

Each step is that we calculate the middle of the array:

Mid-array = round (First element + Last element) / 2

Then we calculate the value of this element and check more or less the obtained value in comparison with the desired f . Search range is reduced by 2 times:

If <middle value>> f, then
Last array element = mid value
Otherwise
First array element = mid value

These steps are repeated until one of the conditions:

The modulus of the difference between the mean and f is less than some epsilon (where epsilon, some error)
The number of iterations exceeded the value of log2 (the number of elements in the array)

I think the point is clear. Thus, we significantly speed up the search for the right element by reducing the search ranges, but sacrifice some accuracy of calculations (for statistics, this is not critical if a couple of elements from millions are not taken into account. Otherwise, you need to make the epsilon zero and complete the search only after reaching the last level tree).

Go to practice

So, suppose we need to make an INNER JOIN on 3 tables, and then set the condition “column x in the range between 10 and 20”. And the column x has no index. It will be very long executed. Here it comes to the aid of this simple way.

We take a table with this same column and use a binary search on it to search for a range of primary keys that satisfy the condition 10 <= x <= 20. Given that for such a sample, we will only use indices, very soon we will get the desired pair of values.

It is worth saying that a binary search is used to find one element, not a range, but nobody prevents us from finding the first element with a value of 10 and the last element with a value of 20. Their primary keys will be range constraints.

With this range, we go back to our query, but now instead of the condition WHERE x> = 10 AND x <= 20 we write WHERE id_x BETWEEN min_id_x AND max_id_x , where min_id_x and max_id_x is the value of the lower and upper edges of the range that satisfies the condition.

What we get: now we do a sample not for a certain column x , but for the primary key. Time spent on traversing a single table saved. A similar procedure can be carried out with other conditions in the request.

I don’t see the point here to give the code, since the binary search code can be found on Wikipedia, and the query itself is nothing supernatural.

findings

This algorithm allows you to transfer conditions from fields without indices to primary keys, which significantly speeds up the work of the query. But the method can not be considered a panacea.

First , it is difficult to prepare a universal solution for all requests. In any case, it will be necessary to take into account the details of the implementation of a particular table and, as a result, to spend time each time optimizing.

Secondly , this method is not suitable for all solutions, since the rows in the table must be sorted in a certain order.

Source: https://habr.com/ru/post/134417/

All Articles

Using Binary Search to Optimize Data Retrieval Request

Introduction

What is binary search

Go to practice

findings

More articles: