I
’ll add another one to Habr's articles on
Binary Search . It will be a question of custom implementation, it can be useful to everyone who often uses in the work of CDF to compare large lists or to search for data in large arrays.
Prehistory
It all started with the fact that I discovered the so-called. Double-TRUE VLOOKUP trick (double-use trick of VPR and TRUE in the 4th parameter). A detailed description of the algorithm can be found in the article by Charles Williams "Why 2 VLOOKUPS are better than 1 VLOOKUP" (at the end of the article).
Understanding the principle of operation and discovering that this approach can be
thousands of times faster than normal linear search (CDF with the 4th parameter FALSE), I began to think over options to reveal its capabilities. In the course of implementation, we got several suitable tools for contextual advertising, one of which I still continue to improve, and has already devoted a couple of articles to the project on Habré. Pulp fiction is recommended to SEO specialists and specialists in contextual advertising (I’ll just make a reservation, the links in the articles are already outdated versions, the latest version is conditionally 6.0, links to download all versions, including the latest, will be at the end of this article):
"
Analysis of large semantic nuclei, or" Robot Recognizer ""
Lemmatization in Excel, or" Robot Recognizer 3.0')
So, despite the incredible speed of these files (incredible for Excel), their creation required the use of the same incredibly long mega-formulas as one of the components of the macro (the last of the above articles gives an example - a formula of 3215 characters). And the culprit is the complex syntax of the function.
If you get a hand with its use, it ceases to seem complicated, but inexperienced users, for whom this approach is intended, are unlikely to want to understand it.
The syntax is:
If (CDF (required; array; 1; TRUE) <required; ""; CDF (required; array; n; TRUE))
where n is the ordinal number of the column from which we want to return the value opposite the desired key.
Instead of “TRUE” in the 4th parameter, you can use “1” for the nominal reduction of the length of the formulas, this does not change their essence.
If you announce the progress of the formula, the following will be:
“If a binary search for a key in the first column of an array returns a value
smaller than the key itself, we return an empty string. Otherwise, we return the result of a binary search for a key with an offset n ".
The approach is used to return
no values if the required key is not in the array, because Often, if the desired is not found - we
do not need a lower value. So to say, all or nothing. In short, this is the essence of the "trick".
Let me remind you that at stake is a speed increase of three to four-digit numbers. If we approach purely mathematically, on a 2 ^ 20 string array, a regular binary search will do ~ 10 calculations, the formula above will be about 20, while a linear search will be ~ 500.000, i.e. the formula increment is higher by 25,000 times. If the bare numbers are not impressive, a more eloquent equivalent comparison is 1 second versus ~ 7 hours.
In practice, the increase is not so significant (at the end of the article there is a link to an article where different methods were compared). This is largely due to the CPU time spent on additional procedures that the program performs (for example, writing values to cells). BUT the increase is still critical (~ 4000 times).
But at the same time we have a complex, completely non-usable syntax. Not all mortals were given CDF, what to speak of combinations of 2 CDF with IF.
I solved the question with complex syntax with the help of VBA - I wrote UDF (user-defined function, user function), which hides our conditional constructions under the hood, leaving us with the usual syntax of well-known CDF.
UDF Code:
Public Function (a, b As Range, c As Integer) As String If Application.VLookup(a, b.Columns(1), 1, True) = a Then = Application.VLookup(a, b, c, True) Else = "" End If End Function
To use the function in your Excel file, you need to add a module to the current workbook, in which you add the above code, or download an example file in the
link , in which it is already done for you.
The function accepts 3 parameters as input, the syntax is similar to the usual CDF, except for the 4th parameter, since it is not needed:
(required; array; column number) .
So, at the output, we got a function with the usual syntax and the usual behavior, but with a speed that is tens to hundreds to thousands of times faster than the usual CDF, depending on the length of the array. With the only limitation - the function works correctly only on an array sorted from the smallest to the largest. Often, the last moment is not an insurmountable obstacle.
Use, comment. I would be happy to improve the algorithm and similar implementation ideas.
I am working on optimizing the search in Python, currently I have not found a standard dictionary search faster, I will be happy for comments on this part.
Links
»
Article about the double CDF trick" Why 2 CDF is better than 1 "»
Comparison of different search methods, including the“ double trick ”trick»
The latest version of the" Robot Recognizer "and all previous , and some other tools for contextual advertising, including the subject of this article, on one link.