Search in line. Implementation in CPython

Quite a while ago, at one of the presentations of graduates of one of the so-called IT academies, the speaker was asked about the details of the implementation of the search for a substring in the string of toli in Java, and in the net of .Net. The graduate of course could not answer anything intelligibly, and I put the question in the already long todo list.

The time has passed, Python has become more relevant to me enterprise platforms, so that your attention analysis of the search string algorithm in Python.

The implementation of the string object can be found here - Objects / stringobject.c . Definition of functions that are responsible for different types of search ( find , rfind , __contains__ ) - Objects / stringlib / find.h. All these functions (along with split , replace , etc.) use the search implemented in Objects / stringlib / fastsearch.h . We will deal with them.

The algorithm itself is named ( The stringlib Library ) by ( Fredrik Lundh ) by the mix of Boyer-Moore and Horspul algorithms with several improvements and simplifications. The algorithm also uses a bloom filter .
')
And so, let's go. Function Header:

fastsearch(const STRINGLIB_CHAR* s, Py_ssize_t n, const STRINGLIB_CHAR* p, Py_ssize_t m, Py_ssize_t maxcount, int mode) /* s - ,   , n -  , p - ,   (  ), m -  , maxcount -    ,       , mode -  /  /    */

Everything seems clear. Let's look at define operators:

 /*      mode */ #define FAST_COUNT 0 #define FAST_SEARCH 1 #define FAST_RSEARCH 2 /* STRINGLIB_BLOOM_WIDTH    \(   )    32, 64, 128 */ #define STRINGLIB_BLOOM_ADD(mask, ch) ((mask |= (1UL << ((ch) & (STRINGLIB_BLOOM_WIDTH -1))))) #define STRINGLIB_BLOOM(mask, ch) ((mask & (1UL << ((ch) & (STRINGLIB_BLOOM_WIDTH -1)))))

How this implementation of bloom filters works. Each character is assigned a number on the basis of its last log (STRINGLIB_BLOOM_WIDTH) bits. For example, for a this is 1 :

 >>> ord('a') & 31 1

For q - 17 :

 >>> ord('q') & 31 17

Using the logical "OR" STRINGLIB_BLOOM_ADD accumulates information about the characters seen in the mask filter. Later you can use STRINGLIB_BLOOM to check if a character has been added to the filter. Of course, this check may have false positives ( STRINGLIB_BLOOM returns not 0, but the character is not really in the string \ filter):

 >>> ord(']') & 31 29 >>> ord('}') & 31 29

But it is quite cheap and helps reduce the number of comparisons.

Now you can go to the algorithm itself. Obvious logic for cases m> n and m <= 1 :

  w = n - m; if (w < 0 || (mode == FAST_COUNT && maxcount == 0)) return -1;

and (part of the code is omitted so as not to tear the article):

  if (m <= 1) { if (m <= 0) return -1; if (mode == FAST_COUNT) { /* ... */ } else if (mode == FAST_SEARCH) { for (i = 0; i < n; i++) if (s[i] == p[0]) return i; } else { /* FAST_RSEARCH */ /* ... */ } return -1; }

After we have a few new variables and the mask is filled in for the required substring:

  mlast = m - 1; skip = mlast - 1; mask = 0; for (i = 0; i < mlast; i++) { STRINGLIB_BLOOM_ADD(mask, p[i]); if (p[i] == p[mlast]) skip = mlast - i - 1; } STRINGLIB_BLOOM_ADD(mask, p[mlast]);

skip is a number, defaults to the index of the last character of the search string minus 1. If there are characters in the string that are equal to the last, then skip is equal to the difference between the indices of the last character and the last character of the line.

Description of the idea of the main algorithm from Wikipedia:

Scan from left to right, right to left comparison. The beginning of the text (string) and the template is combined, the check starts with the last character of the template. If the characters match, the last-to-last character of the pattern is compared, etc. If all the characters of the pattern match the superimposed characters of the string, then the substring is found and the search is over.

If a character in the pattern does not match the corresponding character in the string, the pattern shifts a few characters to the right, and the test starts again from the last character.

Actually the implementation of this description (above w = n - m ):

  for (i = 0; i <= w; i++) { if (s[i+m-1] == p[m-1]) { /* ,      */ for (j = 0; j < mlast; j++) if (s[i+j] != p[j]) break; if (j == mlast) { /*  ! */ if (mode != FAST_COUNT) /*       - ,     */ return i; /*     */ count++; if (count == maxcount) return maxcount; i = i + mlast; continue; } /*         ,      */ /*      ,        */ if (!STRINGLIB_BLOOM(mask, s[i+m])) /*  ,           */ i = i + m; else /*  ,        'skip' */ i = i + skip; } else { /*     .      ,        */ if (!STRINGLIB_BLOOM(mask, s[i+m])) /*  ,           */ i = i + m; /*   , i++ */ } }

A bit of visualization. Let us work with the string “drum” and look for the substring “aba”:

     skip = 1 i = 0: ______________ |||||||| | -  , i++ _______ |||| i = 1: ______________ |||||||| | -  ,   _______ |||| ______________ |||||||| | -  .    ""  ""   ,   skip,   for   i += 2 _______ |||| i = 3: ______________ |||||||| | -  ,   _______ |||| ______________ |||||||| | -  ,   _______ |||| ______________ |||||||| | -  ,   _______ ||||

Well that's all.

If you like it, there are a lot of places in the Python source that are interesting to dig into. For example, the list.sort implementation in Objects / listobject.c or dict.get in Objects / dictobject.c .

Good luck! :)

Source: https://habr.com/ru/post/132156/

All Articles

Search in line. Implementation in CPython

More articles: