String Search Algorithms

Setting the search task in the string

Often one has to deal with a specific search, the so-called string search (search in a string). Let there be some text T and a word (or image) W. It is necessary to find the first occurrence of this word in the specified text. This action is typical of any word processing system. (The elements of the arrays T and W are symbols of some finite alphabet — for example, {0, 1}, or {a, ..., z}, or {a, ..., I}.)

The most typical application of such a task is a documentary search: a collection of documents consisting of a sequence of bibliographic references is given, each link is accompanied by a “descriptor” indicating the subject of the corresponding link. It is necessary to find some keywords that occur among the descriptors. Could take place, for example, the query "Programming" and "Java". Such a request can be interpreted as follows: are there articles with “Programming” and “Java” descriptors.

String search is formally defined as follows. Let an array of T of N elements and an array of W of M elements be given, with 0 <M≤N. The search for the string detects the first occurrence of W in T, the result is the index i, indicating the first since the beginning of the string (from the beginning of the array T) coincidence with the image (word).
Example. It is required to find all occurrences of the sample W = abaa in the text T = abcabaabcabca.

The sample enters the text only once, with a shift of S = 3, the index i = 4.

Direct search algorithm

')
The idea of the algorithm:
1. I = 1,
2. compare the I-th character of the array T with the first character of the array W,
3. match → compare second characters and so on,
4. mismatch → I: = I + 1 and go to step 2,

Algorithm end condition:
1. successive M comparisons are successful,
2. I + M> N, that is, the word is not found.

Algorithm complexity:
The worst case. Let the array T → {AAA ... .AAAB}, length │T│ = N, sample W → {A ... .AB}, length │W│ = M. Obviously, to find a match at the end of the line, you need to make about N * M comparisons, that is, O (N * M).

The disadvantages of the algorithm:
1. high complexity - O (N * M), in the worst case - Θ ((N-M + 1) * M);
2. after a mismatch, the scan always starts with the first character of the pattern and therefore may include T characters that were previously viewed (if the string is read from the secondary memory, then such returns take a lot of time);
3. The information about the text T, obtained by checking this shift S, is not used in any way when checking subsequent shifts.

Algorithm D. Knuth, D. Maurice and V. Pratt (CMP-search)

The algorithm of the CMP-search actually requires only about N comparisons, even in the worst case.
Example.
(Symbols subjected to comparison are underlined.)

After the initial part of the image W partially coincides with the corresponding characters of the string T, we actually know the passed part of the line and can “compute” some information (based on the image W itself), with which we can quickly move forward through the text.

The idea of the CMP-search - with each discrepancy between two characters of text and an image, the image shifts over the entire distance traveled, since smaller shifts cannot lead to complete coincidence.

Features of the CMP-search:
1. order (N + M) character comparisons are required to get the result;
2. The scheme of the CMP-search gives a genuine gain only when a certain number of matches preceded the failure. Only in this case is the image shifted by more than one. Unfortunately coincidences are much less common than mismatches. Therefore, the gain from the CMP-search in most cases of texts is very insignificant.

Algorithm of R. Bower and D. Moore (BM-search)

In practice, the BM search algorithm is most effective if the pattern W is long and the power of the alphabet is sufficiently large.

The idea of the BM-search - comparison of characters begins from the end of the sample, and not from the beginning, that is, the comparison of individual characters occurs from right to left. Then, using a certain heuristic procedure, the value of the right shift s is calculated. Again, characters are compared, starting at the end of the sample.

This method not only improves the handling of the worst case, but also gives a gain in intermediate situations.
Almost always, except for specially constructed examples, the BM search requires significantly fewer N comparisons. In the most favorable circumstances, when the last character of a sample always falls on a mismatched text character, the number of comparisons is (N / M), in the worst case O ((NM + 1) * M + p), where p is the power of the alphabet .

Rabin-Karp algorithm (RK-search)

Let the alphabet D = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}, that is, each character in the alphabet is a d – ary digit, where d = │ D.

Example. Let the sample has the form W = 3 1 4 1 5
We calculate the values of numbers from the window of length | W | = 5 modulo q, q is a prime number.

23590 (mod 13) = 8, 35902 (mod 13) = 9, 59023 (mod 13) = 9, ...
k1 = 314157 (mod 13) - the occurrence of the sample,
k2 = 673997 (mod 13) - idle actuation.

The equality ki = kj (mod q) does not imply that ki = kj (for example, 31415 = 67399 (mod 13), but this does not mean that 31415 = 67399). If ki = kj (mod q), then you still need to check whether the strings W [1 ... m] and T [s + 1 ... s + m] are the same.
If the prime number q is large enough, then the additional cost of analyzing idle positives will be small.
In the worst case, the operating time of the RK algorithm is Θ ((N-M + 1) * M), on average, it works fairly quickly - in O (N + M) time.

Example: How many idle triggers k will make the algorithm RK, if
q = 11, 13, 17. Let W = {2 6}

26 mod 11 = 4 → k = 3 idle trips,
26 mod 13 = 0 → k = 1 idling,
26 mod 17 = 9 → k = 0 idle responses.

Obviously, the number of idle triggers k is a function of the value of the prime number q (if the sample processing function is mod q) and, in general, the type of function for processing the sample W and the text T.

Source: https://habr.com/ru/post/111449/

All Articles