The Knut-Morrs-Pratt algorithm is used to search for a substring (pattern) in a string. It seems that it can be simpler: we move along the line and successively compare the characters with the sample. It did not coincide, we move the beginning of the comparison by one step and compare again. And so on until we find a pattern or reach the end of the line.
The function is approximately as follows:
Simple search case
'needle' - sample
'stack of stack of
needle needle stack of stack
needle ' - search string
Difficult search case
aabbaab
aabaabbaaabaabaabaabaabbaabb
The function works and is quite fast if the samples and the string are “good”. Good ones are when the inner loop is interrupted quickly (at step 1-3, say, as in the simple case). But if the sample and the line contain frequently repeated nested pieces (as a complicated case above), then the internal loop terminates towards the end of the sample and the search time is estimated as O (<sample length> * <string length>). If the line length is 100 thousand, and the sample length is 100, then we get O (10 million). Of course, it’s really rare to find a sample with a length of 100, but in “Olympiad” tasks, “search” is a common thing, so a simple search there is often not suitable.
')
And if a string is a long text, and a sample is a fragment of a word, and all the occurrences of this fragment need to be found, and on the fly, as the word is typed (did you notice how quickly browsers do this)? The KMP algorithm finds all occurrences of a sample in a line and its speed O (<sample length> + <string length>), therefore on large texts / samples or on weak processors (as in low-budget cellular ones) it is out of competition.
Now look at the title? Why "small"? Because the highlight of the CMP is a prefix function, but it is really small. And why "miracle"? Because, he seems to be solving a completely different task, and this solution, after some wonderful trick, turns into a solution to the problem of finding all instances of the sample in a row.
In order to understand what and how the prefix function does, let's look at the complex string
a | a | b | a | a | b | a | a | a | a | b | a | a | b | but | a | a | b |
one | 2 | 3 | four | five | 6 | 7 | eight | 9 | ten | eleven | 12 | 13 | 14 | 15 | sixteen | 17 | 18 |
0 | one | 0 | one | 2 | 3 | four | five | 2 | 2 | 3 | four | five | 6 | 7 | eight | 9 | 3 |
The line below it is the number (position) of the character in the line (for convenience of the description of the algorithm, we consider the number from 1), and the lowest line is the array of M prefix lengths, the key to understanding the prefix function.
Take the character number 7 (this is a) and for K from 1 to 6 we consider prefixed lines (a substring starting with the first index of the line) and suffixes (the substring whose last character is in the line at position 7 (this is “our” character a)) length K.
K | prefix | suffix |
one | but | a | everything is simple: the prefix is ​​the first character of the string, and the suffix is ​​our character a |
2 | aa | ba | |
3 | aab | aba | |
four | aaba | aaba | the longest match, and here and below the suffix and prefix began to overlap (as fragments of the search string) |
five | aabaa | baaba | |
6 | aabaab | abaaba | |
Note: for length 4, the suffix (as a sequence of characters) matches the prefix and this is the maximum K value at which the suffix matches the prefix. It is this that is entered in the corresponding position (7) of the array of prefix lengths.
You can see that for K = 7, the prefix will also match the suffix, since this is the same string. But this trivial case does not suit us, since for the operation of the algorithm, it is necessary to use substrings
Denote the S-source string, S (n) -prefix (prefix) of string S of length n, S [n] -character at position n of string S. M [n] value in the array, S (M [n]) - the same a string that is a prefix and suffix of the maximum length for position n (for brevity, we denote it by P (n)). The string P (n) is, as it were, “virtual”; it is not formed and is not written anywhere. This is just the initial fragment1 of the source string S of length M [n]. And this initial fragment1 coincides (as a sequence of characters) with fragment2 of length M [n], the last character of which is in position n. If M [n] = 0, then there is no coincidence.
We have: position 7 of the array is filled with the value M [7] = 4, the longest line P (7) = '
aaba ' of length 4 (of course), go to position 8 and fill in M ​​[8]. You can stupidly count all the prefixes and suffixes in length from 1 to 7, compare them and write the maximum length to position 8. But we will go the other way (after the ILC). Let the maximum length string P (8) of length k be found, which is the prefix and suffix for position 8. The string p7 of the first k-1 characters is the prefix and suffix for position k-1. Not the fact that for the 7th position it is the longest. However, if it turned out that p7 = P7, then P8 is an extension of P7 by one character. To check whether P7 can be extended by one position, it is necessary to check whether the character added to the suffix matches (this is the symbol S [8] =
a ) with the next prefix symbol. The next character of the prefix
a is in the position M [7] + 1 = 5 (think why this is so). If it coincided (and in our case it coincided), then the task is completed - M [8] = M [7] +1, and P (8) = P (7) + character in 8 position S [8] =
a . We
get P (8) = '
aabaa '. With successful expansion, you need only one comparison to calculate the next value of the array. By the way, it follows that when moving along an array, the values ​​may increase by a maximum of 1.
For convenience, I will repeat the complex line so that you do not need to move up and down.a | a | b | a | a | b | a | a | a | a | b | a | a | b | but | a | a | b |
one | 2 | 3 | four | five | 6 | 7 | eight | 9 | ten | eleven | 12 | 13 | 14 | 15 | sixteen | 17 | 18 |
0 | one | 0 | one | 2 | 3 | four | five | 2 | 2 | 3 | four | five | 6 | 7 | eight | 9 | 3 |
Now another case - P8 could not be expanded; the character S [9] =
a did not match the character of the string S in the position M [8] + 1 = 6
b . The suffix expands easily (since the new character is simply appended to the end), and with the prefix of the problem, since the character added to the suffix may not match the next prefix character. If the prefix P (k) does not fit, you need to find another such prefix, shorter, which has the same suffix and try to expand it. But the prefix is ​​shorter, and with the same suffix it is S [M [M [k]]), since When the array M is filled, each element contains the length of the maximum prefix with the same suffix. Therefore, if we failed to expand S (M [k]), we also try to expand S (M [M [k]]), etc., until it matches or until we get a zero length (then the next character S should be simply compared with 1m character string S). The loop of searching for suitable prefixes ends very quickly, because all the necessary information for this is already sitting in the array M.
For our line, P (8) is just an extension of P (7) by one character, it took 1 comparison. However, P (8) could not be expanded to P (9), since S [9] =
a , and S [M [8] + 1 = 6] =
b . If the prefix P8 of length M [8] = 5 did not fit, try the prefix of length M [5] = 2. It also does not fit: S [2 + 1] =
b . We try the prefix of length M [2] = 1 and it can be expanded because S [1 + 1] =
a . Therefore, M [9] = 2, one more than the length of the expanding prefix. For filling M [10] you need 2 comparisons (you can check). But to fill items from 11 to 17, you need one comparison. As a result, the calculation of the array values ​​takes time of the order O (the length of the string).
So, the filling algorithm is more or less clear.
Go to the trick.
To find the sample
aabaa in the string
aabaaaaaaaaaaaaaaab we glue the sample with the string like this
aaaaa @aaaaaaaaaaaaaaaaab and call for it a prefix function to fill the array.
a | a | b | a | a | @ | a | a | b | a | a | b | a | a | a | a | b | a | a | b | but | a | a | b |
one | 2 | 3 | four | five | 6 | 7 | eight | 9 | ten | eleven | 12 | 13 | 14 | 15 | sixteen | 17 | 18 | nineteen | 20 | 21 | 22 | 23 | 24 |
0 | one | 0 | one | 2 | 0 | one | 2 | 3 | four | five | 3 | four | five | 2 | 2 | 3 | four | five | 3 | four | five | 2 | 3 |
The symbol
'@ ' plays the role of a separator, it is certainly not in the sample, or in the search bar (you need to choose such a symbol). Let's look at positions 11, 14, 19, 22 of the array. The values ​​in the array are equal to 5, which means that the suffix of length 5 (fragment of the search string) coincides with 5 characters of the prefix. And the 5 characters of the prefix - this is the pattern for the search! The search algorithm is obtained as follows: we glue together the sample and the search string using a separator, pass the gluing of the prefix function and then look for elements in the array that are equal to the sample length.
You can see that there will be no values ​​greater than the sample length due to the separator symbol, equal to the sample length can appear only in the positions corresponding to the initial search string. The glued string has a length of <sample length> + <string length>, so the calculation time is estimated as O (<sample length> + <string length>), as mentioned at the beginning of the article. The volume of the required prefix buffer function is <sample length> + <string length>, but you can modify the prefix function so that there is enough buffer <sample length> (example of modification in the appendix)
Prefix function
And now examples of prefix functions. The difference from the description above is that, as is customary in C-languages, indices are counted from 0.
C ++ Example
The function returns a vector of lengths of prefixes of strings, and the string is passed as string (no need to calculate the length)
vector<size_t> prefix_function (string s) { size_t n = s.length(); vector<size_t> pi(n);
C example
The function returns nothing. The first parameter is a pointer to a string, the second is a pointer to a prefix length array to be filled in, the third p is the length of a string and array.
// size_t, int void prefix_function (char *s, int *pi, size_t n) { pi[0]=0; // i- ( i-1) i. // p[0]=0 , p[1]=1, for (size_t i=1; i<n; ++i) { int j = pi[i-1]; while ((j > 0) && (s[i] != s[j])) // j = pi[j-1]; // ( ) if (s[i] == s[j]) // ++j; pi[i] = j; } }
This code was added based on the discussion.
My story about a small miracle - the algorithm of the KMP came to an end. Of course, this article on the CMP is far from the first and certainly not the last. Here are two articles here on habrakhabr:
Search for substrings. Knuth – Morris-Pratt Algorithm and Substring Search and Related Issues .
But I hope someone will find this article useful for themselves, and someone (after all, it may happen) will apply the ILC. Even if he did not fully understand the internal mechanism of his work (it's never too late to figure it out).
The KMP algorithm is not the only fast search algorithm, but it
is fast
enough (for tasks like olympiad ones) and simple at the same time. The
Boyer-Moore algorithm is close to the ILC in complexity and for a certain range of tasks (where the sample does not contain repeating fragments) it is faster. But on the other hand, for another set of tasks (where the pattern and the search string contain repeated sequences, as in the examples for the article), it is noticeably inferior to the CMP. Finally, there are tasks where the choice of one or the other is a matter of taste (which is not disputed). In some cases (or even areas), the
Aho-Corasic algorithm can be more efficient for both the CMP and Boyer-Moore. But those who really need this algorithm do not need a popular presentation,
IMHO .
Addition to the discussion
You can modify the prefix function so that it uses the buffer not O (line length + sample length) but O (sample length). In this case, the memory should be allocated only to store the sample prefix lengths, and the search string prefix lengths are not remembered in the array. The length for the current position is stored in a variable and as soon as it is compared with the length of the sample (i.e., the sample is found), the prefix function calls the processing function of the found fragment. Modified function
prefix_find added to the article. Actually, I never needed to save on memory allocation for an array, but I do not rule out that someone might need it.
Mayorovp drew my attention to the opportunity to save memory, and he also gave a link in his
message to the implementation of a variant of the prefix function, which displays the positions of the found fragments in cout.
As noted, staticlab changed the type of variables to size_t (this would actually be more correct).
For those who have read the article to the end, I will provide illustrations showing how prefix-suffix blocks are organized in fairly complex cases:
IllustrationsThe lines of zeros and ones are the same, only in the first two illustrations are indices with 0, and in the following - with 1. In fact, this is not fundamental, since the
lengths of substrings are written to the array, and the numbering method of characters in the string does not matter.
The last illustration is a bird's-eye view of the prefix-string suffixes.
habrhabhabrhabrhabrhabhabrhabhabrhabhabrhabrhabrhabhabrhabra
without arrays of indices and lengths.
