📜 ⬆️ ⬇️

Search for a substring. Knuth – Morris-Pratt Algorithm

In information retrieval tasks, one of the most important tasks is to find an exactly specified substring in a string. The primitive search algorithm for substrings in a string is based on the enumeration of all substrings whose length is equal to the length of the search pattern, and character-by-character comparison of such substrings with a search pattern. By tradition, the search pattern or sample is usually designated as a needle (English “needle”), and the string in which the search is conducted - as a haystack (English “haystack”). In Python, the primitive algorithm looks like this:

index = -1 for i in xrange(len(haystack)-len(needle)+1): success = True for j in xrange(len(needle)): if needle[j]<>haystack[i+j]: success = False break if success: index = i break print index 


Let n = | haystack |, m = | needle |. The simplest search algorithm, even at best, makes n – m + 1 comparisons; if there are many partial matches, the speed drops to O (n * m).
')
The algorithm considered further, although it has low speed on “good” data, but this is offset by the absence of regression on “bad”. The Knut-Morris-Pratt algorithm is one of the first linear estimation algorithms in the worst case. Before proceeding to the description of the algorithm, it is necessary to consider the concept of prefix-function.

The prefix function of the string π (S, i) is the length of the longest prefix of the string S [1..i], which does not coincide with this string and at the same time is its suffix. Simply put, this is the length of the longest beginning of the line, which is also its end. For string S, it is convenient to represent the prefix function as a vector of length | S | -1. We can consider the prefix-function of length | S |, putting π (S, 1) = 0. An example of the function prefix for the string "abcdabcabcdabcdab":

S [i]abcdabcabcdabcdab
π (S, i)0000one23one23fourfive67fourfive6


Suppose that π (S, i) = k. Note the following properties of the prefix function.
  1. If S [i + 1] = S [k + 1], then π (S, i + 1) = k + 1.
  2. S [1..π (S, k)] is the suffix of the string S [1..i]. Indeed, if the line S [1..i] ends with the line S [1 ... π (S, i)] = S [1..k], and the line S [1..k] ends with the line S [1..π (S, k)], then the string S [1..i] ends with the string S [1..π (S, k)].
  3. ∀ j∈ (k, i), S [1..j] is not a suffix of the string S [1..i]. Otherwise, the assumption π (S, i) = k would be incorrect, since j> k.


These properties allow us to obtain the algorithm for calculating the prefix function.
Let π (S, i) = k. It is necessary to calculate π (S, i + 1).
  1. If S [i + 1] = S [k + 1], then π (S, i + 1) = k + 1.
  2. Otherwise, if k = 0, then π (S, i + 1) = 0.
  3. Otherwise, put k: = π (S, k) and go to step 1.


The key point for understanding the essence of the algorithm is the fact that if the suffix found in the previous step cannot be extended to the next position, then we try to consider smaller suffixes as long as possible.

Algorithm for calculating the prefix function in Python:

 def prefix(s): v = [0]*len(s) for i in xrange(1,len(s)): k = v[i-1] while k > 0 and s[k] <> s[i]: k = v[k-1] if s[k] == s[i]: k = k + 1 v[i] = k return v 


We show that the running time of the algorithm is O (n), where n = | S |. Note that the asymptotics of the algorithm determines the final number of iterations of the while loop. This is so because, without taking into account the while loop, each iteration of the for loop is executed in a time not exceeding a constant. At each iteration of the cycle, the for k is increased by no more than one, which means the maximum possible value of k = n – 1. Since within the while loop the value of k is only decreasing, it turns out that k cannot decrease in total more than n – 1 times. This means that the while loop will be executed no more than n times, which gives the final estimate of the time of the O (n) algorithm.

Consider the Knut-Morris-Pratt algorithm based on the use of the prefix function. As in the primitive substring search algorithm, the pattern “moves” along the line from left to right in order to find a match. However, the key difference is that with the help of the prefix function we can avoid the obviously useless shifts.

Let S [0..m – 1] be the pattern, T [0..n – 1] the search string. Consider the comparison of strings at position i, that is, the pattern S [0..m – 1] is mapped to the part of the string T [i..i + m – 1]. Suppose the first mismatch occurred between the symbols S [j] and T [i + j], where i <j <m. Let P = S [0..j – 1] = T [i..i + j – 1]. When shifting, one can expect that the prefix S will converge with any suffix of the string P. Since the length of the longest prefix, which is also a suffix, is a prefix function of string S for index j, we arrive at the following algorithm:
  1. Build the prefix function of the sample S, we denote it by F.
  2. Put k = 0, i = 0.
  3. Compare the characters S [k] and T [i]. If the characters are equal, increase k by 1. If, in this case, k becomes equal to the sample length, then the occurrence of the pattern S in the string T is found, the index of the entry is equal to i - | S | + 1. The algorithm ends. If the characters are not equal, use the prefix function to optimize shifts. While k> 0, assign k = F [k – 1] and proceed to the beginning of step 3.
  4. While i <| T |, we increase i by 1 and go to step 3.


A possible implementation of the Knut-Morris-Pratt algorithm in Python looks like this:

 def kmp(s,t): index = -1 f = prefix(s) k = 0 for i in xrange(len(t)): while k > 0 and s[k] <> t[i]: k = f[k-1] if s[k] == t[i]: k = k + 1 if k == len(s): index = i - len(s) + 1 break return index 

Source: https://habr.com/ru/post/191454/


All Articles