Description of the operation of the Shift-OR algorithm to search for substrings in the string

1. Instead of entry.

Recently I had to understand the operation of the Shift-Or algorithm, which allows you to find a substring in a string. Based on the results of this analysis, I decided to write this post in the hope that it will help someone understand how this algorithm works, faster than me.

Actually, the main difference between the algorithm and, for example, “naive comparison” is that it is based on logical operations, namely, logical multiplication (it’s AND, it’s the conjunction).

2. What you need to work the algorithm.

First you need to enter some notation. Let us have P - this will be the sample that we will look for in the text, and T - this is actually the text itself, which will be searched. P is of length n, and T is of length m.
')
In the algorithm, a certain matrix M is built (as yet incomprehensible as educated). This matrix is essentially just a binary array (consists of only 0 and 1). M has dimensions nx (m + 1). In this matrix, our i-th index ranges from 1 to n, and j-th has values from 0 to m. Formally, by definition, “an element M (i, j) = 1 if and only if the first i characters of P exactly coincide with i characters of T, ending at position j”. Accordingly, since the array is binary, in all other cases the element is 0. In order to better understand what the matrix M is, consider an example:

Let the text T = "CALIFORNIA" and we want to find in this text the pattern P = "FOR". (Naturally, such a short “text” is taken for simplicity of example). We see that in the text “CALIFORNIA” we have 10 characters, it means m = 10. In the sample P we have 3 characters, it means n = 3. Accordingly, the matrix M will have dimensions 3x11. So, we get the following table:

You can also see that the main for us is only the last row of the table. Indeed, it contains units only if the pattern P is fully included in the text T.

3. So how is this matrix M constructed?

In order for such a matrix to be built by a program, it is necessary to introduce two more operations.

3.1. Binary vector U (x).

This vector is simply a binary string of length n (like the pattern P), in which there are units in the same positions where in the pattern P there is the letter x.
For example, suppose we have P = “ABACDEAB”. For him, the vector U (A) = 10100010:

3.2. Bit-Shift operation (j)

This operation is performed on the binary column j. It is very simple and consists only in shifting a column by 1 position downwards and attributing a unit to it. The length of the column does not change, respectively, we lose the last value of the column.
For example, we have a column of length 10: 0011010101. We apply the Bit-Shift operation three times to it:

In the figure above, the elements that remain from the original column are highlighted in dark turquoise, and the ones that are added as a result of applying the Bit-Shift operation are highlighted in yellow.

Actually, after we entered these two operations, the simplest and most important part remained: the actual construction of the matrix.
It is calculated very simply. And I propose to consider this construction by example.
Let us have a sample P = "ABAAC" (n = 5) and the text T = "XABXABAAXA" (m = 10).
Initially, fill the zero column with zeros:

All subsequent columns are calculated based on the previous one. Then we apply the bit-shift operation to the zero column. get the column:

And we calculate the column U (x). Here “x” is the first character of the text T being checked. We get a completely zero column, since the character “x” does not occur in P. Then we apply the logical AND operation to these columns:

So far, our matrix M looks like this:

Further filling of the matrix is absolutely the same. For the second column, U (A) = 10110, and Bit-Shift (1) = 10,000. So the second column is 10,000, and so on. Each subsequent column is filled based on the previous one. The general formula for filling the jth column of the matrix M is as follows:

As a result, after doing this operation on all the columns, we will get a ready-made matrix M:

As you can see, we do not have full occurrence of P in T, since the last row of the matrix does not contain ones. But then there are partial entries (if we suddenly need them).

4. Instead of a conclusion.

The complexity of this algorithm is O (mn), which in principle is not so small. But the algorithm has some nice sides. Firstly, it is very simple to implement and to understand how it works. Secondly, since each column of the matrix M is calculated on the basis of the previous one, we do not need to keep in memory the entire matrix, and it suffices to keep only two columns. In practice, the algorithm does a good job when searching for small samples of P, such as, for example, the English word. A slightly modified version of this algorithm underlies the work of the unix agrep utility, which looks for a sample in the text with a given number of errors.

PS

Yes, I saw this topic, which describes the same algorithm, but in my post I tried to get as close as possible to explaining the topic with examples, because I think that in some cases it is better to show how it works than to tell.

References

1. Dan Gasfield. Strings, trees and sequences in algorithms. 2003
2. Bitap
3. Google

Source: https://habr.com/ru/post/132128/

All Articles