Polynomial hashes and their applications

Hello, habr. Today I will write how to use polynomial hashes (hereinafter referred to as simply hashes) when solving various algorithmic problems.

Introduction

Let's start with the definition. Suppose we have the string s _0..n-1 . The polynomial hash of this string is the number h = hash (s _0..n-1 ) = s ₀ + ps ₁ + p ² s ₂ + ... + p ^n-1 s _n-1 , where p is a natural number (later it is said which one), and s _i is the code of the i-th character of the string s (in almost all modern languages it is written s[i] ).

Hashes have the property that the same hash lines are equal. Therefore, the main operation that hashes allow you to perform is a quick comparison of two substrings for equality. Of course, to compare 2 strings, we can write a similar function (we will assume that the lengths of the strings s and t are the same and equal to n ):
')

 boolean equals(char[] s, char[] t) { for (int i = 0; i < n; i++) if (s[i] != t[i]) { return false; } } return true; }

However, in the worst case, this function is obliged to check all the characters, which gives the asymptotics of O (n) .

String comparison using hashes

Now let's see how the hashes do it. Since the hash is just a number, we need O (1) time to compare them. However, in order to find the hash of one substring in a naive way, it takes O (n) time. Therefore, you will need to tinker a bit with mathematical formulas and learn how to find all the substrings at once for O (n) hashes. Let's compare the substrings s _L..R and t _X..Y (of the same length). Using the definition of a hash, we can write:

s _L + ps _{L + 1} + ... + p ^RL-1 s _R-1 + p ^RL s _R = t _X + pt _{X + 1} + ... + p ^YX-1 t _Y-1 + p ^YX t _Y

Let's make small transformations on the left side (on the right side, everything will be similar). We write the hash of the substring s _0..R , we need it:

hash (s _0..R ) = s ₀ + ps ₁ + ... + p ^L-1 s _L-1 + p ^L s _L + p ^{L + 1} s _{L + 1} + ... + p ^R-1 s _R-1 + p ^R s _R

We divide this expression into two parts ...

hash (s _0..R ) = (s ₀ + ps ₁ + ... + p ^L-1 s _L-1 ) + (p ^L s _L + p ^{L + 1} s _{L + 1} + ... + p ^R-1 s _R-1 + p ^R s _R )

... and we take out the factor p ^L from the second bracket:

hash (s _0..R ) = (s ₀ + ps ₁ + ... + p ^L-1 s _L-1 ) + p ^L (s _L + ps _{L + 1} + ... + p ^RL-1 s _R-1 + p ^RL s _R )

The expression in the first parenthesis is nothing but the hash of the substring s _0..L-1 , and in the second, the hash of the substring we need s _L..R . So, we got that:

hash (s _0..R ) = hash (s _0..L-1 ) + p ^L hash (s _L..R )

Hence the following formula for hash (s _L..R ) :

hash (s _L..R ) = (1 / p ^L ) (hash (s _0..R ) - hash (s _0..L-1 ))

Similarly, for our second substring, the equality hash (t _X..Y ) = (1 / p ^X ) (hash (t _0..Y ) - hash (t _0..X-1 )) will be _true .

Looking carefully at these formulas, you can see that to calculate the hash of any substring, we only need to know the hash of the prefixes of this string s _0..0 , s _0..1 , ..., s _0..n-2 , s _{0. .n-1} . And, since the hash of each of the next prefix is expressed through the hash of the previous one, they are easy to count as a linear pass through the line. All at once for O (n) time. The powers of p must also be pre-calculated and stored in an array.

Code example:

 //      p,     pow[0] = 1; for (int i = 1; i <= MAX; i++) { pow[i] = pow[i - 1] * p; } //     s hs[0] = s[0]; for (int i = 1; i < n; i++) { hs[i] = hs[i - 1] + pow[i] * s[i]; } //     t ht[0] = t[0]; for (int i = 1; i < m; i++) { ht[i] = ht[i - 1] + pow[i] * t[i]; }

It would seem that we are now fully armed and able to compare 2 any substrings for O (1) . But not everything is so simple: mathematical formulas need some refinement. For example, a similar code:

 if ((hs[R] - hs[L - 1]) / pow[L] == (ht[Y] - ht[X - 1]) / pow[X]) { ... }

will not work.

Remark the first: L (or X ) can appear equal to zero, and at calculation of hs[L - 1] there will be an exit for array borders. However, if L is zero, then the hash of the substring s _{L..R of interest} is stored exactly in hs[R] . Therefore, instead of hs[L - 1] , it is more correct to write like this:
```
 L == 0 ? 0 : hs[L - 1] 
```
.
Second note: even the long type contains only 64 bits (I use Java), and our strings can be several thousand long, and an overflow will inevitably occur when calculating hashes. The easiest way to solve this problem is to close your eyes to it. Well, or almost close: we will have hashes, in fact, modulo 2 ⁶⁴ (and therefore it will not be necessary to perform the operations of taking the remainder of the division - beauty!). The number p for counting them must be, firstly, greater than the code of the largest character in the lines, and secondly, mutually simple with the module (in our case, it is 2 ⁶⁴ , that is, it must be odd). Why so, I will not tell here - you can read about it in books on algebra. Of course, the probability of a collision inevitably appears, but it is extremely small, therefore, when solving Olympiad problems, it can simply be neglected.
Third note: since we now perform all operations on a module, the division is not available to us (more precisely, it is available , but writing it is rather inefficient). Therefore, it is necessary to get rid of it. This is done quite simply, in a way that is called “proportion” in school: the expression is reduced to a common denominator, and multiplication is used instead of division:
```
 if ((hs[R] - (L == 0 ? 0 : hs[L - 1])) * pow[X] == (ht[Y] - (X == 0 ? 0 : ht[X - 1])) * pow[L]) { ... } 
```

Now we take a closer look at the tasks where you can use hashes.

Tasks solved using hashes

1. Comparing substrings

The first and most important application, as already mentioned, is a quick comparison of two substrings - all other algorithms with hashes are based on it. The code in the last section is rather cumbersome, so I will write a more convenient code that will be used in the future.
The following function calculates the hash of the substring s _L..R multiplied by p ^L :

 long getHash(long[] h, int L, int R) { long result = h[R]; if (L > 0) result -= h[L - 1]; return result; }

Now we are comparing the two substrings with the following line:

 if (getHash(hs, L, R) * pow[X] == getHash(ht, X, Y) * pow[L]) { ... }

Multiplication by degrees of the number p can be called “reduction to one degree”. The first hash was multiplied by p ^L , and the second one - by p ^X - means that the comparison would take place correctly, they must be multiplied by the missing factor.
Note: it makes sense to first check whether the lengths of the substrings match. If not, then the lines in principle can not be equal, and then you can not check the above condition.

2. Search for a substring in the string for O (n + m)

Hashes allow you to search for a substring in a string for asymptotically minimal time. This makes the so-called Rabin-Karp algorithm.
Let there be a string s of length n in which we want to find all occurrences of string t of length m . We will find the hash of the string t (the entire string as a whole) and the hashes of all the prefixes of the string s , and then we will move along the string s with a window of length m , comparing substrings s _{i..i + m-1} .
Code:

 //    t long ht = t[0]; for (int i = 1; i < m; i++) { ht += pow[i] * t[i]; } //    for (int i = 0; i + m <= n; i++) { if (getHash(h, i, i + m - 1) == ht * pow[i]) { //     i } }

3. Finding the z-function for O (n log n)

A z-function of string s is an array z , whose i- th element is equal to the longest prefix of a substring starting at position i in string s , which is also the prefix of the entire string s . The value of the z-function in the zero position will be considered equal to the length of the string s , although some sources take it for zero (but this is not critical).

Of course, there is an algorithm for finding the z-function for O (n) . But when you don’t know it or don’t remember it (and the algorithm is rather cumbersome), hashes come to the rescue.

The idea is as follows: for each i = 0, 1, ..., n-1, we search for z _{i by a} binary search, i.e. at each iteration, reducing the interval of possible values by half. This is correct, because the equality s _0..k-1 = s _{i..i + k-1 is} necessarily satisfied for all k , less than z _i , and is not necessarily fulfilled for large k .

Code:

 int[] z = new int[n]; for (int i = 0; i < n; i++) { int left = 1, right = n - i; //    while (left <= right) { //  middle -   int middle = (left + right) / 2; //  ,    s[0..middle-1]  s[i..i+middle-1] if (getHash(h, 0, middle - 1) * pow[i] == getHash(h, i, i + middle - 1)) { //  ,      middle,      z[i] = middle; left = middle + 1; } else { //   ,     right = middle - 1; } } }

4. Search for a lexicographically minimum cyclic row shift for O (n log n) .

There is Duval's algorithm , which allows solving this problem for O (n) , however, I know some pretty strong olympiad programmers who never understood it. As long as they understand it, we will again apply the hashes.

The algorithm is as follows. First we take the string s itself for the best (lexicographically minimal) answer. Then for each cyclic shift using a binary search we find the length of the maximum total prefix of this shift and the current best answer. After that, it suffices to compare the characters following the prefix and, if necessary, update the answer.
We also note that for convenience, it is recommended to assign the string s to itself — you do not have to do modulo operations when referring to the characters of the string s . We assume that this has already been done.

Code:

 int bestPos = 0; for (int i = 1; i < n; i++) { int left = 1, right = n, length = 0; //  length -     while (left <= right) { int middle = (left + right) / 2; if (getHash(h, bestPos, bestPos + middle - 1) * pow[i] == getHash(h, i, i + middle - 1) * pow[bestPos]) { length = middle; left = middle + 1; } else { right = middle - 1; } } //          //      n, //       , //      if (length < n && s[i + length] < s[bestPos + length]) { bestPos = i; } }

Note: in fact, a comparator is written inside the for loop that compares two cyclic shifts lexicographically. Using it, it is possible to sort all cyclic shifts for O (n log ² n) .

5. Search all palindromes in the line for O (n log n) .

Again, there is a solution to this problem for O (n) . And again we will solve it with the help of hashes.

The substring s _L..R is called a palindrome if s _L = s _R , s _{L + 1} = s _R-1 , etc. If expressed in Russian, this means that it is read equally from left to right and from right to left.

Perhaps you already know or have guessed what the hashes are. In addition to the array h[] containing the hashes for the substrings s _0..0 , s _0..1 , ..., s _0..n-2 , s _0..n-1 , we calculate the second array rh[] (for "Inverted" line), which we will bypass from right to left. It will contain, respectively, hashes s _0..n-1 , s _1..n-1 , ..., s _n-2..n-1 , s _n-1..n-1 :

 rh[n - 1] = s[n - 1]; for (int i = n - 2, j = 1; i >= 0; i--, j++) { rh[i] = rh[i + 1] + pow[j] * s[i]; }

It should be clear how to determine if O (1) is a palindrome. I will write a getRevHash () function, similar to getHash (), and then give the necessary condition for comparison. You can verify for yourself the correctness of this expression by doing mathematical calculations similar to those given at the beginning of the article.

 long getRevHash(long[] rh, int L, int R) { long result = rh[L]; if (R < n - 1) result -= rh[R + 1]; return result; } boolean isPalindrome(long[] h, long[] rh, long[] pow, int L, int R) { return getHash(h, L, R) * pow[n - R - 1] == getRevHash(rh, L, R) * pow[L]; }

Now consider the position of i in the string. Let there be a palindrome of odd length d centered at position i (in the case of an even length, centered between positions i-1 and i ). If one symbol is cut from its edges, it will remain a palindrome. And so you can continue until its length becomes zero.
Thus, it is enough for us to store 2 values for each position: how many odd length palindromes exist with center at position i , and how many even length palindromes exist with center between positions i-1 and i . Please note that these 2 values are completely independent of each other, and they must be processed separately.

Let's apply, as before, binary search:

 int[] oddCount = new int[n]; for (int i = 0; i < n; i++) { int left = 1, right = min(i + 1, n - i); while (left <= right) { int middle = (left + right) / 2; if (isPalindrome(h, rh, pow, i - middle + 1, i + middle - 1)) { oddCount[i] = middle; left = middle + 1; } else { right = middle - 1; } } } int[] evenCount = new int[n]; for (int i = 0; i < n; i++) { int left = 1, right = min(i, n - i); while (left <= right) { int middle = (left + right) / 2; if (isPalindrome(h, rh, pow, i - middle, i + middle - 1)) { evenCount[i] = middle; left = middle + 1; } else { right = middle - 1; } } }

Now you can, for example, find the total number of all palindromes in a string, or the length of the maximum palindrome. The length of the maximum odd palindrome with center in position i is counted as 2 * oddCount[i] - 1 , and the maximum even palindrome - 2 * evenCount[i] .
Once again I remind you that you need to be more careful with palindromes of even and odd length - as a rule, they must be processed independently of each other.

Hashes in matrices

Finally, consider the more sophisticated use of hashes. Now our space will be two-dimensional, and we will compare submatrices. Fortunately, hashes are very well generalized to the two-dimensional case (I have not met three-dimensional or more).

Now, instead of the number p and the pow array, we will have two different numbers p , q and two pow1 , pow2 : one number and one array in each direction: vertically and horizontally.

The hash of the matrix a _{0..n-1, 0..m-1 is} called the sum over all i = 0, ..., n-1 , j = 0, ..., m-1 of the quantities p ⁱ q ^j a _ij .

Now let's learn how to count hashes of submatrices containing the upper left element a ₀₀ . Obviously, hash (a _{0..0, 0..0} ) = a ₀₀ . It is almost as obvious that for all j = 1, ..., m-1 hash (a _{0..0, 0..j} ) = hash (a _{0..0, 0..j-1} ) + q ^j a _0j , for all i = 1, ..., n-1 hash (a _{0..i, 0..0} ) = hash (a _{0..i-1, 0..0} ) + p ⁱ a _i0 . This directly follows from the one-dimensional case.

How to calculate the hash of the submatrix a _{0..i, 0..j} ? You can guess that hash (a _{0..i, 0..j} ) = hash (a _{0..i-1, 0..j} ) + hash (a _{0..i, 0..j-1} ) - hash (a _{0..i-1, 0..j-1} ) + p ⁱ q ^j a _ij . This formula can be obtained from the following considerations: we add up all the terms (a hash, remember, this is the sum of several terms) that make up the hash of submatrices a _{0..i-1, 0..j} and a _{0..i, 0..j-1} . In this case, we twice took into account the terms that make up the submatrix a _{0..i-1, 0..j-1} , so we subtract them so that they are counted once. Now, only the element a _ij multiplied by the corresponding powers of p and q is missing.

Approximately from the same considerations as in the first part of the article (have you already noticed the involvement of the inclusion-exception formula?), A function is constructed to calculate the hash of an arbitrary submatrix a _{x1..x2, y1..y2} :

 long getMatrixHash(long[][] h, int x1, int x2, int y1, int y2) { long result = h[x2][y2]; if (x1 > 0) result -= h[x1 - 1][y2]; if (y1 > 0) result -= h[x2][y1 - 1]; if (x1 > 0 && y1 > 0) result += h[x1 - 1][y1 - 1]; return result; }

This function returns the hash of the submatrix a _{x1..x2, y1..y2} multiplied by the value of p ^x1 q ^y1 .

And the comparison of two submatrices a _{ax1..ax2, ay1..ay2} and a _{bx1..bx2, by1..by2} is performed using the following expression:

 if (getMatrixHash(h, ax1, ax2, ay1, ay2) * pow1[bx1] * pow2[by1] == getMatrixHash(h, bx1, bx2, by1, by2) * pow1[ax1] * pow2[ay1]) { ... }

Heshes can also solve problems associated with finding the largest symmetric submatrix and similar to them. And I do not know comparable with the hashes in speed and simplicity of the algorithms that perform this work. Here, the same principles are used as in the search for palindromes in the one-dimensional case (i.e., consider “reversed” hashes from right to left and bottom-up, conduct bin search separately for submatrices of even and odd length). I suggest trying to solve this problem on your own - this article will help you!

Conclusion

So, we have a pretty good machine at our disposal, which allows us to do many things either with the best possible asymptotics, or only slightly (logarithm times) more slowly than specialized algorithms. Not bad, right?

Source: https://habr.com/ru/post/142589/

All Articles