Hello, habr. Today I will write how to use polynomial hashes (hereinafter referred to as simply hashes) when solving various algorithmic problems.
Introduction
Let's start with the definition. Suppose we have the string
s 0..n-1 . The polynomial hash of this string is the number
h = hash (s 0..n-1 ) = s 0 + ps 1 + p 2 s 2 + ... + p n-1 s n-1 , where
p is a natural number (later it is said which one), and
s i is the code of the
i-th character of the string
s (in almost all modern languages ​​it is written
s[i]
).
Hashes have the property that the same hash lines are equal. Therefore, the main operation that hashes allow you to perform is a quick comparison of two substrings for equality.
Of course, to compare 2 strings, we can write a similar function (we will assume that the lengths of the strings
s and
t are the same and equal to
n ):
')
boolean equals(char[] s, char[] t) { for (int i = 0; i < n; i++) if (s[i] != t[i]) { return false; } } return true; }
However, in the worst case, this function is obliged to check all the characters, which gives the asymptotics of
O (n) .
String comparison using hashes
Now let's see how the hashes do it. Since the hash is just a number, we need
O (1) time to compare them. However, in order to find the hash of one substring in a naive way, it takes
O (n) time. Therefore, you will need to tinker a bit with mathematical formulas and learn how to find
all the substrings at
once for
O (n) hashes. Let's compare the substrings
s L..R and
t X..Y (of the same length). Using the definition of a hash, we can write:
s L + ps L + 1 + ... + p RL-1 s R-1 + p RL s R = t X + pt X + 1 + ... + p YX-1 t Y-1 + p YX t YLet's make small transformations on the left side (on the right side, everything will be similar). We write the hash of the substring
s 0..R , we need it:
hash (s 0..R ) = s 0 + ps 1 + ... + p L-1 s L-1 + p L s L + p L + 1 s L + 1 + ... + p R-1 s R-1 + p R s RWe divide this expression into two parts ...
hash (s 0..R ) = (s 0 + ps 1 + ... + p L-1 s L-1 ) + (p L s L + p L + 1 s L + 1 + ... + p R-1 s R-1 + p R s R )... and we take out the factor p
L from the second bracket:
hash (s 0..R ) = (s 0 + ps 1 + ... + p L-1 s L-1 ) + p L (s L + ps L + 1 + ... + p RL-1 s R-1 + p RL s R )The expression in the first parenthesis is nothing but the hash of the substring
s 0..L-1 , and in the second, the hash of the substring we need
s L..R . So, we got that:
hash (s 0..R ) = hash (s 0..L-1 ) + p L hash (s L..R )Hence the following formula for
hash (s L..R ) :
hash (s L..R ) = (1 / p L ) (hash (s 0..R ) - hash (s 0..L-1 ))Similarly, for our second substring, the equality
hash (t X..Y ) = (1 / p X ) (hash (t 0..Y ) - hash (t 0..X-1 )) will be true .
Looking carefully at these formulas, you can see that to calculate the hash of any substring, we only need to know the hash of the prefixes of this string
s 0..0 ,
s 0..1 , ...,
s 0..n-2 ,
s 0. .n-1 . And, since the hash of each of the next prefix is ​​expressed through the hash of the previous one, they are easy to count as a linear pass through the line. All at once for
O (n) time. The powers of
p must also be pre-calculated and stored in an array.
Code example:
It would seem that we are now fully armed and able to compare 2 any substrings for
O (1) . But not everything is so simple: mathematical formulas need some refinement. For example, a similar code:
if ((hs[R] - hs[L - 1]) / pow[L] == (ht[Y] - ht[X - 1]) / pow[X]) { ... }
will not work.
Now we take a closer look at the tasks where you can use hashes.
Tasks solved using hashes
1. Comparing substrings
The first and most important application, as already mentioned, is a quick comparison of two substrings - all other algorithms with hashes are based on it. The code in the last section is rather cumbersome, so I will write a more convenient code that will be used in the future.
The following function calculates the hash of the substring
s L..R multiplied by
p L :
long getHash(long[] h, int L, int R) { long result = h[R]; if (L > 0) result -= h[L - 1]; return result; }
Now we are comparing the two substrings with the following line:
if (getHash(hs, L, R) * pow[X] == getHash(ht, X, Y) * pow[L]) { ... }
Multiplication by degrees of the number
p can be called “reduction to one degree”. The first hash was multiplied by
p L , and the second one - by
p X - means that the comparison would take place correctly, they must be multiplied by the missing factor.
Note: it makes sense to first check whether the lengths of the substrings match. If not, then the lines in principle can not be equal, and then you can not check the above condition.
2. Search for a substring in the string for O (n + m)
Hashes allow you to search for a substring in a string for asymptotically minimal time. This makes the so-called Rabin-Karp algorithm.
Let there be a string
s of length
n in which we want to find all occurrences of string
t of length
m . We will find the hash of the string
t (the entire string as a whole) and the hashes of all the prefixes of the string
s , and then we will move along the string
s with a window of length
m , comparing substrings
s i..i + m-1 .
Code:
3. Finding the z-function for O (n log n)
A z-function of string
s is an array
z , whose
i- th element is equal to the longest prefix of a substring starting at position
i in string
s , which is also the prefix of the entire string
s . The value of the z-function in the zero position will be considered equal to the length of the string
s , although some sources take it for zero (but this is not critical).
Of course, there is an
algorithm for finding the z-function for
O (n) . But when you don’t know it or don’t remember it (and the algorithm is rather cumbersome), hashes come to the rescue.
The idea is as follows: for each
i = 0, 1, ..., n-1, we search for
z i by a binary search, i.e. at each iteration, reducing the interval of possible values ​​by half. This is correct, because the equality
s 0..k-1 = s i..i + k-1 is necessarily satisfied for all
k , less than
z i , and is not necessarily fulfilled for large
k .
Code:
int[] z = new int[n]; for (int i = 0; i < n; i++) { int left = 1, right = n - i;
4. Search for a lexicographically minimum cyclic row shift for O (n log n) .
There is
Duval's algorithm , which allows solving this problem for
O (n) , however, I know some pretty strong olympiad programmers who never understood it. As long as they understand it, we will again apply the hashes.
The algorithm is as follows. First we take the string
s itself for the best (lexicographically minimal) answer. Then for each cyclic shift using a binary search we find the length of the maximum total prefix of this shift and the current best answer. After that, it suffices to compare the characters following the prefix and, if necessary, update the answer.
We also note that for convenience, it is recommended to assign the string
s to itself — you do not have to do modulo operations when referring to the characters of the string
s . We assume that this has already been done.
Code:
int bestPos = 0; for (int i = 1; i < n; i++) { int left = 1, right = n, length = 0;
Note: in fact, a comparator is written inside the
for
loop that compares two cyclic shifts lexicographically. Using it, it is possible to sort all cyclic shifts for
O (n log 2 n) .
5. Search all palindromes in the line for O (n log n) .
Again, there is a
solution to this problem for
O (n) . And again we will solve it with the help of hashes.
The substring
s L..R is called a palindrome if
s L = s R ,
s L + 1 = s R-1 , etc. If expressed in Russian, this means that it is read equally from left to right and from right to left.
Perhaps you already know or have guessed what the hashes are. In addition to the array
h[]
containing the hashes for the substrings
s 0..0 ,
s 0..1 , ...,
s 0..n-2 ,
s 0..n-1 , we calculate the second array
rh[]
(for "Inverted" line), which we will bypass from right to left. It will contain, respectively, hashes
s 0..n-1 ,
s 1..n-1 , ...,
s n-2..n-1 ,
s n-1..n-1 :
rh[n - 1] = s[n - 1]; for (int i = n - 2, j = 1; i >= 0; i--, j++) { rh[i] = rh[i + 1] + pow[j] * s[i]; }
It should be clear how to determine if
O (1) is a palindrome. I will write a getRevHash () function, similar to getHash (), and then give the necessary condition for comparison. You can verify for yourself the correctness of this expression by doing mathematical calculations similar to those given at the beginning of the article.
long getRevHash(long[] rh, int L, int R) { long result = rh[L]; if (R < n - 1) result -= rh[R + 1]; return result; } boolean isPalindrome(long[] h, long[] rh, long[] pow, int L, int R) { return getHash(h, L, R) * pow[n - R - 1] == getRevHash(rh, L, R) * pow[L]; }
Now consider the position of
i in the string. Let there be a palindrome of odd length
d centered at position
i (in the case of an even length, centered between positions
i-1 and
i ). If one symbol is cut from its edges, it will remain a palindrome. And so you can continue until its length becomes zero.
Thus, it is enough for us to store 2 values ​​for each position: how many odd length palindromes exist with center at position
i , and how many even length palindromes exist with center between positions
i-1 and
i . Please note that these 2 values ​​are completely independent of each other, and they must be processed separately.
Let's apply, as before, binary search:
int[] oddCount = new int[n]; for (int i = 0; i < n; i++) { int left = 1, right = min(i + 1, n - i); while (left <= right) { int middle = (left + right) / 2; if (isPalindrome(h, rh, pow, i - middle + 1, i + middle - 1)) { oddCount[i] = middle; left = middle + 1; } else { right = middle - 1; } } } int[] evenCount = new int[n]; for (int i = 0; i < n; i++) { int left = 1, right = min(i, n - i); while (left <= right) { int middle = (left + right) / 2; if (isPalindrome(h, rh, pow, i - middle, i + middle - 1)) { evenCount[i] = middle; left = middle + 1; } else { right = middle - 1; } } }
Now you can, for example, find the total number of all palindromes in a string, or the length of the maximum palindrome. The length of the maximum odd palindrome with center in position
i is counted as
2 * oddCount[i] - 1
, and the maximum even palindrome -
2 * evenCount[i]
.
Once again I remind you that you need to be more careful with palindromes of even and odd length - as a rule, they must be processed independently of each other.
Hashes in matrices
Finally, consider the more sophisticated use of hashes. Now our space will be two-dimensional, and we will compare submatrices. Fortunately, hashes are very well generalized to the two-dimensional case (I have not met three-dimensional or more).
Now, instead of the number
p and the
pow
array, we will have two different numbers
p ,
q and two
pow1
,
pow2
: one number and one array in each direction: vertically and horizontally.
The hash of the matrix
a 0..n-1, 0..m-1 is called the sum over all
i = 0, ..., n-1 ,
j = 0, ..., m-1 of the quantities
p i q j a ij .
Now let's learn how to count hashes of submatrices containing the upper left element
a 00 . Obviously,
hash (a 0..0, 0..0 ) = a 00 . It is almost as obvious that for all
j = 1, ..., m-1 hash (a 0..0, 0..j ) = hash (a 0..0, 0..j-1 ) + q j a 0j , for all
i = 1, ..., n-1 hash (a 0..i, 0..0 ) = hash (a 0..i-1, 0..0 ) + p i a i0 . This directly follows from the one-dimensional case.
How to calculate the hash of the submatrix
a 0..i, 0..j ? You can guess that
hash (a 0..i, 0..j ) = hash (a 0..i-1, 0..j ) + hash (a 0..i, 0..j-1 ) - hash (a 0..i-1, 0..j-1 ) + p i q j a ij . This formula can be obtained from the following considerations: we add up all the terms (a hash, remember, this is the sum of several terms) that make up the hash of submatrices
a 0..i-1, 0..j and
a 0..i, 0..j-1 . In this case, we twice took into account the terms that make up the submatrix
a 0..i-1, 0..j-1 , so we subtract them so that they are counted once. Now, only the element
a ij multiplied by the corresponding powers of
p and
q is missing.
Approximately from the same considerations as in the first part of the article (have you already noticed the involvement of the inclusion-exception formula?), A function is constructed to calculate the hash of an arbitrary submatrix
a x1..x2, y1..y2 :
long getMatrixHash(long[][] h, int x1, int x2, int y1, int y2) { long result = h[x2][y2]; if (x1 > 0) result -= h[x1 - 1][y2]; if (y1 > 0) result -= h[x2][y1 - 1]; if (x1 > 0 && y1 > 0) result += h[x1 - 1][y1 - 1]; return result; }
This function returns the hash of the submatrix
a x1..x2, y1..y2 multiplied by the value of
p x1 q y1 .
And the comparison of two submatrices
a ax1..ax2, ay1..ay2 and
a bx1..bx2, by1..by2 is performed using the following expression:
if (getMatrixHash(h, ax1, ax2, ay1, ay2) * pow1[bx1] * pow2[by1] == getMatrixHash(h, bx1, bx2, by1, by2) * pow1[ax1] * pow2[ay1]) { ... }
Heshes can also solve problems associated with finding the largest symmetric submatrix and similar to them. And I do not know comparable with the hashes in speed and simplicity of the algorithms that perform this work. Here, the same principles are used as in the search for palindromes in the one-dimensional case (i.e., consider “reversed” hashes from right to left and bottom-up, conduct bin search separately for submatrices of even and odd length). I suggest trying to solve this problem on your own - this article will help you!
Conclusion
So, we have a pretty good machine at our disposal, which allows us to do many things either with the best possible asymptotics, or only slightly (logarithm times) more slowly than specialized algorithms. Not bad, right?