An alternative to brute force. Text search with hash function

Earlier, I wrote about the basics of text search, now I want to continue and write about how algorithms are developing towards efficiency.
So, how did Michael Rabin and Richard Karp disperse the algorithm?

Why is brute force so slow? Perhaps because we are doing too many unnecessary actions. Then the idea appears to optimize the inner loop. But as? It would be possible to compare strings for some numbers characterizing them.
There are such numbers, we will get them using a hash function .

First approach

Perhaps something comes to mind like “calculate the hash code for the template and for each of the substrings, if they are the same - this is where the coincidence was found.”
Let's try to solve the problem by writing our own hash function, and also add code that will match the substring with the pattern character-by-character only if the hash codes were equal. The hash function of the string is simply the sum of the character codes that make up this string:
')

private int GetHashOfString( string s)
{
int result = 0;
for ( int i = 0; i < s.Length; i++)
{
result += s[i];
}
return result;
}

The search function itself substring will look like this:

public int Match( string input, string pattern)
{
int inputLength = input.Length;
int patternLength = pattern.Length;
int patternHash = GetHashOfString(pattern);
int substringHash;

for ( int i = 0; i <= inputLength - patternLength; i++)
{
substringHash = GetHashOfString(input.Substring(i, patternLength));
if (patternHash == substringHash)
{
bool success = true ;
for ( int j = 0; j < patternLength; j++)
{
if (input[i + j] != pattern[j])
{
success = false ;
break ;
}
}
if (success)
return i;
}
}
return -1;
}

We overclock the algorithm

But again: to find the hash code at each position, we perform exactly the same actions as with brute force. It is necessary to optimize. Our version of the construction of the hash allows us to significantly speed up its calculation at each subsequent step: take away the ASCII code of the character that was at the zero position and add the code of the new character.

The code will change as follows:

...
int patternHash = GetHashOfString(pattern);
int substringHash = GetHashOfString(input.Substring(0, patternLength));

for ( int i = 0; i <= inputLength - patternLength; i++)
{
if (i > 0)
substringHash =
substringHash - input[i - 1] + input[i + patternLength - 1];
if (patternHash == substringHash)
...

Continue overclocking?

It is possible to overclock the algorithm even more by using another hash function. One of these hash functions interprets each substring as a number in a certain number system, the base of which is a large prime number.

How to get the value of the hash from the previous value in a new step? Since the length of the pattern is constant, we can once calculate and memorize the basis in the degree of the length of the pattern minus one: maxBase = 61 ^ (length – 1). Instead of subtracting the value of the code being thrown, we subtract its value multiplied by maxBase, that is, 'a' * 61 ^ 3.
After that, you need to add a new code and multiply the resulting value on the basis of our system (61).
This can be written as pseudocode:

substringHash = substringHash - input[i - 1];
substringHash = substringHash + input[i + patternLength - 1];
substringHash = substringHash * base; // base –

Another question: what will happen to the hash with a larger string length (more precisely, with a sufficiently large template length)? 61 to the sixth power (for a length of 7 characters) no longer fits into a four-byte integer.

“Modular arithmetic” comes to the rescue. We will not store huge numbers that can not fit in thirty-two-bit int, we will take their remainder from dividing by some prime number q.
In any case, I will say that “modulo arithmetic” is based on such identities as:

(a + b + c) mod x = (a mod x + b mod x + c mod x) mod x
(a * b * c) mod x = (a mod x * b mod x * c mod x) mod x

We implement the algorithm

So, the hash function will not consist of the sum of character codes multiplied by the base base chosen to the appropriate degree, but of this sum taken modulo q. The values of q and base will choose more than the length of the alphabet, that is, for ASCII it will be> 256, and for Unicode> 65536.

Our function for the string s with a length of 3 characters will be:
((ascii(s[0]) * base^2) mod q + (ascii(s[1]) * base^1) mod q + (ascii(s[2]) * base^0) mod q) mod q

Since the length of the pattern during a single search remains unchanged, base ^ (length – 1) mod q remains unchanged. We calculate this value in a separate method:

private int GetMaxBase( int length, int q, int b)
{
int result = 1;
for ( int i = 0; i < length - 1; i++)
result = (result * b) % q;
return result;
}

As before, let's create a method for the hash function:

private int GetHashOfString( string s, int q, int b)
{
int result = 0;
int length = s.Length;

for ( int i = 0; i < length; i++)
result = (b * result + s[i]) % q;
return result;
}

Now the search function itself:

public int Match( string input, string pattern, int b, int q)
{
int inputLength = input.Length;
int patternLength = pattern.Length;

find the value of base ^ (patternLength - 1) modulo q
int maxBase = GetMaxBase(patternLength, q, b);
first find the hash values for the template and the first substring

int patternHash = GetHashOfString(pattern, q, b);
int substringHash =
GetHashOfString(input.Substring(0, patternLength), q, b);

for ( int i = 0; i <= inputLength - patternLength; i++)
{

if the hash values match - we compare the strings completely

if (patternHash == substringHash)
{
bool success = true ;
for ( int j = 0; j < patternLength; j++)
{
if (input[i + j] != pattern[j])
{
success = false ;
break ;
}
}
if (success)
return i;
}

if we are not at the last step, we find the new value of the hash function

if (i != inputLength - patternLength)
substringHash =
(b * (substringHash - input[i] * maxBase) +
input[i + patternLength]) % q;

if it’s a negative number, make it positive :)

if (substringHash < 0) substringHash += q;
}
return -1;
}

Finally

Here we are done. It is worth mentioning that the time costs of this algorithm, even in the worst case, are not inferior to the brute force algorithm, and on average it is quite pleasant to look at the indicators of the algorithm.

I hope that in the near future I will be able to write about the direction in which other people were moving, trying to find effective algorithms for searching for a substring. Thanks to those who read to the end.

The project with tests can be downloaded here.

Source: https://habr.com/ru/post/48317/

All Articles