After reading the article about
substring search algorithms
in the string , I found that it does not tell about the Boyer-Moore algorithm. A few words about him are still there, namely, it says that Boyer-Moore’s algorithm has earned itself the title of “default algorithm”, because on average it gives the best search time (with which I completely agree). Under the cat talked about a simplified version of this algorithm. In principle, most probably studied this algorithm on the 1st or 2nd year of the university (like me), so they can skip this article, there is nothing new here.
Algorithm
This algorithm is also known as Boyer-Moore-Horspul algorithm. The algorithm procedure is very simple. First, a table of offsets is constructed for each character. Then the source line and pattern are combined at the beginning, the comparison is made by the last character. If the last characters are the same, then the comparison follows the last but one character and so on. If the characters do not match, then the pattern is shifted to the right, the number of positions taken from the offset table for the symbol from the source string, and then the last characters of the source string and the pattern are compared again. And so on, until the pattern completely matches the substring of the source string, or the end of the string is reached.
Offset table
The offset table is based on the principle of “skipping as many characters as possible, but no more than that”. For example, if at some step of the algorithm the last characters did not match, and the character in the source string is not present in the pattern at all, then it is clear that you can move right to the full length of the pattern, without any fear. In general, each symbol is assigned a value equal to the difference in the length of the pattern and the ordinal number of the symbol (if the symbol is repeated, then the rightmost occurrence is taken). It is clear that this value will be exactly equal to the ordinal number of the symbol, if you count from the end of the line, which allows you to shift to the right by the maximum possible number of positions.
Illustration
To make it quite clear, let's consider the operation of the algorithm using an example:
Let the source string be “somestring” and the pattern be “string”. Now we build the offset table, it will be equal to the length of the pattern for all characters that are not found in the pattern, and the sequence number from the end for the rest (except for the last one, the pattern length is also taken for it):
')
s->5
t->4
r->3
i->2
n->1
g->6
Now we combine our lines:
somestring
string
Compare the last characters: see that “t” is not equal to “g”. Take the offset value for the symbol “t”, it is equal to 4. Shift the line to the right by 4 positions, and voila:
somestring
string
Further, the algorithm will compare the characters from the last to the first in the pattern with the corresponding characters in the original string. As soon as he compares the latter, it will be found out that this is the first entry.
Code
I wrote some Java code to enhance this theory. Any comments are welcome.
public int getFirstEntry(String source, String template) { int sourceLen = source.length(); int templateLen = template.length(); if (templateLen > sourceLen) { return -1; } HashMap<Character, Integer> offsetTable = new HashMap<Character, Integer>(); for (int i = 0; i <= 255; i++) { offsetTable.put((char) i, templateLen); } for (int i = 0; i < templateLen - 1; i++) { offsetTable.put(template.charAt(i), templateLen - i - 1); } int i = templateLen - 1; int j = i; int k = i; while (j >= 0 && i <= sourceLen - 1) { j = templateLen - 1; k = i; while (j >= 0 && source.charAt(k) == template.charAt(j)) { k--; j--; } i += offsetTable.get(source.charAt(i)); } if (k >= sourceLen - templateLen) { return -1; } else { return k + 1; } }
Conclusion
It is said that such a modification of the Boyer-Moore algorithm copes with random texts better than the original algorithm itself. However, the assessment of complexity in the worst case is still worse. But unlike the original algorithm, no complicated calculations are required here - only an offset table is built, which is not so difficult to implement.