Introduction
In the post, I tried to avoid complex definitions and rigorous mathematical proofs, and some things are intuitively clear. The algorithm is conveniently divided into interconnected parts, therefore, to grasp the principle of its operation should not be difficult.
Initial description
The Aho-Korasik algorithm implements an efficient search for all occurrences of all sample lines in a given string. It was developed in 1975 by
Alfred Aho and Margaret Corasic.
We formally describe the condition of the problem. Several lines of pattern [i] and string s come to the input. Our task is to find all possible occurrences of pattern [i] lines in s.
The essence of the algorithm lies in the use of a
boron data structure and the construction of a
finite deterministic automaton on it. It is important to remember that the task of searching for substrings in strings is trivially implemented in quadratic time, so for effective work it is important that all parts of Aho-Korasika do not asymptotically exceed the line in relation to the length of the lines. We will return to the assessment of complexity at the end, but for now let's take a closer look at the components of the algorithm.
Build on a set of sample strings
Boron structure
What is boron? Strictly speaking, boron is a tree in which each vertex denotes a string (the root denotes the zero string - ε). On the edges between the vertices, 1 letter is written (this is its fundamental difference with
suffix trees , etc.), so getting along the edges from the root to some vertex and contangling the letters from the edges in traversal order, we get the line corresponding to this vertex . From the definition of boron as a tree, the uniqueness of the path between the root and any vertex also follows; therefore, each vertex corresponds to exactly one row (we will identify the vertex and the string it denotes).
We will build boron by adding source lines sequentially. Initially, we have 1 vertex, the root (root) is the empty string. Adding a line is as follows: starting at the root, we move through our tree, each time choosing the edge corresponding to the next letter of the line. If there is no such edge, then we create it along with the vertex. Here is an example of a built-up bur for the lines: 1) acab, 2) accc, 3) acac, 4) baca, 5) abb, 6) z, 7) ac.
')
(Fig. 1)Note the addition of row 7. It does not create new vertices and edges, and the process of adding it stops at the inner vertex. This shows that for each line it is necessary to additionally store a sign that it is a line from the condition or not (red circles).
(Fig. 2)Note also that two strings in boron have common edges, provided they have a common prefix. Extreme case - all the line samples in pairs do not have the same initial part. So the upper estimate for the number of vertices in boron is the sum of the lengths of all rows + 1 (root).
Implementation
We will store bor as an array of vertices, where each vertex has its own unique number, and the root has a zero value (root = 0). Possible description of the vertex structure:
next_vrtx [i] is the number of the vertex to which we will arrive at the symbol with the number i in the alphabet, flag is the bit indicating whether our vertex is the original line, pat_num is the number of the sample line denoted by this vertex.
Preferring the lengths of all added rows - extra memory overhead. We will use the data structure from STL - vector. In it, memory is allocated dynamically, therefore the additional costs will be zero. Explicitly, the procedure for adding a string follows (we use the 26-letter lowercase Latin alphabet => k = 26).
Functions to create a new vertex and initialize boron: vector<bohr_vrtx> bohr; bohr_vrtx make_bohr_vrtx(){ bohr_vrtx v;
The procedure for adding a sample string to boron: void add_string_to_bohr(const string& s){ int num=0;
Check the presence of a string in boron: bool is_string_in_bohr(const string& s){ int num=0; for (int i=0; i<s.length(); i++){ char ch=s[i]-'a'; if (bohr[num].next_vrtx[ch]==-1){ return false; } num=bohr[num].next_vrtx[ch]; } return true; }
Construction of the machine on the forest
Description of the principle of operation
Our task is to build a finite deterministic automaton. What a thing this can be seen
here . In short, the state of the automaton is some kind of vertex of boron. The transition from the state is carried out by 2 parameters - the current vertex v and the symbol ch. on which we need to move from this top. More specifically, it is necessary to find the vertex u, which denotes the longest string, consisting of the suffix string v (possibly zero) + character ch. If there is no such thing in boron, then we go to the root.
Why do we need this? Suppose we can calculate such a vertex quickly, in constant time. Suppose that we are standing at a certain vertex of a boron corresponding to the substring [i..j] of the string s, the entry into which we are looking for. Now we find all the lines of boron, the suffixes s [i..j]. It is argued that they can be searched quickly (described below). After that, we simply move from the state of the automaton v to the state u by the symbol s [j + 1] and continue the search.
(Fig. 3)To implement the automaton, we need the notion of a suffix link from a vertex.
Suffix Links
Let us call a suffix reference of a vertex v a pointer to a vertex u, such that the string u is the largest proper suffix of the string v, or, if there is no such vertex in boron, then a pointer to the root. In particular, the link from the root leads to it. We will need suffix links for each vertex in boron, so we slightly change the vertex structure and vertex creation procedure by entering the additional variable suff_link.
Here is an example of the placement of suf. links for boron in fig. one:
(Fig. 4)Implementation of the machine
Let us return to the problem of fast transition between states of the automaton. Obviously, all possible transitions exist bohr.size () * k, since for each possible vertex and each possible symbol in the alphabet you need to know the transition. Pre-counting significantly reduces the average operation time of the algorithm, so we will use the ideas of lazy dynamics - we will consider them as necessary and memorize the values ​​already counted.
We introduce a computed function for the transition (v, ch). The idea here is this: if from the current vertex there is an edge with the symbol ch, then we go through it, otherwise we will follow the suffix link and start recursively from the new vertex. Why it works is not difficult to guess.
(Fig. 5)The only question is the correct receipt of suf. links from the top. In this task, you can also use lazy dynamics. The heuristic is as follows: to get suf. links of the vertex v (strings s [i..j]) go down to its ancestor par, go through suf. link par and run the transition from the current vertex t on the symbol symb, which is written on the edge from par to v. Obviously, we first get to the largest suffix s [i..j-1] such that it has an edge with the symbol symb, then we will go along this edge. By definition, the resulting vertex is a suffix link from the vertices of v.
(Fig. 6)So, it is clear that the functions of obtaining a suffix link and transition from the state of the automaton are interrelated. Their convenient implementation represents 2 functions, each of which recursively calls the other. The base of both recursions is suf. a link from the root or from the son of the root leads to the root.
To begin with, we will change the vertex structure and the procedure for creating a new vertex: Full implementation of the automaton requires the pre-declaration of one of the functions: int get_auto_move(int v, char ch); int get_suff_link(int v){ if (bohr[v].suff_link==-1)
Identifying "good" suffix links
With the automaton, it is easy to determine the algorithm itself: we read the string, move from state to state by the characters of the line, in each state we move in suf. references, that is, by the suffixes of the string in the position of the machine, while checking their presence in boron.
Everything would be fine, but it turns out that this variant of Aho-Korasika has quadratic asymptotics with respect to N — the length of the readable string s. Indeed, for a string from state v, you can find v.length () suffixes, and the transition from states can simply increase by 1 the length of this string. Tzimes is to move along suf. links fall only in the known among the sample lines. We introduce the concept of "good" suf. links suff_flink. So, bohr [v] .suff_flink is the closest suffix in a boron for which flag = true. The number of "jumps" in the use of such links will decrease and will be proportional to the number of the required entries ending in this position.
(Fig. 7)Again, the boy will change the structure of the vertex and the procedure for adding: Calculate them is quite simple, all the same lazy dynamics. We introduce the function of counting “good” suf. links. If for the top by suf. flag = true, then this is the desired vertex, otherwise recursively run from the same vertex.
The calculation is good suf. references: int get_suff_flink(int v){ if (bohr[v].suff_flink==-1){ int u=get_suff_link(v); if (u==0)
The implementation of the search on the machine
The search is trivial. We will need a procedure for walking on the "good" suf. links check (v, i) from the current position of the automaton v given that this position ends in the i-th letter in the word s.
check (v, i) void check(int v,int i){ for(int u=v;u!=0;u=get_suff_flink(u)){ if (bohr[u].flag) cout<<i-pattern[bohr[u].pat_num].length()+1<<" "<<pattern[bohr[u].pat_num]<<endl; } }
Here is the search itself: void find_all_pos(const string& s){ int u=0; for(int i=0;i<s.length();i++){ u=get_auto_move(u,s[i]-'a'); check(u,i+1); } }
Here is an example of work: bohr_ini(); add_str_to_bohr("abc"); add_str_to_bohr("bcdc"); add_str_to_bohr("cccb"); add_str_to_bohr("bcdd"); add_str_to_bohr("bbbc"); find_all_pos("abcdcbcddbbbcccbbbcccbb");
Run:
1 abc
2 bcdc
6 bcdd
10 bbbc
13 cccb
16 bbbc
19 cccb
Estimation of complexity and storage methods
The existing variant of the algorithm is looped along the length of s (N = s.length ()), from where it can already be evaluated as O (N * O (check)), but since check jumps only over obviously marked vertices for which flag = true , then the total asymptotic behavior can be estimated as O (N + t), where t is the number of all possible occurrences of all sample strings in s. To be precise and take into account the calculations of the automaton and suf. links, the algorithm works O (M * k + N + t), where M = bohr.size (). Memory - constant arrays of size k for each vertex of boron, from where the estimate O (M * k) results.
It turns out that another way of storing, and specifically, referring to the alphabet, is capable of changing this assessment. We will use the map map <char, int> instead of an array. We read
here and
here , we see that the map data structure from STL is implemented by a red-and-black tree, and the time for accessing its elements is proportional to the logarithm of the number of elements. In our case, the binary logarithm of the size of the alphabet k (which is practically a constant). The total time is O ((M + N) * log k + t). In practice, this is much faster than an array. Map does not store unnecessary memory cells for elements at all, so the memory is proportional to the number of edges in a boron (and therefore the number of vertices in a boron, because in a tree with M vertices there are M-1 edges). The number of transition calculations for an automaton is obviously proportional to the length of the string. The resulting estimate is O (M + N).
| Option with arrays next_vrtx [k], auto_move [k] | Option with red-black tree map <char, int> |
Time complexity | O (M * k + N + t) | O ((M + N) * log k + t) |
Space complexity | O (M * k) | O (M + N) |
PS : Here is a link to a fully implemented working algorithm: Aho-Corasick alg .