Construction of the suffix tree: Ukkonen algorithm

At the request of the workers lay out the description and proof of the algorithm Ukkonen.

Task Description

It is required to build a suffix tree for a given string in a reasonable time. A suffix tree is a boron consisting of all the suffixes of a given string. If in short, boron is a hanging tree with symbols on edges, an implementation of a data structure for storing strings. Strings are obtained by passing from the root along the edges, writing down the corresponding characters, to the terminal vertex.

A boron for an arbitrary row of lines is constructed in O (the sum of the lengths of these lines). Obviously, the sum of the lengths of all suffixes of a string is proportional to the square of the length of the string itself. Thus, building a suffix tree with a trivial algorithm works in O (N ² ). And here a reasonable question arises, is it possible to build a suffix tree faster?
')
Actually you can.

And so, let's start from the beginning. Let the length of the string S = s ₁ ... s _n , for which we build a suffix tree, be N. Suppose that the tree grows down, i.e. the root is the upper vertex, and the ribs go from above, from the ancestor, down to the descendant. In the future, I will sometimes call vertices suffixes, and vice versa, implying the corresponding suffixes, if they are vertices for suffixes, or the corresponding substrings.

First, we abandon the idea of adding suffixes one by one, and we will build a tree for all suffixes at the same time, i.e. successively adding characters to the string. Secondly, we compress the edges. What does it mean: if we have several edges in a row, between which there are vertices with only one descendant, then we will merge these edges into one and write on it a sequence of characters written on the combined edges. You can not explicitly store this sequence, but for each edge, store only the indices of the beginning and end of this substring in S. To remote vertices, compare the positions within the long edge and call these positions “imaginary vertices”. Thus, to store a vertex you need: the number of the vertex from which the edge goes, the first character of this edge, the offset along the edge. As ancestors, we will consider only the silent vertices.

First we show that the total number of existing vertices in a compressed tree is O (N).

Lemma: The number of vertices in a compressed suffix tree O (N)

Let's see what kind there are vertices. First, it leaves. Secondly, this is a "fork" - the top with more than one child. The root is also the top.
The root is only one.
Each sheet can only match the suffix. Let the top of the sheet correspond to T = s _l ... s _r - a substring in S that is not a suffix. Then in S there is a string Ts _{r + 1} , which means that there must be an edge from our sheet. Thus, the number of sheets is no more than N.
At first glance, it is not clear how much a fork can be. To do this, consider the places where a compressed tree is built when they are created. When adding one suffix, we first go on a tree, and then maybe 3 cases:
1) we have finished adding the suffix, never trying to get out of an already constructed tree. Then the fork was not created
2) we wanted to get out of the tree in the dumb vertex. Then an edge is created to the end of the suffix from this vertex. New fork is also not created.
3) we wanted to get out of the tree at the imaginary top. That's when we have to split this edge into two, create a fork in this place, and one edge will correspond to the rest of the edge on which we stopped, and the other will
match the ending suffix. In this case, we created only one fork.
We get that the fork is not more than N.
As a result, the number of vertices in a compressed tree is O (N).

The considered algorithm will work online, i.e. after the k-th step of the algorithm, we want to get a suffix tree for the prefix string of length k. In this case, we can subsequently increase the length of the string as needed in a short time. Because suffixes we have N, add characters required O (N), we get the running time O (N ² ). To speed up the work, we will add suffix links to the tree and make 3 optimizations.
A suffix link is a pointer (or number) of a vertex (including an imaginary one), which corresponds to the largest proper (not equal to the line itself) suffix of the line corresponding to the current vertex. Since we store all suffixes, and hence all the substrings, of one string, then the suffix link from the string T = s _l ... s _r points to the string T '= s _{l + 1} ... _r , i.e. shortens the string length by one. We will store suffix links only for fork and root. In order not to consider cases, we add a “empty” vertex blank, from which edges with all symbols of the alphabet lead to the root, and a suffix link from the root leads to it. The fact that the suffix link from the fork leads to the dumb vertex, I will leave without proof, because I consider it obvious. For sheets, a similar statement is not true. Suffix links from a sheet can also point to an imaginary vertex, so we will not store them for sheets. So, suffix links lead to silent vertices and their number O (N).

Now about optimization

Optimization number 1: was a sheet, sheet and stay

Note that when adding a symbol to a sheet, we get a new sheet. Therefore, we say that we create a sheet not only for the considered part of the line, but for the whole of all the lines to the end. For this, the right border of the substring corresponding to an edge to a leaf is infinity. Thus, it is no longer necessary to extend the sheets each time. As a result, if we have created a sheet for some suffix, we will not consider this suffix anymore. It is clear that this does not affect the asymptotics, but it saves us from unnecessary actions.

Optimization number 2: if you just went along the edge, then in smaller suffixes we will also go along the edge

What does this mean? Consider the first (longest) suffix, from the top of which we were able to walk along the edge of the current character. Then it is stated that for all smaller suffixes from their vertex we will go over the corresponding edge, and we will not create sheets or forks. Thus, we reduce the number of vertices at each step. For obvious reasons, this also does not affect the asymptotics, but will be needed for the last optimization.

So, ahead of us is waiting for the last major and most difficult optimization, which reduces the asymptotics to the required value. But for starters, we get rid of the list of Konnyh vertices of suffixes, and to view them all we will walk from the longest, first suffix on the suffix links. Because we know that suffix links are not for every vertex, but only for forks (and the root, but from this point on, we will also consider it a fork), then passing through the suffix link from a leaf or imaginary vertex will occur like this:
1) go to the nearest ancestor-fork along the edge and remember the substring, which we climbed up;
2) go to the suffix link from the fork;
3) go down until we pass through this memorized substring. We may have to go through several forks, then we just need to go along the right edge;
4) a vertex, possibly imaginary, in which we will stop, and will be a “suffix reference” for the original one.
The fact that during the descent down we will not try to get out of the tree, I leave it without proof as obvious.

And so, we have 3 actions with vertices when adding a symbol:
1) Sheet renewal (performed for the sheet)
2) Creating a fork (performed for an imaginary vertex or fork, of which there is no edge with this symbol)
3) Just pass along the edge (performed for an imaginary vertex or fork, of which there is such an edge)
It is argued that if we consider the suffixes by reducing the length, then actions with them will be performed in this order 11 ... 12 ... 23..33. (This follows from the fact that from the vertex of the suffix of a given vertex there are at least all edges similar to the edges from a given vertex, but there may still be other edges). Moreover, the positions of the change of action 1 by 2 and 2 by 3 move only to the left (do not forget that after each phase we have the length of this sequence drop by 1, because a new suffix is added).

Let's call the first vertex, which is not a leaf, as a start-point (SP). This vertex corresponds to the first occurrence of 2 in this sequence. True, there can be no type 2 actions, then SP is the first occurrence 3. And the vertex corresponding to the first occurrence 3 is called the active-point (AP). What did we get after the first two optimizations? The fact that the sequence in this phase can be considered starting with SP, and finish when we arrive at the AP. Now there is a reasonable question, how to quickly find the SP for the next phase? There is a third optimization for this.

Optimization # 3: SP for the next phase is a descendant of the current phase AP

To begin with, SP is a descendant of one of the vertices of suffixes before this phase. During the execution of the phase, some action was performed, and then a descent along the edge with the current symbol. Therefore, I will not operate with descendants, but with the tops of the previous phase.
Obvious fact: the next SP cannot be the suffix with which the action of the first type was performed at the current step. Further, SP cannot be a suffix for which an action of the second type was performed, since at the same time we created a fork and, passing along the edge leading to the leaf, we hit the leaf. Why is the suffix corresponding to the previous AP appropriate? From the AP, we only want to have at least some edge from it. Consider the suffix T corresponding to SP. Since there is at least an edge with this symbol, then T appears as a substring with no suffix in S. Hence, there is a substring Ts _k in S (recall that at this phase we added the symbol s _k ). Then in the new line, with the added symbol s _k , there will be a substring of Ts _k c, where c is a certain character. So, at the vertex Ts _k (descendant of SP) the condition for the AP is fulfilled, and the vertex we just considered is the earliest possible one, which means it will be the next AP.
As a result, we obtain that it is enough for us to store only the AP and move it. Processing the vertices when moving the AP (creating cross-overs or adding one edge to an existing one) will be enough to build a tree.

Congratulations! If you have read this far and almost understood everything, then you can already write a working implementation of the Ukkonen algorithm. For a better understanding of the spread pseudocode of this algorithm:

....   .....  root  blank     ; SP <- root; k = 0; while (c =  k-  ) { while ( SP     c) {   SP    [k,oo]; SP <-     SP; } SP <-      C  1 ; k++;//   ; } ....      V .... U <-  V; S <-    U  V; U <- suff(U); //      while (S  ) { C <-   S; if ( ,   C ==  S) {      ; } else if ( ,   C >  S){      ;   ; } else { // ,   C <  S     ;   S  ,    ; } }

But the last question remains. Why the asymptotics of this algorithm is O (N)? Despite the fact that these optimizations allow you to work with all suffixes no more than 2N times (we either process each suffix and go further, or stay on this suffix and look at the next character; each of these cases is not more than N), we still have transitions by suffix links, for which it is not at all obvious that the total transitions on them will be O (N).

Asymptotic proof

Because the transition by direct link works beyond O (1), and there are no more than N of them, then it suffices to consider the transition down the edges. To do this, consider the length of the displacement along the edge along which we ascended to the ancestor. Let it be equal to L. Then, with each passing down in the ribs, this length decreases by at least 1. And it increases only in the basic procedure when you walk along the edge of the symbol C, and not more than 1 at a time. And there are N. such passages. The final offset value lies in the range [0, N], the number of elongations N, and the total number of cuts can be no more than 2N. This means that the total number of advances down when clicking on the suffix link O (N). As a result, the number of each type of action O (N), and hence the total number of all actions O (N). Asymptotics is proved.

Using Suffix Trees

The first, and perhaps the most common, practical task associated with suffix trees is to find the inclusion of a single line in the text. There are other algorithms for this task, but if we modify it a little, for example, we need to search for it many times, but we cannot process all requests in advance, as in the Aho-Korsika algorithm, then simply walking the tree through the characters from the search query , we will either try to exit the tree, then the answer is negative, or, if the request has ended, the answer is positive. If, in addition to the tree, to store, how many leaves can be reached from this vertex, and a unique symbol can be assigned to the text at the end, so that each suffix corresponds to a leaf, then the number of queries in the text can also be said.

The largest common substring of two or more strings.
The condition of the problem is simple: a set of strings is given (we consider the case when there are two of them), and it is required to find their greatest common substring in linear time. For this you need to build a generalized suffix tree of these substrings. Each leaf of this tree is a suffix of some string data set. Then, if for each fork we indicate the suffixes of which set of lines begin with the line of this vertex, then to find the greatest substring it is enough to go through all the vertices and, if the vertex contains a set of all lines, then compare its length and the length of the answer already found.

The calculation of the sum of leaves in the first task and the sets of strings in the second relates to dynamic programming and does not understand this topic. I can only say briefly that for this you need to walk on the tree from the bottom up and do the counting, based on the already known results.

Ps. If you are willing, I can write an article using the Aho-Korasik algorithm.
Pss. If suddenly there are some unobvious moments, write, I will explain more clearly.

Source: https://habr.com/ru/post/111675/

All Articles