Finding the maximum common subsequence

In this article, I would like to review the popular algorithms for solving the problem of finding the maximum common subsequence or LCS (longest common sequense). Since the emphasis is on learning, rather than on real use, Python was chosen as the language for implementation, which will reduce the amount of code and focus on the main ideas.

Definitions

A sequence is an ordered set of elements. A string is a special case of a sequence, further examples will consider lines for simplicity, but without changes they can be used for arbitrary text or anything else consistent.
Let there be a sequence x consisting of elements x ₁ x ₂ ... x _m and a sequence y consisting of elements y ₁ y ₂ ... y _n . z is a subsequence of x if there exists a strictly increasing set of indices of elements x from which z is obtained.
A common subsequence for x and y is a sequence z , which is both a subsequence of x and a subsequence of y .
The maximum common subsequence is a common subsequence with a maximum long. Further in the text we will use the abbreviation LCS .
As an example, let x = HA B R AHA BR , y = HARB OU R , in this case LCS (x, y) = HARBR . You can already go directly to the LCS calculation algorithm, but it would be good to understand why we may need it.

Practical application

The most common use is for use in file comparison programs, such as GNU diff. Having the LCS found for two texts, it’s trivial to make a list of elementary changes for turning x into y or back. As a bonus, based on the length of the common subsequence, you can set a metric to determine the similarity of the two sequences. Everything, now you can just get down to business.

The first approach or folk art

First, a couple of observations:

If for the x and y sequences we have already calculated LCS (x, y) = z, then the LCS for sequences obtained from x and y by adding the same element will consist of z and this added element
If we add to the x and y sequences one for a different element, then for the xa and yb thus obtained, the LCS must be the larger of two: LCS (x, yb) or LCS (xa, y)

These observations are enough to implement recursion:

def LCS_RECURSIVE(x, y): if len(x) == 0 or len(y) == 0: return [] if x[-1] == y[-1]: return LCS_RECURSIVE(x[:-1], y[:-1]) + [x[-1]] else: left = LCS_RECURSIVE(x[:-1], y) right = LCS_RECURSIVE(x, y[:-1]) return left if len(left) > len(right) else right

Now you might think that this is not the case. In the worst case, when there are no identical LCS_RECURSIVE elements between x and y, it will be called 2 ^{len (x) + len (y)} times, at the recursion level len (x) + len (y). Even if you close your eyes to performance, the code:

 from uuid import uuid4 x = [uuid4().hex for _ in xrange(500)] y = [uuid4().hex for _ in xrange(500)] print LCS_RECURSIVE(x,y)

without an additional call, sys.setrecursionlimit will not succeed. Not the most successful implementation.
')

Dynamic programming or 64 kb is enough for everyone

The considered algorithm is also known as the Needleman – Wunsch algorithm.
The whole approach is reduced to the stage-by-stage filling of the matrix, where the rows are the elements x, and the columns are the elements y. When filling, there are two rules arising from observations already made:
1. If the element x _i is equal to y _j, then in the cell (i, j) the value of the cell (i-1, j-1) is written with the addition of one
2. If the element x _{i is} not equal to y _j, then the maximum of the values (i-1, j) and (i, j-1) is written in the cell (i, j).
Filling takes place in a double cycle of i and j with an increase in the values by one, so at each iteration the cell values needed at this step are already calculated:

 def fill_dyn_matrix(x, y): L = [[0]*(len(y)+1) for _ in xrange(len(x)+1)] for x_i,x_elem in enumerate(x): for y_i,y_elem in enumerate(y): if x_elem == y_elem: L[x_i][y_i] = L[x_i-1][y_i-1] + 1 else: L[x_i][y_i] = max((L[x_i][y_i-1],L[x_i-1][y_i])) return L

Illustration of what is happening:

The cells in which the values directly increased were highlighted. After filling the matrix, connecting these cells, we get the desired LCS. Connecting with the need to move from the maximum index to the minimum, if the cell is highlighted, then add the corresponding element to the LCS, if not, then move up or to the left depending on where the larger value is in the matrix:

 def LCS_DYN(x, y): L = fill_dyn_matrix(x, y) LCS = [] x_i,y_i = len(x)-1,len(y)-1 while x_i >= 0 and y_i >= 0: if x[x_i] == y[y_i]: LCS.append(x[x_i]) x_i, y_i = x_i-1, y_i-1 elif L[x_i-1][y_i] > L[x_i][y_i-1]: x_i -= 1 else: y_i -= 1 LCS.reverse() return LCS

The complexity of the algoritm is O (len (x) * len (y)), the same estimate by memory. Thus, if I want to compare two files of 100,000 lines line by line, then I will need to store in memory a matrix of 10 ¹⁰ elements. Those. real use threatens to get MemoryError almost out of the blue. Before proceeding to the next algorithm, we note that when filling the matrix L at each iteration over the elements x, we need only the line obtained in the previous move. Those. if the task were limited only to finding the length of the LCS without having to calculate the LCS itself, then it would be possible to reduce the memory usage to O (len (y)), saving only two rows of the matrix L.

Fight for a place or Hishberg Algorithm (Hirschberg)

The idea behind this algorithm is simple: if you divide the input sequence x = x ₁ x ₂ ... x _m into two arbitrary parts for any boundary index i by xb = x ₁ x ₂ ... x _i and xe = x _{i + 1} x _{i + 2} ... x _m , then there is a way to similarly partition y (there is such an index j dividing y into yb = y ₁ y ₂ ... y _j and ye = y _{j + 1} y _{j + 2} ... y _n ) such that LCS (x, y) = LCS (xb, yb) + LCS (xe, ye).
In order to find this partition y is proposed:

Fill the dynamic matrix L with fill_dyn_matrix for xs and y. L1 equate the last row of the computed matrix L
Fill in the L matrix for * xe and * y (inverse sequences for xe and y, in Python terms: list (reversed (x)), list (reversed (y))). Equate * L2 the last row of the computed matrix L, L2 is the opposite of * L2
Take j equal to the index for which the sum L1 [j] + L2 [j] is maximal

This can be represented as filling the matrix L from two opposite ends:

Note that since there is a need only in the last row of the matrix L, then the calculation does not need to store the entire matrix. By slightly changing the implementation of fill_dyn_matrix we get:

 def lcs_length(x, y): curr = [0]*(1 + len(y)) for x_elem in x: prev = curr[:] for y_i, y_elem in enumerate(y): if x_elem == y_elem: curr[y_i + 1] = prev[y_i] + 1 else: curr[y_i + 1] = max(curr[y_i], prev[y_i + 1]) return curr

Now directly, about the algorithm itself. It is a classic divide and conquer: as long as there are elements in the sequence x, we divide x in half, find the appropriate partition for y and return the sum of recursive calls for pairs of sequences (xb, yb) and (xe, ye). Note that in the trivial case, if x consists of one element and occurs in y, we simply return a sequence of this single element x.

 def LCS_HIRSHBERG(x, y): x_len = len(x) if x_len == 0: return [] elif x_len == 1: if x[0] in y: return [x[0]] else: return [] else: i = x_len // 2 xb, xe = x[:i], x[i:] L1 = lcs_length(xb, y) L2 = reversed(lcs_length(xe[::-1], y[::-1])) SUM = (l1 + l2 for l1,l2 in itertools.izip(L1, L2)) _, j = max((sum_val, sum_i) for sum_i, sum_val in enumerate(SUM)) yb, ye = y[:j], y[j:] return LCS_HIRSHBERG(xb, yb) + LCS_HIRSHBERG(xe, ye)

Now the memory requirements are O (len (y)), the time required for the calculation has doubled, but still asymptotically O (len (x) len (y)).

Hunt-Szymanski Algorithm Algorithm (Hunt-Szymanski Algorithm)

The first thing we need is to create a hash of the matchlist table, which will contain the indexes of the elements of the second sequence. The time required for this is O (len (y)). The following code on python does this:

 y_matchlist = {} for index, elem in enumerate(y): indexes = y_matchlist.setdefault(elem, []) indexes.append(index) y_matchlist[elem] = indexes

For the “HARBOR” sequence, the hash will be as follows: {'A': [1], 'B': [3], 'H': [0], 'O': [4], 'R': [2, 6] , 'U': [5]}.

Next, iterating over the elements of the sequence x, we fill the array THRESH with the corresponding indices from the prepared matchlist, so that the value of the kth THRESH element should be the index y_index, provided THRESH [k-1] <y_index and y_index <THRESH [k]. Thus, at any moment in time the array THRESH is sorted and we can use a binary search to find the appropriate k. When updating the THRESH element, we also add the sequence element corresponding to the y_index index to our LCS. The following code can clarify:

 x_length, y_length = len(x), len(y) min_length = min(x_length, y_length) THRESH = list(itertools.repeat(y_length, min_length+1)) LINK_s1 = list(itertools.repeat(None, min_length+1)) LINK_s2 = list(itertools.repeat(None, min_length+1)) THRESH[0], t = -1, 0 for x_index,x_elem in enumerate(x): y_indexes = y_matchlist.get(x_elem, []) for y_index in reversed(y_indexes): k_start = bisect.bisect_left(THRESH, y_index, 1, t) k_end = bisect.bisect_right(THRESH, y_index, k_start, t) for k in xrange(k_start, k_end+2): if THRESH[k-1] < y_index and y_index < THRESH[k]: THRESH[k] = y_index LINK_x[k] = (x_index, LINK_x[k-1]) if k > t: t = k

After that, we just have to assemble the sequence from LINK_x.
The running time of this algorithm is O ((r + n) log n), where n is the length of the larger sequence, and r is the number of matches, in the worst case with r = n * n, we get the run time worse than in the previous solution. Memory requirements remain on the order of O (r + n).

Total

There are quite a few algorithms that solve this problem, asymptotically, the most efficient algorithm (Masek and Paterson) has an O (n * n / log n) time estimate. Given the overall low efficiency in LCS calculations, in practice, the simplest preparations are performed before the algorithm works, such as discarding identical elements at the beginning and at the end of sequences and finding trivial differences between sequences. Also, there are some optimizations using bit operations that do not affect the asymptotic behavior of the operation time.
download all code with examples

Source: https://habr.com/ru/post/142825/

All Articles