A simple algorithm for finding all matching sub-texts in two texts

On duty, I often need to find all the intersections between texts (for example, all the quotes from one text to another). For a long time I was looking for a standard solution that would allow it to be done, but I never managed to find it - usually some completely or slightly different problem is solved. For example, the difflib SequenceMatcher class in the standard Python library finds the longest common subsequence in two sequences of hashable elements, and then recursively repeats the search to the left and to the right of it. If one of the texts has a shorter subsequence, which is contained inside an already found one (for example, if a piece of a long quotation was repeated somewhere else), it will skip it. In addition, when I put War and Peace and Anna Karenina in him in the form of word lists and asked me to start by finding the longest subsequence, he thought for seven minutes; when I asked for all matching blocks, he left and did not return (in the documentation they promise average linear time, but something in Leo Tolstoy’s prose seems to bring about the worst-case quadratic).

In the end, I came up with my own algorithm, thereby probably inventing a bicycle, which I hope to see in the comments. The algorithm does exactly what I need: it finds all matching sequences of words in two texts (except for those in both texts that are part of larger matching sequences) and compares War and Peace with Anna Karenina in a minute.

')
The principle is the following: first, all digrams are extracted from each text and a hash table is created, where its serial number is indicated for each digram. Then a shorter text is taken, its digrams are moved in any order, and if one of them is in the dictionary for the second text, then all pairs of the form are added to the separately created array (the digram number from the first dictionary, the digram number from the second dictionary). For example, if in the text 1 digram “A B” is found at positions 1, 2 and 3, and in the second text it is also found at positions 17, 24 and 56, the pairs will fall into the array (1, 17), (1, 24) , (1, 56), (2, 17) ... This is the weakest point of the algorithm: if both texts consist of the same words, then n by m pairs of digits will fall into the array. Such texts, however, are unlikely to come across (the algorithm is oriented on texts in natural language), and to reduce the number of meaningless matches, you can replace the digrams with any n-grams (depending on what minimal matches are needed) or throw out frequency words.

Each pair of digits in the array is a coincidence at the bigram level. Now we need to get matching sequences from there, and if we have a match of the type “A B B C”, then the fact that exactly the same text fragments coincide in “A B”, “B B” and “B C” t. d. must be ignored. All this is very easy to do in linear time using a moderately non-trivial data structure — an ordered set. It should be able to put into itself and throw out elements from itself for a constant time and remember in which order the elements were added to it. The implementation of such a set for Python is here: code.activestate.com/recipes/576694-orderedset For our needs, it should be able to spit out elements from itself not only from the end, but also from the beginning. Add the quick-pop popirst method:

def popFirst(self): if not self: raise KeyError('set is empty') for item in self: i = item break self.discard(i) return i

The rest is quite simple: remove the first element from the set, put it into a temporary array and, while we can, add elements from the set to it that each next and first and second indices are one more than the previous one. When there are no more such, send the found parallel sequence to the final array and repeat again. All added elements are also discarded from the set. Thus, we simultaneously obtain maximally parallel sequences, ignore the coincidence of subsequences in long sequences and do not lose the connection between the sequences found and other places in the text (such coincidences will differ by the first or second index). Ultimately, at the output we have an array of arrays, where each subarray is a matching sequence of words. You can sort these subarrays by length or by index of the beginning (to find out all the parallel places to a particular fragment) or filter by the minimum length.

Code on Python without OrderedSet and with bigrams. compareTwoTexts returns the result immediately. In order to understand exactly which fragments of text match the numbers of bigrams, you need to separately make a dictionary of digrams and a reverse vocabulary from it (extractNGrams, getReverseDic) or just take a word list (extractWords): each digram starts with a word that is in the same position as she herself.

 from OrderedSet import OrderedSet russianAlphabet = {'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''} def compareTwoTexts(txt1, txt2, alphabet = russianAlphabet): # txt1 should be the shorter one ngramd1 = extractNGrams(txt1, alphabet) ngramd2 = extractNGrams(txt2, alphabet) return extractCommonPassages(getCommonNGrams(ngramd1, ngramd2)) def extractNGrams(txt, alphabet): words = extractWords(txt, alphabet) ngrams = [] for i in range(len(words)-1): ngrams.append( (words[i] + ' ' + words[i+1], i) ) ngramDic = {} for ngram in ngrams: try: ngramDic[ngram[0]].append(ngram[1]) except KeyError: ngramDic[ngram[0]] = [ngram[1]] return ngramDic def extractWords(s, alphabet): arr = [] temp = [] for char in s.lower(): if char in alphabet or char == '-' and len(temp) > 0: temp.append(char) else: if len(temp) > 0: arr.append(''.join(temp)) temp = [] if len(temp) > 0: arr.append(''.join(temp)) return arr def getReverseDic(ngramDic): reverseDic = {} for key in ngramDic: for index in ngramDic[key]: reverseDic[index] = key return reverseDic def getCommonNGrams(ngramDic1, ngramDic2): # ngramDic1 should be for the shorter text allCommonNGrams = [] for nGram in ngramDic1: if nGram in ngramDic2: for i in ngramDic1[nGram]: for j in ngramDic2[nGram]: allCommonNGrams.append((nGram, i, j)) allCommonNGrams.sort(key = lambda x: x[1]) commonNGramSet = OrderedSet((item[1], item[2]) for item in allCommonNGrams) return commonNGramSet def extractCommonPassages(commonNGrams): if not commonNGrams: raise ValueError('no common ngrams') commonPassages = [] temp = [] while commonNGrams: if not temp: temp = [commonNGrams.popFirst()] if not commonNGrams: break if (temp[-1][0]+1, temp[-1][1]+1) in commonNGrams: temp.append((temp[-1][0]+1, temp[-1][1]+1)) commonNGrams.discard((temp[-1][0], temp[-1][1])) else: commonPassages.append(temp) temp = [] if temp: commonPassages.append(temp) return commonPassages

Source: https://habr.com/ru/post/261111/

All Articles

A simple algorithm for finding all matching sub-texts in two texts

More articles: