Search for fuzzy duplicates. Shingle algorithm for web documents

Earlier, I showed an elementary implementation of the shingle algorithm , which allows to determine whether the two documents are almost duplicates or not. This time I will explain the implementation of the algorithm described by Zelenkov Yu. G. and Segalovich I.V. in the publication Comparative Analysis of Fuzzy Duplicate Detection Methods for Web Documents .
With this, I begin a series of three theoretical articles, in which I will try to describe the principle of the algorithms of shingle, supershinges and megingshiles for comparing web documents in an accessible language.

In a publication about almost duplicate search algorithms, a version of the shingle algorithm is proposed, using a random sample of 84x random shingles.
Why exactly 84? Using 84x randomly selected checksum values will make it easy to modify the algorithm to a superswitch algorithm and megashingle, which are much less resource-intensive. I will describe them soon.
I recommend armed with a pen and a piece of paper and figuratively present each of the stages described below as a picture.

So, the shingle algorithm for web documents

Let us examine the stages through which the text undergoes comparison:

text canonization;
splitting into shingles;
calculating shingles hashes using 84x static functions;
random sampling of 84 checksum values;
comparison, determination of the result.

1. Canonization of the text

What is the canonization of the text I described in my last article on the Shingle algorithm. But in briefly repeat. The canonization of the text leads the original text to a single normal form.
The text is cleared of prepositions, conjunctions, punctuation marks, HTML tags, and other unnecessary "garbage", which should not be involved in the comparison. In most cases, it is also proposed to remove adjectives from the text, since they do not carry a semantic load.
Also at the stage of canonization of the text, nouns can be reduced to the nominative case, the singular, or the roots can only be left from them.
With canonization of the text, you can experiment and experiment, the scope for action here is wide.
At the exit, we have the text, cleared of "garbage", and ready for comparison.
The process of canonization of the text

2. Sharing into shingles

Shingly (English) - scales, selected from the article subsequence of words.
It is necessary to distinguish between subsequences of subsequences of words, following each other in 10 pieces (the length of a shingle). Sampling is overlapping, not butt.
Thus, by breaking the text into subsequences, we get a set of shingles in an amount equal to the number of words minus the length of the shingle plus one (number of words - length of the length + 1).
I remind you that actions on each of the points are performed for each of the texts being compared.
The process of splitting text into shingly

The process of splitting text into shingly

3. Calculation of shingles hashes using 84x static functions

This is the most interesting stage. The principle of the algorithm of shingles is to compare a random sample of checksums of shingles (subsequences) of two texts among themselves.
The problem of the algorithm is the number of comparisons, because it directly affects the performance. The increase in the number of shingles for comparison is characterized by an exponential growth of operations, which critically affect the performance.
It is proposed to present the text in the form of a set of checksums calculated using 84x unique static hash functions.
Let me explain: for each Shingle, 84 checksum values are calculated through different functions (for example, SHA1, MD5, CRC32, etc., a total of 84 functions). So each of the texts will be represented, one can say, in the form of a two-dimensional array of 84 lines, where each line characterizes the corresponding of 84 functions of checksums.
From the resulting sets, 84 values for each of the texts will be randomly selected and compared to each other according to the checksum function through which each of them was calculated. Thus, for comparison, it will be necessary to perform a total of 84 operations.
Finding shinglum sums

4. Random sample of 84 checksum values

As I described above, comparing the elements of each of the 84x arrays with each other is resource intensive. To increase performance, we will execute a random sample of checksums for each of the 84x rows of a two-dimensional array, for both texts. For example, we will select the lowest value from each row.
So, at the output we have a set of minimum checksum values of the shingles for each of the hash functions.
Random selection of shingle

5. Comparison, determination of the result

And the last stage is a comparison. We compare the 84 elements of the first array with the corresponding 84yu elements of the second array, we consider the ratio of the same values, from this we get the result.
Comparison, result

Conclusion

I hope I was able to explain the theory of finding almost duplicates for web documents, described in the publication by Zelenkov Yu. G. and Segalovich I.V. without going into particular implementation in a specific programming language.
In the second article of the cycle, I am going to describe the algorithm for super-files for web documents.
I hope it was interesting, good luck.

The original article on our blog : Part 1. Shingle algorithm for web documents . Stay tuned for more updates coming soon.

Source: https://habr.com/ru/post/65944/

All Articles