Almost duplicate search and geometry

Recently, I came across a task to search for almost duplicates among a large number of short texts. The search for a ready-made solution did not lead to success, and the resulting solution turned out to be quite interesting, and I could not deny myself the pleasure to share it.

Wording

There is a large database of texts (hundreds of thousands of texts). The length of the text is about the same, about 250 characters, the language is English. Some of the texts were edited (typos were corrected, commas were placed, etc.); Thus, both the original text and its corrected copy appear in the database. There are not many such pairs, say no more than 1%. Task: find all such pairs.

For the formal definition of what “edited text” is, the editorial distance is well suited (I use the difflib module in python). Since there are usually few revisions, it is enough to take pairs of texts that differ by no more than 2-3 edits.

The problem is that there are a lot of texts and it is impossible to compare them in pairs because the editorial distance is calculated rather slowly. Moreover, there is no simple natural way to split texts into small groups (for example, by subject) in order to reduce the score.
')

main idea

I would like to break the entire set of texts into small subsets in such a way that it was necessary to calculate the editorial distance only within the subsets. This can be done in the following way: we associate each text with a point in a certain multidimensional space so that the texts close in the editorial sense are translated into close points, and the far texts into, so to say, not very close points. It is enough to calculate the editorial distance only for close in a geometrical sense texts.

The second requirement for the mapping is quite soft - you can make mistakes, and translate far texts into close points. But each such error leads to an extra distance calculation.

To quickly find nearby points, you can use the spatial index.

Implementation

As such a display “text -> dot”, a display counting letters will suit. The space we are considering will have a dimension equal to the number of letters in the alphabet of the language. The coordinate corresponding to a letter will be equal to the number of occurrences of the letter X in the text. For example, f ("aac") = [2, 0, 1, 0, ..., 0]. We use only letters as coordinates, and do not use punctuation, because observations of texts showed that punctuation can change much more (and not necessarily because of human editing; for example, outwardly similar punctuation characters can be encoded by different unicode sequences).

Thus, we get from the definition that close texts will be displayed in close points.
It turns out that the opposite condition will be fulfilled quite accurately (for small distances, this is our case).

After each text is assigned a dot, we build a spatial index on these points (I use the rtree module for python). The index allows you to make quick queries like "find all points that fall into a certain multidimensional box".

Around each point we describe a cube with a side = 4 (~ 2 * number of edits) and look at all the points that fall into this cube. Search edited texts can be limited only to this cube.

Small optimization

You can reduce the number of points for testing by editing distance by increasing the dimension of space. A natural candidate for an increase in dimension is the introduction of n-grams as coordinates. On the other hand, increasing the dimension reduces the performance of spatial indexes. Naturally, there is the task of identifying “useful” n-grams.

However, even a simple introduction of a small number of coordinates

hash(bigramm) % p

(the remainder of dividing the hash of the bigram by a natural number p) gives a strong decrease in the number of points in the cube.
(The choice of parameters depends on the properties of the text database, the implementation of the hash function, but on my base, using p = 3 reduces the number of test pairs by half).
Not all bigrams are equally useful. It turns out that increasing the dimension (modulus p) does not necessarily reduce the number of pairs tested. This is due to the fact that, when taking the remainder, some digrams, useful in themselves, merge into one coordinate.

Funny observation

It turns out that with almost-identical sets of letters, sometimes you can describe dozens of sentences quite different in meaning (i.e., a large number of texts can fall into one cube with a small side length). However, this rarely happens (on the basis I have).

It is hard to believe, but the following two texts are very close geometrically: "JSE MARKET REPORT JSE" and "365 Data Centers Offers Cloud Storage 17 US Markets 25 September 2014".

Source: https://habr.com/ru/post/262063/

All Articles