I want to share a simple but effective algorithm for determining fuzzy copies of documents. There are many articles on the use of
the shingle algorithm for this purpose. It is rumored that large search engines use a very similar algorithm for themselves. However, everyone acknowledges that shingles are poorly suited for short (3-5 sentences) documents. And in my task I had to work with such documents. As a solution, they suggest looping the text in order to make it a long one, but it seems to me that this is not a very correct decision, the accuracy of duplicate recognition will still be low.
So, the description of the algorithm that I used:
First you need to index the collection of texts.
- Let's transform each collection text into an array of words from this text, after cutting out html-tags, numbers and other unnecessary information.
- We will select the 15 longest words, and those that are shorter than 4 characters are not taken anyway, even if the text itself is shorter than 15 words.
- For each selected word, we calculate the checksum, for example crc32, and write the sums to the database in the format: “id amount”, “document id”, “amount”.
- In another table, we write down some characteristics of the text itself: "md5" and "number of words that fell into the table with checksums". The second field is needed for those texts for which 15 words of the appropriate length could not be found, conventionally it is the length of the document in words.
I use tables with the following structure:
')
CREATE TABLE `items_hashes` (
`id` int(11) NOT NULL auto_increment,
`doc_id` int(11) NOT NULL,
`word_hash` int(11) NOT NULL,
PRIMARY KEY (`id`)
);
CREATE TABLE `items_hashes_summary` (
`doc_id` int(11) NOT NULL,
`full_hash` char(32) NOT NULL,
`length` smallint(11) NOT NULL,
PRIMARY KEY (`doc_id`)
);
Uniqueness text checking
After all documents are indexed, we can check any text for uniqueness within our collection. For this you need:
- Check for a complete match with any document from the collection using the data in the items_hashes_summary table. If such a coincidence is found, then the following steps have no meaning, otherwise we go further.
- For the text to be checked perform the first three steps from the previous list, but for the time being do not write the checksums into the database
- Find documents in the database that contain words that match the document being analyzed and the number of these matches. You can do this with this query:
SELECT doc_id, COUNT(id) as intersects FROM items_hashes
WHERE (word_hash = %cur_doc_hash1% OR word_hash = %cur_doc_hash2% OR ... )
GROUP BY doc_id
HAVING intersects > 1
(on large collections you should write, for example, HAVING intersects> 5 or even more, because otherwise further processing can be slow)
- Now we find the “degree of similarity” in percentage with each of the found documents, which is calculated by the formula:
$ similarity = ($ intersecs / $ length) * 100,
where $ length is the shortest of the compared documents;
$ intersecs - the number of matching words.
I believe that a document is not unique if its “degree of similarity” with any document of the collection is more than 80%, it may be necessary to select a different value for your tasks.
If you have a task to find pairs of similar texts already inside the indexed collection, this can also be done; you just need to modify the query in the penultimate step so that the intersection of the document with yourself is not searched.
It all works pretty fast. In any case, there are no problems on the collection of 12 thousand documents and a very weak VPS. I also tested 70 thousand documents in a virtual machine, everything was very fast too.
Feel like it really works
here . You can take any ad, change something and try to add as new. It should immediately say that a very similar one already exists in the database. Only if suddenly you change the ad too much and it will be added, then do not forget to delete it then please;)
Users still manage to publish the same content ads, but for this they have to rewrite almost all in other words. False positives have not yet been noticed.
PHP implementation
To make it clearer, I tore out two classes from the project that implement this algorithm and put it on pastebin.com: a
class, to get a hash of a document (universal and ready to use in any project) and
a model class (not quite ready for use in any project, but I think it will not be difficult to modify).
PS This is my first topic on Habré, so do not hurt if something is wrong.
UPD: Thank you all for your karma, now I can move it to the “Development” blog.