The implementation of the shingle algorithm on Node.JS. Finding Fuzzy Duplicates for English Texts

When working with information often arise problems of parsing web pages. One of the problems in this case is the definition of similar pages. A good example of such an algorithm is the “Shingle algorithm for web documents” .

A part of the parsing project was implemented on Node.JS, therefore, the algorithm needed to be implemented on it. I did not find any implementations in javascript or npm-packages - I had to write my own.

All work on the code is based on the article above, so all the points of the algorithm will be from it, but with some amendments.

To determine the similarity of 2 documents, you must:

text canonization;
splitting into shingles;
calculating a shingle hash using 84x static functions;
random sampling of 84 checksum values;
comparison, determination of the result.

Points 3,4 for me were quite problematic. 1st - you need to find 84 static functions for hashing, and 2nd - a random sample of 84 x values of checksums. If for the 1st problem - solutions can be found, then the second is not clear to me. If we have an array of text shingles with 84 functions, then it turns out that the output will be a 2-dimensional array of dimension 84xN (the number of shings in the document). Now you need to bypass this 84-element array for each text and compare random shingles hashes. You can compare random elements, but this option may not give matches. If we take the minimum hashes in length, then for md5 all hashes are equal in length, and the length according to the character codes is an additional load. Therefore, I decided to replace points 3 and 4 with a simple hashing of shingles using crc32 and a sequential comparison.
The final algorithm:

text canonization;
splitting into shingles;
calculating a shingle hash using crc32;
sequential comparison, determination of the result.

1. Canonization of the text

In my case, canonization consists of:

cleaning up html entities;
cleaning of excess spaces on the sides (trim);
clearing such special characters '”', '"', "\ n", '\ r', ',', '.', ':', '$', '#', '"', '(' , ')';
clearing unnecessary parts of speech in a sentence

First you need to prepare methods for processing text.

var strWordRemove = function(entry) { var regex = new RegExp('(^|\\s)' + entry + '(?=\\s|$)', 'g'); text = text.replace(regex, ''); }; var strCharacterRemove = function(entry) { var escapeRegExp = function (str) { return str.replace(/[\-\[\]\/\{\}\(\)\*\+\?\.\\\^\$\|]/g, "\\$&"); }; var regex = new RegExp(escapeRegExp(entry), 'g'); text = text.replace(regex, ''); };

')
The first is needed to replace the words in the text, and the second to replace the specials. characters. Next comes the processing itself:

  var withoutTagsRegex = /(<([^>]+)>)/ig; text = text.replace(withoutTagsRegex, ""); text = text.trim(); ['”', '“', "\n", '\r'].forEach(strCharacterRemove);

For Node.JS there is an npm-package “pos”, which allows you to find parts of speech in the text. It works pretty well.

Handling parts of speech using pos

 var words = new pos.Lexer().lex(text); var taggedWords = new pos.Tagger().tag(words); var removeWords = []; var nounWords = []; for (var i in taggedWords) { var taggedWord = taggedWords[i]; var word = taggedWord[0]; var tag = taggedWord[1]; //Adjective /* JJ Adjective big JJR Adj., comparative bigger JJS Adj., superlative biggest CC Coord Conjuncn and,but,or IN Preposition of,in,by TO ÒtoÓ to UH Interjection oh, oops DT Determiner the,some */ //console.log(word + " /" + tag); if(tag === 'NNS') { nounWords.push(word); } if(['JJ', 'JJR', 'JJS', 'CC', 'IN', 'TO', 'UH', 'DT'].indexOf(tag) !== -1) { removeWords.push(word); } } removeWords.forEach(strWordRemove);

All other specials. I decided to remove the characters after processing parts of speech.

 [',', '.', ':', '$', '#', '"', '(', ')'].forEach(strCharacterRemove);

Then it remains to bring all the nouns to a single form and the canonization unit can be considered ready. It is worth noting that pos brings to plural nouns such words as Command's. I decided to skip them.

Nouns to a single species

 // replace all plural nouns to single ones nounWords.forEach(function(entry) { //parent's || Apple's || Smurf's if(entry.length > 2 && entry.slice(-2) === "'s") { // now skip it. in future we can test to remove it return ; } var newOne = ''; if(entry.length > 3 && entry.slice(-3) === "ies") { newOne = entry.slice(0, -3) + 'y'; } else if(entry.length > 2 && entry.slice(-1) === "s") { newOne = entry.slice(0,-1); } else { return ; } var rexp = new RegExp('(^|\\s)' + entry + '(?=\\s|$)','g') text = text.replace(rexp, "$1" + newOne ); });

Remove all multiple spaces and pass the text to the next level.

 text = text.replace(/ +(?= )/g,''); callback(text);

2. Sharding

With this item, everything is simple. We divide the text by spaces and create arrays.

 var makeShingles = function(text, callback) { var words = text.split(' '); var shingles = []; var wordsLength = words.length; while(shingles.length !== (wordsLength - shingleLength + 1)) { shingles.push(words.slice(0, shingleLength).join(' ')); words = words.slice(1); } callback(shingles) };

3. Calculation of shingles hashes using crc32

At this point, we go around the array of shingles and hash the strings. The first cycle from 0 to 1 is left from the attempt to hash with the help of 84 functions. I decided not to clean it (I suddenly come back to this idea).

 var hashingShingles = function(shingles, callback) { var hashes = []; for(var i = 0, n = 1; i < n; i++) { var hashedArr = []; for(var j = 0, k = shingles.length; j < k; j++) { hashedArr.push(crc.crc32(shingles[j])); } hashes.push(hashedArr); } callback(hashes); };

4. Sequential comparison, determination of the result

For example, I took 2 news from google news which he showed as similar. I saved them in the json file and further, for higher speed, processed in parallel using Async utilities. Then I found the number of matching shingles and calculated the result.

Definition of results for 2 texts

 var fileJSON = require('./article1.json'); var content1 = fileJSON.content; var fileJSON2 = require('./article2.json'); var content2 = fileJSON2.content; var async = require('async'); async.parallel([ function(callback){ textCanonization(content1, function(text) { makeShingles(text, function(shingles) { hashingShingles(shingles, function(hashes) { callback(null, hashes); }); }) }); }, function(callback){ textCanonization(content2, function(text) { makeShingles(text, function(shingles) { hashingShingles(shingles, function(hashes) { callback(null, hashes); }); }) }); } ], function(err, results){ var firstHashes = results[0]; var secondHashes = results[1]; var compareShingles = function(arr1, arr2) { var count = 0; arr1[0].forEach(function(item) { if(arr2[0].indexOf(item) !== -1) { count++; } }); return count*2/(arr1[0].length + arr2[0].length)*100; }; var c = compareShingles(firstHashes, secondHashes); console.log(c); });

Formula count*2/(arr1[0].length + arr2[0].length)*100 finds the percentage ratio for 2 texts.

Texts for comparison: FTC says Apple will pay at least $ 32.5 million over in-app purchases and Apple will pay $ 32.5m to settle the app . With the number of words in a shingle equal to 10 - the texts were similar to 2.16% which is very good.
From the questions it is not clear what the option of using 84x functions is better And I would also like to know some kind of algorithm for calculating the optimal number of words in a shingle (currently indicated 10).
The entire source code of the algorithm and an example of work can be viewed on github.com

Source: https://habr.com/ru/post/209826/

All Articles

The implementation of the shingle algorithm on Node.JS. Finding Fuzzy Duplicates for English Texts

1. Canonization of the text

2. Sharding

3. Calculation of shingles hashes using crc32

4. Sequential comparison, determination of the result

More articles: