📜 ⬆️ ⬇️

How to find similar texts and sort

There is a simple method to sort a set of texts by similarity to a given text: by the Euclidean distance between the frequencies of words in the analyzed texts. In principle, the algorithm should be clear on this; a simple implementation can be found here .

Surprisingly, the simple method gives good results. For example, if you are looking for the next book to read, you can enter the text of a read book or several read books as a model for searching, and then for this repository of 10 books we get the following results for the book FAIRY TALES By The Brothers Grimm:

0.0320757 Repo\THE ADVENTURES OF TOM SAWYER.txt 0.0363329 Repo\A TALE OF TWO CITIES - A STORY OF THE FRENCH REVOLUTION.txt 0.0388528 Repo\ALICES ADVENTURES IN WONDERLAND.txt 0.0440605 Repo\MOBY-DICK or, THE WHALE.txt 0.046679 Repo\THE ADVENTURES OF SHERLOCK HOLMES.txt 0.0472574 Repo\The Iliad of Homer.txt 0.0511793 Repo\The Romance of Lust.txt 0.053746 Repo\PRIDE AND PREJUDICE.txt 0.0543531 Repo\BEOWULF - AN ANGLO-SAXON EPIC POEM.txt 0.0557194 Repo\Frankenstein; or, the Modern Prometheus.txt 

As can be seen from the results, fairy-like books were found most similar, and the horror book was the least similar.

For commercial purposes, such a program can be used to find the most suitable advertisement for a given web page by comparing the text of a user-readable page with the text of the pages where existing advertisements lead.
')
Another use is to find a resume from the database, following the example of a candidate’s resume that is suitable for this position, but does not want to join or leaves the company. Finding a replacement for an employee is not such a rare business case. You can also sort the database of resume by similarity to the job description.

PS By the way, Habr in the list of similar articles gives something not very similar. Maybe Habra also apply this method?

Source: https://habr.com/ru/post/422407/


All Articles