Search for texts that do not match the topic and find similar articles

I have a site with articles of similar subjects. The site had two problems: spam messages and duplicate articles, and duplicates were often not exact copies.

This post is about how I solved these problems.

Given:

total number of articles 140,000;
spam count: approximately 16%;
number of obscure duplicates: approximately 63%;

The task: to get rid of spam and duplicates, as well as to prevent their further occurrence.
')

Part 1. Classification for spam / not spam

All articles on my site relate to the same subject and, as a rule, spammers / advertising messages differ in content. I decided to manually sort a number of articles. And on their basis to build a classifier. For classification, we will consider the cosine distance between the vector of the article being checked and the vectors of two groups (spam / not spam). Which group of the article being checked is closer to, and we will refer the article to that.

First you need to manually classify articles. Sketched for this form with the checkboxes and collected in a few hours 650 articles of spam and 2000 not spam.

All posts are cleared of garbage - noise words that do not carry a payload (participles, interjections, etc.). To do this, I gathered from different dictionaries that I found on the Internet, my own dictionary of noise words .

To minimize the number of different forms of a single word, you can use Porter's smong . I took the finished Java implementation from here . Thank you, edunov

From the words of classified articles, cleared of noise words and driven through the stemming, it is necessary to collect a dictionary.

From the dictionary, delete the words that are found only in one article.

For each manually classified article count the number of occurrences of each word from the dictionary. You will get vectors similar to the following:

We increase the weight of words in each vector, counting TF-IDF for them. TF-IDF is considered separately for each group spam / not spam.

First, we consider the TF (term frequency) for each article. This is the ratio of the number of occurrences of a specific word to the total number of words in the article:

where n _i is the number of occurrences of the word in the document, and in the denominator - the total number of words in this document.

Next, consider the IDF (inverse document frequency). Inversion of the frequency with which some word is found in the documents of the group (spam / not spam). IDF reduces the weight of words that are often used in a group. It is considered for each group.

Where

| D | - the number of documents in the group;
d _i ⊃ _{t i} is the number of documents in which t _i occurs (when n _i 0).

We consider TF-IDF:

A lot of weight in TF-IDF will get words with a high frequency within a specific document and with a low frequency of use in other documents.

For each group, we form one vector, counting the arithmetic average of each parameter within the group:

To check an article for spam, you need to get for it the number of occurrences of words from the dictionary. Count for these words TF-IDF for spam and not spam. This will result in two vectors (in the TF-IDF image, not spam and TF-IDF spam). IDF for calculations, we take the one that was considered for the classifying sample.

For each vector, we assume the cosine distance with the vectors for spam and not spam, respectively. The cosine distance is a measure of the similarity of two vectors. The scalar product of the vectors and the cosine of the angle θ between them are related as follows:

Having two vectors A and B, we obtain the cosine distance - cos (θ)

Java code

public static double cosine(double[] a, double[] b) { double dotProd = 0; double sqA = 0; double sqB = 0; for (int i = 0; i < a.length; i++) { dotProd += a[i] * b[i]; sqA += a[i] * a[i]; sqB += b[i] * b[i]; } return dotProd/(Math.sqrt(sqA)*Math.sqrt(sqB)); }

The result is a range from -1 to 1. -1 means that the vectors are completely opposite, 1, that they are the same.

Which of the obtained values (for spam and not spam) will be closer to 1, to that group and we will refer the article.

When checking the similarity for not spam, the number was 0.87:

for spam - 0.62:

Therefore, we consider that the article is not spam.

For testing, I manually selected another 700 entries. The accuracy of the results was 98%. At the same time, what was mistakenly classified was even difficult for me to refer to one category or another.

Part 2. Search for fuzzy duplicates

After clearing spam, 118,000 articles remain.

To search for duplicates, I decided to use an inverted index ( Inverted index ). An index is a data structure, where for each word it is indicated in which documents it is contained. Accordingly, if we have a set of words, you can easily make a request for documents containing these words.

Example with wiki. Suppose we have the following documents:

 T[0] = "it is what it is" T[1] = "what is it" T[2] = "it is a banana"

For these documents, we construct an index, indicating which document contains the word:

 "a": {2} "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} "what": {0, 1}

All the words "what", "is" and "it" are in documents 0 and 1:

 {0,1} ∩ {0,1,2} ∩ {0,1,2} = {0,1}

I saved the index in the MySQL database. Indexing took about 2 days.

To search for similar articles, it is not necessary that all the words match, so when searching by index we get all the articles for which at least 75% of words coincided. The found ones contain similar duplicates, as well as other articles that include 75% of the original words, for example, these can be very long articles.

In order to weed out what is not a duplicate, we must calculate the cosine distance between the original and the found ones. The fact that more than 75% match is considered a duplicate. The number of 75% derived empirically. It allows you to find fairly modified versions of the same article.

An example of duplicates found:

Option 1:
Cheesecake with condensed milk (without baking)

Ingredients:
* 450 gr Sour cream from 20% fat
* 300 gr pastry

* 100 grams of melted butter
* 300 grams of good condensed milk
* 10 grams of instant gelatin
* 3/4 cup cold boiled water

Cooking:
Collapsible form lined with parchment paper.
Mixed into small crumb cookies mix with melted butter. The mixture should not be oily, but not dry. Put the mixture into shape and seal well. Put in the fridge for 30 minutes.
Sour cream mixed with condensed milk.
Gelatin pour 3/4 cup of water and put on for 10 minutes to swell. Then melt it in a water bath so that it dissolves.
In a mixture of condensed milk and sour cream gently pour the gelatin, stirring well. Pour then into the form. And put it in the fridge until it is completely set for about 2-3 hours.
When frozen, you can add berries of fresh cranberries or sprinkle with grated chocolate or nuts.

Enjoy your meal!

Option 2:
"Cheesecake" without baking with condensed milk

Ingredients:

- Sour cream from 20% of fat content 450 gr

-300 gr cookies (I took oatmeal)
-100 g melted butter
-300 g of high-quality condensed milk
-10g instant gelatin
-3/4 cup cold boiled water

Cooking:

one). Form (better detachable) cover with parchment paper.
2). Chop the cookies into fine chips and mix with melted butter. The mass should not be oily, but not too dry. Put the mass in shape and tamp well. Put in the fridge for half an hour.
3). Sour cream mixed with condensed milk (you can add more or less condensed milk, this is a matter of taste).
4) Gelatin pour 3/4 cup of water and leave for 10 minutes. to swell. Then melt it in a water bath so that it dissolves completely.
5) Slowly add gelatin to the creamy condensed mass, stirring well. Pour then into the form. And send in the fridge to full freeze. I froze quickly. Somewhere 2 hours.
I added fresh berries of cranberries when it hardened - this added spice and sourness. You can sprinkle with grated chocolate or nuts.

Full cleaning took about 5 hours.

To update the index to find duplicates, it is not necessary to rebuild existing data. The index is completed incrementally.

Because of the limited amount of RAM on the server, I don’t have the ability to keep the index in memory, so I keep it in MySQL, where the articles themselves are stored. Checking one article, if you do not load everything into memory, takes about 9 seconds for the program. It is long, but because Only a few dozen new articles appear per day, I decided not to bother with speed until it is needed.

Source: https://habr.com/ru/post/252265/

All Articles

Search for texts that do not match the topic and find similar articles

Part 1. Classification for spam / not spam

Part 2. Search for fuzzy duplicates

More articles: