Non-visual methods to protect the site from spam. Part 1. Statistics

Part 1. What statistics say

Non-visual methods to protect the site from spam involve automatic analysis of data received from the visitor. The more data is analyzed, the more fully and accurately the visitor can be determined and the decision is made whether he is a spammer or not.

Systems analyzing such data, as a rule, accumulate statistics of visitor data and decisions made. We offer you a brief overview of the statistical data accumulated by us ( CleanTalk, a site protection service for spam sites ).

Here I deliberately do not cite data analysis of IP-addresses on blacklists. And without them, you can get enough data by analyzing only the contents of the form fields and HTTP headers.
')
I will review the data on the message text, nickname and email address, as well as HTTP headers and the results of validation of the JavaScript test.

The analysis by the given indicators is very simple algorithmically and not demanding of resources, therefore it can be used before other, more resource-intensive checks.

The data reflects the real picture at the time of this writing and is based on an analysis of our current traffic (more than 2,000,000 requests per day). Data can be freely used when analyzing visitors to your sites. I want to note that the decision on each criterion separately is not correct - the best result will be achieved with a comprehensive analysis.

1. Message text

The text of the message is, of course, the main thing in spam. Consequently, spammers will build their messages in such a way that by several criteria they will clearly differ from ordinary messages.

The table shows the most informative statistical data, from my point of view.

Message text options (average values)	Not spam	Spam
Number of links, pcs	1.47	4.27
Number of contacts (phone, e-mail), pcs	1.72	6.38
Form filling time, c	177	eight
The ratio of message length to fill time, characters / s	23.81	308.54

The number of links speaks for itself. The amount of contact information can also tell about spam. The time it takes to complete a form and, as a result, the speed at which a message is typed differ the most

2. Visitor Nickname

Nick can also say a lot. The probable cause is the quality of the nick generation algorithms that spammers use.

Nickname parameters (average values)	Not spam	Spam
Length, characters	7.40	16.52
Number of separator characters, pcs	1.89	3.80
Number of digits, pcs	3.29	7.59
The length of a continuous sequence of consonant letters (for Latin), characters	3.61	5.90

One of the tasks of the spammer is not to stumble upon an error that a user with such a nickname is already on the site. Therefore, the uniqueness of nicknames is currently ensured, according to statistics, in the forehead - by the length, insertion of separators and numbers. As a result, there are a lot of nicknames with a large number of nearby vowels and consonants, with the latter more.

3. Name in email

Everything said for nicknames is true for names in the mail.

Parameters of the name in the e-mail (average values)	Not spam	Spam
Length, characters	10.09	19.16
Number of separator characters, pcs	1.62	4.12
Number of digits, pcs	4.30	9.57

I note that dots are often used as delimiters — a string of characters is generated, then dots are accidentally added to it, and you get a lot of mail names.

4. HTTP headers

Spam bots fake their headers so as not to differ greatly from browsers.
However, as statistics show, this is often true only at the time of writing the bot. In the future, it continues to work and send obviously obsolete headers, as can be seen in the table below.

Percent of HTTP User-Agent Headers	Not spam	Spam
Mozilla / 4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)	0.01%	11.42%
Opera / 9.80 (Windows NT 6.2; Win64; x64) Presto / 2.12.388 Version / 12.17	0.01%	10.84%

Ready-made spam solutions can also leave their headers, in particular, when using an HTTP proxy. And this is also reflected in our statistics.

Percent HTTP Headers Via	Not spam	Spam
Mikrotik HttpProxy	0.86%	33.07%

5. JavaScript test

An additional simple, but very effective test can be a JavaScript test. For example, changing the JS-code of the desired cookie, a lot of options.

The most advanced (and expensive) bots pass JS tests. However, as can be seen from statistics, a large percentage of spam comes from very simple programs that are incapable of it.

The percentage of failure of the JS test	Not spam	Spam
change cookies via JS	0.41%	68.53%

6. Conclusion

I showed the statistics accumulated by our system at the moment. I repeat, for the most accurate solution of spam / not spam, it is necessary to analyze the given indicators in a complex, as well as in combination with other methods of spam checks.

Source: https://habr.com/ru/post/282586/

All Articles