The following is the sum of the main, widely used methods of countering comment spam (and other methods of automatically sending unwanted messages) - some of the techniques I use when developing my own Web applications.
1.
"Martians" - POST request without first reading the page.Comments are usually sent by a POST request. It is assumed that the visitor first visited the page itself, and then leaves a comment - sends the request.
')
Some spam bots are not so smart as to take care of loading the page and saving the entire heap of data — such as Cookies.
No Referer Header . Weak protection, because all the headers are simply faked. However, if there is no such title, you can with good conscience reject an attempt to leave a comment.
The same applies to the possible analysis of the User-Agent field.
There is no "candy .
" "Candy", Cookies, are used to store, for example, a sign of authentication on the site or at least just the fact of visiting the site. If POST arrived, but there is no correct “candy” - we reject the attempt to comment.
Invalid session ID . A simple solution: to transmit a special value in each page, which is compared by a known scheme with that stored on the server side. If there is no correct value, it means that an attempt to either re-send a comment or an attempt to send a comment using a bot We reject.
Client IP changed between page request and form submission . Now there are still providers, when working through which the assigned IP of the computer can change from time to time. Rarely, but it happens. In other cases, this is more likely to indicate the use of a proxy (see also below) and can be considered a suspicious act.
2.
CAPTCHA ( c ompletely a utomated p rocess to t ell c omputers and h umans a part), also known as the Turing testThe most popular scheme now, many consider it the best tool. However, one should not forget that all Turing tests, no matter how sophisticated, will not be tested against a human spammer. Of course, spammers do not sit down to guess Captcha themselves. they will hire for those who want to earn quickly, five cents for one Captcha. It has long been a business.
If you have Captcha everywhere, and cannot give a simple solution, then be sure: spammers-people will come to you.
In addition, the more sophisticated the Captcha, the more likely it is to scare away normal, human visitors.
Verification of the comment as an option Captcha. The well-known “challenge-response” method, including, is used as a countermeasure against email spam: request a mailing address and send a request to it with an activation link to post a comment.
Negative sides: the person will not like it very much, and he may not go to your site again. In addition, if the link does not require any special actions, except for invoking it, it is not so difficult for the spammer to set up a mailbox so that such links are “called” automatically. You can put false links into the letter (a spam bot will follow them and add itself to the black list), but legitimate visitors will come across them. Humans tend to make mistakes.
3.
Names of form fieldsMost of the popular engines of sites "sin" because they have standard field names. This makes life easier for bot authors - we send the form with known values, and the trick is done. There are several solutions.
Rename fields . This works for the vast majority of cases - the authors of bots do not care much about checking their creations for compliance with the trends of time, and the absolute majority of blog owners do not try to protect themselves in this way. It’s not always possible to write modules for this purpose, and there are very few ready-made ones.
Add false fields . This is a better solution. Add a variety of fields of similar types with names similar in structure. The machine will either not be able to fill them all, or will fill in the wrong ones (especially if there are many fields). It’s easy to hide fields with CSS, bots (for now) aren’t smart enough to analyze styles and see which fields are accessible to a human visitor.
Change the field names with each page generation, display the fields all the time in a different order . The best option in conjunction with the previous one: the automata will be confused about what to fill out, and will most likely give themselves away.
4.
TimingA person cannot send a form with the filled data within a couple of milliseconds after receiving the page. Bots can, and do not always keep track of time after page loading. Therefore, it is possible to check whether the person is from the other side.
Wait a few seconds after the page loads . If the form with the data comes faster than in N seconds (let the person think, read and express - let N be no less than, say, a minute when commenting on the main text and at least 15 seconds when answering another comment).
Penalty with incorrect data . If a person does not specify the correct parameters when entering a comment, assign a small penalty - time delay.
Limitations on the number of comments . Do not allow more than M comments per unit of time from a specific IP. A fairly frequent and reasonable measure against automatic comment providers.
Limit the lifetime of the session . Must be approached wisely. This may partly protect against ways to circumvent Captcha with the help of “blacks” (hired solvers), but should not cause serious inconvenience among legitimate visitors.
5.
Suspicious addressesPublic proxies . Although the rumors about the omnipotence of the proxy as a means of masking are greatly exaggerated, they can still be used if some addresses are banned. If the proxy is anonymizing, and the real address cannot be found, all hope is only to analyze the public proxy lists (see FreeProxy for an example of the source of the proxy lists). Here you can also make a proxy network
Thor . But it's easier there - the full list of servers is always available and access from the Thor grid can be denied if desired.
Naturally, a ban on the use of a proxy will also cut off completely legitimate users who, for one reason or another, use a proxy service.
Stop Lists . You can start with
SpamHaus , where there is, in addition to the well-known DROP-list, a simple way to determine the involvement of an IP address in the list of all IP addresses that are exposed to spam or related activity. While with the
DROP-list everything is clear - to block unconditionally - it is not so simple with others, given the false signals and the possibility of falling into stop-lists for nothing.
Addresses belonging to the sites . If requests come from the address where many sites live, for example, there is reason to think whether attempts to send comments from these addresses are legitimate. If an API is used, one thing, if sending a comment form is simulated — another thing — it is probably necessary to refuse publication.
True, there are no absolutely reliable ways to make sure that an address must be considered suspicious from this point of view.
6.
Spam filtering services .
Of these, it is worth mentioning
Akismet and
Mollom . Both, of course, have limitations when used for free, both have paid subscriptions (determine if you can afford this or that).
Like any system being trained, they are not free from errors and many signals from different sites may be required before the same type of spam is effectively caught in all other places.
7.
JavaScriptBots usually do not use JavaScript, its interpretation and execution would take too much time and resources, and the bot must work efficiently and quickly. The downside is that if a legitimate user partially or completely blocks JS on the pages of your site (a popular measure today), it will be difficult for him to comment.
Options in this regard:
Dynamic change of session parameters . With the help of techniques, collectively called AJAX, change, immediately after loading the form, the essential parameters of the session. Field names, key values ​​requiring server-side validation.
The disadvantages are obvious - you need a browser compatible with your AJAX implementation and a good connection to the server so that it works smoothly and does not annoy the person. But in general, it seems quite a promising way for bots to drop out.
Encryption page or part of the page . If a page is encrypted in such a way that it is impossible to bring the page into a readable form by simple means without sharing the site, then it is almost impossible to guess the field names (if they change from request to request, see above) without using a full JS implementation.
Broken URLs . This is not just spam protection, but a way to make it useless. With the help of JavaScript, we turn links that exist in the “trashed” state into genuine links. The source HTML does not contain working URLs that search engines would recognize — thus spam becomes ineffective.
If it is possible to convert URLs into “good” comments into regular ones (note that this takes you or your assistants some time to analyze comments), then the situation will become more pleasant for legitimate visitors.
8.
Brute forcerel = "noindex, nofollow" as an attribute of all links. Not always effective, not all search engines take these attributes into account. Moreover, with the absolute imposition of such attributes on all comments without exception, many legitimate visitors will lose interest in the site.
Remove the URL field in the comments , as it is done by default, for example, in
Drupal . Also, it will “scare away” some legitimate visitors (many leave their links not for the sake of increasing the citation index, but in order to get to know you, so to speak). True, in my experience, spammers are not very scaring off - just links are placed in the body of the comment. Although search engines will not always see such a link, it definitely will not decorate the site.
If the spam is delivered by a personThere are two main situations:
- the spammer uses the service to “solve” obstacles such as Captcha, where people hired for little money solve these problems. Simple and effective.
- the spammer hires people with instructions on which sites and which comments to leave. Not necessarily entirely by hand - most often semi-automatically, using software to speed up the process
It would seem that there is only one way out - to analyze the links left (in fact, comments are written for them). But there are quite promising semantic projects for analyzing such spam records, analyzing the sites mentioned in them. I cannot say exactly what is already available from the wide public. If any of these projects are known, let me know, please.
ConclusionThe article lists the main ideas that undoubtedly occurred to many people. There are no revolutionary new ways to counteract automatic and semi-automatic spam, this is the traditional situation of “armor versus projectile”.
I assume, when creating next versions of my Web products that require protection against automatic sending of messages (including comments), use the following methods:
a) stop lists. We block already known spammers before we start analyzing their data; this includes services like Akismet, Mollom, etc.
b) check on "Martians" - all session data should be in place
c) time frame: we do not allow sending a comment too quickly and / or too often
d) (if JS is enabled) with the help of AJAX-techniques we change session parameters to make life more difficult for bots
d) create false fields besides the correct ones, for all of them we create random names, if there is JS, we encrypt that part of the page where the input fields themselves are to make it difficult for bots to live
Besides.
When an “alarm” is triggered - if there is suspicion of spam - we make life a little worse for those who are trying to resend - we add Captcha, increase the time interval between sending the next form option. With reliable spam, we add IP to the black list and for some interval we definitely block the possibility of sending messages from it, for living users we leave clear instructions on how to remove the ban.
If spam is left by a living person, we use (yet non-existent) a way to efficiently analyze the link and the content of the site to which it leads.
And, of course, keep your database of spammer IPs and all external signs of bots that they use - to identify them in the early stages, since all of the above procedures are a significant load on the server.