Looking for the perfect captcha

CAPTCHA, or C ompletely A utomated P ublic T uring Tests to Tell C omputers and H umans A part (Fully automatic public Turing test to distinguish between computers and people), was created to ensure that the entered data was not generated by a computer . These peculiar tests are commonly used on the Internet to protect registration forms and commenting from spam. To be honest, I have mixed feelings about the CAPTCHA. It annoys me in most cases, but in spite of that I used CAPTCHA as protection on some sites.

In this article I want to delve into the search for the perfect solution to the problem of the growing amount of spam generated by people. We will take a look at how and why CAPTCHA is used and its impact on usability in the search for answers to key questions: what is the ideal CAPTCHA and why is it preferred as a defense?

The pursuit of humanity

In order to understand the need for CAPTCHA, we need to understand the objectives that motivate spammers to create and use automatic input systems. For the purpose of this article, we will take for spam any undesirable action or input on the site, be it something harmful, generating income to the spammer, or not at all appropriate to the purpose and theme of the site. Spam targets include:

Massive advertising;
Manipulating online voting systems;
Creating dishonest excellence;
Vandalism or destruction of the integrity, purity of the site;
Publication of inappropriate links to improve rankings in search engines;
Access to personal information;
Distribution of malicious code.

')
All this leads to the creation of profitable situations for spammers. Automating the process obviously leads to superhuman speed and efficiency.

Those who manage sites know that this is a fairly common business and a serious enough problem. Akismet , a popular spam-catching system (most commonly found as a WordPress plugin), records more than 18 million spam comments per day, in total, this system captured 20 billion comments in its entire history. The Mollom system, which was created with the same purpose, catches more than a million spam comments per day, the same system has calculated that more than 90% of comments are spam . No request to stop does not affect spammers, but their greed can play into our hands; the use of automatic systems for profit has some weaknesses.

Discover CAPTCHA

On one side of the coin is a spammer, on the other is the usual site owner, who has encountered some problems:

Blogs and forums drowning under the weight of spam,
Accounts created under false pretenses for illegal purposes,
Bots that destroy the dynamics of the site,
The need to constantly monitor the quality of content and user experience.

Automatic spam is constantly worrying site owners, so CAPTCHA looks attractive as a solution to this problem ... for the time being. The time required for constant monitoring of user-generated content is incomparable with the time required for introducing CAPTCHA on the site - this is what pushes developers to use CAPTCHA.

It is no secret that CAPTCHA is used almost everywhere. According to the statistics of the reCAPTCHA project, more than 200 million generated reCAPTCHA are tested daily , on average, users spend 10 seconds on input. The Drupal CAPTCHA project reports about 100 thousand uses per week, despite the fact that this is only part of the sites with this protection (we are talking only about those who gave the go-ahead to send reports).

CAPTCHA solves the problem in the forehead: its purpose is solely to stop the spammers. Real users, in most cases, pass the test. That is, ideally, this protection does not affect users.

Unfortunately, this is not the case. The problem of CAPTCHA readability is not new. The W3C organization published a report in 2005 on the topic of CAPTCHA unavailability , in which it was announced that human readability of CAPTCHA in some systems can reach only 90%. A little later, in 2009, Casey Henry drew attention to the impact of CAPTCHA on transitions and suggested that the potential loss of customers is approximately 3%:

Considering the fact that many people consider conversions as a source of income, the loss of 3.2% of customers may affect sales. As for me, it is better to manually sort out spam than to lose part of the profits.

- Casey Henry, CAPTCHAs' Effect on Conversion Rates

In 2010, a team at Stanford University published a study entitled “ How well do people cope with CAPTCHA? Detailed assessment of the situation ”(PDF), in which the CAPTCHA was evaluated on the largest Internet sites. It was no surprise that the results were very unsatisfactory, but the fact that people spent 28.4 seconds on an audio CAPTCHA solution is most striking. The study also paid attention to the problems of people whose English is not native.

Web developer Tim Kadlek foreshadows the death of CAPTCHA , providing a fairly serious argument against using this protection:

Spam is not a problem for users, it is a problem for people who administer the site. This is very arrogant on the part of administrators - to dump such a problem on the shoulders of users of the site.

- Tim Kadlek, Death To CAPTCHAs

Entering a CAPTCHA may seem like a completely trivial task, but research (as the above-mentioned report from the W3C) shows that such a judgment has little to do with the actual situation. And, as Kadlek said in his article, what about users with problematic vision, dyslexia, and other diseases affecting sensory functions? For them, this is an insurmountable obstacle, it is simply not fair. That users invest and set the destination sites.

The question is, is CAPTCHA really so unacceptable to users that it must be abandoned? Perhaps the more important question: is there an easily readable CAPTCHA that cannot be cracked? If the answer is no, then what is the appropriate solution to combat online spam?

CAPTCHA World

The human brain is a terrific tool. His ability to conceptualize, to find order in chaos, to adapt something to a person makes him an incredibly useful thing. In some tasks, he easily leaves computers behind. In others - for example, mathematics - he loses in all respects.
Logically, it is possible to deduce the basic parameters of the most successful CAPTCHA. So, CAPTCHA should be:

A task that users can solve in any conditions, but which the computer cannot
A task that is solved by users in an instant, but which is difficult for a computer,
A task that requires a minimum of data entry,
A task that should be easily accomplished for all users, including those who suffer from specific diseases (CAPTCHA should be no more difficult than regular web surfing).

One of the most noticeable advantages of a man over a computer is expressed in the ability to distinguish between visual images and models. The most popular CAPTCHA is repelled by this fact.
Web developers have cracked a lot of options: simple tests for image definition, interactive tasks, tic-tac-toe games and math problems that even mathematicians would have had to fight fairly. We will consider more adequate ideas that are being introduced on the Internet in our time.

Text definition

The most popular type of CAPTCHA at the moment is the definition of text, a set of characters (a vivid example is the project reCAPTCHA ).

The reCAPTCHA project aims to stop spam and help digitize books.

reCAPTCHA was created at Carnegie Mellon University, the home of the pioneers of CAPTCHA and the originators of the term (in 2000). Now, under the management of Google, the project uses scanned text that OCR systems cannot understand. This, in theory, provides an unbreakable CAPTCHA, which also has another "feature" - help in digitizing books by users.

An example of text that is problematic for OCR, it is these “problem” texts that reCAPTCHA uses.

Those who are especially concerned with the issue of usability have always been complimentary about reCAPTCHA. Unfortunately, absolutely incomprehensible or unreadable CAPTCHA are most often found in the network, the suggestion to users to solve an impossible task cannot well affect usability.

The reCAPTCHA project team is making great efforts to provide audio alternatives for people with visual impairments, but many other CAPTCHAs of this type do not have similar aids. As mentioned in a study at Stanford University, audio CAPTCHA takes a long time to complete. In the same study, attention was drawn to problems with the implementation of the CAPTCHA with English words.

Another attempt to improve an ordinary textual CAPTCHA was presented at the end of 2010 by Solve Media, whose solution was to replace the plain text with an advertisement and a related question.

Solve Media claims that their CAPTCHA can be performed much faster than any others. Despite the fact that most people are skeptical about such marketing chatter, this project definitely has potential, especially considering that many global brands do not depend on their local language.

While the text CAPTCHA has some drawbacks (for example, spammers can use special software for text recognition in the image and thus overcome the anti-spam protection), it is still undoubtedly solvable. This fact is a stone in the garden of those who do not recognize such protection.

Logical tasks

Some people assume that the need to answer a simple logical question may be much more efficient and more convenient than performing visual tasks. The idea is that the complexity of the agreed text may well be sufficient to send computers to a knockdown.

TextCAPTCHA has more than 180 million questions, for example:

What is the sixth letter in the word "habrahabr"?
What does the number fifty eight thousand five hundred seventy four look like in the form of numbers?
Which number of 3, twenty nine, 70, 46, 65 is the smallest?

These questions can be done by a person with the intelligence of a seven-year-old child. They are much more accessible than the task of determining the text or image, and so far this is the only plus of this method. First, the search for an answer to such a question may take time, since the questions are unusual and unfamiliar to ordinary users. Secondly, the computer can still overcome this CAPTCHA. Joel Vanhorn reminded everyone about the Wolfram Alpha service, whose artificial intelligence is quite sufficient for solving such problems.

IBM Watson recently showed the world a frighteningly human-like ability to process text, and such technology can become ubiquitous faster than we think. But instead of worrying that logical questions might be feasible for computers, we need to use this technology to analyze user data and separate human content from computer-generated content, which in most cases is spam. Services like SBlam! actively develop this idea.

Specific questions about a particular site, such as “What is the name of this site?” And “What color dominates the image above?” Can be better than general questions. But on the other hand, of course, the number of such questions in any case will look insignificant next to 180 million questions from TextCAPTCHA.

The most noticeable problem of logical questions is that they are not multilingual, usually English is used. Creating a database with hundreds of millions of questions in all languages of the world is an impossible task for anyone. When prospects are so far from ideal, the question arises: is the correct solution a CAPTCHA?

Image definition

Many people experimented with images instead of text. Benefit? No problems with legibility. Services like identiPIC offer the user to define an object in the image. Microsoft also investigated this method in its Asirra project.

Microsoft Asirra

The fact that we do not see the prevalence of such a CAPTCHA means that this method does not improve usability. In fact, it compromises accessibility. People with vision problems are deprived of any chance to pass this CAPTCHA, and adding any text or description will dramatically reduce the effectiveness of the test.

In 2009, Google published a study (created under the guidance of Rich Gossweiler, Maryam Kemvar and Shumit Baluj), which looked at the alternative forms of this type of CAPTCHA. The project prompted users to correct the positions of the images by turning them.

An innovative idea, I am sure you will agree. The study showed the superiority of this technique over others in terms of ease of passing the test by man. Unfortunately, this method is completely failed in terms of accessibility (think of people with problematic vision).

Defining friends

Another really interesting CAPTCHA, presented in January 2011 as a result of the work of the Internet giant Facebook. The company is experimenting with social authentication to verify account ownership. What is this experiment:

We will show you some photos of your friends and ask you to name the person who is depicted on them. Hackers on the other side of the planet may know your password, but they may not know your friends.

- Alex Rice, Facebook, A Continued Commitment to Security

Facebook Friend Definition Test

What makes Facebook's innovation completely different from other solutions is that this CAPTCHA cuts out not only the machines, but also quite human-like attackers.
Facebook definitely has the prospect of implementing this CAPTCHA across the web. With a base of 600 million users and millions of sites into which Facebook modules are integrated, the Internet giant can use the method of identifying friends to authenticate anywhere. We must not forget about the fact that such a method is much easier than the definition of the text.

There is only one problem. Do you really know who your friends are? It is no secret that friendship requests are often the subject of an exchange between users in order to increase the cherished number that reflects the number of friends. When this list is full of persons completely unknown to you - you are unlikely to pass this test. No matter how good a Facebook idea is, it’s still ultimately doomed to failure, because we are people - we break the rules.

Interactive CAPTCHA

There is a method that attracted serious attention of users for the fact that only people can perform the proposed task. They Make Apps introduced the CAPTCHA in the form of a small slider, which must be moved to the right side in order to confirm the sending of data. CAPTCHA informs the user: "Show your humanity, move the slider to the end of the line to create an account."

They Make Apps use the CAPTCHA slider.

Obviously, this option is not suitable for people with disabilities. Moreover, developing a script that would automatically translate the slider to activate the “Send” button should not be difficult. A more advanced version of the slider is used in the comments on the Adafruit blog . Four different sliders must be set to the correct position for posting a comment.

CAPTCHA on the Adafruit blog

CAPTCHA Alternatives

The CAPTCHA, in its best form, can realize its potential in another area. As a protection for sites, unfortunately, it is not perfect. Creating difficulties for users, CAPTCHA sometimes does not cope with the task. Manual spammers are at the peak of their success, and we need to introduce more sophisticated, invisible methods of protection.

Automatic and manual spam detection

We mentioned several spam definition services at the very beginning of the article. Akismet, Mollom and SBlam! analyze the data received from users and mark spam automatically. The Mollom system sometimes offers to introduce a CAPTCHA, but only if it is not sure. But why not develop your own system, which will be tailored specifically to the needs, requirements and specifics of a particular site?

Putting the responsibility on and taking the burden off users will improve their opinion about the site and increase their activity. Manual content control is often a sacrifice worth making.

The Honey Pot Method

In 2007, Phil Haak developed a very clever method for identifying bots: using a pot of honey. The idea is simple: the form on the site contains an additional field, hidden from users. Spam bots process and interact with the HTML source code, so they will not be able to determine that the field is hidden. If data has been entered into this field, the site administrator can be absolutely sure that this is not a real user.

The honey pot method can be more effective if you obfuscate field hiding using JavaScript or data hashing. These methods are not impenetrable, but we can count on the best.
JavaScript can be used to populate hidden fields dynamically, which can be verified by a server-side script. Scratchmedia uses a similar solution along with CAPTCHA if JavaScript is disabled.
You can also use additional timestamp and session data to determine automatic submission. A recent discussion on the Stack Overflow website was a huge amount of examples and ideas about this, including Hashcash , which is available as a WordPress plugin. The lesson on how to create such protection with jQuery describes a similar method and includes an interesting idea:

Thieves know that if the house has external lighting, a dog in the yard or other similar means of protection, then it is better not to go into this house. Thieves seek greater revenue with minimal cost and risk.

- Jack Born, Safer Contact Forms Without CAPTCHAs

Centralization of user base

With the gradual "socialization" of the Internet, many sites have begun to offer users to register and interact with each other. Publication of data on the site is usually carried out along with the registration of a full account, or anonymously. Both of these methods are open gate for spam. In 2008, Facebook announced Facebook Connect, a service that provides websites and their users with an integrated social network-based platform. Twitter picked up the baton in 2009 with a similar service, "Login using Twitter." Both of these services can be built into the site quite easily, with their help you can completely get rid of registration and comment forms, which are the goal of bots.

These services became so popular that Janrain appeared.
Janrain provides its own solution, based on the aforementioned Facebook Connect, Sign in with Twitter and their ilk, to make the site accessible from any social network.

Mahalo provides the ability to log in using any social network with Janrain.

Other services, such as Disqus , allow users to interact using the built-in anti-spam system and built-in authorization.
The almost complete lack of anonymity makes users think twice before sending any content. It is also very effective in preventing spam; it is worth removing one Facebook user and all sites with Facebook Connect connected are fenced off from the next spammer.

Such services, of course, provoke heated debate about privacy, protection of personal data ... but this is a topic for another article. As an alternative to CAPTCHA, these services have great potential with their availability and usability.

Fixing time spent by user

Another fairly simple method that doesn’t annoy users at all is to separate bots and users by measuring the time spent filling out the form. By calculating the average time spent filling out the form, you can develop certain rules. For example, if it took less than five seconds to fill out a form, which is almost impossible for a person, the user will be asked to try again. Let me remind you: the spammer would prefer lighter targets and leave the site where the attempt to use the automatic system failed.

Ideal captcha

Judging by years of experience and research, it is safe to say that CAPTCHA is far from ideal as a solution to a problem. Remove the spammers from the equation and in this way we completely eliminate the need for CAPTCHA; that is what we should be guided by. The ideal CAPTCHA is the missing CAPTCHA.

Rise of the people

CAPTCHA by its nature performs only one function - weed out people from bots, thereby protecting the site from spam. But she cannot cope with her task if the spammer is not a bot. The best solution is to get rid of any need for spam. If we can change trends, turning spam from extremely profitable into a purely unprofitable business, then it will come to naught in any manifestation.

One of the many dark arts in SEO is the artificial generation of links to a site under the pretext of optimization. Search engines consider incoming links to be a significant indicator of value. Obviously, this is abused by sending similar links to a variety of sites (forums and forms for sending comments are ideal for implementing such methods). The benefits of SEO are so high that automatic spam was not around. The practice of attracting cheap human labor is quite common. And CAPTCHA is not intended for such things.

We must recognize the need for moderation and definition of bots in the background, invisible methods. CAPTCHA - the best temporary solution at the moment and the worst in general. Either fight spam manually, or just forget about the interests of users - the choice is yours.

findings

If the site owners work together to eliminate spam, then it will disappear over time, and one fine day the need for a CAPTCHA will disappear by itself. Is this too idealistic? Maybe. In reality, we are more likely to see close cooperation between technology and the law for the destruction of spammers as a species.

Understanding the alternatives (we are talking about those where checking for spam occurs unnoticed by the user) and embedding them on sites is a good start.This is a positive step towards improving usability and increasing attendance. If users publish content on your site, thank them for their good anti-spam protection:

Moderation wherever possible
Do not allow to post certain content on the site, or vice versa, allow its publication after passing the account verification. It is best to use services like Facebook Connect or Disqus; it will be easier for you and for users.
CAPTCHA Alternatives
Try using the “honey pot” or whatever method, as long as it is invisible to users.
, , . , , . , .
. Akismet , , , .
.

It is absolutely clear, considering all the pros and cons of the CAPTCHA, that the future is imperceptible to the user technologies. At the moment, CAPTCHA should be the most extreme measure.

Original article: In Search Of The Perfect CAPTCHA , David Bushell, 03/04/2011.
The translation is rather free, but the essence and the thought transmitted by the author are preserved. A few minor fragments are excluded, each with a specific reason. For example, the fragment with information about changing Google algorithms was deleted because it refers to an inaccessible page on the Google blog (404).
Please do not throw anything bad, my first translation of this volume.

This text is distributed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 license .
You can copy, edit and use this text for non-commercial purposes with the obligatory indication of authorship and preservation of the original license.

Source: https://habr.com/ru/post/120851/

All Articles