
How often do you encounter the fact that you can not enter the proposed captcha from the first time? Now think about what difficulties may arise in a modern person who is talking to the computer "for you"? For him, the barrier in the form of reCAPTCHA becomes stronger than for a robot.
However, the difficulty of recognizing a captcha by a person is not at all the most acute problem, as it may seem at first glance. One could of course close our eyes to this if the captcha really protected us from various automated recognition systems. But this is far from the case!
And I want to talk about a tool that solves these problems.
')
In order to create adequate protection for CAPTCHA, it is necessary to first classify the methods of its recognition.
Automated Captcha Recognition
Currently, there are three main ways of automatic recognition of captcha:
1. Use of errors in the protection algorithm
This approach is aimed at finding logical errors (vulnerabilities) that allow you to correctly submit the form, without recognizing the captcha itself.
This is the easiest way to bypass protection, but it can be applied mainly only in unpretentious homemade solutions.
The most common mistake is to transfer the captcha verification code through form fields or cookies in open form, to base64, or to md5 hashes without using salt.
It is easy for an attacker to get it, even if he has to generate a
rainbow table according to the captcha alphabet (only 5-digit Russian letters or only 6-digit numbers, etc.) and make a comparison.
It is also considered a mistake when the same captcha can be solved more than once. This applies mainly to cases in which the generated captcha identifier is valid for 5-10 minutes, but it has no limitations on the number of checks. Thereby enabling the attacker, who already knows the answer, to reuse the identifier of the resolved captcha. You can use brute force with the same captcha identifier, which ultimately leads to the desired result.
2. Automatic recognition
There are three main ways of successful automatic captcha recognition:
I. Use ready-made optical character recognition (OCR)
This is the simplest approach that does not require special programming skills. Such programs are mainly distributed free of charge, and there are quite a lot of them:
ocropus ,
cuneiform ,
tesseract ,
gocr ,
orcad and others.
The attacker only needs to transfer a captcha picture to such a program, where at the output he will receive already recognized text. Typically, such software products have many fine-tuning, allowing to make the recognition more effective.
To prevent recognition in this way, various distortions, twists, adding garbage, etc. are used.
In this case, the recognition percentage may be quite small (only about 10%), but the attacker will still succeed.
Ii. Self-written scripts using GD, ImageMagick and other libraries
Such scripts allow you to clear the image of garbage, remove the background, align the text vertically, crop the image, leaving only the text, remove multicolor, make color averaging, etc. In practice, the full recognition of such scripts is very difficult.
Much more effectively you can use such a script only for pre-cleaning the image, leaving the process of parsing the other ways.
Iii. Neural networks
The greatest interest is now represented by neural networks. For many, it still looks like some kind of magic.
Neural networks began to be used attacking, in order to automatically recognize any captcha.
They train a neural network that is able to recognize even the most
complex reCAPTCH with a high degree of probability.
There are
many different libraries for different programming languages that are distributed free of charge. One such library, the
Fast Artificial Neural Network , was previously covered in
Harker magazine .
3. Semi-automatic recognition with the involvement of cheap human resources
There are many sites such as antigate.com, rucaptcha.com, captchabot.com, etc., which provide their clients with a profitable service. They accept from the client, in automatic mode, a picture with a captcha, and after a few (10-30) seconds they give out the correct, in their opinion, answer. The percentage of successful recognition in this case is very high, about 90-95%.
It should be noted that this percentage is much lower for an unprepared visitor.
The cost of this service is approximately $ 1-3 for correctly recognized 1000 captcha.
Imagine the registration of 1000 accounts in GMail, which will cost only 30 rubles!
You will surely ask, who will be engaged in manual recognition for such ridiculous money? These are citizens of the poorest countries of the world, such as: India, China, Vietnam, Pakistan, Cambodia, etc.
Of course, the first thing that comes to mind the creators of a captcha is to enter the Russian alphabet and completely eliminate English.
Some even delete numbers. But as you understand, these half-measures are not capable of giving adequate protection, and in the near future, the owners of services redirect traffic from such caps only to those who can read Russian.
Others try to complicate the captchas themselves, impose various filters, distortions, garbage, etc., thinking that the robot recognizes them.
Thus, they complicate the reading and recognition not only of people employed in these services and all sorts of scripts, but also of ordinary users, whose training is several times lower. All this begins to resemble modern medicine, which fights not with the causes of diseases, but only with their symptoms, with the result that only the patient suffers.
The main consumers of such services are mainly large SEO companies, and various kinds of information aggregators, which automatically receive information about statistics on the search for keywords, positions in search results, search results themselves, etc. Also, their services are used by all sorts of spammers who send messages on social networks, automatically register accounts in mail services, forums, etc.
In my opinion, the number of captch recognitions for spam, although negative, is still not the main focus of marketing such a service, being only the tip of the iceberg. While it’s an iceberg itself, it’s nothing more than providing access to consolidated information.
Here is an example. In our country, information on postal items (tracking) located in the territory of the Russian Federation can be obtained only on the state website of the Russian Post. A year ago, they added captcha to the departure tracking form, which made it impossible to get information on the items of interest automatically.
And now think, where does
the information on shipments come from
on these sites , but already without a captcha?
To resist such services is not easy. This is facilitated by several factors:
Without exception, the owners of the services transferred the responsibility for downloading the images themselves to their clients, which, ultimately, will not allow to calculate the ip-addresses, their own or people who are engaged in the recognition itself, for subsequent blocking.
If the client of such a service has the necessary knowledge, then he will easily connect the base of proxy servers to bypass possible blocking.
It is also worth noting that there was not a single service, including reCAPTCHA, able to effectively counter this.
Easy user recognition
The best solution for the user will be plain text in the picture, ensuring that the use of automatic recognition methods is significantly more complicated.
The transparency property of image formats gif or png comes to the rescue. It is necessary to make it so that when overlaying several parts of the picture on each other, the user can see the undistorted text. Absolutely any browsers support this feature, and even IE6.
Now let's make it a little more complicated. To begin with, we will create several transparent images, in sizes similar to the original, and scatter all the original image on a pixel-by-pixel basis onto these previously prepared layers. When viewing each individual layer, it will be impossible to say exactly what is depicted in the original image. Automatically putting together such a solution is still just as easy.
Continue to complicate and make our layers of random sizes and from random places of the original image.
Now we will put all the information about the layers into a separate css-style file, which will describe the position of each layer relative to the upper left corner of the original image. However, we will need to somehow identify the layers and distinguish them from each other. To do this, we assign random identifiers to all the pictures, and describe them.
Example of the generated html page:
<html> <head> ... <link rel="stylesheet" href="/captcha/954f836a78de1d510d28ce70fa7b6a4a.css"> ... </head> <body> ... <div> <img id="ppaas-org-666ebb41ddda5d4ed6ca4a305ef26aa3" src="/captcha/5cd345e1be7b576c628f0fea59c771a7.gif" alt=""> <img id="ppaas-org-032a6f45b6215a130227c13d93d9243b" src="/captcha/3bae7faafef0fce7dd606e6076fcb491.gif" alt=""> <img id="ppaas-org-1ab330864b702c47f0cb87f436624f04" src="/captcha/639def2a37662dc524977eb23521470d.gif" alt=""> <img id="ppaas-org-d494ac99950d983bef6a5a396100d69a" src="/captcha/9077a2f8a464dd2b54c929133df5f916.gif" alt=""> <img id="ppaas-org-6316b3bc6d6f366eed48f32f6624b396" src="/captcha/607bcc4f9573d7591bddba72820f4460.gif" alt=""> <img id="ppaas-org-b22da7a9fc15987c5ae825e736591d03" src="/captcha/2e37508352cc31227adfd6ac0dfc5eb0.gif" alt=""> <img id="ppaas-org-048a808a9f2f6a88736c212f83c7a23a" src="/captcha/fbe29561657ab6e6f45969a4f208356d.gif" alt=""> <img id="ppaas-org-9416599dcb5540a858d9ed3eb8aaa6bd" src="/captcha/347c4ce6ff64ba6a6af0374ccea286c8.gif" alt=""> <img id="ppaas-org-d7eb49d155684558196821fdb03c608d" src="/captcha/88d31395d0024972f14125996d335529.gif" alt=""> <img id="ppaas-org-10c40dc3fbf7e1dc6a675cec03261105" src="/captcha/fab44113c2a37510d829114796d0fabb.gif" alt=""> <img id="ppaas-org-69f1bac3c78d00bf529d8aa518c4a7c3" src="/captcha/6cc4c1417c1844892dfdf73491cd99d6.gif" alt=""> <img id="ppaas-org-8041ac42a7f1d9fb21d959dd78fd0512" src="/captcha/3afcef8223bcf0771f5c11c93737534a.gif" alt=""> <img id="ppaas-org-d812b3fd1537b3852e8645979c8ce531" src="/captcha/d47a2fc0fac782964d4f57bae5c8e13f.gif" alt=""> <img id="ppaas-org-7830d62c3f648536431ef1ef8522ff4e" src="/captcha/14bd31e6112391aed8f9b45d3fbadf34.gif" alt=""> <img id="ppaas-org-0bb897e2fde54b338eec83c27f913170" src="/captcha/575834849cb528079840be97d77a31d3.gif" alt=""> <img id="ppaas-org-2d2a15cb75aa8fb806fc4c79c2fb559d" src="/captcha/a2f623a5fdfe46efdb3e5410a7c90b98.gif" alt=""> <img id="ppaas-org-1612c676e0333d9742913572ec60aee7" src="/captcha/aade2c5b4f5cbae1d2df9df3fc7c3dec.gif" alt=""> <img id="ppaas-org-34fa4c5d386ddb7b4cf48ce59b9cdc8d" src="/captcha/ddf335c0c060c87c362fd70f06a705aa.gif" alt=""> <img id="ppaas-org-e9747f4f8219bd8cb22d4592fbdfe677" src="/captcha/7605f696aa21366a9f870dcf26fb3788.gif" alt=""> </div> ... </body> </html>
Example css-file /captcha/954f836a78de1d510d28ce70fa7b6a4a.css:
#ppaas-org-666ebb41ddda5d4ed6ca4a305ef26aa3 {position: absolute; z-index: 371; margin: 0px 0 0 2px;} #ppaas-org-032a6f45b6215a130227c13d93d9243b {position: absolute; z-index: 138; margin: 1px 0 0 24px;} #ppaas-org-1ab330864b702c47f0cb87f436624f04 {position: absolute; z-index: 321; margin: 0px 0 0 80px;} #ppaas-org-d494ac99950d983bef6a5a396100d69a {position: absolute; z-index: 320; margin: 4px 0 0 3px;} #ppaas-org-6316b3bc6d6f366eed48f32f6624b396 {position: absolute; z-index: 196; margin: 1px 0 0 74px;} #ppaas-org-b22da7a9fc15987c5ae825e736591d03 {position: absolute; z-index: 92; margin: 0px 0 0 49px;} #ppaas-org-048a808a9f2f6a88736c212f83c7a23a {position: absolute; z-index: 501; margin: 6px 0 0 11px;} #ppaas-org-9416599dcb5540a858d9ed3eb8aaa6bd {position: absolute; z-index: 733; margin: 0px 0 0 7px;} #ppaas-org-d7eb49d155684558196821fdb03c608d {position: absolute; z-index: 54; margin: 0px 0 0 0px;} #ppaas-org-10c40dc3fbf7e1dc6a675cec03261105 {position: absolute; z-index: 634; margin: 3px 0 0 13px;} #ppaas-org-69f1bac3c78d00bf529d8aa518c4a7c3 {position: absolute; z-index: 543; margin: 1px 0 0 38px;} #ppaas-org-8041ac42a7f1d9fb21d959dd78fd0512 {position: absolute; z-index: 506; margin: 1px 0 0 44px;} #ppaas-org-d812b3fd1537b3852e8645979c8ce531 {position: absolute; z-index: 67; margin: 0px 0 0 0px;} #ppaas-org-7830d62c3f648536431ef1ef8522ff4e {position: absolute; z-index: 247; margin: 0px 0 0 20px;} #ppaas-org-0bb897e2fde54b338eec83c27f913170 {position: absolute; z-index: 350; margin: 3px 0 0 2px;} #ppaas-org-2d2a15cb75aa8fb806fc4c79c2fb559d {position: absolute; z-index: 149; margin: 3px 0 0 45px;} #ppaas-org-1612c676e0333d9742913572ec60aee7 {position: absolute; z-index: 429; margin: 1px 0 0 33px;} #ppaas-org-34fa4c5d386ddb7b4cf48ce59b9cdc8d {position: absolute; z-index: 404; margin: 1px 0 0 2px;} #ppaas-org-e9747f4f8219bd8cb22d4592fbdfe677 {position: absolute; z-index: 153; margin: 2px 0 0 9px;}
In the future, this process can be complicated to infinity - draw extra pixels on some layers, paint them on subsequent layers, etc.
Isn't it true that all ingenious is simple ?!
Protection against automated recognition
The big mistake, in my opinion, is blocking ip-addresses that repeatedly entered captcha values incorrectly.
Anyone can enter incorrect values as many times as he wants, and this does not affect anything. Captcha is just designed for such filtering.
It is necessary to block only those who have already specified the X correct values. And unlock if, within N minutes, this user no longer entered the correct values.
In other words, the ip-address of the site visitor, who was noticed for the correct input of
X captch for the last
N minutes, should automatically be refused.
It is also worth noting that this approach begins to work effectively in the case of a single, centralized service.
The ideal solution is one in which such restrictions are not mandatory, and it is possible to change these parameters to fit your requirements.
Some of you will want to make a limit of 1 captcha every 30 minutes, someone will want 5 captchas for 5 hours, and someone will prefer to disable the check.
In the case of the use of such a restriction, it becomes absolutely not important how the automated recognition of captcha occurs.
Any of them ceases to be an effective solution.
You probably ask: What prevents the use of a huge number of proxy servers?
Proxy servers take everything from free access, from the same sites around the world. The last time I had to use them, about 20,000 servers actually worked, where 3,000 of them worked constantly.
The process of finding / creating your own proxy servers is difficult for most. This list includes those servers that have been hacked by brute force, infected with viruses, etc. This is a specific niche, really accessible to only a few.
If only one attacker would have exclusively owned such a base of 20,000 ip-addresses, he could claim to have recognized 40,000 captcha for each hour.
This is 960000 captcha per day - a great result!
Now imagine that all attackers fully or partially own this base. Its decentralized use will lead to permanent denial of service for all of them.
Consider an example. You have just used one ip-address for successful recognition of a captcha in any way, and now you wait about half an hour not to fall under the lock to make a retry. But it turns out, while you were waiting, someone else used this address to enter a captcha on another resource of interest.
Such an attempt will result in his refusal, since this ip-address was already used by you several minutes earlier. When you wait for the time to expire, you recognize the captcha a second time, but you get rejected.
And so everything starts to go in a circle and to infinity.
Protection Privacy as a Service
Up to this point it was only a theory. But you can look at how this works in practice -
http://ppaas.org .
The service allows you to protect any textual information, such as email addresses, phone numbers, etc.
UPD : Please read the post carefully, graphical execution, it is just for ease of reading, the main protection against automated recognition is different.
Thanks for attention. Together we will make this world better.