Walking through the expanses of the Internet, I went to one highly visited ancient site of the RuNet. In order to download a file from this site, you need to guess just such a captcha:

Once again, seeing a picture with numbers - decided.
Thoughts have long passed through my head to break some kind of captcha :)I set myself a task : Write a script that will decrypt the captcha shown and spit out precious tsiferki.
The site name is not specifically cited - you can guess :)')
So let's go!
Analyzing the picture
First you need to look through as many of these caps as possible to identify similarities / differences, some patterns. For these purposes, I downloaded about 50 captchas. Among them you can choose the main ones, which contain maximum differences:




In general, I like to peer at numbers, because at one time I devoted a lot of time to studying mathematics :)We consider and understand:
- black and white picture, gif
- the size of the picture may vary, but the numbers always stand in the center (albeit vertically they are not very aligned in the center)
- gradient is used, its direction can change in 2 directions
- except for the gradient, there is an " angular gradient " ( so I called him, do not kick :) ), the one that comes from a corner at an angle of 45 ( do not kick again :) ) is just a diagonal line, in my understanding
- In total, I identified 6 different writing fonts (more precisely 3, the other 3 are their oblique versions)
- pixels of all digits are no darker than # 606060, but not of the same color
- numbers 3-5 in captcha, no higher than 14px
We are looking for a solution
In my head, for half an hour, the options are scrolled, one thing is clear:
it is desirable to cut the picture, and since the same fonts are used and they do not change at all, you can use " prints " . By this term, I understand that the numbers we have somewhere in the database, and we need to compare them with a picture.
I came to this decision:
- we get an array of prints
- we cut the picture from all sides;
- remove extra colors - this is a gradient and an angular gradient
- we pass through all the pixels from left to right, top to bottom, and if the color of the pixel matches the color of the digit (> = # 606060), then we check with the prints, with all in order
Implementation
- Cooking prints
All of them get 6 * 10 = 60 pieces, put them in an array. I printed the prints from captcha for each font. This is just an array of lines, where in each line the letter " x " is marked with a pixel number.
For example, here is the number 2 of the first font:

- Open the picture
This is done simply through imagecreatefromgif($filename);
- Determine the direction of the gradient
It is necessary to determine in which direction the gradient looks; this will be required in the following points.
It is easy to do, it is enough to determine the color of the first pixel (0, 0)
$color = imagecolorat($image, 0, 0) < 0x20 ? 'black' : 'white' ;
- We clean the corner gradients
Here you need to clean the angular lines, gradients, and it is better to do before cutting the captcha.
Here we need to know the direction of the gradient in order to clean it from the right side.
By analyzing, we reveal that the color drop from a pixel (1, 1) to (2, 2), etc. There can be no more # 202020.
To clean is to paint in black, because we have all the numbers below color # 606060.
We get this picture:

You can view the php-code in the attachment (see the link below)
- Cut the captcha
At this stage, cut off to the left and right by 12px.
Because the height of the digit is not higher than 14px, then we cut off the excess from below and above, depending on the height of the entire captcha.
We get:

- Clean gradient
On all sides there are still extra gradient stripes. They must also be cleaned.
We first pass from top to bottom, then from left to right, take the color of the strip, and if it is solid (length> 10px) and one color - then we assume that this is a gradient strip, and clean it.
Total we get:

But in some cases (~ 5%) there may still be such noises:


True, they still will not interfere with us :) Because their color no longer matches the color of the numbers.
- We verify with prints
We pass through all the pixels from top to bottom from left to right, the color of which fits the color of the numbers and compares with all the prints in order.
results

Testing
For testing, I downloaded 200 such captchas, on my home PC the script disassembled them ~ in
19 seconds .
This is about
10 captchas per second .
Of these 200, not
a single error was detected, the script worked fine :)
Results
I wrote the class CapCrack, which parses the captcha.
If you want to understand the algorithm in more detail, or test it on your PC, you can take a look at the code:
cap_crack.zipI did not stop at this success and decided to try to write a script for downloading files from the site, in automatic mode, but this is a completely different story :) worthy of a separate article ...
PS
This is my first post on Habré, so please do not judge strictly :)