📜 ⬆️ ⬇️

Breaking Captcha

Walking through the expanses of the Internet, I went to one highly visited ancient site of the RuNet. In order to download a file from this site, you need to guess just such a captcha:
image
Once again, seeing a picture with numbers - decided. Thoughts have long passed through my head to break some kind of captcha :)

I set myself a task : Write a script that will decrypt the captcha shown and spit out precious tsiferki.

The site name is not specifically cited - you can guess :)
')
So let's go!



Analyzing the picture


First you need to look through as many of these caps as possible to identify similarities / differences, some patterns. For these purposes, I downloaded about 50 captchas. Among them you can choose the main ones, which contain maximum differences:

imageimageimageimageimage

In general, I like to peer at numbers, because at one time I devoted a lot of time to studying mathematics :)

We consider and understand:

We are looking for a solution


In my head, for half an hour, the options are scrolled, one thing is clear: it is desirable to cut the picture, and since the same fonts are used and they do not change at all, you can use " prints " . By this term, I understand that the numbers we have somewhere in the database, and we need to compare them with a picture.

I came to this decision:

Implementation


  1. Cooking prints
    All of them get 6 * 10 = 60 pieces, put them in an array. I printed the prints from captcha for each font. This is just an array of lines, where in each line the letter " x " is marked with a pixel number.

    For example, here is the number 2 of the first font:
    image
  2. Open the picture
    This is done simply through imagecreatefromgif($filename);

  3. Determine the direction of the gradient
    It is necessary to determine in which direction the gradient looks; this will be required in the following points.
    It is easy to do, it is enough to determine the color of the first pixel (0, 0)
    $color = imagecolorat($image, 0, 0) < 0x20 ? 'black' : 'white' ;

  4. We clean the corner gradients
    Here you need to clean the angular lines, gradients, and it is better to do before cutting the captcha.
    Here we need to know the direction of the gradient in order to clean it from the right side.
    By analyzing, we reveal that the color drop from a pixel (1, 1) to (2, 2), etc. There can be no more # 202020.
    To clean is to paint in black, because we have all the numbers below color # 606060.

    We get this picture:
    image
    You can view the php-code in the attachment (see the link below)

  5. Cut the captcha
    At this stage, cut off to the left and right by 12px.
    Because the height of the digit is not higher than 14px, then we cut off the excess from below and above, depending on the height of the entire captcha.

    We get:
    image
  6. Clean gradient
    On all sides there are still extra gradient stripes. They must also be cleaned.
    We first pass from top to bottom, then from left to right, take the color of the strip, and if it is solid (length> 10px) and one color - then we assume that this is a gradient strip, and clean it.

    Total we get:
    image
    But in some cases (~ 5%) there may still be such noises:
    imageimage
    True, they still will not interfere with us :) Because their color no longer matches the color of the numbers.

  7. We verify with prints
    We pass through all the pixels from top to bottom from left to right, the color of which fits the color of the numbers and compares with all the prints in order.

results


image

Testing


For testing, I downloaded 200 such captchas, on my home PC the script disassembled them ~ in 19 seconds .
This is about 10 captchas per second .

Of these 200, not a single error was detected, the script worked fine :)

Results


I wrote the class CapCrack, which parses the captcha.

If you want to understand the algorithm in more detail, or test it on your PC, you can take a look at the code: cap_crack.zip

I did not stop at this success and decided to try to write a script for downloading files from the site, in automatic mode, but this is a completely different story :) worthy of a separate article ...

PS This is my first post on Habré, so please do not judge strictly :)

Source: https://habr.com/ru/post/63854/


All Articles