Analysis of the simplest captcha (C #)

Some time ago I had to make a program that automatically downloaded files from one fairly well-known site. The problem / at first glance / was that there was a captcha. However, one glance at it was enough to understand, solved and very quickly :) After a few years, I came across that project again and decided to put it on Habr. Immediately make a reservation, I will not call the site for the captcha is still there and let it remain so.

Stage One: Information Collection

')
First of all, a simple little program was written that pulled the address with the image of a captcha and stored the received images in a separate folder. When I looked at those 50 images that the program nagged, I realized that it was still easier than I thought.

Look, these are examples of captcha:

There are few colors, the color of the background and the numbers themselves change sporadically, but the faces of the numbers are clearly visible, there is no noise. Accordingly, I chose the simplest solution - the simplest mask analysis.

Stage Two: Initial Image Analysis, Mask Creation

To create a mask, you need to achieve an image of two colors. To do this, another program was written to the program that downloaded the images from the site, which took the image, calculated which colors were present in the image and made a white background cut + black mask in place of each color. After parsing the image, there were several blanks, one of which clearly showed the characters used in the captcha.

I will give an example of those images:

Here I brought a mask with only the most frequently encountered colors. To weed out the excess, I removed from the mask all colors that contain less than 25 pixels on the form. In principle, this makes it possible to miss the definition if there are 1-2 characters in a captcha that take up less than 25 pixels, but I didn’t see any such images in this captcha and did not bother.

So in the last view, you can see that we actually have a mask. Absolutely clean, which looks like this in the editor:

I quote the code with which I pulled out the options for masks by color:

 public Bitmap ClearBitmap(Bitmap input, Color clr) { var result = new Bitmap(input.Width, input.Height); for (var x = 0; x < input.Width; x++) { for (var y = 0; y < input.Height; y++) { var color = input.GetPixel(x, y); result.SetPixel(x, y, clr == color ? Color.Black : Color.White); } } return result; } public void Main() { var bitmap = new Bitmap("D:\\check_image1227.png"); var palette = new Dictionary<Color, int>(); for (var x = 0; x < bitmap.Width; x++) { for (var y = 0; y < bitmap.Height; y++) { var clr = bitmap.GetPixel(x, y); if (!palette.ContainsKey(clr)) { palette.Add(clr, 1); } else { palette[clr] = palette[clr] + 1; } } } var i = 0; foreach (var c in palette) { if (c.Value > 30) { var temp = this.ClearBitmap(bitmap, c.Key); temp.Save(String.Format("D:\\mask-{0}.bmp", i)); i++; } } }

Stage three: Into the battle!

When I saw the work of the code specified above, it remains only to collect all the numbers used in the captcha and proceed. For this, a program was launched that saves captcha images in order to get 200 images. From the received images I chose those in which all the main characters were displayed and using the code given above their masks were obtained. The result of this work looked like this:

For some reason, the symbol 9 was not used in the captcha, but it does not matter. Easy further. Each number is taken in a square, the right side - the minimum bit, the left - the maximum. Who worked with the assembler on 8086 and made masks of symbols will understand me, for the rest an example:

On the image above put the numbers of bits on the right of the finished number. To make it even more clear, we replace the black dots with 1, the white ones with zero. For example, the top line of a given number in binary looks like 0001111111000, i.e. 1016 in decimal. An array describing the matrix was made for each digit.

A further algorithm looked like this. A function was written that cleared the resulting image from the white dots above, below and on the sides. And a function was written that returned an array of numbers representing the specified area. After that, everything was simple. Since everything was done automatically, I made sure that after cleaning the image and cutting off unnecessary data, a check was made so that the height of the image fell under the size of the digit (in this case it made sense to look for a digit). After that, in the cycle from left to right, the area was compared with each digit to match. Matched digit -> move the image to the width of the character to the right. Check the next area and so on. In the end, everything turned out not very fast, I will attach the project so that you can see for yourself, but the solution returns 100% correctly recognized captcha.

PS Do not kick much for the project code, it has been written for a long time and for speed. Much can be optimized, but this is a problem for you, if of course interesting :)

PPS All the same, I think nothing will change, if I specify the address of the site, it was cracks.ms

Project for VS2010

Source: https://habr.com/ru/post/115739/

All Articles

Analysis of the simplest captcha (C #)

Stage One: Information Collection

Stage Two: Initial Image Analysis, Mask Creation

Stage three: Into the battle!

More articles: