Hacking captcha file sharing

Introduction

This article briefly describes the process of hacking captcha with ifolder.ru . Application in the process of the Python language and third-party libraries. Using the Hough transform algorithm as part of the Open Computer Vision © Intel library will allow us to get rid of image noise, the easy-to-use and fast FANN (Fast Artificial Neural Network) library will make it possible to use an artificial neural network for the image recognition problem.

My motivation was, above all, to try the Python language. As you know, the best way to learn a language is to solve some applied problem on it. Therefore, in parallel with the description of the image processing process, I will talk about which libraries and for which I used.

Problem overview

')
We have the following captcha:

ifolder.ru is a file sharing service that, when downloading and uploading, wants to make sure that you are not a robot. The resource was taken because I have long wanted to apply the following Hough transformation algorithm to this task.

What is the difficulty of recognizing this captcha? There are several of them, we describe them in order of impact on the complexity of solving the problem:

1. The presence of intersections of characters. A good example of such cases:

The percentage of such cases is relatively small, so we write them into a marriage with the note "not recognizable."

2. The presence of lines. On each image there are 4 lines of different lengths (and the length may be equivalent to the linear elements of recognizable objects), thickness and angle of inclination. We regard them as the main element of noise that we have to get rid of.

3. A large variation in the arrangement of characters in the image. Symbols are located at different levels, at different distances.

4. Rotate characters. Symbols have an inclination along one axis, but no more than ~ 30 degrees (the value is obtained empirically).

5. Floating size and thickness of characters.

It looks simple enough captcha for more detailed study is not so simple. :) But it's not so bad. Let's start.

Stage 1. Creating a training set and preprocessing

To begin with, we download several hundred captcha samples from the site, say 500. This is enough to work out the algorithms and create a primary training sample for our neural network.
With the help of the urllib library and a plain script, we download the n-th number of required samples from the site. After that we convert them from gif to 8-bit bitmap, with this format we will continue to work. The important point is the inversion of the image. Those. white objects on a black background. Later it will be clear why this is.

The script that performs all of the above:

from urllib2 import urlopen
from urllib import urlretrieve
from PIL import Image, ImageOps, ImageEnhance
import os
import sys
import re
import time

def main ( url, n ) :
# get url session url
data = urlopen ( url ) . read ( )
match = re . search ( r "/ random / images / \? session = [a-z0-9] + \ & quot;, data )
if match:
imgurl = " ifolder.ru" + match. group ( )
else :
return - 1

# gen imgs
for i in range ( n ) :
urlretrieve ( imgurl, '/ test /' + str ( i ) + '.gif' )
time . sleep ( 1 )
print str ( i ) + 'of' + str ( n ) + 'downloaded'

# convert them
for i in range ( n ) :
img = Image. open ( '/ test /' + str ( i ) + '.gif' ) . convert ( 'L' )
img = ImageOps. invert ( img )
img = ImageEnhance. Contrast ( img ) . enhance ( 1.9 )
img. save ( '/ test /' + str ( i ) + '.bmp' )
# os.unlink ('/ test /' + str (i) + '.gif')

if __name__ == "__main__" :
url = sys . argv [ - 1 ]
if not url. lower ( ) . startswith ( "http" ) :
print "usage: python dumpimages.py http://ifolder.com/?num"
sys . exit ( - 1 )
main ( url, 500 )

Stage 2. Noise removal, localization and separation of objects.

The most interesting and time consuming stage. Examples of captcha that I showed in the review, we will take a sample in this article and will work with them further. So, after the first stage, we have the following:

I used the PIL library to work with images. Easy to use as a hoe, but quite functional and very convenient library.

Let's return to our sheep. In this case, by noise, I mean lines.
As a solution to the problem, I see several options:
1. Genetic algorithms.
2. Conversion Hough. Can be considered as a type of automatic vectorization.

GA were covered on Habré several times, including in the process of solving a similar task of hacking captcha Yandex . It is not difficult to write a modification of the genetic algorithm for the detection of straight lines.

Nevertheless, I made a choice in favor of the second algorithm. Compared to GA, Hough's transformations are a mathematically more rigorous and deterministic algorithm, in which there is no influence of a random factor. In this case, it is less resource-intensive, at the same time is simple enough for understanding and application.
Briefly, the meaning of the algorithm is that any straight line on a plane can be defined by two variables - the angle of inclination and the distance from the origin (theta, r). These variables can be considered as signs; they form their own two-dimensional space. Since a straight line is a collection of points and each of them has its own pair of signs (theta, r), then in the space of these signs we will have clusters of points (maxima or peaks at the intersection) within the final neighborhoods of the signs corresponding to the points of the straight line on the original plane (image) . But everything is simpler than it seems. :)
More information can be found in Wikipedia and see the visualization of the algorithm here . Immediately it becomes clear what they mean.

Writing the implementation yourself is naturally lazy. In addition, it is in the OpenCV library , with which I often work on C / C ++. There are bindings for Python, which are easily assembled and installed.

In general, OpenCV is quite a low-level library and working with it on python is not very convenient, so the authors have provided adapters for conversion to the format of PIL objects. This is done very simply:

src = cvLoadImage ( 'image.bmp' , 1 ) # OpenCV object
pil_image = adapters. Ipl2PIL ( src ) # PIL object

The procedure for deleting lines is as follows:

def RemoveLines ( img ) :
dst = cvCreateImage ( cvGetSize ( img ) , IPL_DEPTH_8U, 1 )
cvCopy ( img, dst )
storage = cvCreateMemStorage ( 0 )
lines = cvHoughLines2 ( img, storage, CV_HOUGH_PROBABILISTIC, 1 , CV_PI / 180 , 35 , 35 , 3 )
for line in lines:
cvLine ( dst, line [ 0 ] , line [ 1 ] , bgcolor, 2 , 0 )
return dst

Images should be monochrome, meaningful white pixels. That is why we inverted the image at the first stage and will invert it during recognition.
The key point is the function call cvHoughLines2 . Attention should be paid to the CV_HOUGH_PROBABILISTIC parameter, which means the use of a more “smart” modification of the algorithm. The last three parameters are also very important, they reflect: the number of points in the feature space cell; minimum line length; and the maximum space (gap), i.e. The number of missing pixels on the line. More information in the library documentation .
It is very important to choose these parameters correctly, otherwise we will remove the straight lines that are part of the characters from the image or, on the contrary, leave a lot of noise. I believe that the parameters I chose are optimal, but far from ideal. For example, let's double the maximum gap. This will lead to this effect:

Together with the lines we removed a lot of useful information. At the same time, properly selected parameters allow achieving an acceptable result:

You have already noticed that, since the lines often intersect the characters, we inevitably remove the useful information too. This is not fatal, in part we will be able to restore it later and ultimately everything will depend on how well our neural network is trained.

The next task is localization and separation of characters. Here the problems described in points 1 and 3 in the review arise. The floating position of the symbols and rotation do not allow us to rely on the same coordinates and location. The characters often “touch”, which prevents us from applying any algorithm from the contours detection series.
It is clear that it is necessary to divide vertically. Without thinking twice, we calculate the number of white pixels in each column of the image and display them in the window:

To build the graphics I used the matplotlib library. The library is striking in its flexibility and inherent functionality; I have not seen anything like it in other languages. PyQt4 was used as the front-end GUI.

If you compare the graphics with the image, then you can see the presence of 3 local minima. According to them, we will “crop” the image. The optimal algorithm for finding the minima in this case is difficult to come up with, if there is one at all. Therefore, I implemented a simple local minimum search algorithm, the parameters were obtained empirically, and it is far from optimal. This is an important point and a more thoughtful algorithm can significantly improve the quality of recognition.
The procedure for dividing the image into symbols can be found in the source code (FindDividingCols and DivideDigits).

Next, we cut the characters because there is a lot of background area left. After you can try to recover lost useful information. I can advise you to apply morphological algorithms , such as Erosion & Dilation (Erosion and Dilatation) or Closing (Closure). They can be found in the OpenCV library, an example of use on python is in the library repository - OpenCV \ samples \ python \ morphology.py. All obtained images of individual characters are reduced to a single size of 18x24.

The result of the division into characters:

Stage 3. Recognition

The next stage is the creation of a neural network and its training. Out of 500 images (4 characters each) I received less than 1000 samples of acceptable quality and content used for training. If we train the network to the level of recognition of a single character with a probability of 0.5, then we obtain the total efficiency 0.5 ^ 4 = 0.0625 or 6%. The goal is more than achievable. The resulting sample was enough for her. If you have a desire to work "Chinese" for several days, then the probability of achieving the best results is great, the main thing is patience, which I do not have. :)
To create and use neural networks, it is convenient to use the FANN library. Wrapper for python without a file did not want to gather, I had to edit the code obtained by SWIG. I decided to post a compiled library, an installer for python 2.6 and a few examples of usage. Download here . I wrote small installation instructions, see INSTALL.

The input is an array of 18 * 24 = 432 pixels (more precisely, we transfer 1 if the pixel is significant and 0 if the background), at the output we get an array of 10 numbers, each of which reflects the probability that the input array belongs to a particular class (digit). Thus, the input layer of our neural network consists of 432 neurons, the output of 10. There is another hidden layer with the number of neurons == 432/3.

The code for creating and learning network:

from pyfann import libfann

num_input = 432
num_output = 10
num_layers = 3
num_neurons_hidden = 144
desired_error = 0.00006
max_epochs = 50000
epochs_between_reports = 1000

ann = libfann. neural_net ( )

ann. create_standard ( num_layers, num_input, num_neurons_hidden, num_output )
ann. set_activation_function_hidden ( libfann. SIGMOID_SYMMETRIC_STEPWISE )
ann. set_activation_function_output ( libfann. SIGMOID_SYMMETRIC_STEPWISE )

ann. train_on_file ( 'samples.txt' , max_epochs, epochs_between_reports, desired_error )

ann. save ( 'fann.data' )
ann. destroy ( )

Using:

def MagicRegognition ( img, ann ) :
ann = libfann. neural_net ( )
ann. create_from_file ( 'fann.data' )

sample = [ ]
for i in img. size [ 1 ] :
for j in img. size [ 0 ] :
if colordist ( img. getpixel ( ( j, i ) ) , bgcolor ) < 10 :
sample [ j + i * img. size [ 0 ] ] = 0
else :
sample [ j + i * img. size [ 0 ] ] = 1

res = ann. run ( sample )

return res. index ( max ( res ) )

Conclusion

Python is a great language, very concise and beautiful, with many third-party libraries that cover all my everyday needs. Low threshold of entry, largely due to the solid community and the amount of documentation. Pythoners, now I'm with you;)

Additional libraries used:
Numpy
Scipy

Sources ( mirror 1 , mirror 2 )

For highlight syntax, the highlight.hohli.com resource was used.

UPD: 1. Perezalil source for 3 resources. 2. Upon request, from the dumpimages.py file, removed the last three characters in the regular expression for the link to the captcha on the ifolder page. And then the "children" indulge :)

Source: https://habr.com/ru/post/67194/

All Articles