Parsing pictures into text: a simple algorithm

The roots of the story go back to those years when one of the clans of the ancient text game "Fight Club" ordered a young Perl programmer from me to play a game. A couple of sleepless nights - and four flat figures are ready along with input check.

A few days later another, no less respected clan came, and ordered the parser of the same captcha. For her analysis, we had to spend much more time, there was no Ocrad at that time, but a very simple and working method was found.
')
A week later, the third and most deserved clan in the game came, and ordered a new captcha. After a couple of months, blanketing almost all the top clans enriched themselves with new artifact pictures, their programmers got lots of colorful paper, the project got a bunch of nonsense generators, and I personally had invaluable experience.

Most recently, this experience was useful for parsing thousands of phone numbers from one of the sites from the image back to text. The algorithm used is the same, and I want to share it. Here is a screwdriver and a hammer, and what you collect with them - a synchrophasotron or a gravushka gun - is already your own business.

I wrote all this first on Perl, and then on PHP, but you shouldn't tire anyone with listings, right?

Step 1. Image in the matrix.
We parse the image into a two-dimensional matrix of the form a _xy , or a [x] [y], if you like it more.
Each element of the matrix is assigned a value - the color of the pixel.
We count the number of pixels of different colors, information about this is entered into a regular array.

Step 2. Get rid of excess.
The image, although taken from a GIF that stores no more than 256 colors, still requires a decrease in the amount of information. We reduce the number of colors: we discard all the values, which are less than at least 50% of the color that has accumulated the largest number of mentions in our array. From a seemingly monochrome image, usually four colors remain. This is a list of primary colors.

Step 3. Next - the funniest thing: we do total Sharpen and Grayscale. Watch your hands:

Create a new two-dimensional matrix b [x] [y]. In it we will write the results.
Take four adjacent pixels - a square.
If at least one of the colors of these pixels remains in the list of primary colors, we write b [0] [0] = X to the new matrix. If none, write b [0] [0] = 0.
Take the next 4 pixels. Repeat until the end of the matrix, and in the case of large images, the operation can even be run twice. Just do not get carried away - the harder the image, the harder it will be to compare further.

The result is such a beauty:

Something in this is from childhood, when the school was taught to write postal codes, right?

The simplest thing remains: to explain to a computer that a graphic image consisting of crosses and zeros may well be a decimal digit. To do this, we divide the matrix into submatrix pieces by symbols, and compare them with the standard. Unfortunately, the standard for each captcha is different, and each time you have to adjust it, albeit slightly.

At the very end, the Olivier algorithm saves us to compare similar strings, which is used in the PHP ready function int similar_text (str, str) . Of course, the smaller the length of the lines, the faster they are checked, so I compared the first line in the “recognized” symbol with the first line of the standard, the second with the second, and so on to the end.

Forty thousand phone numbers were recognized with an error of 1. Now we would make the algorithm more universal - and we have a million in our pockets, right?

Source: https://habr.com/ru/post/158431/

All Articles

Parsing pictures into text: a simple algorithm

More articles: