πŸ“œ ⬆️ ⬇️

Text recognition methods

Just yesterday, the 61st student scientific conference was held at the Southern Federal University in the city of Taganrog, where I presented a report on methods for recognizing text on graphic images. And I would like to share this with even more listeners and readers. Who is interested in reading about the bikes of a novice student in this area, please under the cat.
Pictures and code snippets are present.

Some theory

The topic of text recognition falls under the pattern recognition section. And for a start, briefly about the recognition of images.
Pattern recognition or pattern recognition theory is a branch of computer science and related disciplines that develops the basics and methods for classifying and identifying objects, phenomena, processes, signals, situations, etc. of objects that are characterized by a finite set of certain properties and attributes. This definition gives us Wikipedia .

She also claims that there are two main areas

My work in the course work was carried out in the direction of the second paragraph.

So, my topic is the recognition of text on graphic images and now we don’t have to talk about the importance of this subsection. Everyone has long known that there are millions of old books that are stored in high-security vaults that only specialized personnel have access to. The use of these books is prohibited because of their dilapidation and decrepitude, since it is possible that they can crumble right in the hands of the reader, but the knowledge they hold is undoubtedly a great treasure for humanity and therefore digitizing these books is so important. This is exactly what the data processing specialists are doing.
')
Let's get even closer to the topic of text recognition. It should be noted that under the recognition of the text usually understand the three main methods.
It should also be said that text recognition almost always goes along with detecting text on an image, but since I did not set this goal, the detection phase was omitted and replaced with a light preprocessing.

Now about the work itself. An application has been written that can recognize text when using images of high or medium quality, with little or no noise. The application is able to recognize letters of the English alphabet, upper and lower case. The image is submitted for recognition directly from the application itself.

Filtering and processing

Since the detection stage was omitted and the preprocessing stage was inserted, the image for the most part looks like this.

This image is processed by two filters. Median and monochrome. The application used a modified version of the median filter with an increase in the value of the red component.
Median filter
public static void Median(ref Bitmap image) { var arrR = new int[8]; var arrG = new int[8]; var arrB = new int[8]; var outImage = new Bitmap(image); for (int i = 1; i < image.Width - 1; i++) for (int j = 1; j < image.Height - 1; j++) { for (int i1 = 0; i1 < 2; i1++) for (int j1 = 0; j1 < 2; j1++) { var p = image.GetPixel(i + i1 - 1, j + j1 - 1); arrR[i1 * 3 + j1] = ((pR + pG + pB) / 3) & 0xff; arrG[i1 * 3 + j1] = ((pR + pG + pB) / 3) >> 8 & 0xff; arrB[i1 * 3 + j1] = ((pR + pG + pB) / 3) >> 16 & 0xff; } Array.Sort(arrR); Array.Sort(arrG); Array.Sort(arrB); outImage.SetPixel(i, j, Color.FromArgb(arrR[3], arrG[4], arrB[5])); } image = outImage; } 

This filter is used to minimize noise and blur the sharp edges of the letters (serifs, etc.). After that, the image is processed by monochrome. That is, a clear binarization occurs, while the boundaries of the letters are clearly fixed.
Monochrome
 public static void Monochrome(ref Bitmap image, int level) { for (int j = 0; j < image.Height; j++) { for (int i = 0; i < image.Width; i++) { var color = image.GetPixel(i, j); int sr = (color.R + color.G + color.B) / 3; image.SetPixel(i, j, (sr < level ? Color.Black : Color.White)); } } } 

Segmentation

After preprocessing, the image is segmented during recognition. Again, since the detection stage is omitted, the following heuristics are adopted for the segmentation process. It is assumed that the sentences of the text are arranged horizontally and do not create intersections with each other. Then the segmentation task is easy.
Sets the average distance between two letters in a word. After that, the image is divided into lines by searching for full white stripes. Further, these bands are divided into words by searching for white stripes of a certain width. After all this, the highlighted words are transmitted to the final stage, and they are divided into letters. Thus, at the output of the segmentation module, we have all the text represented by the images of the letters of this text.

Immediately before recognition, the image is normalized and reduced to the size of templates prepared in advance.

Next comes the recognition process itself. For the user, there are two choices, using metrics and using a neural network.

Recognition

Consider the first case - recognition using metrics.

A metric is a conditional function value that determines the position of an object in space. Thus, if two objects are located close to each other, that is, they are similar (for example, two letters A written in a different font), then the metrics for such objects will coincide or be very similar. Hamming metric was chosen for recognition in this mode.

Hamming metric - a metric that shows how strongly objects are not similar to each other.

This metric is often used to encode information and transfer data. For example, after a transmission session, there is the following bit sequence (1001001) at the output, we also know that another bit sequence (1000101) should arrive. We calculate the metric by comparing the parts of a sequence with the corresponding places from another sequence. Thus, the Hamming metric in our case is equal to 2. Since the objects differ in two positions. 2- is the degree of dissimilarity, the more, the worse in our case.
Therefore, in order to determine which letter is depicted, it is necessary to find its metric with all ready-made templates. And the template whose metric will be closest to 0 will be the answer.

But as practice has shown, counting only one metric does not give a positive result, as many letters are similar to each other. for example, "j", "i", which leads to erroneous recognition.

Then it was decided to come up with new metrics, allowing to distinguish between a certain set of letters in a separate class. In particular, metrics were implemented (Reflections of horizontal and vertical, the prevalence of the weight of horizontal and vertical).

The experiment revealed that such letters as β€œH”, β€œI”, β€œi”, β€œO”, β€œo”, β€œX”, β€œx”, β€œl” possess supersymmetry (they completely coincide with their reflections and significant pixels are distributed evenly throughout the image) therefore, they were moved to a separate class, which reduces the enumeration of all metrics by about 6 times. Similar actions were taken in relation to other letters. On average, brute force reduction is about 3 times.
There is also a unique letter such as "J", which is one in its class, and therefore uniquely identified. Further, for each class, the Hamming metric is calculated, which at this stage gives better performance than with direct use.
When creating templates, the font consolas was used, therefore, if the recognizable text is written in this font, the recognition has an accuracy of about 99 percent. When you change the font, accuracy drops to 70 percent.

The second method of recognition - using a neural network.

What is a neural network and in the biological sense, and in the mathematical, I will not tell, because this material is full on the Internet and I do not want to repeat it. It can only be said that in the mathematical sense, the neural network is only a model of biological definition.

There are also many varieties of these models. In my work, I used the Kohonen single layer network.
The principle of operation of the neural network is such that, after teaching a new image to the input layer of neurons, the network responds with the impulse of a particular neuron. Since all neurons are named by the letters, therefore, the reacted neuron carries the recognition response. Delving into the terminology of networks, we can say that a neuron, in addition to the output, also has many inputs. These inputs describe the pixel value of the image. That is, if there is a 16x16 image, the network should have 256 inputs.

Each input is perceived with a certain coefficient and as a result, at the end of recognition a certain charge is accumulated on each neuron, the more the neuron will charge and emit a pulse.

But in order for input coefficients to be properly configured, you must first train the network. This is done by a separate training module. This module takes the next image from the training set and feeds the network. The network analyzes all positions of black pixels and flattens the coefficients minimizing the coincidence error using the gradient method, after which the image is mapped to a specific neuron.
Training
 public void Teach(Bitmap img, Neuron correctNeuron) { var vector = GetVector(img); for (int i = 0; i < vector.Length; i++) { vector[i] *= 10; correctNeuron.Weigths[i] = correctNeuron.Weigths[i] + 0.5 * (vector[i] - correctNeuron.Weigths[i]); } } 

At the end of the training, each neuron is similar to the artist's canvas, where in places where the black pixels are most often found are the darkest paint (the charge value is greater), and where there is less often a very light tone.

All coefficients are aligned and ready to perceive images.
The recognition accuracy with this method reaches 80 percent. It should be noted that the accuracy of recognition depends on the training sample, both on the quantity and on the quality.

Very helpful in writing the application:
Kohonen nets for dummies
A course of lectures from Yandex

PS I would like to hear constructive criticism about the style of writing, manner of presentation and completeness of coverage.

Source: https://habr.com/ru/post/220077/


All Articles