Good afternoon, dear readers. In this article, I will talk about how to parse a number written in Russian.
Smart this parser makes the ability to extract numbers from text with errors made as a result of incorrect input or as a result of optical text recognition from the image (OCR).
For the lazy:
Link to github project: link .
In this section, the algorithms used will be described. Careful, many letters!
At work, I need to recognize the text from a printed document, photographed by the camera of a smartphone / tablet. Because of the non-disclosure agreement, I cannot give an example of a photograph, but the point is that there is a table in the document in which certain indicators are written in numbers and words, and these data must be read. Text parsing in words is required as an additional validation tool to ensure that the number is recognized correctly. But, as you know, OCR does not guarantee accurate text recognition. For example, the number twenty, written in words, can be recognized as “twenty” or even as “two-fold”. It is necessary to take this into account and extract the maximum amount of information, estimating the magnitude of the possible error.
Note. For text recognition, I use tesseract 4. For .NET there is no ready-made NuGet package of the fourth version, so I created one from a branch of the main project, who could be useful: Genesis.Tesseract4 .
Let's start with a simple one, namely with the text recognition algorithm written in words, so far without errors. If you are interested in smart parsing, you can skip this section.
I am not particularly strong in Google, so I did not immediately find a ready-made algorithm for solving this problem. However, this is even better, because the algorithm, invented by himself, gives more room for coding. And the task itself turned out to be interesting.
So, take a small number, for example, "one hundred twenty three." It consists of three words ( tokens ), each of which corresponds to a number, all these numbers are summed up:
" " = + + = 100 + 20 + 3 = 123
So far, everything is simple, but let's dig deeper, for example, consider the number "two hundred and twelve thousand one hundred and five."
" " = ( + ) Ă— + ( + ) = 212 * 1.000 + 105 = 212.105.
As you can see, when there are thousands in the number (as well as millions and other degrees of thousands), the number is divided into parts consisting of a local small number, in the example above - 212, and a multiplier (1000). There may be several such fragments, but they all go in descending order of a multiplier, for example, a thousand cannot be followed by a million or another thousand. This is also true for parts of a small number, so hundreds cannot follow hundreds, and tens after tens, therefore the one hundred five hundred record is incorrect. The characteristic relating two tokens to one type is called a level , for example, one hundred and three hundred tokens have one level, and it is more than the fifty level token.
From these arguments, the idea of ​​the algorithm is born. Let's write out all the possible tokens ( samples ), each of which will be assigned a number, as well as two parameters - the level and the sign of the multiplier.
Token | Number | Level | Factor? |
---|---|---|---|
zero | 0 | one | not |
one / one | one | one | not |
two / two | 2 | one | not |
... | ... | one | not |
nineteen | nineteen | one | not |
twenty | 20 | 2 | not |
... | ... | 2 | not |
ninety | 90 | 2 | not |
hundred | 100 | 3 | not |
... | ... | 3 | not |
nine hundreds | 900 | 3 | not |
thousand / thousand / thousand | 1,000 | four | Yes |
million / million / million | 1,000,000 | five | Yes |
... | ... | ... | Yes |
quadrillion / quadrillion / quadrillion | 1,000,000,000,000,000 | eight | Yes |
In fact, you can add any other tokens to this table, including for foreign languages, just do not forget that in some countries a long rather than a short system for naming numbers is used .
Now let's move on to parsing. Let's get four values:
The algorithm is as follows:
Example of work for the number “two million two hundred twelve thousand one hundred eighty five”.
Token | globalLevel | globalValue | localLevel | localValue |
---|---|---|---|---|
- | - | - | - | |
two | - | - | one | 2 |
million | five | 2,000,000 | - | - |
two hundred | five | 2,000,000 | 3 | 200 |
twelve | five | 2,000,000 | one | 212 |
thousand | four | 2.212.000 | - | - |
hundred | four | 2.212.000 | 3 | 100 |
eighty | four | 2.212.000 | 2 | 180 |
five | four | 2.212.000 | one | 185 |
The result will be 2.212.185.
This algorithm can be used to implement other comparisons, and not just for parsing numbers, for this reason I will try to describe it in as much detail as possible.
With parsing correctly recorded numbers sorted out. Now let's think about what errors can occur if the number of the resulting OCR is incorrect. I do not consider other options, but you can modify the algorithm for a specific task.
I identified three types of errors encountered in the process:
At the same time, the parsing algorithm itself described above almost does not change, the main difference is in splitting a string into tokens.
But let's start with collecting small statistics on the use of letters in tokens. Of the 33 letters of the Russian language, when writing non-negative integers, only 20 are used, let's call them good letters :
The remaining 13, respectively, will be called bad letters . The maximum size of the token is 12 characters (13 when counting up to quadrillion). Substrings longer than this value must be split.
To match strings and tokens, I decided to use the Wagner-Fisher algorithm , although I called it the name of Levenshteyn in code. I do not need the pre-prescription, so I implemented the economical version of the algorithm. I must admit that the implementation of this algorithm turned out to be more difficult than the parser itself.
A small educational program: Levenshtein distance is a special case of the Wagner-Fisher algorithm, when the cost of inserting, deleting and replacing characters is static. In our problem it is not. Obviously, if in the substring we meet a bad letter, then it must be replaced with a good one, but it is extremely undesirable to replace the good with the bad. Generally speaking, it is impossible, but the situation depends on the specific task.
To describe the cost of inserting, deleting and replacing characters, I created the following table: a link to a table with weights . While it is filled with the method of three P (floor, finger, ceiling), but if you fill it with data based on OCR statistics, you can significantly improve the quality of number recognition. The library code contains the resource file NumeralLevenshteinData.txt, into which you can insert data from a similar table using Ctrl + A, Ctrl + C and Ctrl + V.
If the text contains a non-tabular symbol, for example, zero, then the cost of inserting it equals the maximum value from the table, and the cost of deletion and replacement - to the minimum, so the algorithm is more likely to replace zero with the letter “o”, and if you use the third version of tesseract it makes sense to add a zero to the table and set the minimum price for replacing it with the letter “o”.
So, we have prepared the data for the Wagner-Fisher algorithm; let's make changes to the algorithm for splitting a string into tokens. To do this, we will subject each token to additional analysis, but before that we expand the information on the token with the following characteristics:
Algorithm for analyzing tokens:
We are trying to match the token with the sample using the Wagner-Fisher algorithm. This variant consists of a single token (associated sample) and its error is equal to the best distance divided by the sample length.
If the substring is large enough (I have 6 characters), try to divide it into two parts with at least 3 characters each. For a string of 6 characters will be the only option of division: 3 + 3 characters. For a string of 7 characters, there are already two options, 3 + 4 and 4 + 3, and so on. For each of the options, we recursively call the same function of the analysis of tokens, put the obtained variants into the list.
In order not to die in recursion, we determine the maximum level of falling through. In addition, the options obtained as a result of the division artificially worsen by a certain amount (optional, default 0.1), so that the direct comparison option is more valuable. This operation had to be added, because substrings of type “two-fold” were successfully divided into tokens “two” and “five”, and not resulted in “twenty”. Alas, these are the features of the Russian language.
Example: the “two-way” token has a direct comparison with the sample “twenty”, an error is 0.25. In addition, the best division is “two” + “five” costing 0.25 (replacing “a” with “I”), artificially deteriorated to 0.35, with the result that the token “twenty” is preferred.
In addition, a check of tokens is entered into the main number generation algorithm so that their error does not exceed a certain value (optional, default 0.67). With this, we screen out potential garbage, although not very successfully.
We divide the incoming string representing the number in words into substrings using the regular \ s +, then we try to match each of the substrings with sample tokens or break them into smaller substrings, choosing the best results. As a result, we obtain a set of tokens by which we generate a number, and the error value is taken as the arithmetic average of the errors among the tokens used during the generation.
In my task, the numbers are non-negative and relatively small, so I will exclude unnecessary tokens from the "million" and above. For the test, dear readers, I, on the contrary, added additional tokens-slang, which allowed parsing lines like “five pieces”, “mower two hundred” and even “three stolnik and two gold pieces”. It's funny, but it didn't even require changes in the algorithm.
The existing algorithm has some flaws:
The project has an example of a console application running through the samples.txt file with examples for the parser. Here is a screenshot of the results:
I charge you with evaluating the result, but as for me, it is not bad. The magnitude of the error for real recognition examples does not exceed 0.25, although I have not yet chased away the entire set of available documents, probably not everything will be so smooth there.
As for the last section, I was always wondering how much this is “dofiga”. Also, the program gave itself quite an adequate answer, how many should I take for the road (I do not use, but still) and even accurately determined the meaning of the old Russian word “darkness”. And yes, one more measure was not included in the conclusion, which education did not allow to add, but the program considers that it is equal to a thousand =)
Initially, my plans did not include the creation of a library, I decided to issue it exclusively for Habr. I tried to put the code in order, but if you use it, fork or copy, because most likely you will not need jargon and other tokens included in the examples.
The library itself is written under .NET Standart 2.0 and C # 7.x, and the algorithms are easily translated into other languages.
In case of a possible library extension, I will add the composition of the important components of the number parser in words (namespace Genesis.CV.NumberUtils):
Using:
I really hope that the article did not seem boring to you, despite the abundance of text. Recently, I have a lot of topics related to computer vision, which have something to say, so I would like to know the opinion about this format of articles. What should I add or, on the contrary, delete? What is more interesting for you, the readers, the algorithms themselves or code fragments?
Like this article? See others:
Source: https://habr.com/ru/post/453642/
All Articles