Accounting system based on the OCR system

Prologue

In the course of his work, he received the task to invent and implement a system for recording advertising information. Accounting was to check the availability of the necessary information on the right billboard. Shield and printing are numbered.
As a source of information for the system it was proposed to use a photo. After the ~~trade~~ coordination with the designers, it was agreed that both numbers would be located within the same frame. The only thing that the frame could be anywhere on the shield.
Actually, the statement of the problem ends there and the implementation narration begins.
The problem is solved in three steps:

Finding the desired rectangle in the image.
Text recognising.
Verification of recognition.

Act One - Search Engine

To find the desired rectangle in the picture, the easiest way is to find all the pieces that can be called rectangles, and then filter them by certain parameters. To search for rectangles in the image, a slightly doped standard example from OpenCV - squares.cpp was used, from which the search function for rectangles was taken.
The procedure for finding shapes is quite primitive, and if there is a complex image with a lot of color borders and transitions at the input, it gives out a bunch of rectangles, from which even before the recognition procedure, it is necessary to throw out the unnecessary.

The waste is filtered by several criteria:
1. The ratio of width and height.
The program has a cutoff criterion (r.width <5 * r.height), it can be improved and the condition with delta can be used more precisely.
The main thing is that the photographer does not show imagination and does not shoot the object, turning the camera 90 ^o (take a picture of me with my legs).
2. Remove approximately the same shape.

Another point: before straightening, we straighten the rectangles, since the photographer’s hand may waver and the rectangle may have non-horizontal vertical borders in the photo.
')
Next is cutting into a file of all the collected rectangles.
It was experimentally established that the recognition utility works better for pictures in black and white format, for which the cvAdaptiveThreshold method is called before writing to the file. The block size in the conversion procedure was selected experimentally.

<source lang="cpp"> #include "cv.h" #include "highgui.h" #include <iostream> #include <math.h> #include <string.h> #include <stdio.h> using namespace cv; using namespace std; typedef vector<Point> polygon; typedef vector<polygon> polygonList; ... //     bool compareRect(const CvRect &r1, const CvRect &r2) { if (!r1.width || !r1.height) return false; if ((float)abs(r1.width- r2.width)/(float)r1.width > 0.05) return false; if ((float)abs(r1.height - r2.height)/(float)r1.height > 0.05) return false; if ((float)abs(r1.x - r2.x)/(float)r1.width > 0.02) return false; if ((float)abs(r1.y - r2.y)/(float)r1.height > 0.02) return false; return true; } //  CvRect getRect(const polygon& poly) { CvPoint p1 = cvPoint(10000,10000); CvPoint p2 = cvPoint(-10000,-10000); for (size_t i=0; i < poly.size(); i++) { const Point p = poly[i]; if (p1.x > px) p1.x = px; if (p1.y > py) p1.y = py; if (p2.x < px) p2.x = px; if (p2.y < py) p2.y = py; } return cvRect(p1.x,p1.y,p2.x-p1.x,p2.y-p1.y); } int main(int argc, char** argv) { if(argc <= 3) { cout << "Wrong Param Count: " << argc << endl; cout << "Usage: findrect infile extension outfolder" << endl; return 1; } char *fileIn = argv[1]; char *fileExt = argv[2]; char *dirOut = argv[3]; char fileOut[128]; polygonList squares; IplImage *Img = cvLoadImage(fileIn,1); Mat image(Img); if(image.empty()) { cout << "Couldn't load " << fileIn << endl; return 1; } findSquares(image, squares); vector<CvRect> rectList; int p = 0; int adaptive_method = CV_ADAPTIVE_THRESH_GAUSSIAN_C; int threshold_type = CV_THRESH_BINARY; int block_size = 65; double offset = 10; for (int j=0; j<squares.size(); j++) { //  CvRect r = getRect(squares[j]); if (r.width < 5*r.height) continue; //     bool doContinue = false; for (int k=0; k<rectList.size(); k++) if (compareRect(r, rectList[k])) { doContinue = true; break; } if (doContinue) continue; rectList.push_back(r); //     cvSetImageROI(Img, r); IplImage *dst = cvCreateImage(cvSize(r.width, r.height), Img->depth, Img->nChannels); IplImage *gray = cvCreateImage(cvSize(r.width, r.height), 8, 1); IplImage *bw = cvCreateImage(cvSize(r.width, r.height), 8, 1); cvCopy(Img, dst, NULL); cvResetImageROI(Img); //   ,        php sprintf(fileOut,"%s/%d.%s",dirOut, p, fileExt); cout << fileOut << endl; p++; //  - cvCvtColor(dst,gray,CV_RGB2GRAY); cvAdaptiveThreshold(gray, bw, 255, adaptive_method,threshold_type,block_size,offset); cvSaveImage(fileOut, bw); cvReleaseImage(&dst); cvReleaseImage(&gray); cvReleaseImage(&bw); } return 0; }

The second action is a recognition

Recognition utility comes in as normal content and garbage.

As stated earlier, for recognition we use the utility from Google - tesseract.
It was possible to use other means of recognition, cuniform was also tested.
But tesseract was chosen for the reason that there was a lot of information on it and there was a clear instruction for its training on its own set of characters.

Training on your own alphabet was made with several goals:

Dictionary for recognition of numbers - must consist of 10 characters, no letters and other symbols are needed. Short set probability of error.
In principle, it was possible to stop at the 1st - the tesseract has a digit-only recognition mode. You could use it and not bother with creating your own dictionary.
But the test results prompted another idea and the reason is as follows: regular fonts (included in the standard set) have the OCR symbols resembling each other: the number “7” resembles “1” under certain conditions, the number “3 "To" 8 ", etc.
Therefore, it was decided to use a font in which the character of the numbers would not resemble each other. As a hint to find the font was the name thereof - «OCR A Std». This font is just used on the above cuts.
Thus, we have another factor to reduce the likelihood of error.

As a result, a dictionary of 10 characters of this font was created for tesseract, and it can be seen on the clippings above.
I will not give instructions for the training of the utility, the process is not creative, mechanical, there are many instructions in the network.

Action three - collective

The system has been tested under Ubuntu. Running the threading and recognition utilities is done by php.
It also performs the final verification of the recognized data using the checksum method.
The crc-8 algorithm is used.

 $imagesout = '/home/toor/www/out'; $findrect = '/home/toor/OCR/OpenCV-2.2.0/samples/cpp/findrect'; $uploaddir = '/home/toor/www/uploads/'; $rectdir = '/home/toor/www/out/'; $tesseract = '/home/toor/OCR/tesseract-3.00/api/tesseract'; ... if (isset($_FILES['userfile']['tmp_name'])) { $uploadfile = $uploaddir. $_FILES['userfile']['name']; if (!move_uploaded_file($_FILES['userfile']['tmp_name'], $uploaddir . $_FILES['userfile']['name'])) { echo " !"; exit(1); } echo " {$_FILES['userfile']['name']}  !"; $cmd = "$findrect $uploadfile tif $imagesout"; exec($cmd, $output); echo count($output)." "; $datas = array(); foreach($output as $k => $f) { $recognized = "$rectdir$k.txt"; $cmd = "$tesseract $f $rectdir$k -l nums.ocr"; exec($cmd); if (!file_exists($recognized)) continue; echo "file: $recognized"; $data = file_get_contents($recognized); $data = preg_replace('/\D/','',$data); $data = trim($data); if (!strlen($data)) continue; if (!array_key_exists($data,$datas)) $datas[$data] = 1; else $datas[$data]++; } foreach ($datas as $d => $v) { if ($r = crc_check($d, NUMBER_LEN_1, NUMBER_LEN_CRC_1)) { echo ' : '.$r; } if ($r = crc_check($d, NUMBER_LEN_2, NUMBER_LEN_CRC_2)) { echo ' : '.$r; } } }

In general, in the test mode, the system has shown itself quite well.
Worked out pictures from the most simple phones like this

and up to several megabytes with digital cameras.

Links

Tesseract
Opencv
OCR A Std Font

Source: https://habr.com/ru/post/123395/

All Articles