It so happened that it took us to embed text recognition in our application, so we began the search for a suitable library. In the end, they settled on two open source projects,
CuneiForm Linux and
Tesseract-ocr . A careful study of the CuneiForm project showed that this is just the port of the product of Cognitive Technologies, the source code of which they opened in 2008 and successfully scored having received their portion of attention (in any case, this was the impression). In fact, the whole project consisted in porting, and there was no even talk of new features. All this, coupled with the
sad news on the page of the project, forced us to abandon CuneiForm in favor of Tesseract, which is currently owned by Google, which gives some confidence in the future of the project. Under the cut, the experience of building Tesseract-ocr under Windows using MinGW and then creating the simplest C ++ application.
Training
I will try to describe everything that needs to be done to put together a tesseract with a minimal headache, while at the same time I will try not to go deep into banality.
Install and configure MinGW
Download and install the latest available installer from the
official site of the project , do not forget to tick for C ++ Compiler and MSYS Basic System. After that we go into MinGW Shell and install additional packages that we will need later with the following command:
mingw-get install mingw32-automake mingw32-autoconf mingw32-autotools mingw32-libz
Immediately, we note that the directory in which MinGW is installed is mounted in
/ mingv , this will also be useful to us when building libraries.
')
Installing the Leptonica library
Tesseract-ocr uses the
Leptonica library for working with images, I will describe how to build and install it from the source code that can be taken from the
official site , but before that we need to install the libraries
libJpeg ,
libPng and
libTiff , which in turn uses Leptonica it is also an assembly from source codes).
Build libJpeg
Download the archive with source codes from the
official site and unpack it into a separate directory (for simplicity, we assume that it is
D: \ lib \ jpeg ). Back in MinGW Shell and with a slight movement of the hand, we assemble and install the library into directories in which gcc is looking for by default. Flags are overridden to disable debug output.
cd /D/lib/jpeg
./configure CFLAGS='-O2' CXXFLAGS='-O2' --prefix=/mingw
make
make install
Build libPng
We also download the archive with source codes from the
project page and unpack it into the
D: \ lib \ png directory (you, of course, can choose another one). Go back to the MinGW Shell and repeat the same thing as for libJpeg.
LibTiff build
We take archive with source codes from
recommended ftp and unpack in
D: \ lib \ tiff . And collect the same as the previous two.
Build Leptonica
We already have an archive with source codes, it remains to unpack it in D: \ lib \ leptonica. And then it is time to remember about the file, the assembly with Zlib support will not succeed because of a
small bug , which, however, is easy to fix by yourself. To do this, open the file
src / pngio.c , located in the directory where we unpacked the source code Leptonica. There you need to find the line
#include "png.h" and insert directives after it, so that you can get something like this:
#include "png.h"
#ifdef HAVE_LIBZ
#include "zlib.h"
#endif
/* ----------------Set defaults for read/write options ----------------- */
After that we collect Leptonica as well as the previous libraries.
Assembling and installing Tesseract-ocr
Now we have all the necessary dependencies. This time we will download the sources from the svn trunk of the developers:
svn checkout ht tp://tesseract-ocr.googlecode.com/svn/trunk/ tesseract-ocr-read-only
* the space between t is intended exclusively for habrapriser, remove it.
Then again we take up the file, but I pre-exported the source code to
D: \ lib \ tesseract .
I will write the paths to the files relative to the directory in which the tesseract sources are located (recall that in my case it is
D: \ lib \ tesseract ).
- Editing the file ccutil / platform.h . We need to comment out the re-declaration of the BLOB type, which already exists in winsock2.h . It should be something like:
/*typedef struct _BLOB {
unsigned int cbSize;
char *pBlobData;
} BLOB, *LPBLOB;*/
- From vs2008 / port, copy the strtok_r.h and strtok_r.cpp files to the ccutil directory and add strtok_r.cpp to the libtesseract_ccutil_la_SOURCES variable in the ccutil / Makefile.am file.
- We comment on the PBLOB class declaration in api / baseapi.h .
- In the api / Makefile.am file, we supplement the AM_CPPFLAGS variable with the -I $ (top_srcdir) / vs2008 / port variable
or simply copy the file vs2008 / port / version.h to the api directory - We supplement the AM_CPPFLAGS variable in the viewer / Makefile.am file with the value -I $ (top_srcdir) / ccutil
After these manipulations, you can go to the MinGW Shell and start building the library directly:
cd /D/lib/tesseract
./runautoconf
./configure CFLAGS='-D__MSW32__ -O2' CXXFLAGS='-D__MSW32__-O2' LIBS='-lws2_32' LIBLEPT_HEADERSDIR='/mingw/include' --prefix=/mingw
make
make install
While he was going, I had time to drink tea, and after I discovered a stack of header files in
/ mingw / include / tesseract , the
Leptonica header files were located in
/ mingw / include / leptonica , all the libraries were naturally in
/ mingw / lib .
Simple application
I will give the whole code, as it is very small:
#include <stdio.h>
#include <string.h>
#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>
int main(int argc, char* argv[]) {
tesseract::TessBaseAPI tessApi;
tessApi.Init("data", "rus");// data *.traineddata ,
// rus
if(argc > 1) {
PIX *pix = pixRead(argv[1]);// ,
// , Leptonica
tessApi.SetImage(pix);// tesseract,
char *text = tessApi.GetUTF8Text();//
//---
char *fileName = NULL;
long prefixLength;
const char* lastDotPosition = strrchr(argv[1], '.');
if(lastDotPosition != NULL) {
prefixLength = lastDotPosition - argv[1];
fileName = new char[prefixLength + 5];
strncpy(fileName, argv[1], prefixLength);
strcpy(fileName + prefixLength, ".txt\0");
} else {
exit(1);
}
//---
FILE *outF = fopen(fileName, "w");
fprintf(outF, "%s", text);
fclose(outF);
//---
pixDestroy(&pix);
delete [] fileName;
delete [] text;
}
return 0;
}
You can build our application with the command:
g++ -O2 test.cpp -o test.exe -ltesseract_api -ltesseract_main -ltesseract_textord -ltesseract_wordrec -ltesseract_ccstruct -ltesseract_ccutil -ltesseract_classify -ltesseract_dict -ltesseract_image -ltesseract_viewer -ltesseract_cutil -ltesseract_cube -ltesseract_neural -llept -lws2_32
The link is static because the current version of tesseract does not support creating a DLL.
Conclusion
I know from my own experience that the hardest thing to start is, so I hope that my story will be useful to someone, especially considering that Tesseract's online documentation is criminally small and the main one can perhaps be pulled from the source itself using doxygen.
PS: some ideas for fixes were gathered in
this post for which many thanks to the author.