KBookOCR for Linux. FineReader Killer for Linux in its initial stages

Introduction

Perhaps each of us experienced a period in his life which was accompanied by the active digitization of analogs of the material. I mean the need to work with text from non-digitized sources. This refers not only to the problem of scanning, but also a lot of material that unfortunately reaches the end user is not quite in a usable form. And I think each of us very often ran over in the head flattering thoughts about the book distributor in djvu or pdf format in which all the content was presented graphically without the possibility of using materials for its activities.

For Windows users, there is the option of using FineReader, which easily implemented the recognition process with all the consequences.

Linux - problem solving

But what about people who are able to use more advanced operating systems while keeping their finances at an acceptable level? Of course, there are projects of console utilities for text recognition. On the basis of one of the most advanced open technologies OCR, we created a distribution kit for server deployment for OCR with a web interface for communicating with this server itself. But I do not think that such monstrous decisions are interesting to the end user. And the technology itself is implemented in many distributions in the form of a console application that can operate not with popular formats, of which the text (djvu, pdf) most often needs to be “ripped out”, but graphic files, which complicates the process of use.
')
Of course, this state of affairs and the love of Linuxukids for optimizing everything and everything led to the emergence of the BookOCR project, whose founders and programmer are the remarkable man mr-protos , who is not yet in Habré. Further his article about the creation of BookOCR:

BookOCR

mr-protos created a moderately simple bash-script bookocr.sh:
bookocr.tar.xz (posted on dropbox)

Algorithm of his work:

1. checking the file extension (.djvu or .pdf. In the case of a different extension, the script will issue a warning);
2. paginated file conversion to .png for further recognition. (the result is stored in a temporary folder ~ / .tmp_pdf or ~ / .tmp_djvu);
3. recognition of converted pages using OCR;
4. merging page-recognized text files into one;
5. delete the temporary folder.

Script usage:

bookocr.sh <path_to_pdf_or_djvu>
Note: the finished file is created in the same directory as the source one.
To run the script, the following packages must be installed on the system:

cuneiform
ghostscript
djvulibre-bin
libtiff-tools
libnotify-bin

The quality of the recognized text depends primarily on the quality of the original file and on the work of the cuneiform package.

KBookOCR

Of course, this project was the impetus for another ambitious idea, which, together with the author of BookOCR, brought to life your humble servant b0noI . The idea was to implement a system suitable for use by visual aesthetes in all those who prefer a visually beautiful design (this is at least), and at the most create a project based on Linux that allows you to perform FineReader functionality in an equally convenient and aesthetically beautiful way.

The Qt library was chosen for development. On the one hand, this project is an add-on to the BookOCR project, but everything is not so simple. Since the integration had to make significant changes to the original script. There were particular problems with the implementation of the preview djvu files, since if the project poppler exists for pdf, then in the above cases the preview had to be implemented by a third-party bash utility. That is why not only BookOCR is installed into the system during the installation of KbookOCR, but also through the KbookOCR itself, which is also used to obtain the image used in the preview.

Current project status

Already, the project has reached the stage of the finished first version and is actively undergoing public testing (download for Ubuntu deb x86 ). What can the first public and open-sore killer FineReader ?:

preview the document you need to recognize (scroll through the pages);
specify the recognition language. There is currently no language recognition in the document, but this is planned. It is also not possible to specify a double document recognition language (except for rus / eng);
resize previews. Two options are available - original size or reduced;
you can recognize by the specified range or the entire document;
save recognized document. There are two options available - either save the result to a plain text file, or open the result in OpenOffice Writer.

Roadmap

In the next version, the release date of which, unfortunately, is not known, it is planned to be implemented and added:

work with the scanner;
autodetection of language in the document;
more flexible preview. with drawing of thumbnails of pages, as well as with a more flexible indication of the scale of display;
more flexible indication of range recognition.

In a very remote perspective, the options for specifying recognition zones, zone types, as well as not only text recognition but also document formatting in accordance with the original are considered.

Afterword

And although KbookOCR is the most recent creation of our duet, the program is not our first and only creation. In the next series, we will tell you about our first joint Linux project - KbashPod for podkastofilov.

UPD:

Upgrade to version 1.2:

Scanner support (via scanimage);
Outputting the result in html, rtf (via cuneiform) format;
Text formatting processing (via cuneiform);
Dynamic zoom scale preview.

KBookOCR for Linux. FineReader Killer for Linux in its initial stages

Introduction

Linux - problem solving

BookOCR

Algorithm of his work:

Script usage:

KBookOCR

Current project status

Roadmap

Afterword

UPD:

Links

BookOCR

KBookOCR 1.2

The authors

More articles: