Using the Open Source OCR Tesseract Library in Android on the example of a simple application

Today I will show how to add a text recognition (OCR) option to your Android app.

Our test project is one single Activity into which I pushed and recognized. Total in the total account only 200 lines of code.
')
An important feature - the OCR option works offline. OCR increases your .apk by approximately 17mb.

Tesseract - perhaps the most popular and high-quality free OCR library, which still have updates. Creators Tesseract - developers from Google.

Tesseract is written in C, but there is a project tess-two - a ready-made tool for using Tesseract on the Android platform. It provides a Java API for accessing the tesseract classes compiled originally from Tesseract. All you need to do is add tess-two to your build.gradle :

dependencies { compile 'com.rmtheis:tess-two:5.4.1' }

In addition, a tesseract will require a .traineddata file. This file contains data for effective recognition, word dictionaries, and more. The file is unique for each language. Download .traineddata for any language can be on the link . I note that it is possible to create your own .traineddata file. This can be useful if you recognize a specific text or if you have your own vocabulary of possible words. Theoretically, customization will improve the quality of recognition.

Before you go to the java code, make sure that you put in the project file for English eng.traideddata . For example, in src \ main \ assets \ tessdata .

You will need to configure the tesseract before running the recognition. To do this, you need to transfer two parameters to the configuration method (init) - the path to the tessdata folder on your Android device and the language (“eng”). Be careful, the path to the folder is tessdata, and not to the .traideddata file, if the folder is named differently, the code will not work. Obviously, you need to create this folder on external storage and place eng.traideddata in it.

I cite the method that gets the recognized text from Bitmap:

 import com.googlecode.tesseract.android.TessBaseAPI; //... private String extractText(Bitmap bitmap) throws Exception { TessBaseAPI tessBaseApi = new TessBaseAPI(); tessBaseApi.init(DATA_PATH, "eng"); tessBaseApi.setImage(bitmap); String extractedText = tessBaseApi.getUTF8Text(); tessBaseApi.end(); return extractedText; }

Yes, very simple.

Result

Recommendations

1. It is better to run OCR on the server side. If you have a Java project, use tess4j - JNA wrapper for Tesseract. The quality of recognition is higher by 20-30%. Does not sit down the battery, does not weigh .apk.

2. Use image preprocessing before recognizing. The easiest way is to force the user to select a block with text to reduce the recognition area. This may include the alignment of distortion, noise removal, color correction.

Source code is available here .

That's all.

Source: https://habr.com/ru/post/282582/

All Articles

Using the Open Source OCR Tesseract Library in Android on the example of a simple application

Result

Recommendations

More articles: