
Tesseract is a free text recognition computer program developed by Google. The description of the project says: "OCR engine is available." And let's try, whether the PVS-Studio static analyzer will be able to recognize any errors in this project.
Tesseract
Tesseract is a free text-recognition computer program developed by Hewlett-Packard from the mid-1980s to the mid-1990s, and then “lain on the shelf” for 10 years. In August 2006, Google bought it and opened the source code under the Apache 2.0 license to continue development. At the moment, the program is already working with UTF-8, language support (including Russian) is carried out using additional modules. [
description taken from Wikipedia]
The source code for the project is available on Google Code:
https://code.google.com/p/tesseract-ocr/The source code is about 16 megabytes.
')
Test results
I will cite the code snippets I drew attention to when looking at the PVS-Studio report. Maybe I missed something. Therefore, the creators of Tesseract should be tested independently. The trial version is active for 7 days, which is more than enough for such a small project. Well, and then they decide whether they want to use the tool regularly and find typos or not.
As always, remember.
The essence of static analysis is not in one-time checks, but in regular use.
Bad division
void LanguageModel::FillConsistencyInfo(....) { .... float gap_ratio = expected_gap / actual_gap; if (gap_ratio < 1/2 || gap_ratio > 2) { consistency_info->num_inconsistent_spaces++; .... }
PVS-Studio warnings: V636 The '1/2' expression was implicitly casted from 'int' type to 'float' type. Consider using a fractional part. An example: double A = (double) (X) / Y ;. language_model.cpp 1163
Variable 'gap_ratio' want to compare with the value of 0.5. Unfortunately, the chosen bad way to write 0.5. The division 1/2 is integer and results in 0.
The correct code should be:
if (gap_ratio < 1.0f/2 || gap_ratio > 2) {
Or so:
if (gap_ratio < 0.5f || gap_ratio > 2) {
There are other places where suspicious integer division is performed. Perhaps among them there are really unpleasant mistakes.
Code snippets worth checking out:
- baselinedetect.cpp 110
- bmp_8.cpp 983
- cjkpitch.cpp 553
- cjkpitch.cpp 564
- mfoutline.cpp 392
- mfoutline.cpp 393
- normalis.cpp 454
Typo in comparison
uintmax_t streamtoumax(FILE* s, int base) { int d, c = 0; .... c = fgetc(s); if (c == 'x' && c == 'X') c = fgetc(s); .... }
PVS-Studio warning: V547 Expression 'c ==' x '&& c ==' X '' is always false. Probably the '||' operator should be used here. scanutils.cpp 135
The correct verification option is:
if (c == 'x' || c == 'X') c = fgetc(s);
Undefined behavior
Found one interesting design. I have not seen this yet:
void TabVector::Evaluate(....) { .... int num_deleted_boxes = 0; .... ++num_deleted_boxes = true; .... }
PVS-Studio warning: V567 Undefined behavior. The 'num_deleted_boxes' variable is modified while being used between sequence points. tabvector.cpp 735
It is not clear what the author wanted to say with this code. Most likely this code is the result of a typo.
The result of the expression is impossible to predict. The variable 'num_deleted_boxes' can be increased both before and after assignment. The reason - the variable changes twice at the same
point .
The remaining errors leading to undefined behavior are associated with the
use of shifts . Consider an example:
void Dawg::init(....) { .... letter_mask_ = ~(~0 << flag_start_bit_); .... }
Warning V610 Undefined behavior. Check the shift operator '<<. The left operand '~ 0' is negative. dawg.cpp 187
The expression '~ 0' has the type 'int' and is equal to the value '-1'. Shifts in negative values ​​result in undefined behavior. That the program can work correctly is luck and not more than that. You can fix the flaw by making '0' unsigned:
letter_mask_ = ~(~0u << flag_start_bit_);
However, this is not all. The analyzer generates another warning on this line:
V629 Consider inspecting the '~ 0 << flag_start_bit_' expression. Bit shifting of the 32-bit type. dawg.cpp 187
The point is that the variable 'letter_mask_' is of the type 'uinT64'. As I understand it, it may be necessary to write units to the upper 32 bits. In this case, the generated expression is incorrect. It only works with low bits.
It is necessary to make so that "0" was 64-bit type. Corrected version:
letter_mask_ = ~(~0ull << flag_start_bit_);
I will list the list of other code fragments where negative numbers are shifted:
- dawg.cpp 188
- intmatcher.cpp 172
- intmatcher.cpp 174
- intmatcher.cpp 176
- intmatcher.cpp 178
- intmatcher.cpp 180
- intmatcher.cpp 182
- intmatcher.cpp 184
- intmatcher.cpp 186
- intmatcher.cpp 188
- intmatcher.cpp 190
- intmatcher.cpp 192
- intmatcher.cpp 194
- intmatcher.cpp 196
- intmatcher.cpp 198
- intmatcher.cpp 200
- intmatcher.cpp 202
- intmatcher.cpp 323
- intmatcher.cpp 347
- intmatcher.cpp 366
Suspicious double assignment
TESSLINE* ApproximateOutline(....) { EDGEPT *edgept; .... edgept = edgesteps_to_edgepts(c_outline, edgepts); fix2(edgepts, area); edgept = poly2 (edgepts, area);
PVS-Studio warning: V519 The 'edgept' variable is assigned values ​​twice successively. Perhaps this is a mistake. Check lines: 76, 78. polyaprx.cpp 78
Another similar case:
inT32 row_words2(....) { .... this_valid = blob_box.width () >= min_width; this_valid = TRUE; .... }
PVS-Studio warning: V519 The 'this_valid' variable is assigned values ​​twice successively. Perhaps this is a mistake. Check lines: 396, 397. wordseg.cpp 397
Incorrect initialization sequence for class members
In the beginning, consider the class 'MasterTrainer'. Note that the member of the 'samples_' class is located before the 'fontinfo_table_' member:
class MasterTrainer { .... TrainingSampleSet samples_; .... FontInfoTable fontinfo_table_; .... };
According to the standard, the order of initialization of class members in the constructor occurs in the order they are declared in the class. This means that 'samples_' will be initialized BEFORE initializing 'fontinfo_table_'.
Now consider the constructor:
MasterTrainer::MasterTrainer(NormalizationMode norm_mode, bool shape_analysis, bool replicate_samples, int debug_level) : norm_mode_(norm_mode), samples_(fontinfo_table_), junk_samples_(fontinfo_table_), verify_samples_(fontinfo_table_), charsetsize_(0), enable_shape_anaylsis_(shape_analysis), enable_replication_(replicate_samples), fragments_(NULL), prev_unichar_id_(-1), debug_level_(debug_level) { }
The trouble is that for initialization of 'samples_' the still uninitialized variable 'fontinfo_table_' is used.
The situation is similar in this class with the initialization of the 'junk_samples_' and 'verify_samples_' fields.
I do not venture to say how best to deal with this class. It may be enough to move the declaration 'fontinfo_table_' to the very beginning of the class.
A typo in the condition
A typo is not easy to notice, but the analyzer does not know fatigue.
class ScriptDetector { .... int korean_id_; int japanese_id_; int katakana_id_; int hiragana_id_; int han_id_; int hangul_id_; int latin_id_; int fraktur_id_; .... }; void ScriptDetector::detect_blob(BLOB_CHOICE_LIST* scores) { .... if (prev_id == katakana_id_) osr_->scripts_na[i][japanese_id_] += 1.0; if (prev_id == hiragana_id_) osr_->scripts_na[i][japanese_id_] += 1.0; if (prev_id == hangul_id_) osr_->scripts_na[i][korean_id_] += 1.0; if (prev_id == han_id_) osr_->scripts_na[i][korean_id_] += kHanRatioInKorean; if (prev_id == han_id_) <<<<==== osr_->scripts_na[i][japanese_id_] += kHanRatioInJapanese; .... }
PVS-Studio warning: V581 The conditional expressions of the 'if' are agreed alongside each other are identical. Check lines: 551, 553. osdetect.cpp 553
Most likely, the most recent comparison should be:
if (prev_id == japanese_id_)
Unnecessary checks
It is not necessary to check what the 'new' operator returns. If you fail to allocate memory, an exception will be raised. Of course, you can make a special operator 'new', which returns null pointers, but this is a separate case (
details ).
As a result, this function can be simplified:
void SetLabel(char_32 label) { if (label32_ != NULL) { delete []label32_; } label32_ = new char_32[2]; if (label32_ != NULL) { label32_[0] = label; label32_[1] = 0; } }
PVS-Studio warning: V668 against 32 32 32 32 32 32 32 32 32 32 32 32 32. The exception will be generated in the case of memory allocation error. char_samp.h 73
There is another
101 place where the pointer is checked, which returned the 'new' operator. I see no point in listing them in the article. It's easier to start PVS-Studio for this.
Conclusion
Use static analysis regularly, and you will save a lot of time on solving more useful tasks than catching silly mistakes and typos.
And don't forget to follow me on Twitter:
@Code_Analysis . I regularly publish links to interesting articles on C ++ topics.