📜 ⬆️ ⬇️

One interesting bug in Lucene.Net


Some programmers, when they hear about static analysis, say that they do not need it, since all their code is covered by unit tests, and this is enough to catch all the errors. I got an error, which is theoretically possible to find with the help of unit tests, but if you don’t know about it, then it’s almost impossible to write such a test.

Introduction


Lucene.Net is a popular full-text search library ported from Java to C #. The source code is open and available on the project website https://lucenenet.apache.org/ .

Since this project is developing slowly, contains not so much code and is used in many other projects for full-text search, our analyzer has also found only five suspicious places [ 1 ]. For more I did not count. But one of these positives seemed to me particularly interesting, and I decided to tell the readers of our blog about it.

About the error found


We have diagnostics of V3035 that instead of + = you can write by mistake = +, where + will be a unary plus. When I did it by analogy with the same diagnostics of V588 , intended for the C ++ language, I thought - how can you make a mistake in C #? In C ++, it's okay - someone uses different text editors instead of IDEs in which you can seal up and not notice the error. But when typing text in Visual Studio, which automatically aligns the code after putting a semicolon, how can you skip this? It turns out that you can. I found this error in Lucene.Net. And it is more interesting because it is rather difficult to find it in other ways besides static analysis. Consider the code:
')
protected virtual void Substitute( StringBuilder buffer ) { substCount = 0; for ( int c = 0; c < buffer.Length; c++ ) { .... // Take care that at least one character // is left left side from the current one if ( c < buffer.Length - 1 ) { // Masking several common character combinations // with an token if ( ( c < buffer.Length - 2 ) && buffer[c] == 's' && buffer[c + 1] == 'c' && buffer[c + 2] == 'h' ) { buffer[c] = '$'; buffer.Remove(c + 1, 2); substCount =+ 2; } .... else if ( buffer[c] == 's' && buffer[c + 1] == 't' ) { buffer[c] = '!'; buffer.Remove(c + 1, 1); substCount++; } .... } } } 

There is a GermanStemmer class that truncates suffixes in German words to highlight a common root. It works as follows: first, the Substitute method replaces different good letter combinations with other characters, so as not to confuse them with the suffix. Replaced: 'sch' to '$', 'st' to '!' and so on (this can be seen from the example code). Moreover, the number of characters for which such substitutions reduce the length of the word is accumulated in the variable substCount. Next, the Strip method cuts off unnecessary suffixes, and at the end the Resubstitute method performs the inverse replacement: '$' to 'sch', '!' on 'st'. That is, if we had, for example, the word kapitalistischen (capitalist), then Stemmer would work as follows: kapitalistischen => kapitali! I $ en (Substitute) => kapitali! I $ (Strip) => kapitalistisch (Resubstitute).

Because of this typo in the code, if you replace 'sch' with '$' to the variable substCount, the value 2 will be assigned instead of increasing the substCount by 2. And such an error is rather difficult to find by other methods than static analysis. There are developers who say: why do I need a static analyzer if I have unit tests? So, in order to catch such an error with tests, you need to test Lucene.Net on the German text, using GermanStemmer. In the tests, the word that contains the combination 'sch' and one more letter combination for which the substitution will be performed must be indexed, and be present in the word before 'sch', so that substCount is non-zero by the time when the expression substCount = + 2 is executed. A rather non-trivial combination for the test, especially when you do not see an error.

Conclusion


Unit tests and static analysis are not exclusive, but complementary software development techniques [ 2 ]. I suggest to download the PVS-Studio static analyzer, check your projects and find errors that were not detected with the help of unit tests.

Additional links


  1. Andrey Karpov. Why in small programs low density of errors .
  2. Andrey Karpov. How static analysis complements TDD .


If you want to share this article with an English-speaking audience, then please use the link to the translation: Ilya Ivanov. An unusual bug in Lucene.Net .

Read the article and have a question?
Often our articles are asked the same questions. We collected answers to them here: Answers to questions from readers of articles about PVS-Studio, version 2015 . Please review the list.

Source: https://habr.com/ru/post/279221/


All Articles