The difference in the approaches of code analysis by the compiler and the selected tool

The compiler and third-party static code analysis tools have a common task - identifying dangerous code fragments. However, there is a significant difference in what type of analysis they perform. I will use the example of the Intel C ++ compiler and the PVS-Studio analyzer to demonstrate the differences in approaches and explain what caused them.

This time the test project will be the project Notepad ++ version 5.8.2.

Notepad ++

First about the selected project. Notepad ++ is a free and free source code editor that supports a large number of languages, as well as a replacement for the standard Notepad. It runs on Microsoft Windows and is licensed under the GPL. I liked the project because it is written in C ++ and has a small size - 73000 lines of code. And most importantly, this is a fairly neat project, as indicated by the / W4 key in the project settings and the / WX key, which makes treat each warning as an error.
')

Static analysis in the compiler

Now consider the analysis of the project from the point of view of the compiler and a separate specialized tool. The compiler always gives warnings that can be issued by processing only a very small local code fragment. This preference is due to the fact that the compiler has very strict performance requirements. It is not by chance that there are tools for distributed project assembly. Compile time for medium and large projects is a significant factor influencing the choice of development methods. Therefore, if it will be possible to squeeze 5% of the performance gain of the compiler, it will be implemented.

Such optimization makes the compiler more monolithic, and in fact there are hardly any clearly distinguished steps such as preprocessing, building AST , code generation. For example, by indirect evidence, I can argue that the algorithm of a preprocessor that uses Visual C ++ when compiling and when generating preprocessing "* .i" files is different. Also, the compiler does not need and even harmful to keep the AST tree entirely. Once the code for certain nodes has been generated and these nodes are no longer needed, they will be immediately destroyed. In the process of compiling AST may never exist entirely. It is not necessary. Disassembled a piece of code, generated the code, went further. This saves the amount of used memory and cache, and, consequently, increases the speed.

The result of this approach is the “locality” of warnings. The compiler deliberately saves on various structures that could help detect higher level errors. Let's look in practice, what local warnings will be issued by Intel C ++ for the Notepad ++ project. Let me remind you that the Notepad ++ project is built by the Visual C ++ compiler without warnings with the / W4 key. The Intel C ++ compiler will naturally have a different set of warnings, and I also set the / W5 specific key [Intel C ++]. Plus, I'll take a look at what the Intel C ++ compiler calls “remark”.

Let's see what kind of messages we received from Intel C ++. Here he found four errors of the same type when working with the CharUpper function (SEE NOTE IN THE END OF THE ARTICLE). Pay attention to the "locality" of diagnosing this error - just found extremely dangerous type conversion. Consider the corresponding code snippet:

  wchar_t * destStr = new wchar_t [len + 1];
 ...
 for (int j = 0; j <nbChar; j ++)
 {
   if (Case == UPPERCASE)
     destStr [j] =
       (wchar_t) :: CharUpperW ((LPWSTR) destStr [j]);
   else
     destStr [j] =
       (wchar_t) :: CharLowerW ((LPWSTR) destStr [j]);
 }

There are strange type conversions. The Intel C ++ compiler reports: "# 810: conversion from" LPWSTR = {WCHAR = {__ wchar_t} *} "to" __wchar_t "may lose significant bits". Let's look at the prototype of the CharUpper function.

  LPTSTR WINAPI CharUpper (
   __inout LPTSTR lpsz
 );

The function works with a string, and not with individual characters at all. Here, the symbol is converted to a pointer, and a certain area of memory is modified by this pointer. Horror.

But, truth, the horrors discovered by Intel C ++ are probably ending there. Everything else is much duller and more likely to be a messy code than a code that will cause an error. But still consider some other warnings.

Issued a large number of messages # 1125:

"# 1125: function" Window :: init (HINSTANCE, HWND) "is hidden by" TabBarPlus :: init "- virtual function override intended?"

These are not errors, but unsuccessful naming of functions. We are interested in this message to others. Although it seems to involve several classes for verification, special data is not remembered here. The compiler still needs to store various information about the base classes. That is why this diagnosis is implemented.

The following example. The message "# 186: pointless comparison of unsigned integer with zero" is issued for meaningless comparisons:

  static LRESULT CALLBACK hookProcMouse (
   UINT nCode, WPARAM wParam, LPARAM lParam)
 {
   if (nCode <0)
   {
     ...
     return 0;
   }
 ...
 }

The condition "nCode <0" is always false. A good example of a good local diagnosis. So it is quite possible to detect an error.

Consider the latest Intel C ++ warning and that's enough. I think the idea of "locality" is already clear.

  void ScintillaKeyMap :: showCurrentSettings () {
   int i = :: SendDlgItemMessage (...);
   ...
   for (size_t i = 0; i <nrKeys; i ++)
   {
     ...
   }
 }

There is no error again, just not very successful variable naming. Initially, the variable "i" has the type "int". Then in the “for ()” operator a new variable “i” of the type “size_t” is declared and used for other purposes. The compiler at the time of the declaration “size_t i” knows that there is already a variable with that name and issues a warning. Again, this did not require the compiler to store any additional data. He should still remember that until the end of the function body, the variable “int i” is available.

External static code analyzers

Let's pass to specialized tools of static code analysis. They no longer have such severe restrictions on speed, since the frequency of their launch is an order of magnitude lower than that of the compiler. The speed of their work may well be ten times slower than compiling code. This is not critical. For example, a programmer works with a compiler during the day. At night, a static code analyzer is launched and in the morning the programmer receives a report on suspicious places. It is a reasonable approach.

Paying by delays, static code analysis tools can afford to store the entire code tree, run it several times, and store a lot of additional information. This allows them to find “blurred” and high-level errors.

Let's see what the interesting PVS-Studio static analyzer finds in Notepad ++. I note that I use the experimental version, which is not yet available for download. We will present a free set of general-purpose rules in 1-2 months in PVS-Studio version 4.00.

Naturally, like Intel C ++, the PVS-Studio analyzer finds errors that can be classified as “local”. First example:

  bool _isPointXValid;
 bool _isPointYValid;
 bool isPointValid () {
   return _isPointXValid && _isPointXValid;
 };

The PVS-Studio analyzer reports: "V501: there are operator: _isPointXValid && _isPointXValid."

I think the essence of the error is obvious, and we will not dwell on it in more detail. Diagnostics is “local”, since testing can be done in the process of analyzing a single expression.

Another local error leading to incomplete cleaning of the _iContMap array:

  #define CONT_MAP_MAX 50
 int _iContMap [CONT_MAP_MAX];
 ...
 DockingManager :: DockingManager ()
 {
   ...
   memset (_iContMap, -1, CONT_MAP_MAX);
   ...
 }

A warning is issued "V512: A call of the memset function will lead to a buffer overflow or underflow." The correct code is:

  memset (_iContMap, -1, CONT_MAP_MAX * sizeof (int));

And now let's move on to more interesting things. The code where you need to simultaneously analyze two branches at once to be able to suspect that something was wrong:

  void TabBarPlus :: drawItem (
   DRAWITEMSTRUCT * pDrawItemStruct)
 {
 ...

   if (! _isVertical)
     Flags | = DT_BOTTOM;
   else
     Flags | = DT_BOTTOM;
 ...
 }

PVS-Studio reports here: "V523: The 'then' statement is equivalent to the 'else' statement". If you look at the code next door, you can conclude that the author actually wanted to write this:

  if (! _isVertical)
   Flags | = DT_VCENTER;
 else
   Flags | = DT_BOTTOM;

And now have courage. You will test in the form of the following code snippet:

  void KeyWordsStyleDialog :: updateDlg () 
 {
   ...
   Style & w1Style =
     _pUserLang -> _ styleArray.getStyler (STYLE_WORD1_INDEX);
   styleUpdate (w1Style, _pFgColour [0], _pBgColour [0],
     IDC_KEYWORD1_FONT_COMBO, IDC_KEYWORD1_FONTSIZE_COMBO,
     IDC_KEYWORD1_BOLD_CHECK, IDC_KEYWORD1_ITALIC_CHECK,
     IDC_KEYWORD1_UNDERLINE_CHECK);

   Style & w2Style =
     _pUserLang -> _ styleArray.getStyler (STYLE_WORD2_INDEX);
   styleUpdate (w2Style, _pFgColour [1], _pBgColour [1],
     IDC_KEYWORD2_FONT_COMBO, IDC_KEYWORD2_FONTSIZE_COMBO,
     IDC_KEYWORD2_BOLD_CHECK, IDC_KEYWORD2_ITALIC_CHECK,
     IDC_KEYWORD2_UNDERLINE_CHECK);

   Style & w3Style =
     _pUserLang -> _ styleArray.getStyler (STYLE_WORD3_INDEX);
   styleUpdate (w3Style, _pFgColour [2], _pBgColour [2],
     IDC_KEYWORD3_FONT_COMBO, IDC_KEYWORD3_FONTSIZE_COMBO,
     IDC_KEYWORD3_BOLD_CHECK, IDC_KEYWORD3_BOLD_CHECK,
     IDC_KEYWORD3_UNDERLINE_CHECK);

   Style & w4Style =
     _pUserLang -> _ styleArray.getStyler (STYLE_WORD4_INDEX);
   styleUpdate (w4Style, _pFgColour [3], _pBgColour [3],
     IDC_KEYWORD4_FONT_COMBO, IDC_KEYWORD4_FONTSIZE_COMBO,
     IDC_KEYWORD4_BOLD_CHECK, IDC_KEYWORD4_ITALIC_CHECK,
     IDC_KEYWORD4_UNDERLINE_CHECK);
   ...
 }

One can say that I am proud of our PVS-Studio analyzer, which could find an error here. I think you hardly noticed it and just skip the entire code fragment, waiting for an explanation. This code is practically beyond the scope of review (code review). But the static analyzer is patient and pedantic: “V525: The code containing the collection of similar blocks. Check items '7', '7', '6', '7' in lines 576, 580, 584, 588 ".

I will shorten the text to highlight the interesting:

  styleUpdate (...
   IDC_KEYWORD1_BOLD_CHECK, IDC_KEYWORD1_ITALIC_CHECK,
   ...);
 styleUpdate (...
   IDC_KEYWORD2_BOLD_CHECK, IDC_KEYWORD2_ITALIC_CHECK,
   ...);
 styleUpdate (...
   IDC_KEYWORD3_BOLD_CHECK, !!  IDC_KEYWORD3_BOLD_CHECK !!,
   ...);
 styleUpdate (...
   IDC_KEYWORD4_BOLD_CHECK, IDC_KEYWORD4_ITALIC_CHECK,

The code was most likely written by the Copy-Paste technique. The result was the use of IDC_KEYWORD3_BOLD_CHECK instead of IDC_KEYWORD3_ITALIC_CHECK. The warning looks somewhat strange when it comes to the numbers '7', '7', '6', '7'. Unfortunately, it is difficult to give a clearer message. These numbers are taken from the macros of the form:

  #define IDC_KEYWORD1_ITALIC_CHECK (IDC_KEYWORD1 + 7)
 #define IDC_KEYWORD3_BOLD_CHECK (IDC_KEYWORD3 + 6)

The last example given is indicative of the fact that the PVS-Studio analyzer processed a whole large portion of code at the same time, revealed repetitive structures in it and was able to suspect that something was wrong based on heuristics. This is a very significant difference in the level of information processing.

Some numbers

We describe one more consequence of the “local” analysis of compilers and more global in specialized tools. With “local analysis” it is difficult to clarify whether this situation is really dangerous or not. The result is an order of magnitude greater number of false positives. Let me explain by example.

When analyzing the Notepad ++ project, the PVS-Studio tool issued only 10 warnings. Of these, 4 messages indicated real errors. The result is modest, but the static general-purpose analysis in PVS-Studio has just begun its development. In time, he will become one of the best.

When analyzing the Notepad ++ project, the Intel C ++ compiler issued 439 warnings and 3139 remarks. I do not know how many of them really point to real mistakes. From what I had the strength to look at, I saw only 4 true errors related to CharUpper (see description above).

3578 messages are too many to study carefully each of them. It turns out that the compiler offers to pay attention to every twentieth line in the program (73000/3578 = 20). This is not serious. In the case of a general purpose analyzer, if possible, cut off all unnecessary.

Those who have tried the Viva64 rule set (included in PVS-Studio) may notice that it produces the same huge percentage of false positives. But there is a different situation. There you need to be able to identify all suspicious type conversions. It is more important not to miss the error than not to give a false message. Especially false messages are well filtered using settings.

- UPDATE:

It turns out I wrote a lie here. In the example with CharUpperW there is no error. And unfortunately nobody corrected me. I noticed it myself when I decided to implement a similar rule in PVS-Studio.

The fact is that CharUpperW can work both with a string and with individual characters. If the upper part of the pointer is zero, then it is considered that this is not a pointer, but a symbol. The WIN API in this place of course is saddened by its curvature, but the code in Notepad ++ is written correctly.

By the way, now it turns out that Intel C ++ did not find any errors at all.

Source: https://habr.com/ru/post/107389/

All Articles

The difference in the approaches of code analysis by the compiler and the selected tool

Notepad ++

Static analysis in the compiler

External static code analyzers

Some numbers

More articles: