Static analysis and regular expressions

I am developing a PVS-Studio static code analyzer for analyzing C / C ++ programs. After the general-purpose analysis appeared in PVS-Studio 4.00, we received a lot of feedback, both positive and negative. By the way, I suggest downloading a new version of PVS-Studio, in which, thanks to the feedback from people, a large number of errors and shortcomings were corrected.

During the discussion of PVS-Studio 4.00, the question again arose whether it is possible to implement most of the checks using regular expressions, and whether we do not over-complicate by saying that it is necessary to build and work with the parse tree. Here is an example of a comment on this topic. This is not the first time this question has arisen, and I decided to write an article to explain why trying to use regular expressions to analyze C / C ++ code is a very bad idea.

Those who are familiar with the theory of compilation, of course, understand that the C ++ language can be parsed only on the basis of grammars, and not regular expressions. But most programmers with compilation theory are not familiar and continue to repeat about regular expressions for finding errors in the program code.
')

I’ll say right away that you can search for something with regular expressions. And even there are a number of static analyzers working in this way. But their capabilities are very, very limited and come down mainly to messages that use the “strcpy” function and should be replaced with a more secure one.

After thinking about how to talk about the inferiority of the regular expression method, I decided to do it very simply. I will take the ten first general-purpose diagnostic checks implemented in PVS-Studio and show on each one what the limitation will be in the search for the regular expression method.

Diagnostics 0

As soon as I began to describe V501, I immediately remembered that any analysis would do little until #define was disclosed. The error may well hide inside the macro, but from this it will not cease to be an error. Creating a preprocessed file is relatively easy. Imagine that we already have i-files. And now the first difficulty awaits us, since it is necessary to distinguish the code sections related to system files and to user code. If we analyze the library functions of the system, it will significantly reduce the speed of work and give a lot of completely uninteresting diagnostic messages. Thus, it is necessary on the basis of regular expressions ~~and some mother to~~ parse the lines of the form:

  #line 27 "C: \\ Program Files (x86) \\ Microsoft Visual Studio 8 \\ VC \\ atlmfc \\ include \\ afx.h"
 #line 1008 ". \\ mytestfile.cpp"

And to understand what applies to our program, and that to Visual Studio. But that is not all. We must learn to do the relative counting of lines inside the i-files. After all, we need to give not an absolute line number with an error in the preprocessor i-file. We need the line number in our native c / cpp-file, which we analyze.

So, we haven’t got down to the essence of the matter yet, but have already received a lot of difficulties.

Diagnostics 1

V501. There are identical sub-expressions of the foo 'operator.

In order not to clutter the text of the article, I suggest the reader to follow the link and get acquainted with the description of the error and examples. The essence of the diagnosis is to identify the structure of the form:

  if (X> 0 && X> 0)

At first glance, it is quite possible to find constructions with a regular expression, when the same expressions are located to the left and right of the operators &&, ||, ==, and so on. For example, so. We are looking for the operator &&. If to the left and to the right of && there is something identical limited by brackets, then trouble. But it will not work, because you can write this:

  if (A == A && B)

There is an error, but there are different expressions to the left and right of '=='. So it is necessary to introduce the concept of priority operations. And look, if it is '==', then cut off the boundaries for lower priority operations, such as '&&'. And if this is &&, then vice versa you need to capture the operations '==' to detect an error for this case, reaching the bounding brackets:

  if (A == 0 && A == 0)

And so it is necessary to provide logic for all variants of operations with different priorities. Yes, by the way, you can’t rely on brackets either. There may be such cases:

  if ('(' == A && '(' == B)
 b = X> 0 && X> 0;

~~Enchanting hemorrhoids.~~ It is very difficult to consider all options with various regular spells. They will be a lot with a lot of exceptions. And still it will not be reliable, because there will be no feeling that we remembered about all possible designs.

Now compare with the elegance with which I can detect this error, having a syntax tree. If I found the operator &&, ==, || and so on, it only remains for me to compare the left and right branches of the tree. I do it like this:

  if (Equal (left, right))
 {
   // Trouble!
 }

And that's all. No need to think about the priorities of operations. You shouldn't expect a dirty trick to see a bracket in the text: b = '(' == x && x == ')' ;. You can simply compare the left and right branch of the tree.

Diagnostics 2

V502. Perhaps the '?:' Operator was in a different way. The '?:' Operator has a lower limit than the 'foo' operator.

The rule looks for confusion related to the priority of operations (see the error description for details). We need to identify something like:

  int a;
 bool b;
 int c = a + b?  0: 1;

Let us leave aside for the moment the issue of priority operation. With this question when working with regular expressions everything is bad. But even worse is that for this and many other rules one must know the TYPE of the VARIABLE.

It is necessary to display (open) the type of each variable. Need to be able to wade through the jungle typedef. You need to be able to look into the classes in order to understand what vector <int> :: size_type is. You must be able to take into account scopes and different using namespace std ;. And even be able to deduce the type of the variable X from the expression: “auto X = 1 + 2;” in the case of C ++ 0x.

Now the question is how to do it using regular expressions? The answer is no way. Regular expressions are perpendicular to this task. It is necessary either to write a complex type inference mechanism, that is, to actually write a code parser. Or stay with regular expressions, but have no idea about the types of variables and expressions.

The bottom line: working on a regular expression with a C / C ++ program, we do not know the type of variables and expressions. Remember this is a significant limitation.

Diagnostics 3

V503. This is a nonsensical comparison: pointer <0.

Very simple rule. It is suspicious to compare a pointer with <,> to zero. Example:

  CMeshBase * pMeshBase = getCutMesh (Idx);
 if (pMeshBase <0)
   return NULL;

How such a code could be obtained can be found in the error description.

For diagnostics, you just need to know the type of the pMeshBase variable. And why this was impossible was explained a little higher.

This diagnostics is not realizable based on regular expressions.

Diagnostics 4

V504. It is highly probable that the semicolon ';' is missing after 'return' keyword.

  void Foo2 (int * ptr)
 {
   if (ptr == NULL)
     return
   Foo ();
   ...
 }

It is quite possible to diagnose constructions of this type with regular expressions. But there will be too many false positives. We are only interested in cases where the function returns void. In principle, this can be found using only regular expressions. Only it will not be quite easy to understand where the function begins and ends. Try yourself to come up with a regular expression to find the beginning of the function. I assure you, this will be a very interesting task. Especially if you remember that no one bothers to write something like this:

  int foo ()
 {
    ...
   char c [] = 
   "void MyFoo (int x) {"
   ;
   ...
 }

If we have a full-fledged syntax tree with various information, then everything is much simpler. The type of the returned function can be recognized as follows (taken directly from PVS-Studio):

  SimpleType funcReturnType;
 EFunctionReturnType fType;
 if (! env-> LookupFunctionReturnType (fType, funcReturnType))
   return;
 if (funcReturnType! = ST_VOID)
   return;

Diagnostics 5

V505. The 'alloca' function is used inside the loop. This can quickly overflow stack.

Yes, you can try to implement this rule on the basis of regular expressions.

Although I would not dare to try to find out where the cycle begins and where it ends. So many funny situations you can come up with, such as braces in comments and in lines.

  {
   for (int i = 0; i <10; i ++)
   {
     // Cool comment.  Here to you {- suffer now.  :)
     char * x = "Here, too, you need to keep your ear sharply: - {";
   }
   p = _alloca (10);  // Are we inside the loop or not?
 }

Diagnostics 6

V506. Pointer to local variable "X" is stored outside the scope of this variable. Such a pointer will become invalid.

To search for these errors, it is necessary to work with the scope of variables. You also need to know the type of variables.

This diagnostics is not realizable based on regular expressions.

Diagnose 7

V507. Pointer to local array 'X' is stored outside the scope of this array. Such a pointer will become invalid.

This diagnostics is not realizable based on regular expressions.

Diagnostics 8

V508. The use of 'new type (n)' pattern was detected. Probably meant: 'new type [n]'.

It is useful to find typographical errors:

  float * p = new float (10);

It seems everything is simple and could be implemented if you know the type of object being created. Will not work. Just change the text a bit and regular expressions are useless:

  typedef float MyReal;
 ...
 MyReal * p = new MyReal (10);

This diagnostics is not realizable based on regular expressions.

Diagnostics 9

V509. The 'throw' operator inside the destructor should be placed within the try..catch block. Raising exception inside the destructor is illegal.

Yes, you can try this on regular expressions. Destructors are usually small functions and most likely there will be no dirty tricks with different curly braces.

That's just a lot of sweat will have to be in regular expressions to find the destructor function, its beginning and end, is there a throw and is it not caught in catch. Presented the entire scope of work? Weak?

Not for me. Here is how I gracefully wrote in PVS-Studio (the whole rule):

  void ApplyRuleG_509 (VivaWalker & walker, Environment * env,
   const Ptree * srcPtree)
 {
   SimpleType returnType;
   EFunctionReturnType fType;
   bool res = env-> LookupFunctionReturnType (fType, returnType);
   if (res == false || returnType! = ST_UNKNOWN)
     return;
   if (fType! = DESTRUCTOR)
     return;

   ptrdiff_t tryLevel = OmpUtil :: GetLevel_TRY (env);
   if (tryLevel! = -1)
     return;
   string error = VivaErrors :: V509 ();
   walker.AddError (error, srcPtree, 509, DATE_1_SEP_2010 (), Level_1);
 }

Diagnostics 10

V510. The 'Foo' function is not expected to receive class-type variable as 'N' actual argument.

This is a rule on the topic of passing to a function like printf as an argument of classes of type std :: string, and so on. Looking for types. So again, this diagnosis is not realizable based on regular expressions.

Conclusion

I hope I clarified the situation with regular expressions, syntax trees and static code analysis. Thank you all for your attention. Once again, I invite you to download and try PVS-Studio . I am also ready to answer your questions, but I don’t intend to go into disputes about what regular expressions allow and what not. It is not interesting. They allow much, but do not allow even more. C ++ can normally be parsed only with the use of mathematical grammars.

Source: https://habr.com/ru/post/109480/

All Articles