A brief description of the technologies used in the PVS-Studio tool that allow you to effectively detect a large number of error patterns and potential vulnerabilities. The article describes the implementation of the analyzer for C and C ++ code, but the information provided is also valid for the modules responsible for analyzing C # and Java code.
Introduction
There are misconceptions that static code analyzers are fairly simple programs based on the search for code patterns using regular expressions. This is far from the truth. Moreover, the identification of the vast majority of errors using regular expressions is simply
impossible .
Misconception arose based on the experience of programmers when working with some tools that existed 10-20 years ago. The work of tools often really came down to finding dangerous code patterns and functions such as
strcpy ,
strcat , etc. As a representative of this class of tools can be called
RATS .
Such tools, although they could be useful, were generally confused and ineffective. It was from that time that many programmers still have memories that static analyzers are very useless tools that interfere more with work than help it.
')
Time passed, and static analyzers began to be complex solutions that perform in-depth code analysis and find errors that remain in the code even after a careful code review. Unfortunately, due to the past negative experience, many programmers still consider the static analysis methodology to be useless and are in no hurry to implement it in the development process.
In this article I will try to correct the situation a little. I ask readers to devote 15 minutes of time and get acquainted with the technologies used in the PVS-Studio static code analyzer for error detection. Perhaps after this you take a fresh look at the static analysis tools and want to apply them in your work.
Data Flow Analysis
Data flow analysis allows you to find a variety of errors. Among them are: going beyond the array boundary, memory leaks, always true / false conditions, dereferencing a null pointer, and so on.
Also, data analysis can be used to search for situations when using untested data that came into the program from outside. An attacker can prepare such a set of input data to make the program function in the way he needs. In other words, it can use the lack of input control error as a vulnerability. To search for the use of unverified data in PVS-Studio, the specialized diagnostics
V1010 has been implemented and continues to be improved.
Data flow analysis is the calculation of the possible values of variables at various points in a computer program. For example, if the pointer is dereferenced, and it is known that at this moment it may be zero, then this is an error, and the static analyzer will report it.
Let's look at a practical example of using data flow analysis to find errors. Before us is a function from the Protocol Buffers (protobuf) project, designed to validate the date.
static const int kDaysInMonth[13] = { 0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31 }; bool ValidateDateTime(const DateTime& time) { if (time.year < 1 || time.year > 9999 || time.month < 1 || time.month > 12 || time.day < 1 || time.day > 31 || time.hour < 0 || time.hour > 23 || time.minute < 0 || time.minute > 59 || time.second < 0 || time.second > 59) { return false; } if (time.month == 2 && IsLeapYear(time.year)) { return time.month <= kDaysInMonth[time.month] + 1; } else { return time.month <= kDaysInMonth[time.month]; } }
The PVS-Studio analyzer detected two logical errors in the function at once and produces the following messages:
- V547 / CWE-571 Expression 'time.month <= kDaysInMonth [time.month] + 1' is always true. time.cc 83
- V547 / CWE-571 Expression 'time.month <= kDaysInMonth [time.month]' is always true. time.cc 85
Pay attention to the subexpression “time.month <1 || time.month> 12 ". If the
month value is outside the range [1..12], then the function stops its operation. The analyzer takes this into account and knows that if the second
if statement
starts , then the
month value exactly lies in the range [1..12]. Similarly, he knows about the range of other variables (year, day, etc.), but they are not interesting to us now.
Now let's take a look at two identical access operators to the elements of the array:
kDaysInMonth [time.month] .
The array is set statically, and the analyzer knows the values of all its elements:
static const int kDaysInMonth[13] = { 0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31 };
Since the months are numbered from 1, the analyzer does not consider 0 at the beginning of the array. It turns out that a value in the range [28..31] can be extracted from the array.
Depending on whether the year is a leap year or not, 1 is added to the number of days. But this is also not interesting for us now. The comparisons themselves are important:
time.month <= kDaysInMonth[time.month] + 1; time.month <= kDaysInMonth[time.month];
The range [1..12] (month number) is compared to the number of days in the month.
Bearing in mind that in the first case, the month is always February (
time.month == 2 ), we get that the following ranges are compared:
- 2 <= 29
- [1..12] <= [28..31]
As you can see, the result of the comparison is always true, which is what the PVS-Studio analyzer warns about. Indeed, the code contains two identical typos. The left side of the expression should use the
day class member, and not the
month at all.
The correct code should be:
if (time.month == 2 && IsLeapYear(time.year)) { return time.day <= kDaysInMonth[time.month] + 1; } else { return time.day <= kDaysInMonth[time.month]; }
The error considered here was also previously described in the article "
February 31. "
Symbolic Execution
In the previous section, the method was considered when the analyzer calculates the possible values of variables. However, to find some errors, knowing the values of variables is not necessary.
Symbolic Execution implies the solution of equations in symbolic form.
I did not find a suitable demo in our
database of errors , so we will consider a synthetic code sample.
int Foo(int A, int B) { if (A == B) return 10 / (A - B); return 1; }
The PVS-Studio analyzer issues a V609 / CWE-369 Divide by zero warning. Denominator 'A - B' == 0. test.cpp 12
The values of the variables
A and
B are unknown to the analyzer. But the analyzer knows that at the moment of calculating the expression
10 / (A - B) the variables
A and
B are equal. Therefore, a division by 0 will occur.
I said that the values of
A and
B are unknown. For the general case this is true. However, if the analyzer sees a function call with specific values of the actual arguments, then it will take this into account. Consider an example:
int Div(int X) { return 10 / X; } void Foo() { for (int i = 0; i < 5; ++i) Div(i); }
PVS-Studio analyzer warns of division by zero: V609 CWE-628 Divide by zero. Denominator 'X' == 0. The 'Div' function processes value '[0..4]'. Inspect the first argument. Check lines: 106, 110. consoleapplication2017.cpp 106
A mixture of technologies is already working here: data flow analysis, symbolic execution and automatic annotation of methods (we will look at this technology in the next section). The analyzer sees that the variable
X is used in the
Div function as a divisor. Based on this, a special annotation is automatically
generated for the
div function. Further, it is taken into account that the value range [0..4] is passed to the function as the argument
X. The analyzer concludes that a division by 0 should occur.
Method Annotations
Our team annotated thousands of functions and classes provided in:
- WinAPI
- standard library C,
- Standard Template Library (STL)
- glibc (GNU C Library)
- Qt
- MFC
- zlib
- libpng
- Openssl
- and so on
All functions are annotated manually, which allows you to set a lot of characteristics that are important from the point of view of searching for errors. For example, it is specified that the size of the buffer transferred to the
fread function must be no less than the number of bytes to be read from the file. Also indicated is the relationship between the 2nd, 3rd arguments and the value that the function can return. It all looks like this:
Thanks to this annotation, the following code, in which the
fread function is used, will immediately reveal two errors.
void Foo(FILE *f) { char buf[100]; size_t i = fread(buf, sizeof(char), 1000, f); buf[i] = 1; .... }
PVS-Studio warnings:
- V512 CWE-119 A call of the fread function will guide you to the buffer. test.cpp 116
- V557 CWE-787 Array overrun is possible. The value of 'i' index could reach 1000. test.cpp 117
First, the analyzer multiplied the 2nd and 3rd actual argument and calculated that the function can read up to 1000 bytes of data. In this case, the buffer size is only 100 bytes, and it may overflow.
Secondly, since the function can read up to 1000 bytes, the range of possible values for the variable
i is [0..1000]. Accordingly, an array may be accessed at an incorrect index.
Let's look at another simple example of an error that has become possible due to the markup of the
memset function. Before us is a fragment of the code of the project CryEngine V.
void EnableFloatExceptions(....) { .... CONTEXT ctx; memset(&ctx, sizeof(ctx), 0); .... }
The PVS-Studio analyzer found a typo: V575 The 'memset' function processes '0' elements. Inspect the third argument. crythreadutil_win32.h 294
Jumbled 2nd and 3rd function argument. As a result, the function processes 0 bytes and does nothing. The analyzer notices this anomaly and warns programmers about it. We have previously described this error in the article "The
long-awaited check of CryEngine V ".
The PVS-Studio analyzer is not limited to annotations given by us manually. In addition, he independently tries to create annotations, studying the functions of the body. This allows you to find errors of improper use of functions. For example, the analyzer remembers that a function can return nullptr. If the pointer that returned this function is used without a preliminary check, the analyzer will warn about it. Example:
int GlobalInt; int *Get() { return (rand() % 2) ? nullptr : &GlobalInt; } void Use() { *Get() = 1; }
Warning: V522 CWE-690 There might be a potential pointer of a potential pointer 'Get ()'. test.cpp 129
Note. The search for the error just considered can be approached in the opposite way. Do not memorize anything, but each time a
Get function call is encountered, analyze it knowing the actual arguments. This algorithm theoretically allows you to find more errors, but it has an exponential complexity. The time of program analysis grows hundreds or thousands of times, and we consider such an approach a dead end from a practical point of view. In PVS-Studio, we are developing the direction of automatic annotation of functions.
Pattern Matching (pattern-based analysis)
At first glance, pattern matching technology may seem like a search using regular expressions. In fact, it is not, and everything is much more complicated.
First, as I
said , regular expressions are no good at all. Secondly, analyzers work not with text lines, but with syntactic trees, which allows to recognize more complex and high-level error patterns.
Consider two examples, one simpler and one more complicated. I found the first error by checking the source code of Android.
void TagMonitor::parseTagsToMonitor(String8 tagNames) { std::lock_guard<std::mutex> lock(mMonitorMutex); if (ssize_t idx = tagNames.find("3a") != -1) { ssize_t end = tagNames.find(",", idx); char* start = tagNames.lockBuffer(tagNames.size()); start[idx] = '\0'; .... } .... }
The PVS-Studio analyzer recognizes the classic error pattern associated with the programmer’s misconception about the priority of operations in the C ++ language: V593 / CWE-783 Consider reviewing the A = B! The expression is calculated as the following: 'A = (B! = C)'. TagMonitor.cpp 50
Carefully look at this line:
if (ssize_t idx = tagNames.find("3a") != -1) {
The programmer assumes that the assignment is performed at the beginning, and only then the comparison with
-1 . In fact, the comparison occurs first. Classic. This error is discussed in more detail in the
article on testing Android (see the “Other errors” chapter).
Now consider a higher-level version of pattern matching.
static inline void sha1ProcessChunk(....) { .... quint8 chunkBuffer[64]; .... #ifdef SHA1_WIPE_VARIABLES .... memset(chunkBuffer, 0, 64); #endif }
PVS-Studio warning: V597 CWE-14 The compiler could delete the memset function call, which is used to flush the chunkBuffer buffer. The RtlSecureZeroMemory () function should be used to erase the private data. sha1.cpp 189
The essence of the problem lies in the fact that once the buffer has been filled with zeros using the
memset function, this buffer is not used anywhere. When building code with optimization flags, the compiler will decide that this function call is redundant and remove it. It has the right to this, since from the point of view of the C ++ language, a function call does not have any observable behavior on the program operation. Immediately after the
chunkBuffer buffer is filled, the
sha1ProcessChunk function terminates. Since the buffer is created on the stack, after exiting the function it becomes unavailable for use. Therefore, from the point of view of the compiler, it makes no sense to fill it with zeros.
As a result, private data will remain somewhere on the stack, which can cause trouble. This topic is discussed in more detail in the article "
Safe cleaning of private data ".
This is an example of a high-level pattern matching. First, the analyzer must be aware of the existence of this security defect, classified according to the Common Weakness Enumeration as
CWE-14: Compiler .
Secondly, it must find in the code all the places where the buffer is created on the stack, overwritten with the help of the
memset function and then not used anywhere else.
Conclusion
As you can see, static analysis is a very interesting and useful methodology. It allows you to eliminate at the earliest stages a large number of errors and potential vulnerabilities (see
SAST ). If you are still not completely absorbed with static analysis, then I invite you to read our
blog , where we regularly analyze errors found by PVS-Studio in various projects. You just can not stay indifferent.
We will be happy to see your company among
our customers and help make your applications better, more reliable and more secure.

If you want to share this article with an English-speaking audience, then please use the link to the translation: Andrey Karpov.
Technologies used in the PVS-Studio code analyzer for finding bugs and potential vulnerabilities .