Finding errors by calculating virtual values

During the operation of a static analyzer, the exact values or ranges of values of some variables and expressions can be calculated at the analysis stage. This is useful information that you can use when searching for errors. We call such values virtual values, and this article will be about them.

If the static analyzer is able to calculate what the expression is equal to, this allows for a deeper analysis of the code and to identify more errors. Of course, we are talking not only about the exact values of expressions, such as 1 + 2, but also about calculating the range of values that a variable can take at a certain place in a program. In the PVS-Studio analyzer, we call the algorithms responsible for calculating ranges the mechanism of virtual values. Such a mechanism exists both in the core of the C / C ++ code analyzer and in the core of the C # analyzer. In this article we will look at the mechanism of virtual values using the example of a C # analyzer.

In our PVS-Studio analyzer, we use Roslyn to find errors in C # projects to get all the necessary information about the source code. Roslyn provides us with a syntax tree, type information, dependency searching, and so on. During the analysis, PVS-Studio performs a syntax tree traversal and applies diagnostics to its nodes. In addition, during the crawling process, information is collected that can be used by the analyzer later. Examples of such additional information are virtual values.
')

Creating virtual values

Virtual values are stored for fields, properties, variables, and parameters when first mentioned in the code. If the first mention is a declaration of a variable or an assignment, then we will try to calculate the virtual value by analyzing the expression to the right of the equal sign. Otherwise, we usually know nothing about the property / field / parameter and believe that it can take any valid value. Consider an example:

public class MyClass { private bool hasElements = false; public void Func(byte x, List<int> list) { int y = x; hasElements = (list.Count >= 0); if (hasElements && y >= 0) //V3022 { } } }

When in the process of traversing the tree, the analyzer reaches the body of the Func function, it will begin to calculate the virtual values of the variables. The first line declares the variable y , which is initialized to x . Since we know about x only that it is of type byte , it means that it can take any value from 0 to 255. This range of values will be assigned as the virtual value of the variable y . The same will be done for the hasElements field: the analyzer knows that the Count property on the list cannot accept negative values, therefore, the true value is assigned to the hasElements variable. Now, when analyzing the condition hasElements && y > = 0, we know that the left and right sides are true and the whole condition is also always true - this is where the V3022 diagnostics work .

Let's take a closer look at how the virtual value of the expression is calculated.

Calculating the virtual value of an expression

For variables of different types, the virtual value is calculated differently. In the case of an integer type variable, the virtual value is stored as a set of value ranges that the variable can accept. For example, consider the following code:

 public void Func(int x) { if (x >= -10 && x <= 10 && x != 0) { int y = x + 5; } }

At the beginning of the function, nothing is known about the variable x and its range is all valid values of type int : [int.MinValue, int.MaxValue]. When entering the if block, we can refine the virtual value based on the condition. Thus, inside the if block, the variable x will have a range [-10, -1], [1, 10]. If now x will be used when calculating an expression, the analyzer will take into account its virtual value. In our example, y will get the virtual value [-5, 4], [6, 15].

For expressions of type bool, the virtual value is calculated differently. Here we have only three options: false, true, or unknown meaning. Therefore, we simply enumerate a sufficient number of options for all the variables of the expression, and check in all cases whether the expression will take the same value. For example:

 public void Func(uint x) { bool b = (x >= 0); //V3022 }

Whatever values for the parameter x we take, the expression x > = 0 is always true. Therefore, substituting several values instead of x , we will make sure of this and assign true as the virtual value for b .

Another example from the umbraco project:

 private object Aggregate(object[] args, string name) { .... if (name != "Min" || name != "Max") //V3022 { throw new ArgumentException("Can only use min or max..."); } .... }

To make sure that the condition in the example is always true, the analyzer substitutes the values “Min”, “Max”, “”, null instead of name. In each of these cases, either the left or the right side of the expression will be true, so the expression in the condition is always true.

Virtual values calculated for all variables are stored separately for each block. When entering a nested block, the analyzer creates its own set of virtual values based on the parent block. For a simple nested block, this is just a copy of all virtual values. For conditions, loops, and other blocks, virtual values are not just copied, they may be subject to additional restrictions.

Refinement of virtual values in the if / else block

Consider for example how virtual values behave when processing an if / else block.

 public void Func(int x) { if (x >= 0) { //x:[0, int.MaxValue] if (x <= 10) { //x:[0, 10] .... } else { //x:[11, int.MaxValue] .... } } }

After analyzing the condition x > = 0, PVS-Studio will limit the range of the variable x for the first if block to the values [0, int.MaxValue]. After processing the second condition x <= 10, the analyzer will create two more copies of the virtual values of the variable x - one for the if block and the other for the else block. Moreover, these copies will be subject to restrictions taking into account the virtual value of the same variable from the parent block and the virtual value of the expression in the condition. That is, for a nested if block, the virtual value of the variable x will be [0, 10], and for the else block, all other values will be [11, int.MaxValue].

After traversing the if / else block, we need to combine the virtual values from the if and else blocks for each variable. It should also be noted here that if at the end of the if or else there was a transition operator, for example, return , then it is not necessary to combine the values from this block. Examples:

 public void Func1(int x) { bool b1 = false; bool b2 = false; if (x >= 0) { .... b1 = true; b2 = true; } else { .... b1 = true; } //  ,  b1  true //  b2     } public void Func2(int x) { if (x < 0) return; //  ,  x   }

Cycle processing

The peculiarity of calculating virtual values for cycles is that the body of the cycle must be bypassed twice. Consider an example.

 public void Func() { int i = 0; while (i < 10) { if (i == 5) { .... } i++; } }

If we simply copy the virtual values from the parent block to the while block, then when analyzing the condition i == 5 we would get a false positive V3022, since we know that before the cycle, the variable i was zero. Therefore, before analyzing the body of the loop, you need to calculate what values the variables can take at the end of the iteration, as well as in all the blocks containing the continue statement, and combine all these values together with the values of the variables before entering the loop. In addition, if we analyze the for loop, we need to take into account the initialization and change blocks of the counter. After the values of all possible points of entry into the loop are combined, it is necessary to apply the loop condition in the same way as for the if block. So we will get the correct virtual values for the variables and we can perform a second round of the cycle, on which diagnostics will be applied.

After traversing the cycle, we need to combine the virtual values of the variables from all points from which we can get to the next operator after the cycle. These are the values before the beginning of the loop (if no iteration is performed), the values of the variables at the end of the loop body, the values of the variables in the blocks containing the break or continue statements. All of these values we have already calculated and remembered at the time of the first round of the cycle. Now all of them must also be combined and apply the condition opposite to the cycle condition.

It was a difficult explanation, so let's look at an example:

 public void Func(bool condition1, bool condition2, bool condition3) { int x = 0; while (condition1) { //x:[0, 1], [3] if (condition2) { x = 2; break; } if (condition3) { x = 3; continue; } x = 1; } //x:[0, 3] }

In this example, the variable x before entering the loop is zero. Having executed the first pass through the loop, the analyzer will calculate that the variable x can also take the values 1, 2 in the block with break and 3 in the block with continue . Since we have three points of transition to the next iteration of the cycle, at the beginning of the cycle, the variable x can take the values 0, 1 or 3. And we can get into the next operator after the cycle from four points. Therefore, here x can be 0, 1, 2, or 3.

The analyzer also calculates which values the variables can take within the case blocks of the switch statement , within try / catch / finally, and for other language constructs.

Division by zero

Dividing by zero is one of the errors that can be found with virtual values. The peculiarity of this diagnosis is that not every division, in which theoretically there can be a zero in the denominator, should lead to its operation. Consider an example:

 public int GetBlockCount(int dataLength, int blockSize) { return dataLength / blockSize; }

In this function, blockSize can theoretically take any value of type int , and zero also falls within this range. But if you issue warnings to such a code, the diagnosis will lose its meaning, since useful warnings will be lost in hundreds of false positives. Therefore, we needed to teach the analyzer to identify among all the divisions really suspicious, for example, the following:

 public string GetDownloadAvgRateString() { if (SecondsDownloading >= 0) { return GetSpeed(Downloaded / SecondsDownloading); } else { return ""; } }

or such:

 public void Func(int x, int y) { for (int i = -10; i <= 10; i++) { y = x / i; } }

As a solution, we divide the ranges of virtual values into accurate and inaccurate. By default, a range is considered inaccurate until it is refined by explicitly assigning a constant or variable with an exact range, or by limiting the condition of an if statement or a loop. If the zero gets inside or on the border of the exact range, then in this case the division by zero diagnosis works.

Examples

Now let's look at some examples from real projects found by PVS-Studio using virtual values.

Example N1 (RavenDB).

 internal static void UrlPathEncodeChar (char c, Stream result) { if (c < 33 || c > 126) { byte [] bIn = Encoding.UTF8.GetBytes (c.ToString ()); for (int i = 0; i < bIn.Length; i++) { .... } } else if (c == ' ') { //V3022 result.WriteByte ((byte) '%'); result.WriteByte ((byte) '2'); result.WriteByte ((byte) '0'); } else result.WriteByte ((byte) c); }

The first condition of the function UrlPathEncodeChar handles special characters, the second condition is a special optimization for the space. But since the ASCII code space is 32, the space will be processed by the first block. PVS-Studio finds this error as follows: inside the if block, the virtual value of the variable c will be [0, 32], [127, char.MaxValue], and inside the first else block all other values: [33, 126]. Since the space code does not fall into this range, the analyzer reports error V3022 - the expression c == '' is always false.

Example N2 (ServiceStack).

 protected override sealed void Initialize() { if (RootDirInfo == null) RootDirInfo = new DirectoryInfo(WebHostPhysicalPath); if (RootDirInfo == null || !RootDirInfo.Exists) //V3063 throw new ApplicationException("..."); .... }

At the beginning of the function Initialize about the property RootDirInfo, we do not know anything. After analyzing the condition RootDirInfo == null , 2 more copies of virtual values are created: one for the if block in which RootDirInfo is null , and the second for the else block in which RootDirInfo is not null . Although there is no else block in our example, virtual values are still created for it. Further inside the if block, a new value is obtained in the RootDirInfo property, which is obtained as a result of calling the constructor. Since the constructor never returns null , the RootDirInfo value in the if block is now not null . Since the RootDirInfo for the else block is also not null , when combining these two branches, we get that RootDirInfo after processing the first condition will never be null . As a result, when analyzing the second condition, PVS-Studio reports an error V3063 - part of the condition is always false.

Example N3 (ServiceStack).

 public static TextNode ParseTypeIntoNodes(this string typeDef) { .... var lastBlockPos = typeDef.IndexOf('<'); if (lastBlockPos >= 0) { .... //V3022 while (lastBlockPos != -1 || blockStartingPos.Count == 0) { var nextPos = typeDef.IndexOfAny(blockChars, lastBlockPos + 1); if (nextPos == -1) break; .... lastBlockPos = nextPos; } } }

Consider what happens in this example with the lastBlockPos variable. First, the result of calling the IndexOf function is assigned to it. The analyzer knows that the IndexOf , IndexOfAny , LastIndexOf , LastIndexOfAny functions return a non-negative value or -1. Therefore, the range of the lastBlockPos variable will be [-1, int.MaxValue]. After entering the if block, the range will be limited only by non-negative values [0, int.MaxValue]. Next, the analyzer will run through the body of the while loop . The variable nextPos, when declared, gets the range [-1, int.MaxValue]. After analyzing the condition if ( nextPos == -1), two copies of the virtual values of the variable nextPost are created : [-1] for the if branch and [0, int.MaxValue] for the else branch. Since the if branch contains the break statement, the rest of the loop body for the nextPost variable uses only the virtual values of the else branch: [0, int.MaxValue], which are assigned at the end of the lastBlockPos variable.

Thus, we have two transition points to the loop body: one at the entrance to the loop, in which lastBlockPos has the value [0, int.MaxValue] and the second when it goes to the next iteration, at which lastBlockPos also has the value [0, int.MaxValue ]. Therefore, lastBlockPos never accepts negative values in the loop condition, which means the loop condition is always true, as reported by the V3022 diagnostic.

It is worth noting that finding this error manually is quite difficult, since the body of the cycle contains about forty lines and to trace the passage through all branches is problematic.

Example N4 (TransmissionRemoteDotnet).

 // Converts an IPv4 address into a long, for reading from geo database long AddressToLong(IPAddress ip) { long num = 0; byte[] bytes = ip.GetAddressBytes(); for (int i = 0; i < 4; ++i) { long y = bytes[i]; if (y < 0) //V3022 y += 256; num += y << ((3 - i) * 8); } return num; }

Here, the variable y of type long is assigned a value of type byte . Since the byte type is unsigned, checking y <0 is meaningless.

Example N5 (MSBuild).

 private bool ValidateTaskNode() { bool foundInvalidNode = false; foreach (XmlNode childNode in _taskNode.ChildNodes) { switch (childNode.NodeType) { case XmlNodeType.Comment: case XmlNodeType.Whitespace: case XmlNodeType.Text: // These are legal, and ignored continue; case XmlNodeType.Element: if (childNode.Name.Equals("Code") || childNode.Name.Equals("Reference") || childNode.Name.Equals("Using")) { continue; } else { foundInvalidNode = true; } break; default: foundInvalidNode = true; break; } if (foundInvalidNode) //V3022 { .... return false; } } return true; }

In this example, the loop body contains two statements — switch and if . Consider the switch block. The first section with three cases contains only the continue operator, so there can be no transition to checking the condition foundInvalidNode . The second case section either goes to the next iteration of the loop, or sets foundInvalidNode to true and exits the switch . And finally, the default section also sets foundInvalidNode to true and exits the switch . Thus, after exiting the switch, the variable foundInvalidNode will always be true, which means the following if is superfluous. The analyzer took into account that in this switch block there is a branch of default , which means that control cannot proceed immediately to the condition check - one of the switch sections will be necessarily executed.

It should be noted that inside this statement, switch continue is related to the loop, and break exits the switch , not the loop!

Conclusion

Calculation of variable values at the stage of static analysis is a powerful tool for finding errors. The code may contain complex branches, nested conditions and loops, blocks of hundreds of lines in size. Tracing manually how a variable changes and finding an error can be very difficult and the PVS-Studio static analyzer is a good assistant in the search for many errors.

If you want to share this article with an English-speaking audience, then please use the link to the translation: Ilya Ivanov. Search for errors .

Read the article and have a question?

Often our articles are asked the same questions. We collected answers to them here: Answers to questions from readers of articles about PVS-Studio, version 2015 . Please review the list.

Source: https://habr.com/ru/post/282998/

All Articles