Why carrying over with integer overflow is not a good idea

This article is devoted to the indefinite behavior and optimizations of the compiler, especially in the context of the sign integer overflow.

Note from the translator: in Russian there is no clear correspondence in the context in use of the word “wrap” / “wrapping”. There is a mathematical term " transfer ", which is close to the described phenomenon, and the term "carry flag" (carry flag) is the mechanism for setting the flag in processors with integer overflow. Another version of the translation can be the phrase “rotation / revolution / rotation around zero”. It better reflects the meaning of “wrap” in comparison with “transfer”, since shows the overflow transition from positive to negative range. However, as it turned out, these words look unusual in the text for test readers. For the sake of simplicity, we will take the word “wrap” as a translation of the term “wrap”.

C (and C ++) compilers in their work are increasingly guided by the concept of undefined behavior - the notion that the behavior of a program in some operations is not regulated by the standard and that, generating the object code, the compiler has the right to proceed from the assumption that the program does not perform such operations. Many programmers objected to such an approach, since the generated code in this case may behave differently than the author of the program intended. This problem is becoming more acute as compilers use more sophisticated optimization methods that will most likely rely on the notion of undefined behavior.

In this context, an example with a significant integer overflow is indicative. Most C developers write code for machines that use additional code to represent integers, and addition and subtraction in this representation are implemented in the same way in unsigned arithmetic. If the sum of two positive integers with a sign overflows — that is, it becomes larger than the type accommodates — the processor will produce a value that, when interpreted as a binary complement of the sign number, will be considered negative. This phenomenon is called “transfer”, since the result, having reached the upper limit of the range of values, “is transferred” and begins from the lower limit.
')
For this reason, you can sometimes see the following C code:

int b = a + 1000; if (b < a) { //  puts("input too large!"); return; }

The task of the if operator is to detect the overflow condition (in this case it occurs after adding 1000 to the value of the variable a ) and report an error. The problem is that in C, the sign integer overflow is one of the cases of undefined behavior. Compilers for some time now consider such conditions as always false: if you add 1000 (or any other positive number) to another number, the result cannot be less than the initial value. If overflow occurs, then an indefinite behavior arises, and to prevent this from happening is (apparently) the concern of the programmer himself. Therefore, the compiler can decide that the conditional operator can be completely removed for optimization purposes (the condition is always false, it doesn’t affect anything, which means you can do without it).

The problem is that with this optimization, the compiler has removed the check that the programmer added specifically to identify undefined behavior and process it. Here you can see how this happens in practice. (Note: the godbolt.org site on which the example is placed is very cool! You can edit the code and immediately watch how different compilers process it, and there are a lot of them presented here. Experiment!). Note that the compiler does not remove the overflow check if you change the type to unsigned, since the behavior for unsigned overflow is defined in the C language (more precisely, with unsigned arithmetic, the result is transferred, so the overflow does not actually occur).

So what is wrong? Someone says that yes, although it is obvious that many compiler developers consider such a decision to be legitimate. If I understand correctly, the main arguments of the supporters (edit: implementation-dependent) transfer during an overflow are as follows:

Overflow transfer is a useful behavior.
Transfer is the behavior that programmers expect.
The semantics of indefinite behavior in case of overflow does not give a noticeable advantage.
The C language standard regarding undefined behavior allows the implementation to “completely ignore the situation, and the result will be unpredictable,” but this does not give the compiler the right to optimize the code based on the assumption that the situation with uncertain behavior does not happen at all.

Let us examine each item in turn:

Overflow Transfer - useful behavior?

Transfer is mainly useful when you need to track an overflow that has already occurred. (If there are other problems that are solved by the transfer and cannot be solved using unsigned integer variables, I cannot recall such examples at once, and I suspect that there are not many of them). Given that the transfer really simplifies the problem of using incorrectly overflowed variables, it is definitely not a panacea (recall the multiplication or addition of two unknown variables with an unknown sign).

In the trivial cases, when the transfer simply allows you to track the resulting overflow, it is also not difficult to know in advance whether it will occur at all. Our example can be rewritten as follows:

 if (a > INT_MAX - 1000) { //    puts("input too large!"); return; } int b = a + 1000;

That is, instead of calculating the sum and then figuring out whether an overflow has occurred or not, by checking the result for mathematical consistency, you can check whether the sum exceeds the maximum number contained by the type. (If the sign of both operands is unknown, the check will have to be very difficult, but the same applies to the check during the transfer).

Given all this, I find the unconvincing argument that the transference is useful in most cases.

Transfer is the behavior that programmers expect?

It is more difficult to argue with this argument, since it is obvious that the code of at least some C programmers assumes the semantics of the transfer with a signed integer overflow. But this fact alone is not enough to consider such semantics as preferable (note that some compilers allow you to turn it on, if necessary).

The obvious solution to the problem (the programmers' expectation of precisely this behavior) is to make the compiler give a warning when it optimizes the code, assuming no undefined behavior. Unfortunately, as we saw in the example on the website godbolt.org by the link above, compilers do not always act this way (Gcc version 7.3 - yes, and version 8.1 - no, so there is a step backwards).

Semantics of indefinite behavior in case of overflow does not give a noticeable advantage?

If this remark is true for all cases, it would serve as a strong argument that compilers should adhere to the transfer semantics by default, since it would be better to allow overflow checks, even if this mechanism is incorrect from a technical point of view - although if only because it can be used in potentially broken code.

I admit that specifically this optimization (removal of checks of mathematically contradictory conditions) in ordinary C programs can often be neglected, since their authors strive for the best performance and still optimize the code manually: that is, if it is obvious that this if statement contains which will never be true, the programmer will most likely remove it himself. In fact, I found out that in several studies the effectiveness of such optimization was called into question, tested and found to be practically insignificant in the framework of control tests. However, although this optimization almost never gives advantages in the C language, code generators and compiler optimizations are mostly universal and can be used in other languages - and for them this conclusion may be incorrect. Let's take the C ++ language with its, let's say, tradition to rely on the optimizer so that it removes redundant constructs in the code of templates, and not do it manually. But there are languages that are converted by a transpiler to C, and at the same time, the redundant code in them is also optimized by C compilers.

In addition, even if you keep checking for overflow, it’s not at all the fact that the direct cost of transferring integer variables will be minimal even on machines using additional code. The Mips architecture, for example, can perform arithmetic operations only in fixed-size registers (32 bits). The type short int , as a rule, has a size of 16 bits, and char - 8 bits; when stored in a register of a variable of one of these types, its size will expand, and in order to correctly transfer it, you will need to perform at least one additional operation and, possibly, use an additional register (to accommodate the corresponding bitmask). I have to admit that I have not dealt with the code for Mips for a long time, so I’m not sure about the exact cost of these operations, but I’m sure that it is non-zero and that other RISC architectures may experience the same problems.

The language standard prohibits avoiding the transfer of variables if it is supposed to be architecture?

If you look, this argument is especially weak. Its essence is that the standard supposedly allows the implementation (compiler) to interpret "indefinite behavior" only to a limited extent. In the text of the standard itself - in the fragment to which the proponents of the transference appeal - the following is said (this is part of the definition of the concept of “undefined behavior”):

NOTE: Undefined behavior may take the form of completely ignoring the situation, and the result will be unpredictable, ...

The idea is that the words “completely ignoring the situation” do not suggest that an event leading to indefinite behavior — for example, overflow when adding — cannot occur, but rather that if it happens, the compiler should continue to work no matter how than it did not happen, but at the same time also take into account the result that would be obtained if it sends a request to the processor to perform such an operation (in other words, as if the source code was broadcast to the machine in a straightforward and naive manner).

First of all, it should be noted that this text is given as a “note”, and therefore it is not normative (that is, it cannot prescribe something), according to the ISO directive mentioned in the preface to the standard:

In accordance with Part 3 of the ISO / IEC Directives, this foreword, introduction to the text, notes, footnotes and examples are also for informational purposes only.

Since this piece of "indefinite behavior" is a note, it does not prescribe anything. Please note that the current definition of “undefined behavior” is:

behavior arising from the use of intolerable or incorrect software design or incorrect data, to which this International Standard does not impose any requirements .

I highlighted the main point: there are no requirements for undefined behavior; The list of “possible types of indefinite behavior” in a note contains only examples and cannot be the final prescription. The phrase “makes no demands” cannot be interpreted in any other way.

Some, developing this argument, argue that, regardless of the text, the committee on language, when it formulated these words, meant that the behavior in general should correspond to the architecture of the hardware on which the program runs, as far as possible, implying a naive translation in machine code. This may be true, although I have not seen any evidence (for example, historical documents) in support of this argument. However, even if this were so, it is not a fact that this statement applies to the current version of the text.

Reflections at last

The arguments in favor of the transfer are largely untenable. Perhaps the strongest argument comes out if you combine them: less experienced programmers (who do not know the subtleties of C and undefined behavior in them) sometimes count on the transfer, and it does not reduce performance - although the latter is not true in all cases, and the first part is unconvincing if viewed separately.

Personally, I would prefer that overflows be blocked (trapping) rather than tolerated. That is, so that the program falls, and does not continue to work - with uncertain behavior or potentially incorrect result, because in fact, and in another case, there is a vulnerability. This solution will, of course, slightly reduce performance on most (?) Architectures, especially on x86, but, on the other hand, overflow errors will be immediately detected and they will not be able to use or get incorrect results further along the execution programs. In addition, in theory, compilers with this approach could safely remove redundant overflow checks, since it will not happen exactly , although, as I see it, neither Clang nor GCC are using this feature.

Fortunately, both the interrupt and the transport are implemented in the compiler, which I use most often - GCC. To switch between modes, use the command line arguments -ftrapv and -fwrapv, respectively.

There are, of course, many actions that lead to indefinite behavior - an integer overflow is only one of them. I do not think at all that it is useful to interpret all these cases as indefinite behavior, and I am sure that there are many specific situations where semantics should be defined by language or, at least, remain at the discretion of implementations. And I am afraid of too loose interpretations of this concept by compiler manufacturers: if the behavior of the compiler does not correspond to the intuitive ideas of developers, especially those who have personally read the text of the standard, this can lead to real mistakes; if the performance increase in this case is negligible, then it is better to refuse such interpretations. In one of the following posts, I will probably look into some of these issues.

Supplement (dated August 24, 2018)

I realized that much of the above could be written better. Below, I briefly summarize and clarify my words and add a few small remarks:

I did not argue that undefined behavior is preferable to transference in case of overflow - rather, that in practice, transference is not much better than undefined behavior. In particular, security problems can be obtained in the first case, and in the second - and I’m ready to argue that many of the vulnerabilities triggered by overflows that were not caught in time (except those for which the compiler is responsible for removing the erroneous checks) actually appeared from - because of the transfer of the result, and not because of the indefinite behavior associated with overflow.
The only real benefit of the transfer is that the overflow checks are not removed. Although it is possible to protect the code from some attack scenarios this way, the possibility remains that some of the overflows will not be checked at all (that is, the programmer will forget to add such a check) and remain unnoticed.
If the issue of security is not so important, and the high speed of the program's work comes to the fore, then indefinite behavior will give better optimization and greater productivity gain, at least in some cases. On the other hand, if security is above all, the transfer is fraught with vulnerabilities.
This means that if one chooses between interruption, transfer, and undefined behavior, there are very few problems in which transfer can be useful.
As for overflow checks, I think it is harmful to leave them, because it creates a false impression that they work and will always work. Interrupting overflow avoids this problem; adequate warnings - to soften it.
I think that any developer writing security-critical code should ideally be proficient in the semantics of the language in which he writes, as well as be aware of his pitfalls. For C, this means that you need to know the semantics of overflow and the subtleties of undefined behavior. Sadly, some programmers have not grown to this level.
I have met the statement that “most C programmers are expecting a transfer as the default behavior,” but I don’t know the evidence of this. (I wrote “some programmers” in the article, because I know a few real life examples, and in general I doubt that anyone will argue with that).
There are two different problems: what the standard C language requires and what compilers should implement. I (in general) are satisfied with how the standard defines the indefinite behavior during an overflow. In this post, I'm talking about what compilers should do.
When an overflow is interrupted, there is no need to check every operation on it. Ideally, the program with this approach either behaves consistently from the point of view of mathematical rules, or stops working. In this case, the existence of a “temporary overflow” becomes possible, which does not lead to the appearance of an incorrect result. Then the expression a + b - b , and the expression (a * b) / b can be optimized to a (the first is possible during the transfer, but the second is no longer).

Note. Translation of the article is published in the blog with the permission of the author. Original text: Davin McCall " Wrap on integer overflow is not a good idea ".

Additional links from the PVS-Studio team:

Andrey Karpov. Undefined behavior is closer than you think .
Will Dietz, Peng Li, John Regehr, and Vikram Adve. Understanding Integer Overflow in C / C ++ .
V1026. The variable is incremented in the loop. Undefined behavior will occur in case of signed integer overflow .
Stackoverflow. Is signed integer overflow still undefined behavior in C ++?

Source: https://habr.com/ru/post/427683/

All Articles

Why carrying over with integer overflow is not a good idea

More articles: