A rather difficult topic for C ++ programmers is undefined behavior. Even experienced developers often can not clearly articulate its causes. The article is intended to bring a little more clarity to this question.
The article is a translation of several articles and excerpts from the Standard on this topic.
What is the "point of following"?
The standard says:
Sequence points - such points in the process of program execution, in which all side effects of the already executed code have finished their action, and side effects of the code to be executed have not yet begun to act. (§1.9 / 7)
')
Side effects? And what are the "side effects"?
A side effect (side effect) (according to the Standard) is the result of accessing a volatile object, changing an object, calling a function from the I / O library, or calling a function that includes some of these actions. A side effect is a change in the runtime state.
Calculating some expression yields some result. If, in addition to the result, the evaluation of an expression causes changes in the runtime environment, then it is said that this expression has side effects.
For example:
int x = y++;
In addition to the initialization of the variable “x”, the value of the variable “y” has changed due to the side effect of the ++ operator.
Well, that's understandable. Further to points of following. An alternative definition of the notion of a “point of following” is given by Steve Summit (author of the “C Language in Questions and Answers” ​​blog, comp.lang.c):
The sequence point is the moment in time when the dust settled down, and all the side effects encountered were guaranteed to be completed and left behind.
What are the sequence points described in C ++ Standard?
The standard describes the following following points:
- at the end of the evaluation of the full expression (§1.9 / 16). By “full expression” (full-expression) is meant an expression that is not a subexpression - part of another expression (note: the evaluation of a full expression may include the evaluation of a subexpression that is not a part of it. For example, the subexpressions involved in the calculation of the argument by default, they are considered part of the expression that called the function, and not the expression that defines this argument).
For example:
int a = 5;
- in the calculation of the following expressions, namely after calculating the first operand:
1. a && b (§5.14)
2. a || b (§5.15)
3. a? b: c (§5.16)
4. a, b (§5.18)
The calculation of these expressions goes from left to right. After the calculation of the left subexpression, all side effects of this calculation cease. If after calculating the left subexpression the value of the full expression is known, then the right side is not calculated. In the latter case, the operator is a comma. In the function func (a, a ++), a comma is not an operator, but simply a separator between the arguments.
- when calling a function (whether the function is inline or not) after calculating all its arguments (if any) and before executing any instructions in its body.
What is “indefinite behavior”?
The standard defines the phrase “undefined behavior” in §1.3.12:
Undefined behavior is a behavior that may arise as a result of using erroneous software constructs or incorrect data that the International Standard does not impose any requirements on. Indefinite behavior can also occur in situations not explicitly described in the Standard.
In other words, indefinite behavior means anything that can happen, starting from a kozyavka that has fallen out of the nose, ending with the pregnancy of your girl.
What is the relationship between undefined behavior and sequence points?
Before you know the answer to this question, you must understand the differences between undefined behavior, unspecified behavior, and implementation-defined behavior.
Unspecified behavior (according to the Standard) is a behavior for which the Standard offers two or more possible options and does not impose clear requirements on which of them should be chosen in a certain situation.
Unspecified behavior arises from the calculation of subexpressions such as:
- arguments to the function call
- Operands of operators (eg +, -, =, *, /), except for:
- binary logic operators (&& and ||)
- operator conditions (? :)
- operator comma.
(note: except for those operators that contain the following point)
Implementation -defined behavior (according to the Standard) is the behavior of a well-formed software construct with the correct data, which depends on the implementation (must be documented for each implementation).
An example of this behavior is pointer size. In accordance with the Standard, the size of the pointer depends on the specific implementation of the compiler. Within one particular implementation, the size of different types of pointers may also be different.
I also want to note that the order of calculation of the operands of a particular operator, the subexpressions of a specific expression, and the order of occurrence of side effects are not specified.
For example:
int x = 5, y = 6; int z = x++ + y++;
One more example:
int Hello() { return printf("Hello"); } int World() { return printf("World !"); } int main() { int a = Hello() + World(); return 0; }
In §5 / 4, the Standard says:
Between two points of the sequence, the scalar object must change the stored value when calculating the expression no more than once.
What does it mean?
Speaking a little easier, the variable between two points of the sequence cannot be changed more than once. In the expression, the next point of the sequence is usually located on the concluding semicolon, and the previous one - at the end of the preceding statement. The expression may also contain intermediate points of sequence.
Based on the foregoing, the following expressions create undefined behavior:
i++ * ++i; // i = ++i; // ++i = 2; // i 1 i = ++i +1 // i = (i,++i,++i); // `++i` `i` (`i` 1 2 )
But at the same time:
i = (i, ++i, 1) + 1
In addition (by Standard) - the old value of the expression (before the calculation) should be available only for determining the stored value.
This means that incorrect expressions are those in which access to a value may precede its modification.
For example:
std::printf("%d %d", i,++i);
One more example:
a[i] = i++ ;
I heard that in C ++ 0x there are no Follow Points, is that true?
Yes it's true.
The notion of a “point of succession” was replaced by an ISO C ++ committee with a refined and augmented concept of the Relationship of Following [BEFORE AFTER].
What is the Attitude Relation [TO]?
Following is a relation that is:
- asymmetrically
- transitively
- occurs between pairs of computations and the partially ordered set that forms them
Formally, this means that with two given expressions A and B, if A [follows to] B, then execution of A must precede execution of B. If A does not [follow TO] B, then execution of A and B is unsequenced (execution of unordered calculations may overlap).
Calculation of A and B are indeterminantly sequenced, when either A [follows TO] B or B [follows TO] A, but what exactly is not specified. Indefinitely ordered calculations cannot intersect, but any of them can be executed first.
What does “computation” mean in the context of C ++ 0x?
In C ++ 0x, the evaluation (evaluation) of an expression (or subexpression) generally includes:
- counting (computation) of a value (including determining the position of an object in memory for calculating the value of a gvalue expression and getting the value by reference for calculating a prvalue expression)
- initiation of side effects
The standard tells us (§1.9 / 14):
Each value count and side effect associated with the full expression [are preceded by a TO] a value count and side effect count associated with the following full expression to be calculated.
A trivial example:
int x; x = 10; ++x;
In this example, the calculation of the value and the side effect associated with the expression (++ x), [should be AFTER] the calculation of the value and side effect (x = 10).
After all, there must be some connection between the things described above and the indefinite behavior, right?
Of course, there is a connection.
§1.9 / 15 mentions that:
The evaluation of the operands of a particular operator or subexpressions of a specific expression is irregular, except for the cases that were described earlier.
Note: unordered and indefinitely ordered subexpressions of a full expression that is evaluated more than once during program execution are not necessarily calculated each time in the same order.
For example:
int main() { int num = 19 ; num = (num << 3) + (num >> 3) ; }
1) The calculation of the operands of the “+” operator is out of order.
2) The calculation of the operands of the operators "<<" and ">>" is out of order.
§1.9 / 15 counting the value of the operands of a particular operator [should be TO] counting the value of the result of the operation of the operator.
This means that in the x + y expression, the counting of the values ​​of "x" and "y" [should be preceded by] the counting of x + y.
Now to the more important:
§1.9 / 15 If the occurrence of a side effect of a scalar object is disordered with respect to one of the following events:
- the occurrence of another side effect of the same object
- counting values ​​using the value of this object
then the program behavior will be UNDEFINED.
Example:
f(i = -1, i = -1)
Explain this expression. At first glance, the disorder in the calculation of the function's arguments does not entail ambiguity. However, there is no exact probability that the compiler, by optimizing such an expression, will not create a set of instructions similar in action (in his opinion), which will fail during an operation on the same memory.
Suppose that the compiler decided that the best way to assign "-1" would be to zero the variable and decrement it.
Instructions can be formed like this (conditional commands):
clear i decr i clear i decr i
And they can:
clear i clear i decr i decr i
then the value -2 will be stored in i.
When the function is called, each calculation of the value and side effect associated with the expression of the argument of this function, or with the expression calling the function, [should be done before] the execution of any expression or operator in the body of the called function.
The counting of the values ​​and the side effects associated with the different arguments are unordered.
Program flow
Using the terms decoded earlier, the program execution flow can be represented graphically. In the following diagrams, we denote the calculation of the expression (or subexpressions) as E (x), the sequence point is%, the side effect “k” for the object “e” is denoted by S (k, e). If for the calculation it is necessary to read the value from the named object (let “x” be the name), the calculation will be denoted by V (x), in other cases, as previously agreed, E (x). Side effects we write to the right and left of the expressions. The border between two expressions means that the upper expression evaluates to the lower expression (often because the lower expression depends on the prvalue or lvalue of the upper expression).
For two expressions i ++; i ++; the diagram will be:
E(i++) -> { S(increment, i) } |
As you can see, in this case, we have two sequence points, one of which separates the two changes “i”.
Function calls are also of interest, despite the fact that we omit the diagram for them:
int c = 0; int d = 0; void f(int a, int b) { assert((a == c - 1) && (b == d - 1)); } int main() { f(c++, d++); }
This code is correct, because by the time the body of the function f begins to be executed, all side effects generated by the calculation of the arguments are guaranteed to end: “c” and “d” will be increased by 1.
Now consider the expression i ++ * j ++;
{ S(increment, i) } <- E(i++) E(j++) -> { S(increment, j) } \ / +--+--+ | E(i++ * j++) |
Where did the two branches come from? Recall that the sequence points complete the calculations carried out before their occurrence. All the multiplication subexpressions are calculated before the multiplication itself, there is no more repetition point in this expression, therefore, we need to take into account the theoretical “parallelism” of calculating the operands to suggest where a competitive change of the same object can occur. More formally, the two branches are unordered. Sequence points are a relation that orders some calculations and does not order others. So sequence points, as mentioned above, are partial ordering (partial order).
Conflicting side effects.
In order to provide the compiler with freedom in generating and optimizing machine code, in cases similar to the multiplication considered above, the procedure for calculating subexpressions is not established and the side effects generated by them are not separated (except for the cases described earlier).
This can lead to conflicts, so the Standard calls the program's behavior indefinite if it tries to modify the same object without participation of the following points. This applies to scalar objects, because other objects are either immutable (array) or simply do not fall under this rule (class objects). Undefined behavior also occurs if the expression contains a reference to the previous value of the object and its modification, such as in i * i ++
// ! // , 'i' «» : V(i) E(i++) -> { S(increment, i) }) \ / +---+---+ | E(i * i++) |
As an exception, it is allowed to read the value of an object if it is necessary to calculate a new value. Context example: i = i + 1
V(i) E(1) \ / +---+---+ | E(i) E(i + 1) \ / +-------+-------+ | E(i = i + 1) -> { S(assign, i) } |
Here we see the appeal to “i” on the right side; after the calculation of both parts, assignment is made. So the side effect and the reference to “i” occur without crossing the point of the sequence, but we turned to “i” only to determine the stored value, therefore there will be no disagreement.
Sometimes, the value is read after modification. For the occasion
a = (b = 0);
it is fair that the writing in “b” takes place, and then reading from “b” without crossing the sequence point. However, this is normal, because the new value of “b” is already being read, and the old is not addressed. In this case, the side effects of the assignment “b” will end not only to the next point of the sequence, but also before reading the value “b” required for the assignment “a”. The standard explicitly states: “the result of an assignment operation is the value stored in the left operand after the assignment is completed (the result is an lvalue)”. Why is the notion of a point following? Because this concept contains an unnecessary requirement in this situation that all side effects of the left and right operand be completed, instead of considering only the side effects of the assignment returning the lvalue with which the read is performed.
Sources: