📜 ⬆️ ⬇️

What every C programmer should know about Undefined Behavior. Part 1/3

Part 1
Part 2
Part 3

People sometimes ask why code compiled into LLVM sometimes generates SIGTRAP signals when optimization has been enabled. Rummaging, they discover that Clang has generated the “ud2” instruction (X86 code is implied) - the same that is generated by __builtin_trap (). This article discusses several issues regarding the indefinite behavior of C code and how LLVM handles it.

image

In this article (the first of the three) we will try to explain some of these questions so that you can better understand the related compromises and difficulties, and perhaps explore a little more dark sides of C. We will find out that C is not a “high-level assembler” how many experienced C programmers (especially those who are focused on low level) prefer to think, and that C ++ and Objective-C directly inherited many such problems.

Introduction to Undefined Behavior


In programming languages ​​LLVM IR and C, there is the concept of "undefined behavior." Undefined behavior is a broad topic with many nuances. The best introduction to the topic I found is a post on John Regger's blog . The brief point of this beautiful article is that many things that seem meaningful in C actually have undefined behavior, and this is the source of many bugs in programs. Moreover, each construct with undefined behavior in C has a license to implement (compile and execute) code that can format your hard drive and do even worse things completely unexpectedly. And again, I strongly recommend reading John’s article .
')
Undefined behavior exists in C-like languages ​​for the reason that C developers wanted it to be an extremely effective low-level programming language. As opposed to it, languages ​​like Java (and many other “safe” programming languages) avoid unspecified behavior because they want safe and reproducible, implementation-independent behavior and are willing to sacrifice performance to achieve this goal. Although none of this is your goal, if you program in C, you really need to understand what unspecified behavior is.

Before you dive into the details, it is worth recalling that allows the compiler to achieve high performance on a large range of C-applications, despite the fact that there is no magic bullet. At the highest level, the compiler generates fast code due to the fact that it: a) implements the basic compilation algorithms, such as allocation of registers, scheduling, etc .; b) uses many tweaks (such as peephole optimization, loop conversion, etc.), and applies them when it is beneficial; c) removes redundant abstractions (appearing as a result of using macros, etc.), makes inline functions, deletes temporary objects in C ++, etc .; d) and does not spoil anything. Although any of the optimizations may look trivial, saving just one iteration in a critical cycle can speed up the work, for example, the codec, by 10% and save 10% of the power consumed.

Advantages of uncertain behavior in C, examples


Before diving into the dark side of indefinite behavior and behavior and LLVM policies, when used as a C compiler, I think it would be useful to look at a few specific cases of undefined behavior, and talk about how performance is achieved in each of these cases. better than safe languages ​​like java. You can look at this as an optimization, which can be done by a class of indefinite behavior, or as a deliverance from redundancy, which would be required if this class of cases were defined. Although the compiler can eliminate some of these redundancies in some cases, in order to do this in a general form (for each case), it may require solving the “stop problem” and many other interesting problems.

It should also be noted that both Clang and GCC define certain instances of behavior that Standard C leaves undefined. The cases that I will describe are undefined both in accordance with the standard and are considered as undefined by both compilers in the default settings.

Using an uninitialized variable : a widely known source of problems in C programs, and there are many tools for catching such errors: from compiler warnings to static and dynamic analyzers. This increases performance, since it does not require initialization by zero of all variables that fall within the scope (as Java does). For most scalar variables, this is very small redundancy, but initializing arrays on the stack and heap allocated memory can be quite expensive, especially if this memory is then completely overwritten.

Signed integer overflow : if the type 'int' overflows (for example), the result is undefined. For example, "INT_MAX + 1" is not guaranteed to be equal to INT_MIN. This behavior makes possible a whole class of optimizations that are important in some cases. For example, knowing that INT_MAX + 1 is undefined allows you to replace “X + 1> X” with “true”. Knowing that multiplication "cannot" lead to overflow (because it would lead to undefined behavior) allows you to replace "X * 2/2" with "X". Although these examples seem trivial, they are often found after inline functions or macro deployment. A more important optimization occurs for "<=" in this cycle:

for (i = 0; i <= N; ++i) { ... } 

In this loop, the compiler may assume that the number of loop iterations will be exactly N + 1, if “i” is not determined during an overflow, which allows a wide range of optimizations to be undertaken. On the other hand, if a variable should definitely “wrap up” during an overflow, then the compiler must assume that such a cycle can be infinite (what happens if N equals INT_MAX) - which will not allow many loop optimizations to be done. This is especially true for 64-bit platforms in which the “int” is often used for loop variables.

For unsigned variables, it doesn’t cost anything to guarantee overflow modulo 2 (wrapping), and you can always use it. To make a certain overflow of sign numbers will be worth the loss of such optimizations (for example, the general symptom of the problem, tons of sign extensions inside loops in 64-bit targets). Both Clang and GCC allow the "-fwrapv" flag, which causes the compiler to treat sign overflow as defined (except for dividing INT_MIN by -1).

Shift by an amount greater than the width of the variable : The shift of uint32_t by 32 or more bits is undefined. According to my guesswork, this happened because the shift operation on different CPUs is done differently: for example, X86 cuts the 32-bit amount of the shift to 5 bits (that is, the shift by 32 bits is the same as 0 bits ), but PowerPC cuts the 32-bit amount of the shift to 6 bits (and the result of the shift to 32 bits is zero). Because of these hardware differences, the behavior is not fully defined in the C language (that is, a 32-bit shift on PowerPC can format your hard drive, it is not guaranteed to give zero as a result). The cost of eliminating such ambiguous behavior is that the compiler must generate additional operations (such as 'and') for the variable shift, which would make this operation twice as expensive on common CPUs.

Dereferencing wild pointers and accessing beyond the array : Dereferencing an arbitrary pointer (such as NULL, a pointer to unallocated memory, etc.) and accessing an array with an output beyond its borders is a common bug in C applications, and needs clarification. To eliminate this source of undefined behavior, a range check must be performed when accessing an array, and the ABI must be modified so that the range information accompanies each pointer that can be used in address arithmetic. It would be extremely expensive for many numerical and other applications, and would break binary compatibility with all existing C libraries.

NULL pointer dereferencing: in contrast to popular opinion, null pointer dereferencing in C is undefined. It is not defined as a call to the “trap” command, and if you make a page mmap at address 0, this will not lead to access to that page. This is a violation of the rule that prohibits dereferencing wild pointers and using NULL as a watchdog value. NULL pointer dereferencing is vague and allows for a wide range of optimizations: in contrast, Java forbids the compiler to move side effects operations between objects that cannot be considered by the optimizer as guaranteed non-zero. This significantly worsens planning and other optimizations. In C-like languages, NULL dereferencing ambiguity allows a large number of scalar optimizations, and to improve the result of deploying macros and inline functions.

If you are using an LLVM-based compiler, you can dereference the "volatile" pointer to null and get an emergency stop, if that is what you want, since read and write operations of volatile objects are generally not optimized. Currently, there is no flag that allows an arbitrary read operation from a pointer to NULL to be considered as a valid operation or to allow arbitrary read operations from a pointer, which is known to be zero.

Violation type rules : A case of undefined behavior is the conversion of int * to float * followed by dereference (accessing “int” as if it were a “float”). The C language requires this type of conversion to occur through memcpy: the use of pointer conversion is incorrect and the results are not defined. There are quite a few nuances in these rules, and I don’t want to dive into the details here (there are exceptions for char *, vectors have special properties, associations work differently, etc.). This behavior makes it possible for Type-Based Alias ​​Analysis (TBAA), used in a wide range of memory accessibility optimizations in the compiler, and can significantly improve the performance of the generated code. For example, this rule allows clang to optimize such a function:

 float *P; void zero_array() { int i; for (i = 0; i < 10000; ++i) P[i] = 0.0f; } 

in "memset (p, 0, 40000)". This optimization also allows you to take out a lot of read operations per cycle, optimize general subexpressions, etc. This class of undefined behavior can be disabled with the -fno-strict-aliasing flag, which disables analysis. When the flag is set, Clang will compile this cycle into 10,000 4-byte write operations (which is many times slower), because it must assume that each of these write operations changes the value of P, as in the following example:

 int main() { P = (float*)&P; // cast causes TBAA violation in zero_array. zero_array(); } 

Such a violation of typing is rather unusual, so the Committee for Standardization decided that a significant performance gain is worth unexpected results with “reasonable” type conversions. It is worth noting that Java takes advantage of the optimization of type conversions without such flaws, because it does not contain unsafe pointer pointers as such.

In any case, I hope that this has given you an understanding of how whole classes of optimizations become possible due to indefinite behavior in C. Of course, there are many other cases, including violations of follow points, such as “foo (i, ++ i ), Contests in multi-threaded programs, access violations, division by zero, etc.

In the next post, we will discuss why indefinite behavior in C is a rather frightening thing if performance is not your only goal. In the last post of the loop, we’ll talk about how LLVM and Clang handle unspecified behavior.

Source: https://habr.com/ru/post/341048/


All Articles