From the translator:
I bring to your attention the translation of two posts from the blog John Regehr. I decided to combine them in one publication because, firstly, they have a small amount, and, secondly, the second post is a continuation of the first, and is a response to a comment on the first post on Hacker News.
Link to the first post
Link to the second post
Part 1. Undefined behavior! = Insecure programming
Undefined behavior (UB) in C and C ++ is a danger to developers, especially if the code works with untrusted data. Less well known is that undefined behavior exists in the intermediate representation (IR) of most AOT optimizing compilers. For example, LLVM IR has the meaning
undef and
“poisoned” values in addition to the explosive UB of the C language. When people start worrying about it, the typical response is: “What? LLVM IR is just as bad as C! ”This article explains why it’s so wrong.
Undefined behavior is the result of a design decision: the rejection of the systematic detection of software errors at a certain level of the system. The responsibility for ensuring that there are no such errors lies with a higher level of abstraction. For example, it is obvious that a secure language can be compiled into machine code, but it is also obvious that the insecurity of machine code places all uncompromising high-level guarantees on the implementation of a language. Swift and Rust compile to LLVM IR; some of their security guarantees are carried out by dynamic checks in the generated code, others are made using type checking and are not presented at the LLVM IR level. In other words, UB is not detected at the LLVM level and is a code problem from a secure subset of Swift and Rust languages. Even C can be used safely if any tool in the development environment makes sure that there is no UB.
L4.verified project does exactly that.
The essence of indefinite behavior is that it avoids the connection between error checking operations and unsafe operations. Checks, being unrelated to unsafe operations, can be deleted by the optimizer, or, alternatively, removed from the loop. The remaining unsafe operations can be, with well-designed IR, mapped to processor operations with minimal overhead, or without them at all. For example, consider the following Swift code:
')
func add(a : Int, b : Int)->Int { return (a & 0xffff) + (b & 0xffff); }
Although the Swift implementation should catch an exception if the integer overflows, the compiler sees that the overflow is impossible and generates the following LLVM IR:
define i64 @add(i64 %a, i64 %b) { %0 = and i64 %a, 65535 %1 = and i64 %b, 65535 %2 = add nuw nsw i64 %0, %1 ret i64 %2 }
Not only the addition operation with overflow checking was replaced with the addition operation without checking, but also the addition instruction was marked with the attributes nsw and nuw, which indicate that both the sign and unsigned overflow are undefined. By themselves, these attributes do not bring benefits, but may cause additional optimizations if the function is inline. When the Swift test suite compiles to LLVM, only one of the eight addition instructions has an attribute indicating that the integer overflow is not defined.
In this particular example, the attributes nsw and nuw are redundant, since the optimization pass can detect the fact that addition cannot cause overflow. However, in general, these attributes, and similar ones, have real value, allowing you to avoid costly static analysis in order to rediscover what is already known. Also, some facts may not be discovered later, since the information will be lost at some compilation steps.
In general, indefinite behavior is an abstraction visible to a programmer who represents an aggressive and dangerous compromise: sacrifice the program's correctness in exchange for performance and the compiler simplicity. On the other hand, UB at the lower level of the system, such as the machine code and intermediate representation of the compiler, is an internal design decision that has no influence on how the programmer perceives the system. This type of UB simply requires us to accept the fact that security checks can be removed from their associated unsafe operations in order to achieve efficient program execution.
Part 2. Does an expressive programming language always have undefined behavior?
On Hacker News, someone
commented on my previous post like this:
EMNIP, the Gödel incompleteness theorem implies that any language has at least somewhat undefined behavior.Let's look at this statement, remembering that incompleteness and insolvability are quite complex things. Many years ago I read with pleasure Gödel’s theorem: “
incompleteness leads to abuse ” (I certainly need to re-read it).
Firstly, there are programming languages ​​without UB, for example, such a language that any program prints "7" on it. Whatever means UB (we have not yet given a formal definition), it is clear that the language that always prints "7" does not have it.
There are also useful languages ​​that do not have UB, such as the expression language, which calculates the elementary functions over real numbers in the IEEE format. It is easy to talk about such languages ​​because they are not Turing-complete: all calculations are completed and we must make sure that they complete with a certain result.
But in reality, the commentator with HN was referring to
Rice's theorem : “any non-trivial property of a language recognized by the Turig machine is insoluble.” As a result, if f () is some arbitrary calculation, we cannot generally decide whether the program causes undefined behavior:
main() { f(); return 1 / 0;
But this is nonsense. Rice's theorem applies only to non-trivial properties. To get around this limitation, we need to define a language in which the absence of UB is a trivial property. This is achieved by the fact that each operation of the program performs a complete function: it is defined under any conditions. Thus, programs in this language can either terminate in a certain state, or not terminate at all. No extremely complete specifications are usually made for real programming languages, because it will take too much work, especially if the language is open for interaction with other levels of the system, for example, contains an inline assembler. But this problem is purely practical, theoretically there is no problem. It is not very difficult to write a specification without UB for a (Turing-complete) toy language or a subset of a real language.
Now back to the original comment on HN: it was about incompleteness theorems, not about the stopping problem. I don’t even know what to say, I don’t see how Godel’s theorem has to do with the indefinite behavior of programming languages.