Pointers are complex, or what is stored in a byte?

Hello, Habr! I present to you the translation of the article "Pointers Are Complicated, or: What's in a Byte?" authorship of Ralf Jung.

This summer I am working on Rust fulltime again, and I will again (among other things) work on a “memory model” for Rust / MIR. However, before I talk about my ideas, I finally must dispel the myth that "pointers are simple: they are just numbers." Both parts of this statement are erroneous, at least in languages with unsafe features, such as Rust or C: pointers cannot be called either prime or (ordinary) numbers.

I would also like to discuss the part of the memory model that needs to be addressed before we can talk about the more complex parts: in what form is the data stored in memory? A memory consists of bytes, minimum addressable units and the smallest elements that can be accessed (at least on most platforms), but what are the possible byte values? Again, it turns out that “it's just an 8-bit number” is not suitable as an answer.

I hope that after reading this post, you will agree with me regarding both statements.

Pointers are complicated

What is the problem with "pointers are regular numbers"? Let's look at the following example: (I use C ++ here, since writing unsafe code in C ++ is easier than writing in Rust, and unsafe code is just the place where the problems appear. Insecure Rust and C have all the same problems that and C ++).

int test() { auto x = new int[8]; auto y = new int[8]; y[0] = 42; int i = /* -     */; auto x_ptr = &x[i]; *x_ptr = 23; return y[0]; }

Optimizing the last read of y [0] with a return of 42 is always very beneficial. The rationale for this optimization is that changing x_ptr, which points to x, cannot change y.

However, when dealing with low-level languages such as C ++, we can violate this assumption by assigning i the value yx. Since & x [i] is the same as x + i, we write 23 in & y [0].

Of course, this does not prevent C ++ compilers from doing such optimizations. To resolve this, the standard says that our code has UB .

Firstly, it is not allowed to perform arithmetic operations on pointers (as in the case of & x [i]), if in this case the pointer goes beyond any of the boundaries of the array . Our program violates this rule: x [i] goes beyond x, so it is UB. In other words, even calculating the x_ptr value is UB, so we don’t even get to the place where we want to use this pointer.

(It turns out that i = yx is also UB, since only pointers pointing to the same memory allocation are allowed to be subtracted . However, we could write i = ((size_t) y - (size_t) x) / sizeof (int) to bypass this is a limitation.)

But we are not done yet: this rule has the only exception that we can use to our advantage. If the arithmetic operation calculates the value of the pointer to the address exactly after the end of the array, then everything is in order. (This exception is necessary to calculate vec.end () for the most common loops in C ++ 98.)

Let's change the example a bit:

 int test() { auto x = new int[8]; auto y = new int[8]; y[0] = 42; auto x_ptr = x+8; //    if (x_ptr == &y[0]) *x_ptr = 23; return y[0]; }

Now imagine that x and y were allocated one after another , with y having a larger address. Then x_ptr points to the beginning of y! Then the condition is true and assignment occurs. At the same time, there is no UB due to the exit of the pointer abroad.

It seems that this will not allow optimization. However, the C ++ standard has another ace up its sleeve to help compiler creators: in fact, it does not allow us to use x_ptr. According to what the standard says about adding numbers to pointers , x_ptr points to the address after the last element of the array. It does not point to a specific element of another object, even if they have the same address . (At least this is a common interpretation of the standard based on which LLVM optimizes this code .)

And even though x_ptr and & y [0] point to the same address , this does not make them the same pointer , that is, they cannot be used interchangeably: & y [0] points to the first element of y; x_ptr points to the address after x. If we replace * x_ptr = 23 with the string * & y [0] = 0, we will change the value of the program, even though the two pointers were checked for equality.

This is worth repeating:

Just because two pointers point to the same address does not mean that they are equal and can be used interchangeably.

Yes, this difference is elusive. In fact, this still causes differences in programs compiled with LLVM and GCC.

Also note that this one-after rule is not the only place in C / C ++ where we can observe such an effect. Another example is the restrict keyword in C, which can be used to express that pointers do not overlap (are not equal):

 int foo(int *restrict x, int *restrict y) { *x = 42; if (x == y) { *y = 23; } return *x; } int test() { int x; return foo(&x, &x); }

Calling test () calls UB, since two memory accesses in foo should not occur at the same address. Replacing * y with * x in foo, we will change the value of the program, and it will no longer call UB. Once again: although x and y have the same address, they cannot be used interchangeably.

Pointers are definitely not just numbers.

Simple pointer model

So what is a pointer? I do not know the full answer. In fact, this is an open area for research.

One important point: here we are looking at an abstract pointer model . Of course, on a real computer, pointers are numbers. But a real computer does not carry out the optimizations that modern C ++ compilers do. If we wrote the above programs in assembler, then there would be no UB, no optimizations. C ++ and Rust take a more “higher-level” approach to memory and pointers, limiting the programmer to the compiler. When you need to formally describe what a programmer can and cannot do in these languages, the model of pointers as numbers is shattered, so we need to find something else. This is another example of using a "virtual machine" different from a real computer for specification purposes - an idea that I wrote about earlier .

Here is a simple sentence (in fact, this model of pointers is used in CompCert and my work by RustBelt , as well as the way the miri interpreter implements pointers ): a pointer is a pair of some ID that uniquely identifies a memory area (allocation), and the offset is relative this area. If you write this in Rust:

 struct Pointer { alloc_id: usize, offset: isize, }

The operations of adding (subtracting) a number to a pointer (from a pointer) affect only the offset, and therefore the pointer can never leave the memory area. Subtracting pointers is only possible if they belong to the same memory area (in accordance with C ++ ).

(As we can see, the C ++ standard applies these rules to arrays, not memory areas. However, LLVM applies them at the area level .)

It turns out (and miri shows the same thing) that this model can serve us well. We always remember which region of memory the pointer belongs to, so we can distinguish the one-after-one pointer of one memory region from the pointer to the beginning of another region. Thus miri may find that our second example (with & x [8]) has UB.

Our model is falling apart

In our model, pointers, although they are not numbers, are at least simple. However, this model will begin to fall apart before our eyes, as soon as you remember the conversion of pointers to numbers. In miri, casting a pointer to a number actually does nothing, we just get a numerical variable (i.e., its type says it is a number) whose value is a pointer (i.e., a pair of memory area and offset). However, multiplying this number by 2 leads to an error, since it is completely unclear what it means to "multiply such an abstract pointer by 2".

I must clarify: this is not a good solution when it comes to defining the semantics of a language. However, this works well for the interpreter. This is the simplest approach, and we chose it because it is not clear how it can be done otherwise (except to not support such casts at all - but with their support miri can run more programs): in our abstract machine there is no single "address space", in which all allocated memory areas would be located, and all pointers were mapped to specific different numbers. Each memory area is identified by a (hidden) ID. Now we can begin to add additional data to our model, such as a base address for each memory area, and somehow use it to bring the number back to the pointer ... and at this point the process becomes really very complicated, and, in any case, a discussion of this Models are not the purpose of writing a post. Its purpose is to discuss the need for such a model. If you are interested, I recommend that you read this document , which takes a closer look at the above idea of adding a base address.

In short, the casts of pointers and numbers to each other are confusing and difficult to define formally, given the optimizations discussed above. There is a conflict between the high-level approach needed for optimizations and the low-level approach needed to describe casting pointers to numbers and vice versa. For the most part, we simply ignore this problem in miri and, whenever possible, try to do as much as possible using the simple model that we work with. A complete definition of languages such as C ++ or Rust, of course, cannot go along such a simple path, it should explain what is really happening. As far as I know, there is no suitable solution, but academic research is approaching the truth .

That is why pointers are also not simple.

From pointers to bytes

I hope I have made a reasonably convincing argument that numbers are not the only data type to consider if we want to formally describe low-level languages like C ++ or the (insecure) part of Rust. However, this means that a simple operation like reading a byte from memory cannot just return u8. Imagine that we implement memcpy by reading each byte of the source in turn into some local variable v, and then store this value in the target location. But what if this byte is part of a pointer? If the pointer is a pair of memory area ID and offset, then what will be its first byte? We need to say what the value of v is equal to, so we will have to somehow answer this question. (And this is a completely different problem than the problem with multiplication, which was in the previous section. We just assume that there is some abstract type of Ponter.)

We cannot represent the byte of the pointer as a value of the range 0..256 (note: hereinafter 0 is turned on, 256 is not). In general, if we use a naive memory representation model, the extra “hidden” part of the pointer (the one that makes it more than just a number) will be lost when the pointer is written to memory and re-read from it. We will have to fix this, and for this we will have to expand our concept of “byte” to represent this additional state. Thus, the byte is now either the value of the range 0..256 ("raw bits"), or the nth byte of some abstract pointer. If we had to implement our memory model in Rust, it could look like this:

 enum ByteV1 { Bits(u8), PtrFragment(Pointer, u8), }

For example, PtrFragment (ptr, 0) represents the first byte of the ptr pointer. Thus, memcpy can "break" the pointer into separate bytes that represent this pointer in memory, and copy them individually. On a 32-bit architecture, the full ptr representation will contain 4 bytes:

 [PtrFragment(ptr, 0), PtrFragment(ptr, 1), PtrFragment(ptr, 2), PtrFragment(ptr, 3)]

This representation supports all operations of moving data over pointers at the byte level, which is quite enough for memcry. Arithmetic or bit operations are not fully supported; as noted above, this would require a more complex representation of pointers.

Uninitialized memory

However, we have not finished with our definition of "byte". To fully describe the behavior of the program, we need to consider another option: a byte in memory can be uninitialized . The last byte definition will look like this (suppose we have a Pointer type for pointers):

 enum Byte { Bits(u8), PtrFragment(Pointer, u8), Uninit, }

We use the Uninit value for all bytes in the allocated memory into which we have not written any value yet. It is possible to read uninitialized memory without problems, but any other actions with these bytes (for example, numeric arithmetic) leads to UB.

This is very similar to LLVM rules for the special poison value. Note that LLVM also has a undef value, which is used for uninitialized memory and works a bit differently. However, compiling our Uninit to undef is correct (undef is in some ways "weaker"), and there are suggestions to remove undef from LLVM and use poison instead .

You may wonder why we have a special Uninit value at all. Why not choose some arbitrary b: u8 for each new byte, and then use Bits (b) as the initial value? This is really one option. However, first of all, all compilers came to the approach using a special value for uninitialized memory. Not following this approach means not only causing compilation problems through LLVM, but also reviewing all optimizations and making sure that they work correctly with this modified model. The key point here: you can always safely replace Uninit with any other value: any operation receiving this value will in any case lead to UB.

For example, this C code is easier to optimize with Uninit:

 int test() { int x; if (condA()) x = 1; //     ,       ,  condA() //  ,      x. use(x); //  x = 1. }

With Uninit, we can easily say that x has either a Uninit value or a value of 1, and since replacing Uninit with 1 works, the optimization is easily explained. Without Uninit, x is either "some kind of arbitrary bit pattern" or 1, and the same optimization is harder to explain.

(We can argue that we can swap operations when we make a non-deterministic choice, but then we need to prove that the code that is difficult to analyze does not use x in any way. Uninit avoids this trouble with unnecessary evidence.)

Finally, Uninit is the best choice for interpreters like miri. Such interpreters have problems with operations such as “just select any of these values” (that is, non-deterministic operations), since they tend to go through all possible paths of program execution, which means that they need to try all possible values. Using Uninit instead of an arbitrary bit pattern means that miri can tell you after one program run whether your program uses uninitialized values incorrectly.

Conclusion

We saw that in languages like C ++ and Rust (unlike real computers) pointers can be different even if they point to the same address, and that a byte is more than just a number in the range 0..256. Therefore, if in 1978 the C language could be "portable assembler", now it is an incredibly erroneous statement.

Source: https://habr.com/ru/post/460953/

All Articles