What is Strict Aliasing and why should we care?

(OR pun pun typing, indefinite behavior and leveling, oh my god!)

Hello everyone, in a few weeks we are launching a new thread on the “Developer C ++” course. Our event will be devoted to this event.

What is strict aliasing? First we will describe what aliasing is, and then we will find out what strictness is for.
')
In C and C ++, aliasing is related to the types of expressions that allow us access to stored values. In both C and C ++, the standard defines which expressions for naming which types are allowed. The compiler and optimizer are allowed to assume that we strictly follow the rules of aliasing, hence the term - the strict aliasing rule. If we try to access a value using an invalid type, it is classified as undefined behavior (UB). When we have uncertain behavior, all bets are made, the results of our program cease to be reliable.

Unfortunately, with strictly aliasing violations, we often get the expected results, leaving the possibility that a future version of the compiler with a new optimization will break the code that we considered valid. This is undesirable; it is worth understanding the strict aliasing rules and avoiding their violation.

To better understand why we should be worried, we will discuss the problems that arise when violating the rules of strict aliasing, type punning, as it is often used in strict aliasing rules, as well as on how to create a pun, along with some possible using C ++ 20 to simplify the pun and reduce the likelihood of errors. We will summarize the discussion, having considered some methods for detecting violations of strictly aliasing rules.

Preliminary examples

Let's take a look at some examples, and then we can discuss exactly what the standard (s) says, look at some additional examples, and then see how to avoid strict aliasing and identify violations that we missed. Here is an example that should not surprise you:

int x = 10; int *ip = &x; std::cout << *ip << "\n"; *ip = 12; std::cout << x << "\n";

We have an int * pointing to the memory occupied by the int, and this is valid aliasing. The optimizer should assume that assignments via ip can update the value occupied by x.

The following example shows aliasing, which results in undefined behavior:

 int foo( float *f, int *i ) { *i = 1; *f = 0.f; return *i; } int main() { int x = 0; std::cout << x << "\n"; // Expect 0 x = foo(reinterpret_cast<float*>(&x), &x); std::cout << x << "\n"; // Expect 0? }

In the foo function, we take int * and float *. In this example, we call foo and set both parameters to point to the same memory location, which in this example contains an int. Note that reinterpret_cast tells the compiler to process the expression as if it were of the type specified by the template parameter. In this case, we tell it to process the expression & x as if it were of type float *. We can naively expect that the result of the second cout will be 0, but with optimization enabled using -O2 and gcc, and clang will get the following result:
0
one

What may be unexpected, but perfectly correct, since we have caused unspecified behavior. Float cannot be a valid alias for an int object. Consequently, the optimizer may assume that the constant 1, which was saved during dereference i, will be the return value, since saving via f cannot correctly affect the int object. Connecting code in Compiler Explorer shows that this is exactly what is happening ( example ):

 foo(float*, int*): # @foo(float*, int*) mov dword ptr [rsi], 1 mov dword ptr [rdi], 0 mov eax, 1 ret

An optimizer that uses type-based Alias Analysis (TBAA) analysis assumes that 1 is returned, and directly transfers the constant value to the eax register, which stores the return value. TBAA uses language rules about which types are allowed for aliasing to optimize loading and storage. In this case, TBAA knows that float cannot be an int alias, and optimizes the i load to death.

Now to the directory

What exactly does the standard say about what we are allowed and not allowed to do? The standard language is not straightforward, so for each element I will try to provide code examples that demonstrate meaning.

What does the C11 standard say?

The C11 standard says the following in the “6.5 Expressions” section of paragraph 7:

The object must have its own stored value, which is accessed only with the help of the lvalue expression, which has one of the following types: 88) - a type compatible with the effective type of the object,

 int x = 1; int *p = &x; printf("%d\n", *p); //* p   lvalue-  int,    int

- a qualified version of the type compatible with the actual type of object,

 int x = 1; const int *p = &x; printf("%d\n", *p); // * p   lvalue-  const int,    int

- type, which is a type with or without a sign, corresponding to a qualified type of object,

 int x = 1; unsigned int *p = (unsigned int*)&x; printf("%u\n", *p ); // *p   lvalue-  unsigned int,

See Footnote 12 for the gcc / clang extension , which allows unsigned int * int * to be assigned, even if they are not compatible types.

- type, which is a type with or without a sign, corresponding to a qualified version of the actual type of object,

 int x = 1; const unsigned int *p = (const unsigned int*)&x; printf("%u\n", *p ); // *p   lvalue-  const unsigned int,     ,

- an aggregate or combined type that includes one of the above types among its members (including, recursively, a member of a subaggregated or contained union), or

 struct foo { int x; }; void foobar( struct foo *fp, int *ip );// struct foo -  ,   int   ,       *ip // foo f; foobar( &f, &f.x );

- character type.

 int x = 65; char *p = (char *)&x; printf("%c\n", *p ); // * p   lvalue-  char,    . //    -    .

What does C ++ 17 Draft Standard say

The C ++ 17 project standard in section 11 [basic.lval] states: if the program tries to access the stored value of an object through a glvalue that is different from one of the following types, the behavior is undefined: 63 (11.1) is the dynamic type of the object,

 void *p = malloc( sizeof(int) ); //   ,       int *ip = new (p) int{0}; //        int std::cout << *ip << "\n"; // * ip   glvalue-  int,

(11.2) - cv-qualified (cv - const and volatile) version of the dynamic type of the object,

 int x = 1; const int *cip = &x; std::cout << *cip << "\n"; // * cip    glvalue  const int,   cv-    x

(11.3) - a type similar (as defined in 7.5) to the dynamic type of an object,

//

(11.4) - type, which is a type with or without a sign, corresponding to the dynamic type of the object,
// si ui ,
// godbolt (https://godbolt.org/g/KowGXB) , .

 signed int foo( signed int &si, unsigned int &ui ) { si = 1; ui = 2; return si; }

(11.5) - type, which is a type with or without a sign, corresponding to the cv-qualified version of the dynamic type of the object,

 signed int foo( const signed int &si1, int &si2); //  ,

(11.6) - aggregate or combined type, which includes one of the above types among its elements or non-static data elements (including, recursively, an element or non-static data element of a subaggregate or containing a union),

 struct foo { int x; };

// Compiler Explorer (https://godbolt.org/g/z2wJTC)

 int foobar( foo &fp, int &ip ) { fp.x = 1; ip = 2; return fp.x; } foo f; foobar( f, fx );

(11.7) - a type that is (possibly a cv-qualified) type of the base class of a dynamic object type,

 struct foo { int x ; }; struct bar : public foo {}; int foobar( foo &f, bar &b ) { fx = 1; bx = 2; return fx; }

(11.8) - type char, unsigned char or std :: byte.

 int foo( std::byte &b, uint32_t &ui ) { b = static_cast<std::byte>('a'); ui = 0xFFFFFFFF; return std::to_integer<int>( b ); // b   glvalue-  std::byte,      uint32_t }

It is worth noting that the signed char not included in the list above, this is a noticeable difference from C, which indicates the type of character.

Subtle differences

Thus, although we can see that C and C ++ say similar things about aliasing, there are some differences that we should be aware of. C ++ does not have a C concept of a valid or compatible type, and C does not have a C ++ concept of a dynamic or similar type. Although both have lvalue and rvalue expressions, C ++ also has glvalue, prvalue, and xvalue expressions. These differences are mostly beyond the scope of this article, but one interesting example is how to create an object from the memory involved by malloc. In C, we can set a valid type, for example, by writing to memory via lvalue or memcpy.

 //     C,    C ++ void *p = malloc(sizeof(float)); float f = 1.0f; memcpy( p, &f, sizeof(float)); //   *p - float  C //  float *fp = p; *fp = 1.0f; //   *p - float  C

None of these methods is sufficient in C ++, which requires the placement of new:

 float *fp = new (p) float{1.0f} ; //   *p  float

Are int8_t and uint8_t char types?

Theoretically, neither int8_t nor uint8_t should be char types, but in practice they are implemented in this way. This is important because if they really are character types, then they are also aliases like char-types. If you do not know about this, it can lead to an unexpected decrease in performance . We see that glibc typedef int8_t and uint8_t for signed char and unsigned char respectively.

This would be difficult to change, since for C ++ it would be an ABI break. This would change the name distortion and break any API that uses any of these types in their interface.

The end of the first part. And we will tell you about the pun of typing and leveling in a few days.

Write your comments and do not miss the open webinar , which is already on March 6, will be held by the head of technology development at Rambler & Co - Dmitry Shebordaev .

Source: https://habr.com/ru/post/442554/

All Articles

What is Strict Aliasing and why should we care?

More articles: