📜 ⬆️ ⬇️

The history of one bug: the alignment of data on x86

Once I had to calculate the sum of the vectors of integers.

It sounds unusual. Who needs to do this in real life? Usually such calculations are found only in tasks from elementary school or compiler benchmarks. But now it really happened.

In reality, the task was to check the checksum of IPv4 headers , which is the sum of the reverse codes (additions to one) of double-byte machine words. Simply put, it means adding all the words and all the carry bits that are produced in the process. This procedure has several nice features:
')

There was one important requirement: the source data was not aligned (IP frames are the same as they came from the equipment or read from the file).

I didn’t have to worry about software portability, because the code had to work only on a single platform: Intel x64 (Linux and GCC 4.8.3). Intel has no restrictions on the alignment of integer operands (access to unaligned data used to be slow, but not anymore), and since the byte order is not important, then the order of bytes from the youngest to the oldest will go down. So I quickly wrote:

 _Bool check_ip_header_sum (const char * p, size_t size) { const uint32_t * q = (const uint32_t *) p; uint64_t sum = 0; sum += q[0]; sum += q[1]; sum += q[2]; sum += q[3]; sum += q[4]; for (size_t i = 5; i < size / 4; i++) { sum += q[i]; } do { sum = (sum & 0xFFFF) + (sum >> 16); } while (sum & ~0xFFFFL); return sum == 0xFFFF; } 

The source code and the output of the assembler can be found in the repository .

The most common IP header size is 20 bytes (5 double words, which I will call just words ) —that is why the code looks that way. In addition, the size can never be less - it is checked before calling this function. Since an IP header cannot be more than 15 words, the number of loop iterations is from 0 to 10.

This code is really not programmatically portable - it is known that accessing arbitrary memory using pointers to 32-bit values ​​does not work on some processors. For example, on most RISC processors, if not at all. But, as I said earlier, it was never supposed that this would create problems on x86.

And of course (otherwise there wouldn’t be something to talk about here) the reality proved the opposite, and this code fell out with a SIGSEGV error.

Simplify


The failure occurred only when the loop was running, that is, the headers were longer than 20 bytes. In real life, this happens very rarely, but I was lucky, because in my test dataset there were such headers. Let's simplify our code just before this loop. We write it on pure C and divide it into two files in order to avoid the built-in functions. Here is our sum.c

 #include <stdlib.h> #include <stdint.h> uint64_t sum (const uint32_t * p, size_t nwords) { uint64_t res = 0; size_t i; for (i = 0; i < nwords; i++) res += p [i]; return res; } 

But main.c

 #include <stdint.h> #include <stdio.h> extern uint64_t sum (const uint32_t * p, size_t nwords); char x [100]; int main (void) { size_t i; for (i = 0; i < sizeof (x); i++) x [i] = (char) i; for (i = 0; i < 16; i++) { printf ("Trying %d sum\n", (int) i); printf ("Done: %d\n", (int) sum ((const uint32_t*) (x + i), 16)); } return 0; } 

Now SIGSEGV appeared on the function sum when i 1.

Investigation


The function code sum is surprisingly large, so I will show only the main loop.

 .L13: movdqa (%r8), %xmm2 addq $1, %rdx addq $16, %r8 cmpq %rdx, %r9 pmovzxdq %xmm2, %xmm1 psrldq $8, %xmm2 paddq %xmm0, %xmm1 pmovzxdq %xmm2, %xmm0 paddq %xmm1, %xmm0 ja .L13 

The compiler is smart. Too clever, as for me. He applied the SSE instruction set (he was allowed to do this because I used it everywhere and specified -msse4.2 on the command line. This code simultaneously reads four values ​​( movdqa ), then converts them into a 64-bit format in two registers ( two instructions pmovzxdq and psrldq ) and adds the current amount ( %xmm0 ). After iterating the cycle, it adds the accumulated values ​​together.

This looks like an acceptable optimization for the case when we are dealing with a lot of words, but this is not the case here. The compiler did not have the opportunity to establish the typical number of iterations of the loop, so he optimized the code to the maximum, correctly reasoning that for a case with a small number of words, the losses from over-optimization would also be small. We will check later what the losses are and how big they are.

What in general in this code could cause an error? We quickly realized that this is the movdqa instruction. Like most SSE memory-access instructions, it requires a 16-byte alignment of the address of the source argument. But we cannot expect such an alignment from the uint32_t pointer, and how, then, generally use this instruction?

The compiler really cares about alignment. Before starting the cycle, it calculates how many words can be processed before the cycle begins.

  testq %rsi, %rsi ; %rsi is n je .L14 movq %rdi, %rax ; %rdi is p movq %rsi, %rdx andl $15, %eax shrq $2, %rax negq %rax andl $3, %eax 

Or, in a more familiar form:

  if (nwords == 0) return 0; unsigned start_nwords = (- (((unsigned)p & 0x0F) >> 2)) & 3; 

It returns 0 if p ends with 0, 1, 2, or 3 in Hex, returns 3 if it ends with 4-7, returns 2 for the range 8 − B, and 1 for C − F. After processing these first words, we can start a cycle (provided that the number of remaining words is at least 4, and taking care of the leftovers).

In short, this code aligns the pointer to 16 bytes, but on the condition that it is already aligned to 4 .

Suddenly, our x86 behaves like RISC: it crashes with an error if the pointer to uint32_t not aligned to 4 bytes.

Simple solutions do not fit


No simple manipulation of this function solves the problem. We could, for example, declare the parameter p as char* in a naive attempt to “explain to the compiler the arbitrary nature of the pointer”:

 uint64_t sum0 (const char * p, size_t nwords) { const uint32_t * q = (const uint32_t *) p; uint64_t res = 0; size_t i; for (i = 0; i < nwords; i++) res += q [i]; return res; } 

Or we can replace indexing with pointer arithmetic.

 uint64_t sum01 (const uint32_t * p, size_t n) { uint64_t res = 0; size_t i; for (i = 0; i < n; i++) res += *p++; return res; } 

Or apply both methods.

 uint64_t sum02 (const char * p, size_t n) { uint64_t res = 0; size_t i; for (i = 0; i < n; i++, p += sizeof (uint32_t)) res += *(const uint32_t *) p; return res; } 

None of these modifications helps. The compiler is smart enough to ignore all syntactic sugar and reduce the code to the base. All these versions crash with a SIGSEGV error.

What the standards say


This is like a rather dirty compiler trick. The program transformation that it performs contradicts the programmer’s usual expectations for x86. Is the compiler allowed to do this? To answer this question, you have to look at the standard.

I'm not going to dig deep into various C and C ++ standards . We will see only one of them, namely C99 , and more specifically, the latest publicly available version of the standard C99 (2007) .

It presents the concept of leveling:

3.2
alignment
the requirement for objects of a certain type to be located on the borders of memory elements with addresses multiple of the byte address

This concept is used in defining a pointer mapping.

6.3.2.3
A pointer to an object or an incomplete type can be converted to a pointer to another object or an incomplete type. If the resulting pointer is incorrectly aligned for the specified type, then the behavior is undefined. Otherwise, when converting back, the result should be equal to the original pointer. When a pointer to an object is converted to a pointer to a character data type, the result points to the lower addressed byte of the object. Successful increment of the result, up to the size of the object, gives pointers to the remaining bytes of the object.

And also it is used in pointer dereferencing.

6.5.3.2 Address and dereference operation
If the pointer has been assigned an invalid value, the behavior of the unary operator * is undefined. 87 )

87 ) Among the invalid values ​​for dereferencing an operator by a unary operator *: null pointer; improperly aligned address for the type of the specified object; address of the object at the end of its use.

If I understand these points correctly, the conversion of pointers (other than the conversion of something to char * ) is generally a dangerous business. The program can fall out right here during the conversion. Alternatively, the conversion can be successful, but produce an invalid value that will cause the program to crash (or crap on the output) during dereference. In our case, both of these can happen if the alignment requirement for uint32_t performed by this compiler differs from unity (the alignment for char* ). Since the most natural alignment for uint32_t is 4, the compiler was absolutely right.

Version sum0 , although it does not solve the problem, is still better than the original sum , because it required that the pointer be already of type uint32_t* , which required the pointer to be converted in the call code. This conversion may fail immediately or produce an invalid pointer value. Let's give alignment to the responsibility of the sum function and therefore replace the sum with sum0 .

These clauses of the standard explain why our attempts to solve the problem failed, playing with the types of pointers and how to calculate them. It doesn't matter what we do with the pointer, as a result, it is converted to uint32_t* , which immediately signals the compiler that the pointer is aligned on the border of four bytes.

There are only two suitable solutions.

Disable SSE


The first is not so much a decision - rather, a trick. Problems with alignment on x86 appear only when using SSE, so let's turn it off. We can do this for the entire file in which the sum declared, and if this is inconvenient, only for this particular function.

 __attribute__ ((target("no-sse"))) uint64_t sum1 (const char * p, size_t nwords) { const uint32_t * q = (const uint32_t *) p; uint64_t res = 0; size_t i; for (i = 0; i < nwords; i++) res += q [i]; return res; } 

This code is even worse tolerable than the original one, since it uses GCC-specific attributes and Intel-specific attributes. It can be cleared by an appropriate conditional compilation.

 #if defined (__GNUC__) && (defined (__x86_64__) || defined (__i386__)) __attribute__ ((target ("no-sse"))) #endif 

However, this method is actually of little use, since it only allows the program to compile on other computers and other architectures, but it will not necessarily work there. The program may still fail if there is a RISC processor or if the compiler uses a different syntax to disable SSE.

Even if we remain within the framework of GCC and Intel, who can guarantee that in ten years there will not be another architecture other than SSE? In the end, my original code could have been written 20 years ago, when SSE did not exist (the first MMX appeared in 1997).

However, such a program is compiled into a very neat code:

 sum0: testq %rsi, %rsi je .L34 leaq (%rdi,%rsi,4), %rcx xorl %eax, %eax .L33: movl (%rdi), %edx addq $4, %rdi addq %rdx, %rax cmpq %rcx, %rdi jne .L33 ret .L34: xorl %eax, %eax ret 

This is exactly the code I was thinking about when I wrote the function. I think this code will work faster than SSE-based code for small vector sizes, which is our case with IP headers. Later we will take measurements.

Using memcpy


Another option is to use the memcpy function. This function can copy bytes representing a number into a variable of the appropriate type, regardless of alignment. And she does it in full compliance with the standard. It may seem that it is inefficient, and 20 years ago it was so. Nowadays, however, it is not necessary to implement a function as a procedure call; the compiler can regard it as its own language function and replace it with memory-to-register memory transfer. GCC definitely does. It compiles the following code:

 uint64_t sum2 (const char * p, size_t nwords) { uint64_t res = 0; size_t i; uint32_t temp; for (i = 0; i < nwords; i++) { memcpy (&temp, p + i * sizeof (uint32_t), sizeof (temp)); res += temp; } return res; } 

in code similar to the original SSE, but only uses movdqu instead of movdqa . This instruction allows unaligned data; but acts with different performance. On some processors, it is much slower than movdqa , even if the data is actually aligned. On others, it works at almost the same speed.

Another difference in the generated code is that it does not even try to align the pointer. It uses movdqu on the original pointer even if it could align it and then use movdqa . This means that such a code, being more universal, may end up being slower than the original code on some input data.

This solution is completely portable, it can be used anywhere, even on RISC architectures.

Combined solution


The first solution seems to be faster on our data (although we have not measured it yet), while the second is more portable. We can combine them together:

 #if defined (__GNUC__) && (defined (__x86_64__) || defined (__i386__)) __attribute__ ((target ("no-sse"))) #endif uint64_t sum3 (const char * p, size_t nwords) { uint64_t res = 0; size_t i; uint32_t temp; for (i = 0; i < nwords; i++) { memcpy (&temp, p + i * sizeof (uint32_t), sizeof (temp)); res += temp; } return res; } 

This code will compile into a good non-SSE loop on GCC / Intel, but will still produce working (and fairly good) code on other architectures. This is the version I'm going to use in my project.

The x86 code produced is identical to that obtained from sum1 .

Speed ​​measurement


We have seen that the compiler has the full right to generate code with movdqa . How good is this solution in terms of performance? Let's measure the performance of all our solutions. First, we do this on fully aligned data (the pointer is aligned on the border 16). The values ​​in the table are in nanoseconds per word to be added.

Size, wordssum0 (movdga)sum1 (loop)sum2 (movdqu)sum3 (loop, memcpy)
one2.911.952.901.94
five0.840.790.770.79
sixteen0.460.450.410.46
10240.240.460.260.48
655360.240.450.240.45

This table confirms that on a very small number of words (1), ordinary cycles work faster than SSE-based versions, although the difference is not so big (1 nanosecond per word, and here there is only one word).

SSE is much faster on a large number of words (1024 and more), and here the overall result in winning can be quite significant.

On the input data of medium size (like 16), the speed is almost the same, with a slight advantage of SSE ( movdqu ).

Run the test on all values ​​between 1 and 16 and check where the equilibrium point is. Versions sum1 (non-SSE cycle) and sum3 show a very similar result (which is expected because the code is the same; the difference in the results shows the measurement error in the region of 0.02 ns). That is why only the final version is presented on the graph ( sum3 ).



We see that simple cycles win SSE versions with the number of words up to 3, after which SSE versions begin to take over (the movdqu version is usually faster than the original movdqf ).

I think that in the absence of any additional information, the rights compiler assumes that an arbitrary cycle will be executed more than three times, so the decision to use SSE was completely correct. But why did he not immediately switch to the movdqu option? Is there any reason to use movdqa ?

We have seen that when the data is aligned, the movdqu version works at the same speed as movdqa , on a large number of words, and works faster on a small number of words. The latter can be explained by a smaller number of instructions that precede the cycle (there is no need to check for alignment). What happens if we run a test on unaligned data? Here are the results for some options:

Size, wordsOffset 0Offset 1Offset 4
movdqamovdquloopmovdquloopmovdqamovdquloop
one2.912.901.942.931.942.902.901.94
five0.840.770.790.770.790.840.790.78
sixteen0.460.410.460.420.460.520.400.46
10240.240.260.480.260.510.250.250.47
655360.240.240.450.250.500.240.240.46

As you can see, the alignment gives only minor changes in speed, with the exception of one: the movdqa version starts to slow down a bit (0.52 ns instead of 0.46 ns) when shifted by 4 with 16 words. The direct loop is still the best solution on a small number of words; movdqu - the best solution for large quantities. The compiler was wrong using movdqa . A possible explanation is that it is optimized for the old Intel processor model. The movdqu instruction worked a bit slower on movdqa on Xeon processors, even on fully aligned data. It seems that now it is no longer observed, so the compiler can be simplified (and soften the alignment requirements).

Original feature


The original function for checking IP headers should now be rewritten as follows:

 #if defined (__GNUC__) && (defined (__x86_64__) || defined (__i386__)) __attribute__ ((target ("no-sse"))) #endif _Bool check_ip_header_sum (const char * p, size_t size) { const uint32_t * q = (const uint32_t *) p; uint32_t temp; uint64_t sum = 0; memcpy (&temp, &q [0], 4); sum += temp; memcpy (&temp, &q [1], 4); sum += temp; memcpy (&temp, &q [2], 4); sum += temp; memcpy (&temp, &q [3], 4); sum += temp; memcpy (&temp, &q [4], 4); sum += temp; for (size_t i = 5; i < size / 4; i++) { memcpy (&temp, &q [i], 4); sum += temp; } do { sum = (sum & 0xFFFF) + (sum >> 16); } while (sum & ~0xFFFFL); return sum == 0xFFFF; } 

If we are afraid of converting an unaligned pointer to uint32_t* (the standard speaks of undefined behavior), then the code will look like this:

 #if defined (__GNUC__) && (defined (__x86_64__) || defined (__i386__)) __attribute__ ((target ("no-sse"))) #endif _Bool check_ip_header_sum (const char * p, size_t size) { uint32_t temp; uint64_t sum = 0; memcpy (&temp, p, 4); sum += temp; memcpy (&temp, p + 4, 4); sum += temp; memcpy (&temp, p + 8, 4); sum += temp; memcpy (&temp, p + 12, 4); sum += temp; memcpy (&temp, p + 16, 4); sum += temp; for (size_t i = 20; i < size; i+= 4) { memcpy (&temp, p + i, 4); sum += temp; } do { sum = (sum & 0xFFFF) + (sum >> 16); } while (sum & ~0xFFFFL); return sum == 0xFFFF; } 

Both versions look pretty ugly, especially the second. Both remind me of programming in pure assembler. However, this is exactly the right way to write portable code in C.

Interestingly, although in our tests the cycle worked at the same speed as movdqu , on the number of words 5, but after writing this function in one cycle from 0 to size it began to work slower (the usual result is 0.48 ns and 0.83 ns per word).

C ++ Version


C ++ allows you to write the same function in a much more readable form, using some templates. We introduce the parameterized type const_unaligned_pointer :

 template<typename T> class const_unaligned_pointer { const char * p; public: const_unaligned_pointer () : p (0) {} const_unaligned_pointer (const void * p) : p ((const char*)p) {} T operator* () const { T tmp; memcpy (&tmp, p, sizeof (T)); return tmp; } const_unaligned_pointer operator+ (ptrdiff_t d) const { return const_unaligned_pointer (p + d * sizeof (T)); } T operator[] (ptrdiff_t d) const { return * (*this + d); } }; 

Here is the whole frame. This definition should contain a test for equality, a minus operator for two pointers, a plus operator in another direction, some transformations, and probably other things.

With the help of a parameterized type, our function looks very close to where we started:

 bool check_ip_header_sum (const char * p, size_t size) { const_unaligned_pointer<uint32_t> q (p); uint64_t sum = 0; sum += q[0]; sum += q[1]; sum += q[2]; sum += q[3]; sum += q[4]; for (size_t i = 5; i < size / 4; i++) { sum += q[i]; } do { sum = (sum & 0xFFFF) + (sum >> 16); } while (sum & ~0xFFFFL); return sum == 0xFFFF; } 

From this it turns out exactly the same assembler code as in the code with memcpyand, obviously, it works with the same speed.

A few more templates


Our code reads only unaligned data, so class is const_unaligned_pointersufficient. What if we wanted to write it too? We can write a class for this, but in this case we need two classes: one for the pointer and the other for the l-value, which is obtained during the dereference of this pointer:

 template<typename T> class unaligned_ref { void * p; public: unaligned_ref (void * p) : p (p) {} T operator= (const T& rvalue) { memcpy (p, &rvalue, sizeof (T)); return rvalue; } operator T() const { T tmp; memcpy (&tmp, p, sizeof (T)); return tmp; } }; template<typename T> class unaligned_pointer { char * p; public: unaligned_pointer () : p (0) {} unaligned_pointer (void * p) : p ((char*)p) {} unaligned_ref<T> operator* () const { return unaligned_ref<T> (p); } unaligned_pointer operator+ (ptrdiff_t d) const { return unaligned_pointer (p + d * sizeof (T)); } unaligned_ref<T> operator[] (ptrdiff_t d) const { return *(*this + d); } }; 

Again, this code simply demonstrates the idea; Much more is needed to make it suitable for use in production. Let's try running a simple test:

 char mem [5]; void dump () { std::cout << (int) mem [0] << " " << (int) mem [1] << " " << (int) mem [2] << " " << (int) mem [3] << " " << (int) mem [4] << "\n"; } int main (void) { dump (); unaligned_pointer<int> p (mem + 1); int r = *p; r++; *p = r; dump (); return 0; } 

Here is his issue:

 0 0 0 0 0 0 1 0 0 0 

We could write

  ++ *p; 

but this requires a definition of operator++c unaligned_ref.

findings



Update


On the / r / cpp / branch, the OldWolf2 user noticed that the checksum verification code contains an error in the last line:

  } while (sum & ~0xFFFFL); 

He is right: u 0xFFFFLtype unsigned long, which is not always the same as uint64_t. The length longcan be 32 bits, and then the inversion of bits (reverse code) will occur before the extension to 64 bits, and the real constant in the test will be 0x00000000FFFF0000. It is easy to pick up input data, where such a test will fail, for example, for an array of two words: 0xFFFFFFFFand 0x00000001.

We can either do the reverse code after translating to 64 bits:

  } while (sum & ~(uint64_t) 0xFFFF); 

Or, alternatively, make a comparison:

  } while (sum > 0xFFFF); 

Interestingly, GCC produces a more concise code in the second case. Here is the test version:

 .L15: movzwl %ax, %edx shrq $16, %rax addq %rdx, %rax movq %rax, %rdx xorw %dx, %dx testq %rdx, %rdx jne .L15 

But the comparison version:

 .L44: movzwl %ax, %edx shrq $16, %rax addq %rdx, %rax cmpq $65535, %rax ja .L44 

Comments are welcome below or on reddit .

Source: https://habr.com/ru/post/318456/


All Articles