πŸ“œ ⬆️ ⬇️

Not so easy to reset arrays in VC ++ 2015

What is the difference between these two definitions of initialized local C / C ++ variables?

char buffer[32] = { 0 }; char buffer[32] = {}; 

One difference is that the first is valid in C and C ++, and the second is only in C ++.

Well, then let's focus on C ++. What do these two definitions mean ?
')
The first one says: the compiler must set the value of the first element of the array to zero and then (roughly speaking) initialize the remaining elements of the array to zero. The second means that the compiler must initialize the entire array with zeros.

These definitions are somewhat different, but in fact the result is one β€” the entire array must be initialized with zeros. Therefore, according to the β€œas-if” rule in C ++, they are the same. That is, any sufficiently modern optimizer should generate identical code for each of these fragments. Right?

But sometimes the differences in these definitions matter. If (hypothetically) the compiler takes these definitions to the highest degree literally, then for the first case such code will be generated:

  1: buffer[0] = 0; memset(buffer + 1, 0, 31); 

while for the second case the code will be as follows:

  2: memset(buffer, 0, 32); 

And if the optimizer does not notice that these two statements can be combined, the compiler can generate a less efficient code for the first definition than for the second.

If the compiler literally implemented algorithm 1, then the zero value will be assigned to the first data byte, then (if the processor is 64-bit) three write operations of 8 bytes will be performed. To fill the remaining seven bytes, 3 more write operations may be required: first write 4 bytes, then 2 and then 1 more byte.

Well, this is hypothetical . This is exactly how VC ++ works. For 64-bit builds, the typical code generated for "= {0}" looks like this:

 xor eax, eax mov BYTE PTR buffer$[rsp+0], 0 mov QWORD PTR buffer$[rsp+1], rax mov QWORD PTR buffer$[rsp+9], rax mov QWORD PTR buffer$[rsp+17], rax mov DWORD PTR buffer$[rsp+25], eax mov WORD PTR buffer$[rsp+29], ax mov BYTE PTR buffer$[rsp+31], al 

Graphically, it looks like this (almost all write operations are not aligned):

image

But if you omit zero, VC ++ will generate the following code:

 xor eax, eax mov QWORD PTR buffer$[rsp], rax mov QWORD PTR buffer$[rsp+8], rax mov QWORD PTR buffer$[rsp+16], rax mov QWORD PTR buffer$[rsp+24], rax 

What looks like this:

image

The second command sequence is shorter and faster. The speed difference is usually difficult to measure, but in any case you should prefer a more compact and fast code. Code size affects performance at all levels (network, disk, cache), so extra code bytes are undesirable.

This is generally not important, it probably will not even have any noticeable effect on the size of real programs. But personally, I find the code generated for β€œ= {0};” rather amusing. Equivalent to the constant use of "uh" in public speaking.

I first noticed this behavior and reported it six years ago, and recently discovered that this problem is still present in VC ++ 2015 Update 3. I was curious and I wrote a small Python script to compile the code below using various array sizes and various optimization options for x86 and x64 platforms:

 void ZeroArray1() { char buffer[BUF_SIZE] = { 0 }; printf(β€œ   .%s\n”, buffer); } void ZeroArray2() { char buffer[BUF_SIZE] = {}; printf(β€œ   .%s\n”, buffer); } 

The graph below shows the size of these two functions in one specific platform configuration β€” size optimization for a 64-bit build β€” versus BUF_SIZE values ​​ranging from 1 to 32 (when BUF_SIZE value exceeds 32, the size of the code variants are the same):

image

In cases where the BUF_SIZE value is 4, 8, and 32, the memory savings are impressive β€” the code size decreases by 23.8%, 17.6%, and 20.5%, respectively. The average amount of saved memory is 5.4%, and this is quite significant, considering that all these functions have a common epilogue code, a prologue and a call to printf .

Here I would like to recommend that all C ++ programmers, when initializing structures and arrays, give preference to β€œ= {};” instead of = β€œ= {0};”. In my opinion, it is better from an aesthetic point of view, and it seems that it almost always generates a shorter code.

But it is "almost." The results above demonstrate that there are several cases in which β€œ= {0};” generates more optimal code. For single and double-byte formats, β€œ= {0};” immediately writes a zero to the array (according to the command), while β€œ= {};” resets the register and only then creates such a record. For the 16-byte format, β€œ= {0};” uses the SSE register to zero all bytes at the same time β€” I don’t know why this method is not used more often in compilers.

So, before recommending anything, I considered it my duty to test various optimization settings for 32-bit and 64-bit systems. Main results:

32-bit with / O1 / Oy-: The average memory saving from 1 to 32 is 3.125 bytes, 5.42%.
32-bit c / O2 / Ou: The average memory saving from 1 to 40 is 2.075 bytes, 3.29%.
32-bit c / o2: The average memory saving from 1 to 40 is 1,150 bytes, 1.79%.
64-bit c / o1: The average memory saving from 1 to 32 is 3,844 bytes, 5.45%
64-bit w / O2: The average memory saving from 1 to 32 is 3.688 bytes, 5.21%.

The problem is that the result for 32-bit / O2 / Ou, where "= {};" is an average of 2.075 bytes more than with "= {0};". This is true for values ​​from 32 to 40, where the code β€œ= {};” is usually 22 bytes more! The reason is that the code β€œ= {};” uses the β€œmovaps” commands instead of the β€œmovups” to reset the array. This means that he has to use a lot of commands only to ensure that the stack is aligned to 16 bytes. Here is bad luck.

image

findings


I still recommend that C ++ programmers give preference to β€œ= {};”, although some somewhat contradictory results show that the advantage provided by this option is insignificant.
It would be nice if the VC ++ optimizer generated identical code for these two components, and it would be great if this code was always perfect. Maybe someday it will be?

I would like to know why the VC ++ optimizer is so inconsistent when deciding when to use 16-byte SSE registers to reset the memory. On 64-bit systems, this register is used only for 16-byte buffers initialized with β€œ= {0};”, although a more compact code is usually generated with SSE.

I think these difficulties with code generation are also characteristic of a more serious problem related to the fact that adjacent initializers of aggregates are not merged. However, I have already devoted a lot of time to this issue, so I will leave it at the level of theory.

Here I reported this bug, and the Python script can be found here .

Please note that the code below, which should also be equivalent, in any case generates even worse code than ZeroArray1 and ZeroArray2.

 char buffer[32] = β€œβ€; 

Although I did not test it myself, I heard that the gcc and clang compilers did not fall for the bait β€œ= {0};”.

In earlier versions of VC ++ 2010, the problem was more serious. In some cases, a memset call was used, and β€œ= {0};” guaranteed incorrect address alignment under any circumstances. In earlier versions of VC ++ 2010 CRT, with incorrect alignment, the last 128 bytes of data were recorded four times slower (using the command stosb instead of stosd). This was quickly fixed.

Translated by ABBYY LS

Source: https://habr.com/ru/post/307920/


All Articles