📜 ⬆️ ⬇️

New optimizations for x86 in the expected GCC 5.0

So, the actual development of new optimizations in GCC 5.0 can be considered complete. The GCC 5.0 product is now in the stage3 phase, that is, there is a refinement of the already implemented optimizations. In this and subsequent articles I will discuss the optimizations implemented in GCC 5.0 for x86 and their impact on the performance of programs for the Intel Atom and Intel Core line processors . Today we will talk about vectorization of group memory calls. In subsequent articles, I will talk about accelerations in the 32-bit PIC mode and additional vectorization gain.

As you have probably guessed, the vectorization of group memory calls has been significantly improved in GCC 5.0. The group memory reference is an iterable sequence of calls. For example:

x = a[i], y = a[i + 1], z = a[i + 2] 

iterated over i is a group of loads from memory of length 3. The length of the group of calls to memory is the distance between the lowest and highest addresses in the group. For the example above, this is (i + 2) - (i) + 1 = 3
The number of memory hits in a group does not exceed its length. For example:

 x = a[i], z = a[i + 2] 

iterated on i is a group of length 3, despite the fact that there are only 2 memory references.
')
GCC 4.9 vectorizes groups where length is a power of 2 (2, 4, 8 ...).

GCC 5.0 vectorizes groups where the length is 3 or degree 2 (2, 4, 8 ...). Other lengths are not vectorized, as they are very rare in real applications.

Most often, the vectorization of the memory reference group is used when working with arrays of structures.

1. Converting images, say, in the structure of RGB. An example .
2. Work with N-dimensional coordinates (say, to normalize three-dimensional points). An example .
3. Multiplication of vectors by a constant matrix:

 a[i][0] = 7 * b[i][0] - 3 * b[i][1]; a[i][1] = 2 * b[i][0] + b[i][1]; 

In general, in GCC 5.0 (compared to 4.9)


The following tables provide an estimate of the performance gains when using GCC 5.0 on byte structures (the largest number of elements in a vector) compared with GCC 4.9. The following cycle is used for evaluation:

  int i, j, k; byte *in = a, *out = b; for (i = 0; i < 1024; i++) { for (k = 0; k < STGSIZE; k++) { byte s = 0; for (j = 0; j < LDGSIZE; j++) s += in[j] * c[j][k]; out[k] = s; } in += LDGSIZE; out += STGSIZE; } 

Where:

 const byte c[8][8] = {1, -1, 1, -1, 1, -1, 1, -1, 1, 1, -1, -1, 1, 1, -1, -1, 1, 1, 1, 1, -1, -1, -1, -1, -1, 1, -1, 1, -1, 1, -1, 1, -1, -1, 1, 1, -1, -1, 1, 1, -1, -1, -1, -1, 1, 1, 1, 1, -1, -1, -1, 1, 1, 1, -1, 1, 1, -1, 1, 1, 1, -1, -1, -1}; 

Such a matrix is ​​used to minimize computations within the cycle to relatively fast additions and subtractions.


Compile options "-Ofast" plus "-march = slm" for Silvermont , "-march = core-avx2" for Haswell, and all combinations -DLDGSIZE = {1,2,3,4,8} -DSTGSIZE = {1,2 3,4,8}

The performance gain of GCC 5.0 compared to 4.9 (how many times it accelerated, the more - the better).

Silvermont : Intel Atom (TM) CPU C2750 @ 2.41GHz

Gain up to 6.5 times!

image

As can be seen from the table, the results for groups in the memory of length 3 are not very good. This is because such a conversion requires 8 instructions "pshufb", the duration of which in Silvermont is about 5 cycles. Despite this, the vectorization of other commands in the cycle may give a greater increase. This can be seen in the example of a group of downloads from memory lengths of 2, 3, 4 and 8.

Example GCC assembler for a group of downloads from memory length 2:
GCC 4.9GCC 5.0
movdqa .LC0 (% rip),% xmm7
movdqa .LC1 (% rip),% xmm6
movdqa .LC2 (% rip),% xmm5
movdqa .LC3 (% rip),% xmm4
movdqu a (% rax,% rax),% xmm1
movdqu a + 16 (% rax,% rax),% xmm0
movdqa% xmm1,% xmm3
pshufb% xmm7,% xmm3
movdqa% xmm0,% xmm2
pshufb% xmm5,% xmm1
pshufb% xmm6,% xmm2
pshufb% xmm4,% xmm0
por% xmm2,% xmm3
por% xmm0,% xmm2
movdqa .LC0 (% rip),% xmm3
movdqu a (% rax,% rax),% xmm0
movdqu a + 16 (% rax,% rax),% xmm1
movdqa% xmm3,% xmm4
pand% xmm0,% xmm3
psrlw $ 8,% xmm0
pand% xmm1,% xmm4
psrlw $ 8,% xmm1
packuswb% xmm4,% xmm3
packuswb% xmm1,% xmm0
Here, as in the Haswell example below, it is worth noting that most of the ".LC" constants will be invariants in a loop, but only if there are free registers. Otherwise, they will have to be installed directly in phufb: "pshufb .LC0,% xmm3". Such pshufb will be larger by the size of the address and potentially take longer to execute.

Haswell : Intel Core (TM) i7-4770K CPU @ 3.50GHz

Gain up to 3 times!

image

In this case, the results for groups in memory of length 3 are much better, since in Haswell the instruction “pshufb” is executed in 1 clock cycle. Moreover, the greatest increase here is precisely for the implemented vectorization of groups of memory calls of length 3.

Example GCC assembler for a group of downloads from memory length 2:
GCC 4.9GCC 5.0
vmovdqa .LC0 (% rip),% ymm7
vmovdqa .LC1 (% rip),% ymm6
vmovdqa .LC2 (% rip),% ymm5
vmovdqa .LC3 (% rip),% ymm4
vmovdqu a (% rax,% rax),% ymm0
vmovdqu a + 32 (% rax,% rax),% ymm2
vpshufb% ymm7,% ymm0,% ymm1
vpshufb% ymm5,% ymm0,% ymm0
vpshufb% ymm6,% ymm2,% ymm3
vpshufb% ymm4,% ymm2,% ymm2
vpor% ymm3,% ymm1,% ymm1
vpor% ymm2,% ymm0,% ymm0
vpermq $ 216,% ymm1,% ymm1
vpermq $ 216,% ymm0,% ymm0
vmovdqa .LC0 (% rip),% ymm3
vmovdqu a (% rax,% rax),% ymm0
vmovdqu a + 32 (% rax,% rax),% ymm2
vpand% ymm0,% ymm3,% ymm1
vpsrlw $ 8,% ymm0,% ymm0
vpand% ymm2,% ymm3,% ymm4
vpsrlw $ 8,% ymm2,% ymm2
vpackuswb% ymm4,% ymm1,% ymm1
vpermq $ 216,% ymm1,% ymm1
vpackuswb% ymm2,% ymm0,% ymm0
vpermq $ 216,% ymm0,% ymm0
From the above, the conclusion follows: using GCC 5.0, you can significantly speed up the performance of applications similar to those described above. You can start trying now! Most edits are planned to be ported to Android NDK.

Compilers used in measurements:

You can download the example on which measurements were taken from the original text of the article in English .

Source: https://habr.com/ru/post/244137/


All Articles