Go 1.11: AVX-512 with Go

In Go 1.11, the assembler for the x86 platform has been significantly updated.

Programmers will be able to use the AVX-512 - the latest instructions available in Intel processors.

Under the cut:

The most significant updates in cmd/asm ( go tool asm )
How a new instruction set was implemented in Go assembler
Using new instructions and special features of the EVEX prefix
The level of integration into the toolchain (recipes for circumventing current restrictions)

What's new?

From the visible to the programmer:

AVX-512

- New vector registers: X16-X31 , Y16-Y31 and Z0-Z31
- Added mask registers: K0-K7
- Special features of the EVEX prefix (see below: rounding, zeroing, ...).
- Hundreds of new instructions (379 new opcodes + AVX {1,2} instructions with an EVEX prefix).

Added 110 missing legacy instructions ( CL97235 ).
Up to 25% faster assembly ( CL108895 ). Accelerates assembly by about 1.5%.

Preliminary work was also done to improve error messages ( CL108515 ), but this will not go into the release of go1.11.

In addition to the fact of adding new extensions, it is important that in the new assembler all VEX and EVEX tables are generated automatically.

Now in Go there is an x86 assembler to which you do not need to add new instructions manually.

Encoder in go assembler

The assembly part responsible for generating the machine code is in the standard cmd / internal / obj / x86 package.

Most of the code in it is the x86 assembler source code from plan9 translated from C.

Assembler tables conceptually consist of 3 dimensions: X, Y, and Z.
The specific instruction is generated as encode(X, Y, Z) .
An alternative mental model may be table[X][Y][Z] , but it is less close to implementation details.

From the space of opcodes (dimension X), the object optab corresponding to the assembled instruction is optab . Then, the list of available combinations of operands (dimension Y) is ytab object ytab is selected corresponding to the arguments of the instruction. The final step is to choose a code generation scheme: Z-case.

It is easy to find constants in the code that have Y and Z prefixes, but there is nothing with the X prefix.

Funny note

There is a hypothesis that initially they were A, B, and C prefixes, then B and C were renamed Y and Z, and opcodes remained with the prefix A.

What is also funny, the type of A-constants is obj.As , which can be abbreviated from asm (assembler opcode), or simply means the plural of A

Previously, instructions in Go x86 assembler were added manually, according to the following scheme:

Adding a new constant to aenum.go .
Adding optab to global x86 assembly table .
Selection or addition of the desired ytab list.
Adding end2end tests for new instructions.

If we already have all the necessary A, Y, and Z constants, it remains to generate the encoder tables and tests themselves.

This process is well automated if we have a source from which to read information about instructions: their encoding, the types of allowed operands, and so on.
Fortunately, we have such a source.

x86avxgen and Intel XED

To generate all the instructions that use VEX and EVEX prefixes, the x86avxgen utility was written. This program generates the same optab and ytab objects for the assembler.

The input data for the program are XED datafiles , which you can work with from Go using the xeddata package.

The advantage of code generation is that in order to implement new instructions from the AVX series, it will be enough to restart x86avxgen and add tests.
Test generation is also automated using the Intel XED encoder (XED is primarily a library).

EVEX has a large amount of free space for opcodes and potential for extensions, so new instructions will definitely appear.
In the near future you can pry using the document ISA-extensions .

Syntax

In addition to the code generator tables themselves, the parser has been updated.
Now for x86 you can use register lists and opcode suffixes.

 VP4DPWSSD zmm1{k1}{z}, zmm2+3, m128

In this case, +3 means that the second zmm operand describes a range of registers of 4 elements (in the manual, these ranges are referred to as "register block").

The range for Z0+3 in Go assembly will look like this:

 VP4DPWSSD Z25, [Z0-Z3], (AX)

Using ranges of type [Z0-Z1] , [Z3-Z0] , [AX-DX] is an error
assembly stage.

Suffixes are used to activate special features of the AVX-512.
For example, take one of the new VADDPD instruction VADDPD :

 VADDPD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst{er}

Now we will analyze what all this magic from {k1} , {z} , m64bcst and {er} m64bcst .

Please note: the order of the operands is completely inverse to the Intel syntax.
Just like in the GNU assembler (AT & T syntax).

 //  , "" VADDPD. VADDPD (AX), Z30, Z10 // {k1} - merging     K . VADDPD (AX), Z30, K5, Z10 //          K , //     merging  zeroing. // {z} -  zeroing-mask (    merging-mask). VADDPD.Z (AX), Z30, Z10 // m64bcst -   embedded broadcasting. //  "bcst"   Microsoft  (MASM). VADDPD.BCST (AX), Z30, Z10 // {er} -  embedded rounding.    memory . //    SAE (. ),     . VADDPD.RU_SAE Z0, Z30, Z10 //   +Inf VADDPD.RD_SAE Z0, Z30, Z10 //   -Inf VADDPD.RZ_SAE Z0, Z30, Z10 //    VADDPD.RN_SAE Z0, Z30, Z10 //   " "

More interestingly, the Z suffix, if the instruction supports it, can be used in conjunction with other suffixes:

 // SAE -  "surpress all exceptions". //     {sae}. VMAXPD.SAE.Z Z3, Z2, Z1

For questions like "Why so?" can answer go # 22779: AVX512 design .
It is also recommended to follow the link to golang-dev given there.

Comparison with GNU assembler

The order of the operands is identical to that of the GNU assembler.

Those who found the "strange" order of the operands in the CMP instructions are waiting for the news:
For AVX instructions, these special rules do not apply (decide whether this is good or bad).

Feature	GNU assembler	Go assembler
Masking	`VPORD %ZMM0, %ZMM1, %ZMM2{%K2}` `{k}` always at dst operand	`VPODR Z0, Z1, K2, Z2` `{k}` always before the dst operand
Broadcasting	`VPORD (%RDX){1to16}, %ZMM1, %ZMM2` `1toN` at memory argument	`VPORD.BCST (DX), Z1, Z2` `BCST` suffix
Zeroing	`VPORD %ZMM0, %ZMM1, %ZMM2{z}` `{z}` argument at dst operand	`VPORD.Z Z0, Z1, Z2` `Z` suffix
Rounding	`VSQRTPD {ru-sae}, %ZMM0, %ZMM1` Special first argument	`VSQRTPD.RU_SAE Z0, Z1` Suffix
SAE	`VUCOMISD {sae}, %XMM0, %XMM1` Similar to rounding	`VUCOMISD.SAE X0, X1` Similar to rounding
Multi-source	`V4FMADDPS (%RCX), %ZMM4, %ZMM1` Specifies first register	`V4FMADDPS (CX), [Z4-Z7], Z1` Explicit indication of the range

Both assemblers use VEX when assembling instructions, where it is possible to apply both VEX and EVEX circuits. In other words, VADDPD X1, X2, X3 will have a VEX prefix.

In cases where there is an ambiguity of the operand dimension, in the Go assembler, opcodes get additional size suffixes:

 VCVTSS2USIL (AX), DX // VCVTSS2USI (%RAX), %EDX VCVTSS2USIQ (AX), DX // VCVTSS2USI (%RAX), %RDX

Where in the Intel syntax you can specify the width of the memory operand, in GNU and Go assemblers use X and Y size suffixes:

 VCVTTPD2DQX (AX), X0 // VCVTTPD2DQ XMM0, XMMWORD PTR [RAX] VCVTTPD2DQY (AX), X0 // VCVTTPD2DQ XMM0, YMMWORD PTR [RAX]

A complete list of instructions with size suffixes can be found in the documentation .

Disassembling the AVX-512

CL113315 adds support for the AVX-512 to go tool asm , mainly affecting the parser and the obj/x86 code generator, but what happens if you compile the .s file and try to explore it with go tool objdump ?

 //  avx.s TEXT avxCheck(SB), 0, $0 VPOR X0, X1, X2 // AVX1 VPOR Y0, Y1, Y2 // AVX2 VPORD.BCST (DX), Z1, K2, Z2 // AVX-512 RET

You will not see what you expect:

 $ go tool asm avx.s $ go tool objdump avx.o TEXT avxCheck(SB) gofile..$GOROOT/avx.s avx.s:2 0xb7 c5f1ebd0 JMP 0x8b avx.s:3 0xbb c5f5ebd0 JMP 0x8f avx.s:4 0xbf 62 ? avx.s:4 0xc0 f1 ICEBP avx.s:4 0xc1 755a JNE 0x11d avx.s:4 0xc3 eb12 JMP 0xd7 avx.s:5 0xc5 c3 RET

Using objdump on Go object files does not work:

 $ objdump -D avx.o objdump: avx.o: File format not recognized

But it can be used on executable files.
If the assembler code is included in the main package, the system objdump will cope with the task.

A simpler way to get machine code is to pass the -S argument:

 $ go tool asm -S avx.s avxCheck STEXT nosplit size=15 args=0xffffffff80000000 locals=0x0 0x0000 00000 (avx.s:1) TEXT avxCheck(SB), NOSPLIT, $0 0x0000 00000 (avx.s:2) VPOR X0, X1, X2 0x0004 00004 (avx.s:3) VPOR Y0, Y1, Y2 0x0008 00008 (avx.s:4) VPORD.BCST (DX), Z1, K2, Z2 0x000e 00014 (avx.s:5) RET 0x0000 c5 f1 eb d0 c5 f5 eb d0 62 f1 75 5a eb 12 c3 ........b.uZ... go.info.avxCheck SDWARFINFO size=34 0x0000 02 61 76 78 43 68 65 63 6b 00 00 00 00 00 00 00 .avxCheck....... 0x0010 00 00 00 00 00 00 00 00 00 00 01 9c 00 00 00 00 ................ 0x0020 01 00

Octets of interest to us: c5 f1 eb d0 c5 f5 eb d0 62 f1 75 5a eb 12 c3 .
Copy them and we will do the reverse through the system objdump :

 # 1.        xxd. # 2.  objdump  binary . # :  Intel    "i386"  "i386:intel". $ echo 'c5 f1 eb d0 c5 f5 eb d0 62 f1 75 5a eb 12 c3' | xxd -r -p > shellcode.bin && objdump -b binary -m i386 -D shellcode.bin Disassembly of section .data: 00000000 <.data>: 0: c5 f1 eb d0 vpor %xmm0,%xmm1,%xmm2 4: c5 f5 eb d0 vpor %ymm0,%ymm1,%ymm2 8: 62 f1 75 5a eb 12 vpord (%edx){1to16},%zmm1,%zmm2{%k2} e: c3 ret

Disassembling with XED

XED also provides several useful utilities, one of which allows
use encoder / decoder via command line.

 $ echo 'c5 f1 eb d0 c5 f5 eb d0 62 f1 75 5a eb 12 c3' > data.txt && xed -64 -A -ih data.txt && rm data.txt 00 LOGICAL AVX C5F1EBD0 vpor %xmm0, %xmm1, %xmm2 04 LOGICAL AVX2 C5F5EBD0 vpor %ymm0, %ymm1, %ymm2 08 LOGICAL AVX512EVEX 62F1755AEB12 vpordl (%rdx){1to16}, %zmm1, %zmm2{%k2} 0e RET BASE C3 retq

The -A flag selects AT & T syntax, -64 selects 64-bit mode.

The xed-ex4 shows detailed information about the instruction:

 $ xed-ex4 -64 C5 F1 EB D0 PARSING BYTES: c5 f1 eb d0 VPOR VPOR_XMMdq_XMMdq_XMMdq EASZ:3, EOSZ:2, HAS_MODRM:1, LZCNT, MAP:1, MAX_BYTES:4, MOD:3, MODE:2, MODRM_BYTE:208, NOMINAL_OPCODE:235, OUTREG:XMM0, P4, POS_MODRM:3, POS_NOMINAL_OPCODE:2, REG:2, REG0:XMM2, REG1:XMM1, REG2:XMM0, SMODE:2, TZCNT, VEXDEST210:6, VEXDEST3, VEXVALID:1, VEX_PREFIX:1 0 REG0/W/DQ/EXPLICIT/NT_LOOKUP_FN/XMM_R 1 REG1/R/DQ/EXPLICIT/NT_LOOKUP_FN/XMM_N 2 REG2/R/DQ/EXPLICIT/NT_LOOKUP_FN/XMM_B YDIS: vpor xmm2, xmm1, xmm0 ATT syntax: vpor %xmm0, %xmm1, %xmm2 INTEL syntax: vpor xmm2, xmm1, xmm0

go tool objdump based on x86.csv , which does not contain many new instructions and has inaccuracies.

The csv file itself is created by the x86spec utility based on the conversion from the Intel manual (PDF).
The next step is to create x86.csv from x86.csv tables, which will re-generate the tables for the decoder.

AVX-512 Application

One of the major AVX-512 users in the Go world is minio .
Before 1.11, they had to use the asm2plan9s utility.

Here, for example, their results for sha256 :

 Processor SIMD Speed (MB/s) 3.0 GHz Intel Xeon Platinum 8124M AVX512 3498 1.2 GHz ARM Cortex-A53 ARM64 638 3.0 GHz Intel Xeon Platinum 8124M AVX2 449 3.1 GHz Intel Core i7 AVX 362 3.1 GHz Intel Core i7 SSE 299

In order to start to get acquainted with the new extension, you can try to use the instructions already familiar to you from AVX1 and AVX2 (without Z registers). This way you can experiment with new features, such as merging / zeroing masks, without the risk of getting into a completely new “features” space.

The most important thing - measure, before you draw final conclusions. In doing so, check both the performance of the function itself and the application as a whole.

I also recommend that you familiarize yourself with golang.org/wiki/AVX-512-support-in-Go-assembler .

In more detail, the topic of effective use of the AVX-512 will be discussed in a separate article.

Source: https://habr.com/ru/post/359132/

All Articles