The slowest way to speed up a program on Go

There is something great about programming in assembler. It can be very slow and full of errors compared to programming in a language such as Go, but sometimes it’s a good idea or at least a very fun experience.

Why spend time programming in assembler when there are excellent high level programming languages? Even with today's compilers, there are still a few cases where you want to write code in assembly language. These are cryptography , performance optimization, or access to things that are usually not available in the language . The most interesting, of course, is performance optimization.

When the performance of some part of your code really matters to the user, and you have already tried all the more simple ways to make it faster, writing code in assembly language can be a good place to optimize. Although the compiler can be perfectly optimized for creating assembler code, you can know more about a particular case than the compiler suggests.

We write the assembly code in Go

The best way to get started is to write the simplest function. For example, the add function adds two int64s.

 package main import "fmt" func add(x, y int64) int64 { return x + y } func main() { fmt.Println(add(2, 3)) }

Run: go build -o add-go && ./add-go

To implement this function in assembler, create a separate file add_amd64.s , which will contain the assembler code. The examples use assembler for AMD64 architecture.

add.go:

 package main import "fmt" func add(x, y int64) int64 func main() { fmt.Println(add(2, 3)) }

add_amd64.s:

 #include "textflag.h" TEXT ·add(SB),NOSPLIT,$0 MOVQ x+0(FP), BX MOVQ y+8(FP), BP ADDQ BP, BX MOVQ BX, ret+16(FP) RET

To run the example, put these two files in one directory and run the command go build -o add && ./add

The assembler syntax is at best ... unclear. There is an official Go manual and a rather ancient manual for the Plan 9 assembler , which gives some tips on how the assembly language in Go works. The best sources to learn are the existing assembler code Go and the compiled versions of Go functions that can be obtained by executing the command: go tool compile -S <go file> .

The most important things to know are the function declaration and the stack layout.

The magic spell to start the function is TEXT ·add(SB), NOSPLIT, $0 . The Unicode character symbol · separates the package name from the function name. In this case, the package name is main , so the package name is empty here, and the function name is add . The NOSPLIT directive means that you do not need to write the size of the arguments as the next parameter. The constant $0 at the end is where you need to put the size of the arguments, but since we have a NOSPLIT , we can just leave it as $0 .

Each function argument is pushed onto the stack, starting at address 0(FP) , meaning an offset of zero bytes from the FP pointer, and so on for each argument and the return value. For func add (x, y int64) int64 , it looks like this:

Let's sort the code of the already familiar add function:

 TEXT ·add(SB),NOSPLIT,$0 MOVQ x+0(FP), BX MOVQ y+8(FP), BP ADDQ BP, BX MOVQ BX, ret+16(FP) RET

The assembler version of the add function loads the variable x at the memory address +0(FP) into the BX register. It then loads y from address +8(FP) into the BP register, adds BP and BX , saves the result to BX , and finally copies BX to +16(FP) and returns from the function. The calling function, which pushes all the arguments onto the stack, will read the return value, from where we left it.

Function optimization with assembler

It is not necessary to write a function in the assembler that adds two numbers, but why do you really need to use it?

Suppose you have a bunch of vectors, and you want to multiply them by the transformation matrix. Perhaps the vectors are points, and you want to move them in space ( translation on Habré - approx. Lane.) . We will use vectors with a 4x4 transformation matrix.

 type V4 [4]float32 type M4 [16]float32 func M4MultiplyV4(m M4, v V4) V4 { return V4{ v[0]*m[0] + v[1]*m[4] + v[2]*m[8] + v[3]*m[12], v[0]*m[1] + v[1]*m[5] + v[2]*m[9] + v[3]*m[13], v[0]*m[2] + v[1]*m[6] + v[2]*m[10] + v[3]*m[14], v[0]*m[3] + v[1]*m[7] + v[2]*m[11] + v[3]*m[15], } } func multiply(data []V4, m M4) { for i, v := range data { data[i] = M4MultiplyV4(m, v) } }

Execution takes 140 ms for 128 MB of data. Which implementation can be faster? The standard is copying memory, which takes about 14 ms.

The following is a version of the function, written in assembler using SIMD instructions for performing multiplications, allowing you to multiply four 32-bit floating-point numbers in parallel:

 #include "textflag.h" // func multiply(data []V4, m M4) // //     FP // +0  data, ptr // +8  data, len // +16  data, cap // +24 m[0] | m[1] // +32 m[2] | m[3] // +40 m[4] | m[5] // +48 m[6] | m[7] // +56 m[8] | m[9] // +64 m[10] | m[11] // +72 m[12] | m[13] // +80 m[14] | m[15] TEXT ·multiply(SB),NOSPLIT,$0 // data ptr MOVQ data+0(FP), CX // data len MOVQ data+8(FP), SI //   data MOVQ $0, AX //  ,    CMPQ AX, SI JE END //    128- xmm- (https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions#Registers) //  [m[0], m[1], m[2], m[3]]  xmm0 MOVUPS m+24(FP), X0 //  [m[4], m[5], m[6], m[7]]  xmm1 MOVUPS m+40(FP), X1 //  [m[8], m[9], m[10], m[11]]  xmm2 MOVUPS m+56(FP), X2 //  [m[12], m[13], m[14], m[15]]  xmm3 MOVUPS m+72(FP), X3 LOOP: //       xmm //  data[i][0] (x)  xmm4 MOVSS 0(CX), X4 //  data[i][1] (y)  xmm5 MOVSS 4(CX), X5 //  data[i][2] (z)  xmm6 MOVSS 8(CX), X6 //  data[i][3] (w)  xmm7 MOVSS 12(CX), X7 //       // [0, 0, 0, x] => [x, x, x, x] SHUFPS $0, X4, X4 // [0, 0, 0, y] => [y, y, y, y] SHUFPS $0, X5, X5 // [0, 0, 0, z] => [z, z, z, z] SHUFPS $0, X6, X6 // [0, 0, 0, w] => [w, w, w, w] SHUFPS $0, X7, X7 // xmm4 = [m[0], m[1], m[2], m[3]] .* [x, x, x, x] MULPS X0, X4 // xmm5 = [m[4], m[5], m[6], m[7]] .* [y, y, y, y] MULPS X1, X5 // xmm6 = [m[8], m[9], m[10], m[11]] .* [z, z, z, z] MULPS X2, X6 // xmm7 = [m[12], m[13], m[14], m[15]] .* [w, w, w, w] MULPS X3, X7 // xmm4 = xmm4 + xmm5 ADDPS X5, X4 // xmm4 = xmm4 + xmm6 ADDPS X6, X4 // xmm4 = xmm4 + xmm7 ADDPS X7, X4 // data[i] = xmm4 MOVNTPS X4, 0(CX) // data++ ADDQ $16, CX // i++ INCQ AX // if i >= len(data) break CMPQ AX, SI JLT LOOP END: //     (Non-Temporal) SSE- (MOVNTPS) // ,         (  SFENCE) SFENCE RET

This implementation takes 18 ms, so it is pretty close to the copy speed of the memory. The best optimization might be to run such things on the GPU, not on the CPU, because the GPU is really good at it.

Operating time for various programs, including the inline version of Go and the assembler implementation without SIMD (with links to the source code):

Program	Time, ms	Acceleration
Original , zip	140	1x
Inline version , zip	69	2x
Assembler , zip	43	3x
Assembly with SIMD , zip	17	8x
Copy memory , zip	15	9x

The cost of optimization will be the complexity of the code, which simplifies the operation of the machine. Writing a program in assembler is a complicated optimization method, but sometimes it is the best available method.

Implementation notes

The author developed assembly parts mainly in C and 64-bit assembler using Xcode, and then ported them to Go format. Xcode has a good debugger that allows you to check the values of the processor registers while the program is running. If you include an assembly .s file in an Xcode project, the IDE will compile it and link it to the desired executable file.

The author used the Intel x64 instruction set guide and the Intel Intrinsics manual to figure out which instructions to use. Converting to assembly language Go is not always easy, but many 64-bit assembly instructions are listed in x86 / anames.go , and if not, they can be encoded directly with a binary representation.

Translator's Note

The original article does not include the header #include "textflag.h" in the assembler files, which is why the compile-time error is illegal or missing addressing mode for symbol NOSPLIT .
Therefore I uploaded the corrected versions on GitHub Gist. To start, unpack the archive, execute the following commands: chmod +x run.sh && ./run.sh .

It is possible to run code with an assembler only by assembling a binary with the help of go build , otherwise the compiler will crash onto the empty function body.

Unfortunately, there is really little information on the Internet on assembler in Go. I advise you to read an article on Habré about the assembly language Go .

Another way to use assembler inserts in Go

As you know, Go supports the use of code written in C. Therefore, nothing prevents to do this:

sqrt.h:

 double sqrt(double x) { __asm__ ("fsqrt" : "+t" (x)); return x; }

sqrt.go:

 package main /* #include "sqrt.h" */ import "C" import "fmt" func main() { fmt.Printf("%f\n", C.sqrt(C.double(16))) }

And run:

 $ go run sqrt.go 4.000000

Assembler is fun!

Source: https://habr.com/ru/post/349384/

All Articles