C vs Go cycles and simple math

When I was tired of C programming, like many, I was interested in the Go language. It is strongly typed, compiled, therefore sufficiently productive. And then I wanted to know how confused the Go creators were to optimize the work with cycles and numbers.

First, we look at how things are with C.

We write such a simple code:

#include <stdint.h> #include <stdio.h> int main() { uint64_t i; uint64_t j = 0; for ( i = 10000000; i>0; i--) { j ^= i; } printf("%lu\n", j); return 0; }

Compile with O2, disassemble:
')

 564: 31 d2 xor %edx,%edx 566: b8 80 96 98 00 mov $0x989680,%eax 56b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 570: 48 31 c2 xor %rax,%rdx 573: 48 83 e8 01 sub $0x1,%rax 577: 75 f7 jne 570 <main+0x10>

We get the execution time:

real 0m0,023s
user 0m0,019s
sys 0m0,004s

It would seem that there is no where to accelerate, but we have the same modern processor, for such operations we have fast sse registers. We try the options gcc -mfpmath = sse -msse4.2 the same result.
Add -O3 and cheers:

  57a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) 580: 83 c0 01 add $0x1,%eax 583: 66 0f ef c8 pxor %xmm0,%xmm1 587: 66 0f d4 c2 paddq %xmm2,%xmm0 58b: 3d 40 4b 4c 00 cmp $0x4c4b40,%eax 590: 75 ee jne 580 <main+0x20>

It can be seen that SSE2 commands and SSE registers are used, and we get a triple performance boost:

real 0m0,006s
user 0m0,006s
sys 0m0,000s

Also on Go:

 package main import "fmt" func main() { i := 0 j := 0 for i = 10000000; i>0; i-- { j ^= i } fmt.Println(j) }

 0x000000000048211a <+42>: lea -0x1(%rax),%rdx 0x000000000048211e <+46>: xor %rax,%rcx 0x0000000000482121 <+49>: mov %rdx,%rax 0x0000000000482124 <+52>: test %rax,%rax 0x0000000000482127 <+55>: ja 0x48211a <main.main+42>

Go timings:
regular go:
real 0m0,021s
user 0m0,018s
sys 0m0,004s

gccgo:
real 0m0,058s
user 0m0,036s
sys 0m0,014s

Performance as in the case of C and O2, also put the gccgo result the same, but it works longer than the regular Go (1.10.4) compiler. Apparently, due to the fact that the regular compiler perfectly optimizes the start of threads (in my case, 5 additional threads were created for 4 cores), the application works faster.

Conclusion

I still managed to get the standard Go compiler to work with sse instructions for the loop, slipping it into a native sse float.

 package main // +build amd64 import "fmt" func main() { var i float64 = 0 var j float64 = 0 for i = 10000000; i>0; i-- { j += i } fmt.Println(j) }

0x0000000000484bbe <+46>: movsd 0x4252a(%rip),%xmm3 # 0x4c70f0 <$f64.3ff0000000000000>
0x0000000000484bc6 <+54>: movups %xmm0,%xmm4
0x0000000000484bc9 <+57>: subsd %xmm3,%xmm0
0x0000000000484bcd <+61>: addsd %xmm4,%xmm1
0x0000000000484bd1 <+65>: xorps %xmm2,%xmm2
0x0000000000484bd4 <+68>: ucomisd %xmm2,%xmm0
0x0000000000484bd8 <+72>: ja 0x484bbe <main.main+46>

Source: https://habr.com/ru/post/432986/

All Articles

C vs Go cycles and simple math

Conclusion

More articles: