📜 ⬆️ ⬇️

Golang: specific performance issues

The Go language is gaining popularity. So confident that there are more and more conferences, for example, GolangConf , and the language is among the ten most highly paid technologies. Therefore, it already makes sense to talk about its specific problems, for example, performance. In addition to common problems for all compiled languages, Go has its own. They are associated with the optimizer, the stack, the type system and the multitasking model. Ways to solve them and workarounds are sometimes very specific.

Daniel Podolsky , although the evangelist of Go, also encounters a lot of strange things in him. Everything strange and, most importantly, interesting, collects and tests , and then talks about it in HighLoad ++. The transcript of the report will include numbers, graphs, code examples, profiler results, a comparison of the performance of the same algorithms in different languages ​​- and everything else, for which we so hate the word “optimization”. There will be no revelations in the transcript - where did they come from in such a simple language - and everything that can be read about in the newspapers.



About the speakers. Daniil Podolsky : 26 years of experience, 20 in operation, including the leader of the group, 5 years of programming on Go. Kirill Danshin : creator of Gramework, Maintainer, Fast HTTP, Black Go-mage.
')
The report was jointly prepared by Daniil Podolsky and Kirill Danshin, but Daniel made a report, and Cyril helped mentally.

Language constructions


We have a performance standard - direct . This is a function that increments a variable and no longer does anything.

 //   var testInt64 int64 func BenchmarkDirect(b *testing.B) { for i := 0; i < bN; i++ { incDirect() } } func incDirect() { testInt64++ } 

The result of the function is 1.46 ns per operation . This is the minimum option. Faster than 1.5 ns per operation, probably will not work.

Defer how we love him


Many know and love to use the defer language construct. Quite often we use it like this.

 func BenchmarkDefer(b *testing.B) { for i := 0; i < bN; i++ { incDefer() } } func incDefer() { defer incDirect() } 

But you can’t use it like that! Each defer eats 40 ns per operation.

 //   BenchmarkDirect-4 2000000000 1.46 / // defer BenchmarkDefer-4 30000000 40.70 / 

I thought maybe this is due to inline? Maybe inline is so fast?

Direct inlines, and the defer function cannot inline. Therefore, compiled a separate test function without inline.

 func BenchmarkDirectNoInline(b *testing.B) { for i := 0; i < bN; i++ { incDirectNoInline() } } //go:noinline func incDirectNoInline() { testInt64++ } 

Nothing has changed, defer took the same 40 ns. Defer dear, but not catastrophic.

Where a function takes less than 100 ns, you can do without defer.

But where the function takes more than a microsecond, it is all the same - you can use defer.

Passing a parameter by reference


Consider a popular myth.

 func BenchmarkDirectByPointer(b *testing.B) { for i := 0; i < bN; i++ { incDirectByPointer(&testInt64) } } func incDirectByPointer(n *int64) { *n++ } 

Nothing has changed - nothing is worth it.

 //     BenchmarkDirectByPointer-4 2000000000 1.47 / BenchmarkDeferByPointer-4 30000000 43.90 / 

Except for 3 ns per defer, but this is written off for fluctuations.

Anonymous Functions


Sometimes newbies ask, “Is an anonymous function expensive?”

 func BenchmarkDirectAnonymous(b *testing.B) { for i := 0; i < bN; i++ { func() { testInt64++ }() } } 

An anonymous function is not expensive, it takes 40.4 ns.

Interfaces


There is an interface and a structure that implements it.

 type testTypeInterface interface { Inc() } type testTypeStruct struct { n int64 } func (s *testTypeStruct) Inc() { s.n++ } 

There are three options for using the increment method. Directly from Struct: var testStruct = testTypeStruct{} .

From the appropriate concrete interface: var testInterface testTypeInterface = &testStruct .

With runtime interface conversion: var testInterfaceEmpty interface{} = &testStruct .

Below is runtime interface conversion and use directly.

 func BenchmarkInterface(b *testing.B) { for i := 0; i < bN; i++ { testInterface.Inc() } } func BenchmarkInterfaceRuntime(b *testing.B) { for i := 0; i < bN; i++ { testInterfaceEmpty.(testTypeInterface).Inc() } } 

The interface, as such, costs nothing.

 //  BenchmarkStruct-4 2000000000 1.44 / BenchmarkInterface-4 2000000000 1.88 / BenchmarkInterfaceRuntime-4 200000000 9.23 / 


Runtime interface conversion is worth it, but not expensive - you don’t need to specifically refuse. But try to do without it where possible.

Myths:


Switch, map and slice


Every newcomer to Go asks what happens if you replace switch with map. Will it be faster?

Switch come in different sizes. I tested on three sizes: small for 10 cases, medium for 100 and large for 1000 cases. Switch for 1000 cases are found in real production code. Of course, no one writes them with his hands. This is auto-generated code, usually a type switch. Tested on two types: int and string. It seemed that it would turn out more clearly.

Little switch. The fastest option is the actual switch. It is followed immediately by slice, where the corresponding integer index contains a reference to the function. Map is not a leader on either int or string.
BenchmarkSwitchIntSmall-45000000003.26 ns / op
BenchmarkMapIntSmall-4100,000,00011.70 ns / op
BenchmarkSliceIntSmall-45000000003.85 ns / op
BenchmarkSwitchStringSmall-4100,000,00012.70 ns / op
BenchmarkMapStringSmall-4100,000,00015.60 ns / op

Switch on strings is significantly slower than on int. If you can switch not to string, but to int, then do so.

Middle switch. Switch itself still rules int, but slice has overtaken it a bit. Map is still bad. But on a string key, map is faster than switch - as expected.
BenchmarkSwitchIntMedium-43000000004.55 ns / op
BenchmarkMapIntMedium-4100,000,00017.10 ns / op
BenchmarkSliceIntMedium-43000000003.76 ns / op
BenchmarkSwitchStringMedium-450,000,00028.50 ns / op
BenchmarkMapStringMedium-4100,000,00020.30 ns / op

Big switch. A thousand cases show the unconditional victory of map in the nomination “switch by string”. Theoretically, slice won, but in practice I advise you to use the same switch here. Map is still slow, even considering that map has integer keys with a special hash function. In general, this function does nothing. The int itself acts as a hash for int.
BenchmarkSwitchIntLarge-4100,000,00013.6 ns / op
BenchmarkMapIntLarge-450,000,00034.3 ns / op
BenchmarkSliceIntLarge-4100,000,00012.8 ns / op
BenchmarkSwitchStringLarge-420,000,000100.0 ns / op
BenchmarkMapStringLarge-43000000037.4 ns / op

Findings. Map is only better on large quantities and not on an integer condition. I am sure that on any of the conditions except int, it will behave the same as on string. Slice always steers when the conditions are integer. Use it if you want to “speed up” your program by 2 ns.

Inter-routine interaction


The topic is complex, I have conducted many tests and will present the most revealing ones. We know the following means of interagency interaction .


Of course, I tested on a significantly larger number of goroutines that compete for one resource. But he chose three for himself as indicative: a little - 100, a medium - 1000 and a lot - 10000.

The load profile is different . Sometimes all gorutins want to write in one variable, but this is rare. Usually, after all, some write, some read. Of the mostly readers - 90% read, of those who write - 90% write.

This is the code that is used so that the goroutine that serves the channel can provide both reading from and writing to a variable.

 go func() { for { select { case n, ok := <-cw: if !ok { wgc.Done() return } testInt64 += n case cr <- testInt64: } } }() 

If a message arrives to us through the channel through which we write, we execute it. If the channel is closed, we finish goroutin. At any time, we are ready to write to the channel that is used by other goroutines for reading.
Benchmarkmutex-4100,000,00016.30 ns / op
Benchmarkatomic-42000000006.72 ns / op
BenchmarkChan-45,000,000239.00 ns / op

This is data for one goroutine. The channel test is performed on two goroutines: one processes the Channel, the other writes to this Channel. And these options were tested on one.


With a small amount of goroutine, the Atomic is an effective and fast way to synchronize, which is not surprising. Direct is not here, because we need synchronization, which it does not provide. But Atomic has flaws, of course.
BenchmarkMutexFew-43000055894 ns / op
BenchmarkAtomicFew-4100,00014585 ns / op
BenchmarkChanFew-45000323859 ns / op
BenchmarkChanBufferedFew-45000341321 ns / op
BenchmarkChanBufferedFullFew-42000070052 ns / op
BenchmarkMutexMostlyReadFew-43000056402 ns / op
BenchmarkAtomicMostlyReadFew-41,000,0002094 ns / op
BenchmarkChanMostlyReadFew-43000442689 ns / op
BenchmarkChanBufferedMostlyReadFew-43000449,666 ns / op
BenchmarkChanBufferedFullMostlyReadFew-45000442,708 ns / op
BenchmarkMutexMostlyWriteFew-42000079708 ns / op
BenchmarkAtomicMostlyWriteFew-4100,00013358 ns / op
BenchmarkChanMostlyWriteFew-43000449,556 ns / op
BenchmarkChanBufferedMostlyWriteFew-43000445,423 ns / op
BenchmarkChanBufferedFullMostlyWriteFew-43000414626 ns / op

Next up is Mutex. I expected Channel to be about as fast as Mutex, but no.

Channel is an order of magnitude more expensive than Mutex.

Moreover, Channel and buffered Channel come out at about the same price. And there is Channel, in which the buffer never overflows. It is an order of magnitude cheaper than the one whose buffer overflows. Only if the buffer in Channel is not full, then it costs about the same in orders of magnitude as Mutex. This is what I expected from the test.

This picture with the distribution of how much it costs is repeated on any load profile - both on MostlyRead and MostlyWrite. Moreover, the full MostlyRead Channel costs the same as the incomplete one. And MostlyWrite's buffered Channel, in which the buffer is not full, costs the same as the rest. I cannot say why this is so - I have not yet studied this question.

Passing parameters


How to pass parameters faster - by reference or by value? Let's check.

I checked as follows - made nested types from 1 to 10.

 type TP001 struct { I001 int64 } type TV002 struct { I001 int64 S001 TV001 I002 int64 S002 TV001 } 

The tenth nested type will have 10 int64 fields, and the nested types of the previous nesting will also be 10.

Then he wrote functions that create a type of nesting.

 func NewTP001() *TP001 { return &TP001{ I001: rand.Int63(), } } func NewTV002() TV002 { return TV002{ I001: rand.Int63(), S001: NewTV001(), I002: rand.Int63(), S002: NewTV001(), } } 

For testing, I used three options of the type: small with nesting 2, medium with nesting 3, large with nesting 5. I had to put a very large test with nesting 10 at night, but there the picture is exactly the same as for 5.

In functions, passing by value is at least twice as fast as passing by reference . This is due to the fact that passing by value does not load the escape analysis. Accordingly, the variables that we allocate are on the stack. It is substantially cheaper for runtime, for garbage collector. Although he may not have time to connect. These tests went on for several seconds - the garbage collector was probably still asleep.
BenchmarkCreateSmallByValue-4200,0008942 ns / op
BenchmarkCreateSmallByPointer-4100,00015985 ns / op
BenchmarkCreateMediuMByValue-42000862317 ns / op
BenchmarkCreateMediuMByPointer-420001228130 ns / op
BenchmarkCreateLargeByValue-4thirty47398456 ns / op
BenchmarkCreateLargeByPointer-42061928751 ns / op

Black magic


Do you know what this program will output?

 package main type A struct { a, b int32 } func main() { a := new(A) aa = 0 ab = 1 z := (*(*int64)(unsafe.Pointer(a))) fmt.Println(z) } 

The result of the program depends on the architecture on which it is executed. On little endian, for example, AMD64, the program displays 232. On big endian - one. The result is different, because on little endian this unit appears in the middle of the number, and on big endian - at the end.

There are still processors in the world where endian switches, for example, Power PC. You will have to figure out what endian is configured on your computer during startup before making conclusions about what unsafe tricks of this kind do. For example, if you write a Go-code that will be executed on some IBM multiprocessor server.

I cited this code to explain why I consider all unsafe black magic. You do not need to use it. But Cyril believes that it is necessary. And that's why.

There is a function that does the same thing as GOB - Go Binary Marshaller. This is Encoder, but on unsafe.

 func encodeMut(data []uint64) (res []byte) { sz := len(data) * 8 dh := (*header)(unsafe.Pointer(&data)) rh := &header{ data: dh.data, len: sz, cap: sz, } res = *(*[]byte)(unsafe.Pointer(&rh)) return } 

In fact, it takes a piece of memory and draws an array of bytes from it.

This is not even an order - these are two orders. Therefore, Cyril Danshin, when he writes a high-performance code, does not hesitate to get into the guts of his program and make it unsafe.

Benchmark gob-4200,0008466 ns / op120.94 MB / s
BenchmarkUnsafeMut-450,000,00037 ns / op27691.06 MB / s
We will discuss more specific features of Go on October 7 at the GolangConf - a conference for those who use Go in professional development, and those who consider this language as an alternative. Daniil Podolsky is just a member of the Program Committee, if you want to argue with this article or reveal related issues - submit an application for a report.

For everything else, with regard to high performance, of course, HighLoad ++ . We also accept applications there. Sign up for the newsletter and stay up to date with the news of all our conferences for web developers.

Source: https://habr.com/ru/post/461291/


All Articles