Simple Go Program Optimization Techniques

I always care about performance. I don't know exactly why. But I just get pissed off by slow services and programs. Looks like I'm not alone .

In A / B tests, we tried to slow down the page output in 100 millisecond increments and found that even very small delays lead to a significant drop in revenue. - Greg Linden, Amazon.com

By experience, low productivity is manifested in one of two ways:

Operations that are well executed on a small scale become unviable with an increase in the number of users. These are usually O (N) or O (N²) operations. When the user base is small, everything works fine. Product rush to market. As the base grows, more and more unexpected pathological situations arise - and the service stops.
Many separate sources of non-optimal work, "death from a thousand cuts."

For most of my career, I either did data science with Python, or created services on Go. In the second case, I have much more experience in optimization. Go is usually not a bottleneck in the services I write - programs when working with databases are often limited to I / O. However, in the batch machine learning pipelines that I developed, the program is often limited to CPU. If the Go program over-uses the processor, there are various strategies.
')
This article explains some of the techniques that can be used to significantly increase productivity without much effort. I deliberately ignore methods that require significant effort or large changes in the program structure.

Before you start

Before making any changes to the program, take the time to create a suitable baseline for comparison. If you do not do this, you will wander in the dark, wondering if there is any benefit from the changes made. First write benchmarks and take profiles for use in pprof. The best way to write a benchmark is also on Go : it simplifies the use of pprof and memory profiling. Also use benchcmp: a useful tool for comparing the difference in performance between tests.

If the code is not very compatible with benchmarks, just start with something that can be measured. You can profile the code manually using runtime / pprof .

So, let's begin!

Use sync.Pool to reuse previously selected objects.

sync.Pool implements the release list . This allows you to reuse previously selected structures and depreciates the distribution of the object for many uses, reducing the work of the garbage collector. The API is very simple. Implement a function that allocates a new instance of an object. The API will return a pointer type.

var bufpool = sync.Pool{ New: func() interface{} { buf := make([]byte, 512) return &buf }}

After that, you can make Get() objects from the pool and Put() them back when you're done.

 // sync.Pool returns a interface{}: you must cast it to the underlying type // before you use it. b := *bufpool.Get().(*[]byte) defer bufpool.Put(&b) // Now, go do interesting things with your byte buffer. buf := bytes.NewBuffer(b)

There are nuances. Before Go 1.13, the pool was cleaned up with every garbage collection. This can adversely affect the performance of programs that allocate a lot of memory. Starting from 1.13, more objects seem to survive the GC .

!!! Before returning the object to the pool, it is necessary to reset the structure fields.

If you do not do this, you can get a dirty object from the pool that contains data from a previous use. This is a serious security threat!

 type AuthenticationResponse { Token string UserID string } rsp := authPool.Get().(*AuthenticationResponse) defer authPool.Put(rsp) // If we don't hit this if statement, we might return data from other users! if blah { rsp.UserID = "user-1" rsp.Token = "super-secret } return rsp

The safe way to always guarantee zero memory is to do it explicitly:

 // reset resets all fields of the AuthenticationResponse before pooling it. func (a* AuthenticationResponse) reset() { a.Token = "" a.UserID = "" } rsp := authPool.Get().(*AuthenticationResponse) defer func() { rsp.reset() authPool.Put(rsp) }()

The only time this is not a problem is when you use the memory to which you recorded. For example:

 var ( r io.Reader w io.Writer ) // Obtain a buffer from the pool. buf := *bufPool.Get().(*[]byte) defer bufPool.Put(&buf) // We only write to w exactly what we read from r, and no more. nr, er := r.Read(buf) if nr > 0 { nw, ew := w.Write(buf[0:nr]) }

Avoid using structures containing pointers as keys for a large map.

Whew, I was too verbose. I apologize. Often talked (including my former colleague Phil Pearl ) about Go performance with a large heap size . During garbage collection, the runtime scans objects with pointers and tracks them. If you have a very large map[string]int , then GC should check every line. This happens with every garbage collection, since the lines contain pointers.

In this example, we write 10 million items in map[string]int and measure the duration of the garbage collection. We allocate our map in the package area to ensure heap memory allocation.

 package main import ( "fmt" "runtime" "strconv" "time" ) const ( numElements = 10000000 ) var foo = map[string]int{} func timeGC() { t := time.Now() runtime.GC() fmt.Printf("gc took: %s\n", time.Since(t)) } func main() { for i := 0; i < numElements; i++ { foo[strconv.Itoa(i)] = i } for { timeGC() time.Sleep(1 * time.Second) } }

Running the program, we will see the following:

  inthash → go install && inthash
 gc took: 98.726321ms
 gc took: 105.524633ms
 gc took: 102.829451ms
 gc took: 102.71908ms
 gc took: 103.084104ms
 gc took: 104.821989ms

This is quite a long time in a computer country!

What can be done for optimization? A good idea seems to be to delete pointers everywhere so as not to load the garbage collector. There are pointers in the rows ; so let's implement this as map[int]int .

 package main import ( "fmt" "runtime" "time" ) const ( numElements = 10000000 ) var foo = map[int]int{} func timeGC() { t := time.Now() runtime.GC() fmt.Printf("gc took: %s\n", time.Since(t)) } func main() { for i := 0; i < numElements; i++ { foo[i] = i } for { timeGC() time.Sleep(1 * time.Second) } }

Running the program again, we will see:

  inthash → go install && inthash
 gc took: 3.608993ms
 gc took: 3.926913ms
 gc took: 3.955706ms
 gc took: 4.063795ms
 gc took: 3.91519ms
 gc took: 3.75226ms

Much better. We sped up garbage collection 35 times. When used in production, you will need to hash the strings into integers before inserting them into the map.

By the way, there are still many ways to avoid GC. If you allocate giant arrays of meaningless structures, ints or bytes, GC will not scan it : that is, you save on GC time. Such methods usually require substantial processing of the program, so today we will not go into this topic.

As with any optimization, the effect may vary. See the tweet thread from Damien Gryski with an interesting example of how removing rows from a large map in favor of a smarter data structure actually increased memory consumption. In general, read everything that he publishes.

Generate marshaling code to avoid reflection in runtime.

Marshaling and unmarshaling your structure into various serialization formats, such as JSON, is a typical operation, especially when creating microservices. Many microservices have a single job at all. Functions like json.Marshal and json.Unmarshal rely on json.Unmarshal reflection to serialize structure fields to bytes and vice versa. This can be slow: reflection is not as efficient as explicit code.

However, there are optimization options. The marshaling mechanics in JSON looks like this:

 package json // Marshal take an object and returns its representation in JSON. func Marshal(obj interface{}) ([]byte, error) { // Check if this object knows how to marshal itself to JSON // by satisfying the Marshaller interface. if m, is := obj.(json.Marshaller); is { return m.MarshalJSON() } // It doesn't know how to marshal itself. Do default reflection based marshallling. return marshal(obj) }

If we know the marshalling process in JSON, we have a hook to avoid being reflected in runtime. But we don’t want to manually write all the marshalling code, what to do? Instruct the computer to generate this code! Code generators like easyjson look at the structure and generate highly optimized code that is fully compatible with existing marshaling interfaces, such as json.Marshaller .

Load the package and write the following command to $file.go containing the structures for which you want to generate code.

  easyjson -all $ file.go

The file $file_easyjson.go must be generated. Since easyjson implemented the json.Marshaller interface for you, instead of the default reflection, these functions will be called. Congratulations: you just sped up your JSON code three times. There are many tricks to further increase productivity.

I recommend this package because I myself have used it before, and successfully. But beware. Please do not take this as an invitation to start an aggressive debate with me about the fastest JSON packages.

You should ensure that the marshaling code is re-generated when the structure changes. If you forget to do this, the new fields you add will not be serialized, which will lead to confusion! You can use go generate for these tasks. To keep it in sync with the structures, I prefer to put generate.go at the root of the package, which causes go generate for all package files: this can help when you have a lot of files that need to generate such code. General advice: to ensure that the structures are updated, call go generate in CI and check that there are no differences with the registered code.

Use strings.Builder to build strings.

In Go, strings are immutable: present them as read-only bytes. This means that each time you create a string, you allocate memory and potentially create more work for the garbage collector.

Go 1.10 implemented strings.Builder as an effective way to create strings. Internally, it writes to the byte buffer. Only when calling String() in the builder, a string is actually created. It relies on some unsafe tricks to return the base bytes as a zero-distributed string: see this blog for further study on how this works.

Let's compare the performance of two approaches:

 // main.go package main import "strings" var strs = []string{ "here's", "a", "some", "long", "list", "of", "strings", "for", "you", } func buildStrNaive() string { var s string for _, v := range strs { s += v } return s } func buildStrBuilder() string { b := strings.Builder{} // Grow the buffer to a decent length, so we don't have to continually // re-allocate. b.Grow(60) for _, v := range strs { b.WriteString(v) } return b.String() }

 // main_test.go package main import ( "testing" ) var str string func BenchmarkStringBuildNaive(b *testing.B) { for i := 0; i < bN; i++ { str = buildStrNaive() } } func BenchmarkStringBuildBuilder(b *testing.B) { for i := 0; i < bN; i++ { str = buildStrBuilder() }

Here are the results on my Macbook Pro:

  strbuild → go test -bench =.  -benchmem
 goos: darwin
 goarch: amd64
 pkg: github.com/sjwhitworth/perfblog/strbuild
 BenchmarkStringBuildNaive-8 5000000 255 ns / op 216 B / op 8 allocs / op
 BenchmarkStringBuildBuilder-8 20000000 54.9 ns / op 64 B / op 1 allocs / op

As you can see, strings.Builder is 4.7 times faster, causes eight times less selections and takes up four times less memory.

When performance is important, use strings.Builder . In general, I recommend using it everywhere except in the most trivial cases of string construction.

Use strconv instead of fmt

fmt is one of Go's most famous packages. You probably used it in your first program to display “hello, world”. But when it comes to converting integers and floats to strings, it is not as effective as its younger brother, strconv . This package shows decent performance with very few API changes.

fmt basically accepts interface{} as function arguments. There are two drawbacks:

You lose type safety. It is very important for me.
This may increase the amount of excreta needed. Passing a type without a pointer as interface{} usually results in a heap allocation. This blog explains why this is so.
The following program shows the performance difference:
```
 // main.go package main import ( "fmt" "strconv" ) func strconvFmt(a string, b int) string { return a + ":" + strconv.Itoa(b) } func fmtFmt(a string, b int) string { return fmt.Sprintf("%s:%d", a, b) } func main() {} 
```
```
 // main_test.go package main import ( "testing" ) var ( a = "boo" blah = 42 box = "" ) func BenchmarkStrconv(b *testing.B) { for i := 0; i < bN; i++ { box = strconvFmt(a, blah) } a = box } func BenchmarkFmt(b *testing.B) { for i := 0; i < bN; i++ { box = fmtFmt(a, blah) } a = box } 
```
Benchmarks on Macbook Pro:
```
  strfmt → go test -bench =.  -benchmem
 goos: darwin
 goarch: amd64
 pkg: github.com/sjwhitworth/perfblog/strfmt
 BenchmarkStrconv-8 30000000 39.5 ns / op 32 B / op 1 allocs / op
 BenchmarkFmt-8 10000000 143 ns / op 72 B / op 3 allocs / op 
```
As you can see, the strconv variant is 3.5 times faster, causes three times less discharge and takes up half the memory.

Highlight cut-off capacity with make to avoid redistribution.

Before proceeding to improve performance, let's quickly update the information on cuts in memory. The cut is a very useful design in Go. It provides a scalable array with the ability to take different views in the same base memory without redistribution. If you look under the hood, the slice consists of three elements:
```
 type slice struct { // pointer to underlying data in the slice. data uintptr // the number of elements in the slice. len int // the number of elements that the slice can // grow to before a new underlying array // is allocated. cap int } 
```
What are these fields?
- data : pointer to basic data in the slice
- len : current number of elements in the slice
- cap : the number of elements to which the slice can grow before redistribution
Under the hood, sections are arrays of fixed length. When the maximum value ( cap ) is reached, a new array is allocated with a double value, the memory is copied from the old slice to a new one, and the old array is discarded.

I often see a code like this where a slice with a zero boundary capacity is allocated, if the cut capacity is known in advance:
```
 var userIDs []string for _, bar := range rsp.Users { userIDs = append(userIDs, bar.ID) } 
```
In this case, the cut starts with the zero size len and the zero boundary capacity cap . After receiving the answer, we add elements to the slice, while reaching the boundary capacity: a new base array is allocated, where the cap doubled and the data is copied into it. If we have 8 elements in response, this leads to 5 redistributions.

The following method is much more effective:
```
 userIDs := make([]string, 0, len(rsp.Users) for _, bar := range rsp.Users { userIDs = append(userIDs, bar.ID) } 
```
Here we explicitly allocated a cutoff capacity with make. Now we can safely add data there, without additional redistribution and copying.

If you do not know how much memory to allocate, because the capacity is dynamic or is calculated later in the program, measure the final distribution of the size of the slice after the program is running. I usually take the 90th or 99th percentile and hard code the value in the program. In cases when CPU is more expensive for you than RAM, set this value higher than you think you need.

The tip also applies to cards: make(map[string]string, len(foo)) will allocate enough memory to avoid redistribution.

See this article on how slices work in reality.

Use methods to transfer byte slices.

When using packets, use methods that allow the transfer of a byte slice: these methods usually give more control over the distribution.

A good example is the comparison of time.Format and time.AppendFormat . The first returns a string. Under the hood, it allocates a new byte slice and calls time.AppendFormat on it. The second one takes a byte buffer, writes a formatted time representation, and returns an extended byte slice. This is often found in other packages of the standard library: see strconv.AppendFloat against bytes.NewBuffer .

Why does this increase productivity? Well, now you can transfer byte slices that you received from sync.Pool , instead of allocating a new buffer each time. Or, you can increase the initial buffer size to a value that is more appropriate for your program to reduce the number of repeated copies of the slice.

Summary

All these methods can apply to your code base. Over time, you will build a mental model to talk about performance in Go programs. This will greatly help in their design.

But apply them depending on the situation. These are tips, not the gospel. Measure and check everything with benchmarks.

And know when to stop. Increasing productivity is a good exercise: the task is interesting, and the results are immediately visible. However, the utility of improving performance depends on the situation. If your service responds in 10 ms, and the network delay is 90 ms, you should probably not try to reduce the 10 ms to 5 ms: you still have 95 ms. Even if the service is ultimately optimized to 1 ms, the total delay will still be 91 ms. Probably there is a bigger fish.

Optimize wisely!

Links

If you want to know more, here are some great sources of inspiration:

Source: https://habr.com/ru/post/457004/

All Articles

Simple Go Program Optimization Techniques

Before you start

Use sync.Pool to reuse previously selected objects.

Avoid using structures containing pointers as keys for a large map.

Generate marshaling code to avoid reflection in runtime.

Use strings.Builder to build strings.

Use strconv instead of fmt

Highlight cut-off capacity with make to avoid redistribution.

Use methods to transfer byte slices.

Summary

Links

More articles: