Today, we will accelerate the gluing of short strings to Go by 30%. And for this we will not need to modify Go itself, all this will be implemented as a third-party library .
Under the cut you are waiting for:
+
, strings.Builder
and own concatenation functionThis article can also be considered a pretext to discuss CL123256: runtime, cmd / compile: specialize concatstring2 . Ideas to improve this change list are welcome.
The comparison was made with the go tip
(master) version of the compiler. You can get similar results on versions around Go 1.5. The last major change to the concatstrings
function was CL3120: cmd / gc: allocate buffers for non-escaped strings on stack .
BenchmarkConcat2Operator-8 20000000 83.8 ns/op BenchmarkConcat2Builder-8 20000000 70.9 ns/op BenchmarkConcat2-8 20000000 62.1 ns/op BenchmarkConcat3Operator-8 20000000 104 ns/op BenchmarkConcat3Builder-8 20000000 89.9 ns/op BenchmarkConcat3-8 20000000 82.1 ns/op
ConcatOperator
uses +
.ConcatBuilder
uses strings.Builder
with proper pre-allocation.Concat
uses the function that we implement as part of this story.
name old time/op new time/op delta Concat2-8 84.2ns 卤 1% 62.7ns 卤 2% -25.49% (p=0.000 n=9+10) Concat3-8 103ns 卤 3% 83ns 卤 4% -19.83% (p=0.000 n=10+9)
The assembler implementation under GOARCH=AMD64
slightly faster and has additional optimization, which is present in the built-in operator +
, but more on that below:
name old time/op new time/op delta Concat2-8 84.2ns 卤 1% 57.1ns 卤 3% -32.20% (p=0.000 n=9+9)
The assembler function will be taken as 100% of the performance (relative to the other implementations under consideration).
Results for longer lines can be seen in the README.md . The longer the string, the less pronounced the difference between implementations.
The simplest solution is to use the +
operator.
The semantics of this operator are: take two strings and return a result string that contains the concatenation of both strings. There is no guarantee that a new string will be returned. For example, if an empty string and any other chaining occurs, runtime can return a non-empty argument, avoiding the need to allocate new memory and copy data there.
But, as can be seen from the results at the beginning of the article, this is the slowest way.
func concat2operator(x, y string) string { return x + y }
Performance rating: 67.8% .
Not so long ago, Go added a new type - strings.Builder . This is an analogue of bytes.Buffer
, but when you call the String()
method, memory is not re-allocated and data is copied.
Unlike bytes.Buffer
, the builder does not have a small buffer optimization and, therefore, previously allocated memory for the string. If you do not use the Grow
method, the performance will be worse than in the case of bytes.Buffer
. Several regressions in Go 1.11 are caused by this particular feature (see CL113235 ).
In our code, for the purity of the experiment, we will avoid this error.
func concat2builder(x, y string) string { var builder strings.Builder builder.Grow(len(x) + len(y)) // builder.WriteString(x) builder.WriteString(y) return builder.String() }
Performance rating: 80.5% (+12.7).
If we look at what code the compiler generates for the +
operator, we will see calls to the functions concatstring2
, concatstring3
and so on (up to concatstring5
inclusive).
func concat2codegen(x, y) string { return x + y } // => CALL runtime.concatstring2(SB) func concat3codegen(x, y, z) string { return x + y + z } // => CALL runtime.concatstring3(SB)
Let's look at the runtime / string.go itself :
func concatstring2(buf *tmpBuf, a [2]string) string { return concatstrings(buf, a[:]) } func concatstring3(buf *tmpBuf, a [3]string) string { return concatstrings(buf, a[:]) }
So, it remains to study the function concatstrings
.
The full listing is available below the spoiler, but the high-level description:
buf
parameter can be nil
. This buffer is allocated by the compiler, if the string does not "run away" from the scope of its definition. If the string lives longer than the frame, then this buffer will always be nil
(as happens most often). However, if this buffer is available, it will be possible to avoid allocation if the result fits in it (its size is 32 bytes). // concatstrings implements a Go string concatenation x+y+z+... // The operands are passed in the slice a. // If buf != nil, the compiler has determined that the result does not // escape the calling function, so the string data can be stored in buf // if small enough. func concatstrings(buf *tmpBuf, a []string) string { idx := 0 l := 0 count := 0 for i, x := range a { n := len(x) if n == 0 { continue } if l+n < l { throw("string concatenation too long") } l += n count++ idx = i } if count == 0 { return "" } // If there is just one string and either it is not on the stack // or our result does not escape the calling frame (buf != nil), // then we can return that string directly. if count == 1 && (buf != nil || !stringDataOnStack(a[idx])) { return a[idx] } s, b := rawstringtmp(buf, l) for _, x := range a { copy(b, x) b = b[len(x):] } return s }
Here we see several places at once that can be optimized for a particular case:
buf
most often empty. When the compiler was unable to prove that it is safe to place the string on the stack, passing an extra parameter and checking it for nil
inside the function gives only overhead.len(a) == 2
we do not need a cycle and the calculations can be simplified. And this is the most common type of concatenation.When executing ./make.bash
(build of compiler and go stdlib) we see 445 concatenations with two operands:
A total of 89% of the concatenations of the two arguments are sweat optimization.
For the go
utility, we have:
To implement the specialization, we need to know how the lines are represented in Go. Binary compatibility is important for us, with unsafe.Pointer
can be replaced by *byte
without any victims.
type stringStruct struct { str *byte len int }
The second important conclusion that we can make from runtime: the lines begin their lives mutable. The part of the memory referenced by []byte
to which the contents of the new line is written is allocated, and only after that the []byte
thrown out, and the memory to which it refers is stored in stringStruct
.
For those who want more details, it鈥檚 suggested to examine the rawstringtmp
and rawstring
.
// rawstring allocates storage for a new string. The returned // string and byte slice both refer to the same storage. // The storage is not zeroed. Callers should use // b to set the string contents and then drop b. func rawstring(size int) (s string, b []byte) { p := mallocgc(uintptr(size), nil, false) stringStructOf(&s).str = p stringStructOf(&s).len = size *(*slice)(unsafe.Pointer(&b)) = slice{p, size, size} return }
We can do roughly the same thing using the dark side of the unsafe
package:
func concat2(x, y string) string { length := len(x) + len(y) if length == 0 { return "" } b := make([]byte, length) copy(b, x) copy(b[len(x):], y) return goString(&b[0], length) }
We select []byte
, which we use to form the contents of the new line. Then we can only finalize the line by casting it to the expected runtime presentation. The goString
function is responsible for this:
func goString(ptr *byte, length int) string { s := stringStruct{str: ptr, len: length} return *(*string)(unsafe.Pointer(&s)) }
Performance rating: 91.9% (+10.9).
Unfortunately, the previous version of the function is not optimized for concatenation with an empty string, and we also perform a number of unnecessary calculations because of the inability to allocate memory directly, we have to work with a byte slice.
One of the interesting features of Go assembly is that it allows you to call, for example, non-exported runtime functions. We can call runtime路mallocgc
from assembly code even if it is not part of the runtime
package. We will use this property.
We can also check the ownership of rows of stack memory, which makes it safe to optimize the return of one of the arguments as a result.
Suppose a function is called with arguments concat2("", "123")
. x
is an empty string, and if y
not allocated on the stack, we can return it as a result of concatenation.
//; , x y stringStruct. //; CX - y.str. //; SI - y.len. maybe_return_y: //; . MOVQ (TLS), AX //; *g CMPQ CX, (AX) JL return_y //; y_str < g.stack.lo CMPQ CX, 8(AX) JGE return_y //; y_str >= g.stack.hi JMP concatenate //; y , return_y: MOVQ CX, ret+32(FP) //; stringStruct.len MOVQ SI, ret+40(FP) //; stringStruct.str RET
MOVQ (TLS), AX
move * g to the register AX
. Reading at zero offset will give the g.stack.lo
field, and from the 8th byte g.stack.hi
begins (for a 64-bit platform).
type g struct { stack struct { lo uintptr // 0(AX) hi uintptr // 8(AX) } stackguard0 uintptr // 16(AX) stackguard1 uintptr // 24(AX) // ... }
The concatenate
body allocates memory, fills it with both strings, and returns a new string.
#include "textflag.h" #include "funcdata.h" TEXT 路Strings(SB), 0, $48-48 NO_LOCAL_POINTERS // . MOVQ x+0(FP), DX MOVQ x+8(FP), DI MOVQ y+16(FP), CX MOVQ y+24(FP), SI TESTQ DI, DI JZ maybe_return_y // x - , y TESTQ SI, SI JZ maybe_return_x // y - , x concatenate: LEAQ (DI)(SI*1), R8 // len(x) + len(y) // . MOVQ R8, 0(SP) MOVQ $0, 8(SP) MOVB $0, 16(SP) CALL runtime路mallocgc(SB) MOVQ 24(SP), AX // MOVQ AX, newstr-8(SP) // x. MOVQ x+0(FP), DX MOVQ x+8(FP), DI MOVQ AX, 0(SP) MOVQ DX, 8(SP) MOVQ DI, 16(SP) CALL runtime路memmove(SB) // y len(x). MOVQ x+8(FP), DI MOVQ y+16(FP), CX MOVQ y+24(FP), SI MOVQ newstr-8(SP), AX LEAQ (AX)(DI*1), BX MOVQ BX, 0(SP) MOVQ CX, 8(SP) MOVQ SI, 16(SP) CALL runtime路memmove(SB) // . MOVQ newstr-8(SP), AX MOVQ x+8(FP), R8 ADDQ y+24(FP), R8 MOVQ AX, ret+32(FP) MOVQ R8, ret+40(FP) RET maybe_return_y: // . MOVQ (TLS), AX // *g CMPQ CX, (AX) JL return_y // y_ptr < stk.lo CMPQ CX, 8(AX) JGE return_y // y_ptr >= stk.hi JMP concatenate // y , return_y: MOVQ CX, ret+32(FP) MOVQ SI, ret+40(FP) RET maybe_return_x: // . MOVQ (TLS), AX // *g CMPQ DX, (AX) JL return_x // x_ptr < stk.lo CMPQ DX, 8(AX) JGE return_x // x_ptr >= stk.hi JMP concatenate // x , return_x: MOVQ DX, ret+32(FP) MOVQ DI, ret+40(FP) RET
If you are interested in the nature of NO_LOCAL_POINTERS
in this code, you can read Calling a Go function from asm ("fatal error: missing stackmap") .
Performance rating: 100% (+8.6).
All code is provided as a concat package.
Is the world ready for such a quick concatenation? Who knows.
At the beginning of the article was mentioned CL123256 . He has several paths of development:
The current proposed version accelerates only the most frequent and simplest case of concatenation of a pair of strings (arity = 2).
If Go does not accept this change, then a comparable acceleration can be achieved by implementing the operations on strings as a third-party library. Less comfortable, beautiful and elegant, but it works.
Source: https://habr.com/ru/post/417479/
All Articles