馃摐 猬嗭笍 猬囷笍

Accelerate do-it-yourself string concatenation


Today, we will accelerate the gluing of short strings to Go by 30%. And for this we will not need to modify Go itself, all this will be implemented as a third-party library .


Under the cut you are waiting for:



This article can also be considered a pretext to discuss CL123256: runtime, cmd / compile: specialize concatstring2 . Ideas to improve this change list are welcome.


Immediately results


The comparison was made with the go tip (master) version of the compiler. You can get similar results on versions around Go 1.5. The last major change to the concatstrings function was CL3120: cmd / gc: allocate buffers for non-escaped strings on stack .


 BenchmarkConcat2Operator-8 20000000 83.8 ns/op BenchmarkConcat2Builder-8 20000000 70.9 ns/op BenchmarkConcat2-8 20000000 62.1 ns/op BenchmarkConcat3Operator-8 20000000 104 ns/op BenchmarkConcat3Builder-8 20000000 89.9 ns/op BenchmarkConcat3-8 20000000 82.1 ns/op 

ConcatOperator uses + .
ConcatBuilder uses strings.Builder with proper pre-allocation.
Concat uses the function that we implement as part of this story.


Benchstat comparison :


 name old time/op new time/op delta Concat2-8 84.2ns 卤 1% 62.7ns 卤 2% -25.49% (p=0.000 n=9+10) Concat3-8 103ns 卤 3% 83ns 卤 4% -19.83% (p=0.000 n=10+9) 

The assembler implementation under GOARCH=AMD64 slightly faster and has additional optimization, which is present in the built-in operator + , but more on that below:


 name old time/op new time/op delta Concat2-8 84.2ns 卤 1% 57.1ns 卤 3% -32.20% (p=0.000 n=9+9) 

The assembler function will be taken as 100% of the performance (relative to the other implementations under consideration).


Results for longer lines can be seen in the README.md . The longer the string, the less pronounced the difference between implementations.

Naive concatenation


The simplest solution is to use the + operator.


The semantics of this operator are: take two strings and return a result string that contains the concatenation of both strings. There is no guarantee that a new string will be returned. For example, if an empty string and any other chaining occurs, runtime can return a non-empty argument, avoiding the need to allocate new memory and copy data there.


But, as can be seen from the results at the beginning of the article, this is the slowest way.


 func concat2operator(x, y string) string { return x + y } 

Performance rating: 67.8% .

strings.Builder


Not so long ago, Go added a new type - strings.Builder . This is an analogue of bytes.Buffer , but when you call the String() method, memory is not re-allocated and data is copied.


Unlike bytes.Buffer , the builder does not have a small buffer optimization and, therefore, previously allocated memory for the string. If you do not use the Grow method, the performance will be worse than in the case of bytes.Buffer . Several regressions in Go 1.11 are caused by this particular feature (see CL113235 ).


In our code, for the purity of the experiment, we will avoid this error.


 func concat2builder(x, y string) string { var builder strings.Builder builder.Grow(len(x) + len(y)) //      builder.WriteString(x) builder.WriteString(y) return builder.String() } 

Performance rating: 80.5% (+12.7).

Code Generation for Concatenation


If we look at what code the compiler generates for the + operator, we will see calls to the functions concatstring2 , concatstring3 and so on (up to concatstring5 inclusive).


 func concat2codegen(x, y) string { return x + y } // => CALL runtime.concatstring2(SB) func concat3codegen(x, y, z) string { return x + y + z } // => CALL runtime.concatstring3(SB) 

Let's look at the runtime / string.go itself :


 func concatstring2(buf *tmpBuf, a [2]string) string { return concatstrings(buf, a[:]) } func concatstring3(buf *tmpBuf, a [3]string) string { return concatstrings(buf, a[:]) } 

So, it remains to study the function concatstrings .
The full listing is available below the spoiler, but the high-level description:


  1. The buf parameter can be nil . This buffer is allocated by the compiler, if the string does not "run away" from the scope of its definition. If the string lives longer than the frame, then this buffer will always be nil (as happens most often). However, if this buffer is available, it will be possible to avoid allocation if the result fits in it (its size is 32 bytes).
  2. If all lines except one are empty, the function will return this line. But at the same time, the lines allocated on the stack and leaving the frame bypass the optimization, so that the caller does not receive the already freed memory.
  3. Further, all lines are copied to the new memory.

Full listing of the concatstrings function
 // concatstrings implements a Go string concatenation x+y+z+... // The operands are passed in the slice a. // If buf != nil, the compiler has determined that the result does not // escape the calling function, so the string data can be stored in buf // if small enough. func concatstrings(buf *tmpBuf, a []string) string { idx := 0 l := 0 count := 0 for i, x := range a { n := len(x) if n == 0 { continue } if l+n < l { throw("string concatenation too long") } l += n count++ idx = i } if count == 0 { return "" } // If there is just one string and either it is not on the stack // or our result does not escape the calling frame (buf != nil), // then we can return that string directly. if count == 1 && (buf != nil || !stringDataOnStack(a[idx])) { return a[idx] } s, b := rawstringtmp(buf, l) for _, x := range a { copy(b, x) b = b[len(x):] } return s } 

Here we see several places at once that can be optimized for a particular case:



Statistics on the use of concatenation

When executing ./make.bash (build of compiler and go stdlib) we see 445 concatenations with two operands:


  • 398 results "run away". In this case, our specialization makes sense.
  • 47 results do not leave your frame.

A total of 89% of the concatenations of the two arguments are sweat optimization.


For the go utility, we have:


  • 501 calls concatstring2
  • 194 calls to concatstring3
  • 55 calls to concatstring4

Version for all architectures


To implement the specialization, we need to know how the lines are represented in Go. Binary compatibility is important for us, with unsafe.Pointer can be replaced by *byte without any victims.


 type stringStruct struct { str *byte len int } 

The second important conclusion that we can make from runtime: the lines begin their lives mutable. The part of the memory referenced by []byte to which the contents of the new line is written is allocated, and only after that the []byte thrown out, and the memory to which it refers is stored in stringStruct .


For those who want more details, it鈥檚 suggested to examine the rawstringtmp and rawstring .


runtime.rawstring
 // rawstring allocates storage for a new string. The returned // string and byte slice both refer to the same storage. // The storage is not zeroed. Callers should use // b to set the string contents and then drop b. func rawstring(size int) (s string, b []byte) { p := mallocgc(uintptr(size), nil, false) stringStructOf(&s).str = p stringStructOf(&s).len = size *(*slice)(unsafe.Pointer(&b)) = slice{p, size, size} return } 

We can do roughly the same thing using the dark side of the unsafe package:


 func concat2(x, y string) string { length := len(x) + len(y) if length == 0 { return "" } b := make([]byte, length) copy(b, x) copy(b[len(x):], y) return goString(&b[0], length) } 

We select []byte , which we use to form the contents of the new line. Then we can only finalize the line by casting it to the expected runtime presentation. The goString function is responsible for this:


 func goString(ptr *byte, length int) string { s := stringStruct{str: ptr, len: length} return *(*string)(unsafe.Pointer(&s)) } 

Performance rating: 91.9% (+10.9).

AMD64 version


Unfortunately, the previous version of the function is not optimized for concatenation with an empty string, and we also perform a number of unnecessary calculations because of the inability to allocate memory directly, we have to work with a byte slice.


One of the interesting features of Go assembly is that it allows you to call, for example, non-exported runtime functions. We can call runtime路mallocgc from assembly code even if it is not part of the runtime package. We will use this property.


We can also check the ownership of rows of stack memory, which makes it safe to optimize the return of one of the arguments as a result.


Suppose a function is called with arguments concat2("", "123") . x is an empty string, and if y not allocated on the stack, we can return it as a result of concatenation.


 //; ,  x  y   stringStruct. //; CX - y.str. //; SI - y.len. maybe_return_y: //;      . MOVQ (TLS), AX //; *g CMPQ CX, (AX) JL return_y //;  y_str < g.stack.lo CMPQ CX, 8(AX) JGE return_y //;  y_str >= g.stack.hi JMP concatenate //; y  ,    return_y: MOVQ CX, ret+32(FP) //; stringStruct.len MOVQ SI, ret+40(FP) //; stringStruct.str RET 

MOVQ (TLS), AX move * g to the register AX . Reading at zero offset will give the g.stack.lo field, and from the 8th byte g.stack.hi begins (for a 64-bit platform).


 type g struct { stack struct { lo uintptr // 0(AX) hi uintptr // 8(AX) } stackguard0 uintptr // 16(AX) stackguard1 uintptr // 24(AX) // ...   } 

The concatenate body allocates memory, fills it with both strings, and returns a new string.


Full listing with comments
 #include "textflag.h" #include "funcdata.h" TEXTStrings(SB), 0, $48-48 NO_LOCAL_POINTERS //    . MOVQ x+0(FP), DX MOVQ x+8(FP), DI MOVQ y+16(FP), CX MOVQ y+24(FP), SI TESTQ DI, DI JZ maybe_return_y // x -  ,   y TESTQ SI, SI JZ maybe_return_x // y -  ,   x concatenate: LEAQ (DI)(SI*1), R8 // len(x) + len(y) //     . MOVQ R8, 0(SP) MOVQ $0, 8(SP) MOVB $0, 16(SP) CALL runtime路mallocgc(SB) MOVQ 24(SP), AX //     MOVQ AX, newstr-8(SP) //  x. MOVQ x+0(FP), DX MOVQ x+8(FP), DI MOVQ AX, 0(SP) MOVQ DX, 8(SP) MOVQ DI, 16(SP) CALL runtime路memmove(SB) //  y   len(x). MOVQ x+8(FP), DI MOVQ y+16(FP), CX MOVQ y+24(FP), SI MOVQ newstr-8(SP), AX LEAQ (AX)(DI*1), BX MOVQ BX, 0(SP) MOVQ CX, 8(SP) MOVQ SI, 16(SP) CALL runtime路memmove(SB) //   . MOVQ newstr-8(SP), AX MOVQ x+8(FP), R8 ADDQ y+24(FP), R8 MOVQ AX, ret+32(FP) MOVQ R8, ret+40(FP) RET maybe_return_y: //      . MOVQ (TLS), AX // *g CMPQ CX, (AX) JL return_y //  y_ptr < stk.lo CMPQ CX, 8(AX) JGE return_y //  y_ptr >= stk.hi JMP concatenate // y  ,    return_y: MOVQ CX, ret+32(FP) MOVQ SI, ret+40(FP) RET maybe_return_x: //      . MOVQ (TLS), AX // *g CMPQ DX, (AX) JL return_x //  x_ptr < stk.lo CMPQ DX, 8(AX) JGE return_x //  x_ptr >= stk.hi JMP concatenate // x  ,    return_x: MOVQ DX, ret+32(FP) MOVQ DI, ret+40(FP) RET 

If you are interested in the nature of NO_LOCAL_POINTERS in this code, you can read Calling a Go function from asm ("fatal error: missing stackmap") .


Performance rating: 100% (+8.6).

As a conclusion


All code is provided as a concat package.


Is the world ready for such a quick concatenation? Who knows.


At the beginning of the article was mentioned CL123256 . He has several paths of development:


  1. Variable specialization for the case when the compiler does not allocate a temporary buffer. There is less gain for each case, but it covers more types of concatenation and practically does not increase the size of the code (both machine and Go code).
  2. More specializations for special cases. Higher gain, but more machine code, can harm the instruction cache.
  3. Tons of machine code, for each special case and specialized memmove, in the manner of how it is done in glibc. There are mainly questions of expediency.

The current proposed version accelerates only the most frequent and simplest case of concatenation of a pair of strings (arity = 2).


If Go does not accept this change, then a comparable acceleration can be achieved by implementing the operations on strings as a third-party library. Less comfortable, beautiful and elegant, but it works.


')

Source: https://habr.com/ru/post/417479/


All Articles