Go assembler guide

Before you start implementing the runtime and studying the standard library, you need to master the abstract assembler Go. I hope this guide will help you quickly master the necessary knowledge.

Content

This article assumes that readers have a basic knowledge of any kind of assembler.
')
When it comes to architecture-related issues, linux / amd64 is always implied.

We will always work with compiler optimizations enabled .

All quotes are taken from official documentation and / or code base, unless otherwise noted.

"Pseudoassembler"

The Go compiler generates an abstract, portable assembler that is not tied to any hardware. The assembler Go then uses this pseudo-assembler to generate machine-specific instructions for the target hardware.

This additional "level" gives a lot of advantages. The main one is the easy porting of Go to the new architecture. For details, I send you to Rob Pike’s “ The Design of the Go Assembler ”.

The most important thing to know about the Go assembler: it is not a direct representation of the machine underlying the language. Something is compared directly with the machine, but something is not. The fact is that the compiler does not need to transfer the assembler to a regular pipeline. Instead, the compiler operates on a semi-abstract set of instructions, which are partially selected after generating the code. The assembler works in a semi-abstract form, so if you see the MOV instruction, this does not mean that the toolkit will generate a move instruction for this operation. Perhaps this will be a cleaning or loading instruction. Or maybe the generated instruction will exactly match the machine instruction with the same name. In general, machine-specific operations look like they are, and more general concepts, like moving memory or call and return routines, are more abstract. The details depend on the architecture, and we apologize for the inaccuracies, the situation is uncertain.

An assembler program is a way to parse the description of this set of semi-abstract instructions and turn them into instructions for transfer to a linker.

Simple program decomposition

Consider this code on Go ( direct_topfunc_call.go ):

//go:noinline func add(a, b int32) (int32, bool) { return a + b, true } func main() { add(10, 32) }

(Pay attention to the //go:noinline compiler directive //go:noinline ... Be careful.)

Let's compile the code into an assembler:

 $ GOOS=linux GOARCH=amd64 go tool compile -S direct_topfunc_call.go 0x0000 TEXT "".add(SB), NOSPLIT, $0-16 0x0000 FUNCDATA $0, gclocals·f207267fbf96a0178e8758c6e3e0ce28(SB) 0x0000 FUNCDATA $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB) 0x0000 MOVL "".b+12(SP), AX 0x0004 MOVL "".a+8(SP), CX 0x0008 ADDL CX, AX 0x000a MOVL AX, "".~r2+16(SP) 0x000e MOVB $1, "".~r3+20(SP) 0x0013 RET 0x0000 TEXT "".main(SB), $24-0 ;; ...omitted stack-split prologue... 0x000f SUBQ $24, SP 0x0013 MOVQ BP, 16(SP) 0x0018 LEAQ 16(SP), BP 0x001d FUNCDATA $0, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB) 0x001d FUNCDATA $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB) 0x001d MOVQ $137438953482, AX 0x0027 MOVQ AX, (SP) 0x002b PCDATA $0, $0 0x002b CALL "".add(SB) 0x0030 MOVQ 16(SP), BP 0x0035 ADDQ $24, SP 0x0039 RET ;; ...omitted stack-split epilogue...

We decomposed two functions line by line to understand how the compiler works.

Analyzing `add`

 0x0000 TEXT "".add(SB), NOSPLIT, $0-16

0x0000 : Offset (offset) of the current instruction relative to the start of the function.
TEXT "".add : The TEXT directive declares the character "".add part of the .text section (that is, executable code) and means that the instructions following the directive are the body of the function.

The empty string "" during the build will be replaced with the name of the current package: for example, "".add after linking to the final binary will become main.add .
(SB) : SB is a virtual register containing a "static-base" pointer, that is, the address of the beginning of the program's address space.

"".add(SB) declares that our character is located at an address with a constant offset from the beginning of the address space. In other words, it is the absolute direct address where the symbol of the global function is written. This confirms objdump :

$ objdump -j .text -t direct_topfunc_call | grep 'main.add' 000000000044d980 g F .text 000000000000000f main.add
All user characters are written as offsets for pseudo-registers FP (arguments and local variables) and SB (global variables). The pseudo-register SB can be considered as a source of memory, so the symbol foo(SB) is the name foo as an address in memory.
NOSPLIT tells the compiler that it should NOT insert the stack split preamble (stack-split), which checks whether the current stack should be enlarged.

In the case of our add function, the compiler set this flag itself: it is smart enough and realized that since add does not have local variables and its own stack frame, then it simply cannot outgrow the current stack. This means that checks are performed on every call - processor cycles thrown to the wind.

"NOSPLIT" : do not insert the initial check if the stack should be split. The frame for the subroutine (routine), as well as what it calls, must be placed in the spare space at the beginning of the stack segment. Used to protect subroutines, such as the stack partitioning code itself. At the end of the article we will talk a little about gorutin and stack splits.
$0-16: $0 - the size (in bytes) of the stack frame allocated in memory. $16 - the size of the arguments passed to the caller.

In general, after the frame size comes the size of the argument, separated by a minus sign (this is not a subtraction, but a stupid syntax). The frame size of $24-8 means that the function has a frame size of 24 bytes, and it is called with an 8-byte argument that is in the frame of the caller. If NOSPLIT is not specified for TEXT , then the size of the argument must be provided. For assembly functions with go-prototypes, go vet will check if the size of the argument is correct.

0x0000 FUNCDATA $0, gclocals·f207267fbf96a0178e8758c6e3e0ce28(SB)
0x0000 FUNCDATA $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)

The FUNCDATA and PCDATA provided by the compiler and contain information for the garbage collector.

Do not go deep yet, we will return to this in the article where garbage collection will be dealt with.

0x0000 MOVL "".b+12(SP), AX
0x0004 MOVL "".a+8(SP), CX

The calling convention in Go instructs all arguments to be pushed to the stack using the space already reserved in the stack frame of the caller. The caller’s duty is to shrink and grow the stack so that the caller can pass arguments and the caller can return values.

The Go compiler never generates instructions for the PUSH / POP family: the stack size is changed by decrementing or incrementing the virtual pointer of the SP equipment stack (see discussion issue # 21: about SP register ).

The pseudo-register SP is a virtual stack pointer used to refer to local frame variables and arguments prepared for function calls. It indicates the beginning of the local stack frame, so links should use a negative offset in the range [−framesize, 0]: x-8(SP) , y-4(SP) , and so on.

Although the official documentation states that “All user characters are written as offsets relative to the FP pseudo-register (arguments and local variables),” this is only true for the code you write yourself.

Like most newer compilers, the Go toolkit directly in the generated code always refers to arguments and local variables using offsets from the stack pointer. This allows the stack frame to be used as a general-purpose register on platforms with a smaller number of registers (for example, x86).

Check out the “ x86-64 stack frame layout ” if you like these boring details (see also issue # 2: Frame pointer).

"".b+12(SP) and "".a+8(SP) refer to addresses located 12 and 8 bytes from the top of the stack (remember: the stack grows down!).

.a and .b are arbitrary aliases for the places to which we refer. Although they have absolutely no semantic meaning , they are prescribed to be used when relative addressing is used for virtual registers. This is what the documentation says about the virtual frame pointer:

The FP pseudo-register is a virtual frame pointer used to refer to function arguments. Compilers support a virtual frame pointer and refer to arguments in the stack as offsets from the pseudo-register. Thus, 0 (FP) is the first argument of the function, 8 (FP) is the second (on a 64-bit machine), and so on. However, if you refer to the function arguments in this way, you must first put the name, for example: first_arg + 0 (FP) and second_arg + 8 (FP) (here the offset - from the frame pointer - differs from SB, which means offset from characters). The assembler uses this convention forcibly, rejecting simple 0 (FP) and 8 (FP). The real name does not correspond semantically, but should be used to document the name of the argument.

Finally, two more important points should be noted:

The first argument a is not at 0(SP) , but at 8(SP) , because the caller retains its return address at 0(SP) by means of a pseudo-function CALL .
Arguments are passed in reverse order. That is, the first argument will be closest to the top of the stack.

 0x0008 ADDL CX, AX 0x000a MOVL AX, "".~r2+16(SP) 0x000e MOVB $1, "".~r3+20(SP)

ADDL adds two Long-words (for example, 4-byte values), lying in AX and CX , and the result is written in AX . Then this result is moved to "".~r2+16(SP) , on the stack of which the caller has previously reserved a place and will look for returned values there. I repeat: in this case, "".~r2 has no semantic meaning.

To demonstrate how Go handles multiple return values, we will return the constant boolean value true . The mechanics are exactly the same as in the case of the first return value, only the offset will correspond to changes in SP .

 0x0013 RET

The RET pseudoinstructor tells the Go assembler to insert any instructions required by the calling convention used on the target platform in order to correctly return the result from the subroutine of the call. This will certainly force the code to extract (pop off) the return address located at 0(SP) , and then return to it.

The last instruction in the TEXT block should be some kind of transition, it is usually a (pseudo) RET instruction. If this is not the case, the linker will add a jump-to-itself instruction. There is no “fall through” in the TEXT blocks.

We'll have to learn at once a large amount of syntax and semantics. Here is an inline summary of the above:

 ;; Declare global function symbol "".add (actually main.add once linked) ;; Do not insert stack-split preamble ;; 0 bytes of stack-frame, 16 bytes of arguments passed in ;; func add(a, b int32) (int32, bool) 0x0000 TEXT "".add(SB), NOSPLIT, $0-16 ;; ...omitted FUNCDATA stuff... 0x0000 MOVL "".b+12(SP), AX ;; move second Long-word (4B) argument from caller's stack-frame into AX 0x0004 MOVL "".a+8(SP), CX ;; move first Long-word (4B) argument from caller's stack-frame into CX 0x0008 ADDL CX, AX ;; compute AX=CX+AX 0x000a MOVL AX, "".~r2+16(SP) ;; move addition result (AX) into caller's stack-frame 0x000e MOVB $1, "".~r3+20(SP) ;; move `true` boolean (constant) into caller's stack-frame 0x0013 RET ;; jump to return address stored at 0(SP)

But a visual representation of the contents of the stack after the execution of main.add :

  | +-------------------------+ <-- 32(SP) | | | G | | | R | | | O | | main.main's saved | W | | frame-pointer (BP) | S | |-------------------------| <-- 24(SP) | | [alignment] | D | | "".~r3 (bool) = 1/true | <-- 21(SP) O | |-------------------------| <-- 20(SP) W | | | N | | "".~r2 (int32) = 42 | W | |-------------------------| <-- 16(SP) A | | | R | | "".b (int32) = 32 | D | |-------------------------| <-- 12(SP) S | | | | | "".a (int32) = 10 | | |-------------------------| <-- 8(SP) | | | | | | | | | \ | / | return address to | \|/ | main.main + 0x30 | - +-------------------------+ <-- 0(SP) (TOP OF STACK) (diagram made with https://textik.com)

Analyzing `main`

In order not to have to flip through the article, let me remind you what our main function looks like:

 0x0000 TEXT "".main(SB), $24-0 ;; ...omitted stack-split prologue... 0x000f SUBQ $24, SP 0x0013 MOVQ BP, 16(SP) 0x0018 LEAQ 16(SP), BP ;; ...omitted FUNCDATA stuff... 0x001d MOVQ $137438953482, AX 0x0027 MOVQ AX, (SP) ;; ...omitted PCDATA stuff... 0x002b CALL "".add(SB) 0x0030 MOVQ 16(SP), BP 0x0035 ADDQ $24, SP 0x0039 RET ;; ...omitted stack-split epilogue... 0x0000 TEXT "".main(SB), $24-0

Nothing new:

"".main (once linked main.main ) is the symbol of a global function in the .text section, whose address is a constant offset from the beginning of our address space.
This code places a 24-byte stack frame in memory, takes no arguments, and does not return values.

0x000f SUBQ $24, SP
0x0013 MOVQ BP, 16(SP)
0x0018 LEAQ 16(SP), BP

As mentioned above, the calling convention in Go dictates that all arguments be passed to the stack.

The caller - main - increases its stack frame by 24 bytes ( do not forget that the stack grows down, so in this case SUBQ increases the stack frame ) by decrementing the virtual stack pointer. What do these 24 bytes consist of:

8 bytes ( 16(SP)-24(SP) ) are used to store the current value of the BP frame pointer ( real! ) To unwind the stack (stack-unwinding) and simplify debugging.
1 + 3 bytes ( 12(SP)-16(SP) ) is reserved for the second return value ( bool ) plus 3 bytes of the necessary equalization on amd64.
4 bytes ( 8(SP)-12(SP) ) are reserved for the first return value ( int32 ).
4 bytes ( 4(SP)-8(SP) ) are reserved for the value of the argument b ( int32 ).
4 bytes ( 0(SP)-4(SP) ) are reserved for the value of the argument a ( int32 ).

Finally, after increasing the stack, LEAQ calculates the new frame pointer address and saves it to BP .

 0x001d MOVQ $137438953482, AX 0x0027 MOVQ AX, (SP)

The caller takes an argument for the Quad word being called (8-byte value) and places it on top of the stack that has just increased.

Although at first glance it may seem like random garbage, in fact, 137438953482 corresponds to 4-byte values of 10 and 32 , which are combined into one 8-byte value:

 $ echo 'obase=2;137438953482' | bc 10000000000000000000000000000000001010 \____/\______________________________/ 32 10 0x002b CALL "".add(SB)

We apply CALL to the add function as an offset from the static-base pointer. That is, it is a direct transition to a direct address.

Note that CALL also places the return address (8-byte value) on top of the stack. Therefore, each link to SP from within the add function will be offset by 8 bytes! For example, "".a is now not at 0(SP) , but at 8(SP) .

 0x0030 MOVQ 16(SP), BP 0x0035 ADDQ $24, SP 0x0039 RET

Finally we:

We unwind the frame pointer by one stack pointer (that is, we “go down” one level).
Reduce the stack by 24 bytes to return the space we previously occupied.
Ask the assembler Go to insert a return routine.

A couple of words about gorutinah, stacks and splits

This is not the time or place to deal with the gorutin's giblets, but if you begin to sink into the assembler, you will very quickly have to become familiar with the instructions related to managing the stack.

You need to be able to quickly recognize these patterns and generally understand what they are doing and how.

Stacks

Since the number of gorutin in the Go-program is not defined and in practice can reach several million, in order to avoid devouring all the available memory, you need to follow the conservative method of allocating the stack for gorutin during runtime.

Thus, each new gorutina initially receives a small 2 KB stack during runtime (in fact, it is in a heap).

During its execution, the gorutin can outgrow the initial stack space (i.e. there will be a stack overflow). To prevent this from happening, the runtime environment, when the stack is filled, allocates a new stack, twice the old one, whose contents are copied to the new stack.

This process is known as stack split (split-split) and provides a dynamic stack mechanism for gorutin.

Divisions

In order for the stack sharing mechanism to work, the compiler inserts new instructions at the beginning and end of each function that may overflow its stack.

To avoid unnecessary costs, functions that are unlikely to outgrow the stack are labeled NOSPLIT , which tells the compiler not to insert checks.

Let's take a look at our main function, but this time without omitting the preamble with the split stack:

 0x0000 TEXT "".main(SB), $24-0 ;; stack-split prologue 0x0000 MOVQ (TLS), CX 0x0009 CMPQ SP, 16(CX) 0x000d JLS 58 0x000f SUBQ $24, SP 0x0013 MOVQ BP, 16(SP) 0x0018 LEAQ 16(SP), BP ;; ...omitted FUNCDATA stuff... 0x001d MOVQ $137438953482, AX 0x0027 MOVQ AX, (SP) ;; ...omitted PCDATA stuff... 0x002b CALL "".add(SB) 0x0030 MOVQ 16(SP), BP 0x0035 ADDQ $24, SP 0x0039 RET ;; stack-split epilogue 0x003a NOP ;; ...omitted PCDATA stuff... 0x003a CALL runtime.morestack_noctxt(SB) 0x003f JMP 0

As you can see, the preamble is divided into prologue and epilogue:

In the prologue, it is checked whether the space allocated for the gorutine has overflowed, and if so, the execution goes to the epilogue.
Epilogue starts the mechanism for increasing the stack, and then returns to the prologue.

There is a feedback loop, which works until a sufficiently large stack is allocated for the “starving” mountain.

Prologue

 0x0000 MOVQ (TLS), CX ;; store current *g in CX 0x0009 CMPQ SP, 16(CX) ;; compare SP and g.stackguard0 0x000d JLS 58 ;; jumps to 0x3a if SP <= g.stackguard0

TLS is a virtual register supported by the runtime environment containing a pointer to the current g , that is, to a data structure that monitors the entire state of the gorutine.

Let's look at the definition of g in the runtime source code:

 type g struct { stack stack // 16 bytes // stackguard0 is the stack pointer compared in the Go stack growth prologue. // It is stack.lo+StackGuard normally, but can be StackPreempt to trigger a preemption. stackguard0 uintptr stackguard1 uintptr // ...omitted dozens of fields... }

16(CX) corresponds to g.stackguard0 , the threshold value supported by the runtime environment. It compares this value with the stack pointer and finds out if the goretin is close to stack exhaustion. That is, the prolog checks if the current SP value is less than or equal to stackguard0 (correctly, it is greater), and if necessary, it goes to the epilog.

Epilogue

 0x003a NOP 0x003a CALL runtime.morestack_noctxt(SB) 0x003f JMP 0

The epilogue body is simple: it is called during runtime, which does all the work of increasing the stack, and then goes back to the first instruction of the function (that is, the prologue).

The NOP instruction stands in front of CALL so that the prologue does not go directly to CALL . On some platforms, this can lead to bad consequences. Therefore, right before the call itself, they usually insert an empty instruction (noop instruction) and land on the NOP (also see discussion issue # 4: Clarify "nop before call" paragraph ).

Minus some subtleties

We considered only the tip of the iceberg. The internal mechanics of increasing the stack have much more nuances: the process is rather complicated and requires a separate article for detailed consideration.

Conclusion

As you dive into the Go device in the following articles, the Go assembler will be one of the most important tools for understanding internal mechanics and connections with things that are not so obvious at first glance.

Links

Source: https://habr.com/ru/post/358088/

All Articles