Superscalar stack processor: we continue to cross the grass and the hedgehog

A continuation of the article , where it was possible to demonstrate that the frontend of a stack machine completely allows to hide behind it a superscalar processor with OoO .
The topic of this article is function invocation.

Stack (s) and function calls

From general considerations, the architecture being described has problems calling functions. In fact, after returning from a function, we expect a return of the state of the registers in the context of the current function. In modern register architectures for this, registers are divided into two categories — the caller is responsible for the safety of some, the callee for others.

But in our front-end stack architecture, therefore, the compiler may not be aware of the existence of any kind of registers. And the processor itself should take care of saving / restoring the context, which seems to be a non-trivial task.

But before we do a lyrical digression on the theme of the stack itself.
The very concept of a stack is misleading.
')
Here in IBM / 360 there is no hardware stack. But functions (including recursively) can be called, for this the parameters are stored in the memory area, which must be solicited from the OS before the call.

In x86 there is a hardware stack, but no one classifies this architecture as a stack one. This stack is an excellent mechanism for storing local variables and function parameters.

AMD29K, SPARC and Itanium belong to the so-called Berkeley Risc family of architectures, and their stack performs another important function: the register pool is the top of the stack (register windows), which is supposed to speed up the transfer of parameters when calling functions.
SPARC V7 appeared a couple of years earlier than AMD29K, but it seems (author) less architecturally slim.
Itanium's RSE block is generally similar to that of AMD29K, but appeared much later.

AMD29K deserves some kind words

It has two hardware stacks. A few stacks in the architecture are not new, it was still on Burroughs B5000 , the Soviet (and present ) Elbrus. But there the second stack is designed to store the return addresses from the procedures. Here they are both used to store data:

memory stack - used to store large local variables (structures and arrays) as well as the tail of parameters, if there are more than 16. Register gr125 (msp) is a pointer to the top of this stack.
register stack - there are 128 local registers that form the top of the stack
- register stack serves for quick access to the top of the stack in memory (different from the above described memory stack, of course)
- global registers gr126 (rab) and gr127 (rfb) determine the top and bottom of the stack, gr1 (rsp) stores a pointer to its top
- can do two reads and one write in one cycle
- there are no explicit stack operations such as push & pop, instead of them, when the function is called for it, the number of registers defined by the compiler is released (activation record, so called call frame)
- access to data from the activation record goes through registers, which for each function are numbered from lr0
- lr0 and lr1 are reserved in the first return address, in the second - the activation record of the calling function
- register windows of the calling and called functions intersect with parameters similar to SPARC
- if there is not enough free registers to call a function, trap SPILL occurs, the handler of which pushes some of the register values into memory, freeing them
- on the contrary, when there are too many free registers, FILL works.
- to make this happen, the compiler inserts instructions
```
         sub gr1, gr1,16; function prologue, lr0 + lr1 + 2 local variables 
         asgeu SPILL, gr1, rab; compare with top of window 
 .  .  .  ; function body
         jmpi lr0; return 
         asleu FILL, lr1, rfb; compare with bottom of window gr127
```

What interesting ideas should be noted here?

The numbering of registers for each function is its own, this is a feature of Berkeley RISC
But stack splitting is a feature of this particular architecture. In SPARC, register windows are saved to the same stack, where the normal (not fast) variables lie. And fill / spill are made with gaps - each window from its frame.

This division of the stack into “big but slow” and “small but fast” is very important. We will understand with the motives.

Parameter passing, function call

The idea of a stack as a repository of local variables (and parameters) is beautiful in its logic and completeness. Weak spot - system performance is limited by delays and memory performance. At the time of the PDP-11 , nothing could be done about it, but since then the situation has changed.

First, access to registers has become significantly faster than access to memory, which has caused the need for data caching. Secondly, it became possible to have a much larger number of registers.

The possession of a large number of registers creates a temptation to use them to speed up the transfer of arguments when calling functions. In fact, there are usually few parameters (less than local data), their values are almost always needed by someone. And what of the local data deserves to get into the registers, let the optimization decide. This, of course, is a very crude simplification, intended only to demonstrate a general motivation.

Currently, there are two common methods for passing parameters through registers:

Assignment to specific registers of a special role. For example, in MSVC (x86-64), it is customary to pass the first four integer arguments through the registers RCX, RDX, R8, R9. This implies a single register numbering for all functions. The architectures using this technique can also be attributed to MIPS, PPC, ARM, DEC Alpha ... It is clear that in the call chain there is still no other way to save parameters except on the stack. Here all hope for a cache. Or the optimizer, which can decide that a specific parameter in this function is no longer used and does not need to be saved at all.
Register windows technique. This branch of architecture is growing from the Berkeley Risc project. This includes the AMD29K already parsed by us, as well as i960, Itanium, SPARC. The bottom line is that the limited number of parameters and local data of the called function are located in the register window; when you call the next function, the window is shifted, so this data forms a stack. Each function has its own register numbering. All that does not fit into the window, falls into a normal stack, global registers for temporary data can also be used. So in the case of i960 and SPARC, the register stack is interspersed with a regular one, and for AMD29K & Itanium these are different stacks. In fact, AMD29K & Itanium suggests the compiler to choose which data it considers worthy of being in the “fast stack”, everything else will happen by itself. This is reminiscent of the now obsolete “register” keyword in C, only the compiler decides, the high-level language, though.

In terms of potential performance, both approaches are roughly equivalent. In the first approach, the entire burden of optimization falls on the compiler, and not on the processor, which (probably) facilitates and reduces the cost of development of the final system.

But we got a little carried away, it's time to return to preserving the context of the current function in the designed architecture.

Saving function context

And what is included in this context? The registers used at the moment of calling the child procedure.
At the same time, the registers of non-calculated expressions are interconnected through topological sorting, but this does not matter for the called function. At the time of the capture of the output registers by the mops, it does not matter in what order they were captured.

There is a nuance that is worth noting - before the start of the function call, all the mops with which its arguments were calculated must work. Therefore, from the point of view of the processor, a function call is a generalized instruction with an arbitrary number of arguments.

Now it is necessary to determine the numbering of registers.
Let the numbering be common, i.e. we took the path of MIPS, not SPARC.

If the call chain is long enough, it is obvious that all registers are busy . And we are talking about which ones we will load into memory (SPILL). Those. capture order is still important.
order of capture depends on the history of calls
inside the function it is determined dynamically
there are no guarantees that by returning the result of the function in some register, we will not get (during the reverse execution of FILL) a conflict with this register
the author does not see ways to avoid such conflicts, which, of course, does not mean that there really are no ways for them

Let's try register windows.

The numbering of the registers in each function starts anew, and this is simply wonderful. saves us from taking into account the history
common technique - ring buffer registers, FILL & SPILL
we will use two physical stacks - for register windows and everything else
you need to remember which registers are used at the time of calling the child function. And we have such information. Suppose that the registers r0, r5 and r11 are occupied. In fact, only a quarter of the used range of registers is used and there is a temptation to somehow “pack” them. But then when returning from the child function, you will have to “unpack” them back. So the pool of registers of the current function (in this case) will remain the size of 12 registers (+ service information: return address, previous frame). Moreover, the number of registers in itself is not so critical, the number of simultaneous read / write operations is much more expensive, and it will not change
but with the preservation of registers in memory, perhaps something can be done, try not to save in memory obviously unnecessary data from unused registers
for this, after writing the occupied function registers, save the mask of their employment
for this, in turn, FILL & SPILL will have to be done not by the number of registers needed, but by frame: everything that concerns one call at a time
then in the current beginning of the register ring buffer there will always be a frame descriptor (the next candidate for SPILL)
and the first register that we retrieve with FILL will contain a mask (or part of it) of the used registers, with which we will extract from the memory and the required number of registers, which was required

However, by focusing on the outside of the function call, we lost sight of how it would all be executed inside. The author does not consider himself a specialist in hardware, but he still foresees problems.
Fortunately, we will deal with them in the next article .

Source: https://habr.com/ru/post/279123/

All Articles

Superscalar stack processor: we continue to cross the grass and the hedgehog

Stack (s) and function calls

AMD29K deserves some kind words

Parameter passing, function call

Saving function context

More articles: