⬆️ ⬇️

Superscalar stack processor: we continue to cross the grass and the hedgehog



A continuation of the article , where it was possible to demonstrate that the frontend of a stack machine completely allows to hide behind it a superscalar processor with OoO .

The topic of this article is function invocation.



Stack (s) and function calls



From general considerations, the architecture being described has problems calling functions. In fact, after returning from a function, we expect a return of the state of the registers in the context of the current function. In modern register architectures for this, registers are divided into two categories — the caller is responsible for the safety of some, the callee for others.



But in our front-end stack architecture, therefore, the compiler may not be aware of the existence of any kind of registers. And the processor itself should take care of saving / restoring the context, which seems to be a non-trivial task.



But before we do a lyrical digression on the theme of the stack itself.

The very concept of a stack is misleading.

')

Here in IBM / 360 there is no hardware stack. But functions (including recursively) can be called, for this the parameters are stored in the memory area, which must be solicited from the OS before the call.



In x86 there is a hardware stack, but no one classifies this architecture as a stack one. This stack is an excellent mechanism for storing local variables and function parameters.



AMD29K, SPARC and Itanium belong to the so-called Berkeley Risc family of architectures, and their stack performs another important function: the register pool is the top of the stack (register windows), which is supposed to speed up the transfer of parameters when calling functions.

SPARC V7 appeared a couple of years earlier than AMD29K, but it seems (author) less architecturally slim.

Itanium's RSE block is generally similar to that of AMD29K, but appeared much later.



AMD29K deserves some kind words



It has two hardware stacks. A few stacks in the architecture are not new, it was still on Burroughs B5000 , the Soviet (and present ) Elbrus. But there the second stack is designed to store the return addresses from the procedures. Here they are both used to store data:

What interesting ideas should be noted here?

  1. The numbering of registers for each function is its own, this is a feature of Berkeley RISC
  2. But stack splitting is a feature of this particular architecture. In SPARC, register windows are saved to the same stack, where the normal (not fast) variables lie. And fill / spill are made with gaps - each window from its frame.


This division of the stack into “big but slow” and “small but fast” is very important. We will understand with the motives.



Parameter passing, function call



The idea of ​​a stack as a repository of local variables (and parameters) is beautiful in its logic and completeness. Weak spot - system performance is limited by delays and memory performance. At the time of the PDP-11 , nothing could be done about it, but since then the situation has changed.



First, access to registers has become significantly faster than access to memory, which has caused the need for data caching. Secondly, it became possible to have a much larger number of registers.



The possession of a large number of registers creates a temptation to use them to speed up the transfer of arguments when calling functions. In fact, there are usually few parameters (less than local data), their values ​​are almost always needed by someone. And what of the local data deserves to get into the registers, let the optimization decide. This, of course, is a very crude simplification, intended only to demonstrate a general motivation.



Currently, there are two common methods for passing parameters through registers:

  1. Assignment to specific registers of a special role. For example, in MSVC (x86-64), it is customary to pass the first four integer arguments through the registers RCX, RDX, R8, R9. This implies a single register numbering for all functions. The architectures using this technique can also be attributed to MIPS, PPC, ARM, DEC Alpha ... It is clear that in the call chain there is still no other way to save parameters except on the stack. Here all hope for a cache. Or the optimizer, which can decide that a specific parameter in this function is no longer used and does not need to be saved at all.
  2. Register windows technique. This branch of architecture is growing from the Berkeley Risc project. This includes the AMD29K already parsed by us, as well as i960, Itanium, SPARC. The bottom line is that the limited number of parameters and local data of the called function are located in the register window; when you call the next function, the window is shifted, so this data forms a stack. Each function has its own register numbering. All that does not fit into the window, falls into a normal stack, global registers for temporary data can also be used. So in the case of i960 and SPARC, the register stack is interspersed with a regular one, and for AMD29K & Itanium these are different stacks. In fact, AMD29K & Itanium suggests the compiler to choose which data it considers worthy of being in the “fast stack”, everything else will happen by itself. This is reminiscent of the now obsolete “register” keyword in C, only the compiler decides, the high-level language, though.


In terms of potential performance, both approaches are roughly equivalent. In the first approach, the entire burden of optimization falls on the compiler, and not on the processor, which (probably) facilitates and reduces the cost of development of the final system.



But we got a little carried away, it's time to return to preserving the context of the current function in the designed architecture.



Saving function context



And what is included in this context? The registers used at the moment of calling the child procedure.

At the same time, the registers of non-calculated expressions are interconnected through topological sorting, but this does not matter for the called function. At the time of the capture of the output registers by the mops, it does not matter in what order they were captured.



There is a nuance that is worth noting - before the start of the function call, all the mops with which its arguments were calculated must work. Therefore, from the point of view of the processor, a function call is a generalized instruction with an arbitrary number of arguments.



Now it is necessary to determine the numbering of registers.

Let the numbering be common, i.e. we took the path of MIPS, not SPARC.



Let's try register windows.



However, by focusing on the outside of the function call, we lost sight of how it would all be executed inside. The author does not consider himself a specialist in hardware, but he still foresees problems.

Fortunately, we will deal with them in the next article .

Source: https://habr.com/ru/post/279123/



All Articles