📜 ⬆️ ⬇️

Superscalar Stacking Processor: We Cross Horror And Hedgehog


In this article, we will develop a (software) model of a superscalar processor with OOO and a stack machine front end .

In fact, continuation 1 & 2 .

Introduction


Why stack machine ?
Philip Kupman writes : “From a theoretical point of view, the stack is important in itself, since it is the main and most natural mechanism for executing well-structured code ... stack machines are much more efficient than register processors when working with code that has high modularity. In addition, they are very simple and demonstrate excellent performance with rather modest iron requirements. ”

The advantages of stack machines include compact code, easy compilation and interpretation of code, as well as compact state of the processor.

Successful samples of computers with stack architecture can be called Burroughs B5000 (1961 ... 1986) and HP3000 (1974 ... 2006). Do not forget about the Soviet Elbrus .
')
So why are stack machines currently pushed out by register processors to the verge of progress? The reasons, it seems, are as follows:
  1. There are two "extreme" implementations of stack processors: with a fixed hardware stack and a stack located in memory. Both of these extremes have significant flaws: in the case of a fixed-size stack, we have problems with the compiler and / or the operating system, which must think about what will happen if the stack overflows or becomes empty and how to work out the corresponding interrupt. If there is no interrupt handling, we are dealing with a microcontroller that is doomed to be manually programmable. With such an interrupt handler, the hardware stack plays the role of a “window” to fast-access memory, while shifting this “window” is a rather expensive procedure. As a result, we have unpredictable program behavior that does not allow the use of such a processor, for example, in real-time systems, although, from the point of view of Koopman, this is just an ecological niche of such processors.
  2. In the case of a stack that is located entirely in memory, we pay memory access for almost every instruction executed. Thus, the overall system performance is limited on top by the memory bandwidth.
  3. The complexity of the processor has long ceased to be a deterrent to development.
  4. The hardware implemented stack implies a strict sequence of events, and, consequently, the impossibility of using implicit program parallelism. If superscalar processors themselves distribute instructions to pipelines according to available resources, and VLIW processors expect that this work has already been done for them by the compiler, then for stack machines, an attempt to find hidden parallelism in the code faces almost insurmountable difficulties. In other words, we are again confronted with a situation where technology can provide more than the architecture can afford. What is strange.
  5. Programming for register machines is more “technologically”. With only a few general-purpose registers, "manually" you can easily create code that will access memory only when this cannot really be avoided. Note that in the case of a stack machine, this is much more difficult to achieve. And, although the optimal distribution of temporary variables in registers is an NP-complete problem, there are several inexpensive, but effective heuristics that allow you to create quite a decent code in a reasonable time.

There were, of course, attempts to hide the register machine inside the stack wrapper, for example, Ignite or ICL2900 , but they did not have commercial success, because in fact, only for the sake of compilation convenience, users had to incur fixed costs at runtime.

So why does the author again raise this topic, which makes him think that there are prospects in this area, what problems he is trying to solve by what means?

Motives


Nobody argues with the fact that the code for stack machines is much more compact in comparison with register architectures. You can dispute the significance of this compactness, but the fact remains. How is code compression achieved? After all, it is functionally doing the same thing, what is the focus?

Consider a simple example, the calculation of the expression x + y * z + u.
When building a parse tree with a compiler, it looks like this:

The assembly code (x86) for this expression looks like this:
mov eax,dword ptr [y] imul eax,dword ptr [z] mov ecx,dword ptr [x] add ecx,eax mov eax,ecx add eax,dword ptr [u] 

For a hypothetical stack engine, pseudocode is:
  push x push y push z multiply add push u add 

The difference is clear and obvious, in addition to the 4 memory accesses, addition and multiplication, there are register identifiers in each instruction of the register processor.

But after all, these registers in reality do not even exist, it is a fiction that the super-scalar processor uses to denote the connection between instructions.
A set of registers and operations with them form the architecture of the processor and (at least for x86) historically formed at a time when super-scalar processors did not exist, and the names of registers meant quite physical registers.

As you know, during the time the dog could grow up and register identifiers are currently used for other purposes. The architecture now is rather an interface between the compiler and the processor. The compiler does not know how the processor actually works, and can produce code for a whole heap of compatible devices. And processors can focus on useful and important things, creating externally compatible products.

What is the fee for such a unification?
  1. From the point of view of a super-scalar processor. Here, for example, an article from Intel, literally:
    The superscalar parallelizes the sequential code. But this parallelization process is too laborious even for the current processing power, and it is this that limits the performance of the machine. It is impossible to make this conversion faster than a certain number of teams per clock. More can be done, but at the same time clocks will drop - this approach is obviously meaningless.
    Here is another favorite:
    The trouble is not in the superscalar itself, but in the presentation of the programs. Programs are presented sequentially, and they need to be converted into parallel execution at runtime. The main problem of the superscalar is in the unsuitability of the input code to its needs. There is a parallel algorithm of the kernel and parallel organized hardware. However, between them, in the middle, there is a certain bureaucratic organization - a consistent system of commands
  2. From the point of view of the compiler. From the same place:
    The compiler converts a program to a sequential instruction set; the overall speed of the process will depend on the sequence in which it does it. But the compiler does not know exactly how the machine works. Therefore, generally speaking, the work of the compiler today is shamanism.
    The compiler makes titanic efforts to cram a program into a specified number of registers, and this number is fictitious. The compiler tries to push all the parallelism detected in the program (like a rope in the eye of a needle ) through the limitations of the architecture. To a large extent unsuccessfully.

What to do?


The problem is indicated - inconsistency of the interface / architecture with modern requirements.
What are these (new) requirements?
  1. All concurrency known to the compiler must be transferred without loss.
  2. The cost of unpacking dependencies between instructions should be minimal.

Again, consider the example of compiling the expression x + y * z + u.
x86stack machine
  mov eax, dword ptr [y]
        imul eax, dword ptr [z] 
        mov ecx, dword ptr [x]
        add ecx, eax  
        mov eax, ecx
        add eax, dword ptr [u] 
  push x
        push y
        push z
        multiply
        add
        push u                             
        add 
The x86 registers are intended to indicate the relationship between the various instructions. And what about the stack machine, how are these dependencies implemented there? Implicitly, through the position in the stack. Each instruction knows the number of its arguments and expects to find them at the top of the stack before execution.

But what if, when we talk about the top of the stack, this is not necessarily a stack of data , but rather a stack of operations ( micro-operations , mops).

What are the types of mops?
  1. Operations with registers, for example, addition: the values ​​of two registers are summed and placed in the third. These mops are binary - before being placed on the stack, the links to the two top mops are removed from the stack and the link to the given one is entered. And also unary - change of a sign, for example. When creating such an operation, its reference replaces the reference at the top of the stack.
  2. Memory operations - loading a value from memory into a register and vice versa. A load from memory has no predecessors and is simply pushed onto the stack of the mops. Unloading into memory removes one item from the stack. Loading from memory adds a link to the stack.
  3. The special operation is pop, it is just an instruction to the decoder to remove an element from the top of the stack.
  4. Other - branching, function calls, service, until we consider

For operations with memory, there is a special type of communication between the mops - through the address, for example, we load the value into memory and immediately read it from memory into the register. Formally, between these operations there is no connection through the register, in fact it is. In other words, it would be nice to wait until the end of the recording before reading. Therefore, for mops in the decoder cache, it is necessary to track such dependencies. Outside the cache, this problem is solved by serialization through the memory access bus.

Software model


Speculative constructions work fine up to the point of attempting to implement them. So in the future we will be theorizing in parallel with checking on the model.

The author has implemented a simple C subset compiler sufficient for the execution of the Ackermann function. The compiler produces code for the stack machine and immediately tries to execute it on the fly.

Let's not hurry and practice on something simple, for example
  long d1, d2, d3, d4, d5, d6, d7, d8; d1 = 1; d2 = 2; d3 = 3; d4 = 4; d5 = 5; d6 = 6; d7 = 7; d8 = 8; print_long (((d1 + d2)+(d3 + d4))+((d5 + d6)+(d7 + d8))); //    

But first it is necessary to determine the parameters of the processor model.
  1. 32 integer registers
  2. 4 instructions per clock are decoded
  3. one pipeline for working with memory, loading and unloading the word to / from the register takes 3 cycles
  4. 2 pipelines with ALU , the operation of summation / subtraction takes 3 cycles, comparison - 1 cycle

Here and below, instructions are shown at the time of the end of their execution (when the registers are already defined) in the pipeline where they were executed:
tactconv. of memoryconv. N1conv. N2
3
four
five
6
7
eight
9
ten
eleven
12
13
14
15
sixteen
17
18
nineteen
20
21
22
23
24
25
26
27
28
29
32
35
LLOAD r3
LLOAD r2
LLOAD r1
Lload r0
LLOAD r7
LLOAD r6
LLOAD r5
LLOAD r4
LSTORE r3
LSTORE r2
LSTORE r1
LSTORE r0
LSTORE r7
LSTORE r6
LSTORE r5
LSTORE r4
LLOAD r11
LLOAD r10
LLOAD r8
LLOAD r15
Lload r12
LLOAD r19
LLOAD r17
LLOAD r16





















LADD r11 r10 -> r9

LADD r8 r15 -> r14

LADD r12 r19 -> r18
LADD r9 r14 -> r13
LADD r17 r16 -> r23
LADD r18 r23 -> r22
LADD r13 r22 -> r21
And in register 21, at 35, we get the cherished value of 36.
The data from the same table, but in the form of a mop dependency diagram, in brackets [] is the number of the mop completion cycle (x & y does not make sense here, only the direction of the arrows is important).

What do we see?

Let's try the same, but without explicit initialization of variables.
  long d1, d2, d3, d4, d5, d6, d7, d8; print_long (((d1 + d2)+(d3 + d4))+((d5 + d6)+(d7 + d8))); //    
tactconv. of memoryconv. N1conv. N2
3
four
five
6
7
eight
9
ten
eleven
12
13
14
15
sixteen
17
18
nineteen
LLOAD r3
LLOAD r2
Lload r0
LLOAD r7
LLOAD r4
LLOAD r11
LLOAD r9
LLOAD r8





LADD r3 r2 -> r1

LADD r0 r7 -> r6

LADD r4 r11 -> r10
LADD r1 r6 -> r5
LADD r9 r8 -> r15


LADD r10 r15 -> r14


LADD r5 r14 -> r13
The calculation took 8 + 2 + 3 * 3 = 19 cycles, the minimum of the possible, again one pipeline was enough.

Now we will try to remove all the brackets and just sum up the variables.
  long d1, d2, d3, d4, d5, d6, d7, d8; print_long (d1+d2+d3+d4+d5+d6+d7+d8); //    
tactconv. of memoryconv. N1conv. N2
3
four
five
6
7
eight
9
ten
eleven
12
13
14
15
sixteen
17
18
nineteen
20
21
22
23
24
25
LLOAD r3
LLOAD r2
Lload r0
LLOAD r6
LLOAD r4
LLOAD r10
LLOAD r8
LLOAD r14





LADD r3 r2 -> r1


LADD r1 r0 -> r7


LADD r7 r6 -> r5


LADD r5 r4 -> r11


LADD r11 r10 -> r9


LADD r9 r8 -> r15


LADD r15 r14 -> r13

In this case, there is a data dependency and the inability to fully load the conveyor led to an extra 6 bars.

How about turning the expression inside out?
  long d1, d2, d3, d4, d5, d6, d7, d8; print_long (d1+(d2+(d3+(d4+(d5+(d6+(d7+d8))))))); //    
It was worth the extra clock since data loading is not in the optimal order.

It is strange that in the presence of explicit concurrency and two integer pipelines, the second one is idle. Let's try to move the system. Let's go back to
  long d1, d2, d3, d4, d5, d6, d7, d8; print_long (((d1 + d2)+(d3 + d4))+((d5 + d6)+(d7 + d8))); //    

Let's try to increase the speed of access to memory, let the reading into the register occur in 1 clock cycle. Funny, but little has changed. The total time decreased by 2 clocks, more precisely, the whole picture moved up by 2 clocks. It is understandable, the memory bandwidth has not changed, we just removed the delay to obtain the first result.

Perhaps the summation is going too fast. Let it be done 7 cycles instead of 3.
The total time increased to 29 (8 + 3 * 7) cycles, but in principle nothing has changed, the second pipeline is idle.

Let's try to sum up the pyramid of 16 numbers and enter 2 memory pipelines. The total time is 36 cycles (16/2 + 4 * 7). The second numeric pipeline was not involved. Numbers now come in pairs, all 8 summations of the first level start with a delay in tact and this is enough for everything to fit into one conveyor.

And only if you enter 4 memory pipelines and allow 6 decoding per clock, it comes to the second numeric pipeline (it has already executed 3 (sic!) Instructions), however, the total time is 33 clock cycles. Those. the effectiveness of the summation itself has even deteriorated.

Indeed, the conveyor is a very effective mechanism and we need quite weighty arguments to have more than one.

Register allocation


In the examples described above, the assignment of registers took place at the time of creation of the mops. As a result, the register usage map when summing variables with a pyramid looks like this (initial parameters - 4 instructions are decoded per cycle, one memory pipeline, three clocks for reading and summing):
8 variables16 variables
If, on the other hand, you assign registers at the time the instructions are placed in the pipeline, the picture looks much better:
8 variables16 variables
For the control, it is worthwhile to give the instruction dependency diagram (8 numbers):

Function call


From general considerations, the architecture being described has problems calling functions. In fact, after returning from a function, we expect a return of the state of the registers in the context of the current function. In modern register architectures for this, registers are divided into two categories — the caller is responsible for the safety of some, the callee for others.

But in our front-end stack architecture, therefore, the compiler may not be aware of the existence of any kind of registers. And the processor itself should take care of saving / restoring the context, which seems to be a non-trivial task.

However, we will solve this problem in the next article .

Source: https://habr.com/ru/post/278575/


All Articles