📜 ⬆️ ⬇️

Compilation. 7: register assignment

File names are infinite in length, where infinity is set to 255 characters.
--Peter Collinson: The Unix File System

So, we have a program on the p-code, and at its disposal an unlimited number of registers (ie, 255). The number of registers in a real processor is much smaller (suppose four). What do we do?

Further in the post:

  1. P-code parsing
  2. Lifetime
  3. Implementation
  4. Simple optimizations
  5. Version splitting
  6. Work with memory
  7. What happened?

P-code parsing


In the last post there was a dump of the compiled program. Her 40 teams are easy to decipher:
 00 mov r01, 0
 01 mov r02, 0x3e8
 02 echo 0x126
 03 echo r01
 04 echo 0xa0
 05 echo r02
 06 echo 0xa7
 07 le r03, r01, r02
 08 jz r03, + 0x1d (= 0x26)
 09 add r04, r01, r02
 0a mov r05,2
 0b div r06, r04, r05
 0c echo 0x162
 0d echo r06
 0e echo 0xcc
 0f input r07
 10 mov r08, 1
 11 eq r09, r07, r08
 12 jz r09, +4 (= 0x17)
 13 mov r0a, 1
 14 sub r0b, r06, r0a
 15 add r02, r0b, 0
 16 jz 0, + e (= 0x25)
 17 mov r0c, 2
 18 eq r0d, r07, r0c
 19 jz r0d, +4 (= 0x1e)
 1a mov r0e, 1
 1b add r0f, r06, r0e
 1c add r01, r0f, 0
 1d jz 0, +7 (= 0x25)
 1e mov r10, 3
 1f eq r11, r07, r10
 20 jz r11, +3 (= 0x24)
 21 echo 0x146
 22 hlt
 23 jz 0, +1 (= 0x25)
 24 echo 0x16a
 25 jz 0, -0x1f (= 7)
 26 echo 0xff
 27 hlt

It can be seen that not all registers are used at the same time: most store intermediate values ​​that are needed only for the next command. For example, r03 used only in commands 07-08 , and r0f only in commands 1b-1c . It means that there are no obstacles to use the same physical register for the stored values: in the 07-08 commands it will be used for one value, and in 1b-1c - for another. Any two registers that are not used simultaneously for different values can be combined into one physical register.

There is a small subtlety about the definition of "being used simultaneously for different values." Firstly, it is impossible to consider the register lifetime as just a gap from its first mention to the last: the code may contain jumps in all directions, for example:
    ...
 10 mov r, 42
 11 jz 0, 20
 12 echo r
    ...
 20 ...
    ...
 30 jz 0, 12
    ...

Even though r is mentioned only in commands 10–12, it is implicitly used in command 20, because after it, a jump back to 12 is possible; therefore, r cannot be combined with other registers used in command 20.
')
The second subtlety:
    ...
 10 mov r, 42
 11 echo r
    ...
 20 ...
    ...
 30 mov r, 24
 31 echo r
    ...

In command 20, the r register is not actually used, although it is mentioned before this (10-11), and after that (30-31). We can store any other values ​​in it, because before the next use (31) it will be assigned a new value. In principle, we can even assign different physical registers for r in 10-11 and 31-31, because the values ​​will still be stored different. But for this you need to find out that r two independent “versions” and process them separately.

Lifetime


You can determine the set of commands in which the register is used iteratively:
For each command, we determine the set of registers that it needs at the input and at the output, and apply three rules until these sets are established. Here is what we get for our program:
              iteration number: 1 (input / output) 2 (input / output) 3 (input / output) 4 (input / output) ... total (input / output)
 00 mov r01, 0 / r01
 01 mov r02, 0x3e8 / r01 r01 / r01, r02
 02 echo 0x126 / r01 r01 / r01 r01 / r01 r01, r02 / r01, r02
 03 echo r01 r01 / r01 / r01 / r01 / r02 r01, r02 / r01, r02
 04 echo 0xa0 / r02 r02 / r02 r02 / r02 r01, r02 / r01, r02
 05 echo r02 r02 / r02 / r02 / r02 / r01, r02 r01, r02 / r01, r02
 06 echo 0xa7 / r01, r02 r01, r02 / r01, r02 r01, r02 / r01, r02 r01, r02 / r01, r02
 07 le r03, r01, r02 r01, r02 / r01, r02 / r03 r01, r02 / r03 r01, r02 / r01, r02, r03 r01, r02 / r01, r02, r03
 08 jz r03, + 0x1d (= 0x26) r03 / r03 / r01, r02 r01, r02, r03 / r01, r02 r01, r02, r03 / r01, r02 r01, r02, r03 / r01, r02
 09 add r04, r01, r02 r01, r02 / r01, r02 / r01, r02 / r01, r02 / r04 r01, r02 / r01, r02, r04
 0a mov r05,2 / r04, r05 r04 / r04, r05 r04 / r04, r05 r01, r02, r04 / r01, r02, r04, r05
 0b div r06, r04, r05 r04, r05 / r04, r05 / r04, r05 / r04, r05 / r06 r01, r02, r04, r05 / r01, r02, r06
 0c echo 0x162 / r06 r06 / r06 r06 / r06 r01, r02, r06 / r01, r02, r06
 0d echo r06 r06 / r06 / r06 / r06 / r01, r02, r06 / r01, r02, r06
 0e echo 0xcc r01, r02, r06 / r01, r02, r06
 0f input r07 / r07 r01, r02, r06 / r01, r02, r06, r07
 10 mov r08, 1 / r07, r08 r07 / r07, r08 r07 / r07, r08 r01, r02, r06, r07 / r01, r02, r06, r07, r08
 11 eq r09, r07, r08 r07, r08 / r07, r08 / r09 r07, r08 / r09 r07, r08 / r09 r01, r02, r06, r07, r08 / r01, r02, r06, r07, r09
 12 jz r09, +4 (= 0x17) r09 / r09 / r09 / r09 / r06, r07 r01, r02, r06, r07, r09 / r01, r02, r06, r07
 13 mov r0a, 1 / r06, r0a r06 / r06, r0a r06 / r06, r0a r01, r06 / r01, r06, r0a
 14 sub r0b, r06, r0a r06, r0a / r06, r0a / r0b r06, r0a / r0b r06, r0a / r0b r01, r06, r0a / r01, r0b
 15 add r02, r0b, 0 r0b / r0b / r0b / r0b / r01, r0b / r01, r02
 16 jz 0, + e (= 0x25) r01, r02 / r01, r02
 17 mov r0c, 2 / r07, r0c r07 / r07, r0c r07 / r07, r0c r01, r02, r06, r07 / r01, r02, r06, r07, r0c
 18 eq r0d, r07, r0c r07, r0c / r07, r0c / r0d r07, r0c / r0d r07, r0c / r0d r01, r02, r06, r07, r0c / r01, r02, r06, r07, r0d
 19 jz r0d, +4 (= 0x1e) r0d / r0d / r0d / r0d / r06, r07 r01, r02, r06, r07, r0d / r01, r02, r06, r07
 1a mov r0e, 1 / r06, r0e r06 / r06, r0e r06 / r06, r0e r02, r06 / r02, r06, r0e
 1b add r0f, r06, r0e r06, r0e / r06, r0e / r0f r06, r0e / r0f r06, r0e / r0f r02, r06, r0e / r02, r0f
 1c add r01, r0f, 0 r0f / r0f / r0f / r0f / r02, r0f / r01, r02
 1d jz 0, +7 (= 0x25) / r01, r02 r01, r02 / r01, r02
 1e mov r10, 3 / r07, r10 r07 / r07, r10 r07 / r07, r10 r01, r02, r07 / r01, r02, r07, r10
 1f eq r11, r07, r10 r07, r10 / r07, r10 / r11 r07, r10 / r11 r07, r10 / r11 r01, r02, r07, r10 / r01, r02, r11
 20 jz r11, +3 (= 0x24) r11 / r11 / r11 / r11 / r01, r02, r11 / r01, r02
 21 echo 0x146
 22 hlt
 23 jz 0, +1 (= 0x25) / r01, r02 r01, r02 / r01, r02
 24 echo 0x16a / r01, r02 r01, r02 / r01, r02
 25 jz 0, -0x1f (= 7) / r01, r02 r01, r02 / r01, r02 r01, r02 / r01, r02 r01, r02 / r01, r02
 26 echo 0xff
 27 hlt
Concerning separate iterations:
  1. At the input of each command are the registers it uses; (rule number 1)
  2. At the output of each command, copy the input of the commands reachable from it; (rule number 2)
  3. At the input of each command we copy its output, excluding the register recorded by the command; (rule number 3)
  4. At the output of each command, copy the input of the commands reachable from it; (rule number 2)
Further, the rules number 2 and number 3 are repeated in turn.

After several iterations, we get the final markup of the “survivability of the registers” - which registers are needed by each team. Then we can rename the registers of the p-code to physical ones - for example, “greedily”: we take the “most necessary” register (the one that is needed by the largest number of commands); rename to physical, which is not yet engaged in the relevant teams; repeat.
The optimal choice of which n-registers should be provided for the physical in order to last as much as possible is possible only by exhaustive search: this task is equivalent to “coloring the graph in a minimum of colors,” about which mathematicians believe that there is no effective solution for it. Our “greedy” approach is not the worst possible.

What if there is no free physical for the next p-register? For example, in our code there are commands (10-12, 17-19) that use 5 n-registers in each: at least one of them does not have enough space. It is clear that you have to keep it in memory.
But just saying the register “you are not lucky, you will be stored in memory” is not enough: after all, the commands of our p-code do not know how to work with the values ​​from memory, as well as the commands of most real processors. It is necessary at the time of use to get the value from memory in the register. Only what? After all, this is why they removed the value in memory, because all the registers are occupied!

At this point, the theory I learned ends, and guesses and heuristics come along. I decided to “pour out into memory” extra registers before a command for which there are not enough registers, and then read them back from memory. Thus, in the “lifetime” of the poured registers, a hole will appear in the place of this command, and in this hole we will be able to use the released registers for other purposes.

To be able to “push” the code and insert read / write commands into the memory, you have to change the intermediate format a bit: now it will be a linked list, and the jump target will be stored not as an offset, but as a pointer (turning the list into a digraph). During the generation of the p-code, for backpathing, we will need to refer to the commands and by index; for this we will get a separate displacement vector.
typedef std::list< struct commandn>::iterator pcommandn;
struct commandn {
command cmd;
pcommandn tgt; //
commandn( const command& cmd) : cmd(cmd), tgt( NULL ) {}
};

std::list<commandn> pcode;
std::vector<pcommandn> pcodeidx;

inline int emit( const command& cmd) {
pcodeidx.push_back(pcode.insert(pcode.end(), commandn(cmd)));
return pcode.size()- 1 ;
}
inline int emit(command::opcodes opcode, regnum dest, regnum src1, regnum src2) {
return emit(command(opcode, dest, src1, src2));
}
inline int emit(command::opcodes opcode, regnum dest, short imm) {
return emit(command(opcode, dest, imm));
}
inline void fix( int index, command::opcodes opcode, regnum dest, short imm) {
*pcodeidx[index] = commandn(command(opcode, dest, imm));
}

Pointers to a jump target are difficult to fill in during the generation of the p-code: at the time of creating the jumps, the target index is already known, but it is not yet known what kind of team will be there; to put a pointer to a team that has not yet been created is problematic. Therefore, after the end of the parsing and generation of the p-code, we bypass the generated code, and place the jump pointers; then do all the necessary work with the allocation of registers; and at the end, before dumping the p-code to a file, you will need to “serialize” it - determine the final index of each command, and fill in the jump offsets. For this, add the field int index; to the struct commandn int index;

int main() {
yyparse();
//
for ( int i= 0 ; i<pcode.size(); i++)
if (pcodeidx[i]->cmd.opcode==command::jz)
pcodeidx[i]->tgt = pcodeidx[i+ 1 +pcodeidx[i]->cmd.imm];

// /
// ...

// /
int index = 0 ;
foreach(i, pcode) i->index = index++;
foreach(i, pcode)
if (i->cmd.opcode==command::jz)
i->cmd.imm = i->tgt->index-i->index- 1 ;

// -
// ...
}

Things are easy: implement "processing / optimization". It takes, as you might expect, most of the entire compiler code.

Implementation


For assigning registers, we will store for each command (directly in the struct commandn ) sets of living at the input and output registers: separately - registers of the p-code, to determine the lifetime; and separately - physical registers, to determine the compatibility with the actual appointment.
typedef std::set<regnum> aliveset;
enum physreg { firstp = 1 , lastp = 4 };
typedef std::set<physreg> alivesetp;
struct commandn {
// ...
aliveset onenter, onexit; // liveness check
alivesetp onenterp, onexitp; // phys reg allocation
// ...
};

Another useful thing is to determine by opcode which registers are used by the command:
inline bool hasnext(pcommandn cmd) {
return (cmd->cmd.opcode!=command::hlt &&!
(cmd->cmd.opcode==command::jz && !cmd->cmd.dest));
}
inline bool has2src(pcommandn cmd) {
return cmd->cmd.opcode>=command::add;
}
inline bool writesdest(pcommandn cmd) {
return cmd->cmd.opcode>=command::mov;
}
inline bool readsdest(pcommandn cmd) {
return cmd->cmd.opcode>=command::store &&
cmd->cmd.opcode<=command::echo;
}

After that, according to the scheme above, we calculate for each command the set of registers needed at the input and at the output; at the same time we consider how many times each register is needed for a greedy assignment.
std::vector< int > liveness() { // returns usecount
std::vector< int > usecount(lastreg+ 1 );
bool changed = false ;
// rule 1
foreach(i, pcode) {
i->onenter = i->onexit = aliveset();
if (has2src(i)) {
if (i->cmd.src1) {
usecount[i->cmd.src1]++;
i->onenter.insert(i->cmd.src1);
}
if (i->cmd.src2) {
usecount[i->cmd.src2]++;
i->onenter.insert(i->cmd.src2);
}
} else if (readsdest(i) && i->cmd.dest) {
usecount[i->cmd.dest]++;
i->onenter.insert(i->cmd.dest);
}
if (i->onenter.size())
changed = true ;
}
while (changed) {
changed = false ;
foreach2(i, pcode, next) {
int oldsize = i->onenter.size()+i->onexit.size();
// rule 2 (next command)
if (hasnext(i))
i->onexit.insert(next->onenter.begin(), next->onenter.end());
// rule 2 (jmp target)
if (i->cmd.opcode==command::jz)
i->onexit.insert(i->tgt->onenter.begin(), i->tgt->onenter.end());
// rule 3
i->onenter.insert(i->onexit.begin(), i->onexit.end());
if (writesdest(i))
i->onenter.erase(i->cmd.dest);

if (i->onenter.size()+i->onexit.size() != oldsize)
changed = true ;
}
}
return usecount;
}

The foreach2 macro foreach2 allows us to look both at the current and at the next element in the loop body:
#define foreach2(i, list, next) typedef typeof(list) TOKENPASTE2(T, __LINE__ ); \
        for (TOKENPASTE2(T, __LINE__ )::iterator next = list.begin(), i=next++; i != list.end(); i=next, next++)

In principle, it was possible to reduce the total number of iterations, if we bypass the commands not “from the top down”, but “from the bottom up”, because the need “spreads” from each command to the previous ones; the top-down bypass is selected to simplify the code. In the worst case, for a code of length N commands, 2 N iterations will be required (at each iteration, the need extends to the distance of a half-team). For test.jsk 35 iterations were enough, because jumping reduces the distance.

Finally, why we did all this:
// / :
while ( 1 ) { //

//
// ...

std::vector< int > usecount = liveness();
std::vector<physreg> physmap(lastreg+ 1 );
foreach(i, pcode)
i->onenterp = i->onexitp = alivesetp();
while ( 1 ) {
//
std::vector< int >::iterator max = std::max_element(usecount.begin(), usecount.end());
if (!*max) break ; //
*max = 0 ;
regnum r = max - usecount.begin();
// . - â„–r
std::vector< int > collisions(lastp); // 0-based
for (physreg p=firstp; p<=lastp; p=(physreg)(p+ 1 )) { // error: ISO C++ forbids incrementing an enum
foreach(i, pcode) {
if (contains(i->onenter, r) && contains(i->onenterp, p))
collisions[p- 1 ]++;
if (contains(i->onexit, r) && contains(i->onexitp, p))
collisions[p- 1 ]++;
}
if (collisions[p- 1 ]) continue ; // .
physmap[r] = p;
// - â„–r â„–p
foreach(i, pcode) {
if (contains(i->onenter, r))
i->onenterp.insert(p);
if (contains(i->onexit, r))
i->onexitp.insert(p);
}
break ; //
}
if (physmap[r]) continue ; //
// . - â„–r
// : ,
foreach2(i, pcode, next)
if ((i->cmd.opcode!=command::load && i->cmd.opcode!=command::store) &&
((contains(i->onenter, r) && i->onenterp.size()==lastp) ||
(contains(i->onexit, r) && i->onexitp.size()==lastp))) {
// , ,
// ...
goto afterspill;
}
// : ?
yyerror( "don't know how to proceed" );
}
// ;
foreach(i, pcode) {
i->cmd.dest = physmap[i->cmd.dest];
if (has2src(i)) {
i->cmd.src1 = physmap[i->cmd.src1];
i->cmd.src2 = physmap[i->cmd.src2];
}
}
break ;

afterspill : // - " "
}

Without going into how exactly we poured the register, it is worth mentioning that the code above has a lot of problems. First, heuristics work only in the simplest cases; more common situations like
    ...
 10 need r1, r2, r3, r4
    ...
 20 need r1, r3, r4, r5
    ...
 30 need r2, r3, r4, r5
    ...
The r5 lacks the physical, because all the physical ones were spent on r1,r2,r3,r4 ; but there is not a single command where all five registers are needed at the same time.

A tougher problem is that each spout clogs the program with useless store/load pairs, and they need to be cleaned - otherwise the registers used in them create enough “noise” so that the registers are not enough for anything else.

Another idea: a hole in the lifetime caused by the outflow can divide the original n-register into two independent “versions” that can be placed in different physical registers. For the sake of the flexibility of our system, it is necessary to implement the splitting of versions: then with each iteration of the external cycle, the lifetimes of the registers will be reduced, and it will become easier and simpler to distribute physical ones to them.

And, since we went to remove the extra load and store , we will also remove other extra commands at the same time. Given that our p-code operations have no other side effects, other than calculating the register with the result, you can delete commands whose result is not alive at the output. Maybe the programmer assigned the value of an unused variable, or used an operator like (2+2);
We delete everything - so we will delete the unreachable code to the heap (for example, jz after hlt), and compress the chains of jumps (we will replace the jump for the jump with one jump).

To remove a command, in order not to look for and fix all jumps on it, it is convenient to replace it with nop ; as it, let's assign an unconditional jump to the next command ( jz r00, +0 ). The benefit is that at the next iteration, the jump to nop shrink along the chain, and the nop itself will be cleaned up as an unreachable command.

Simple optimizations


To detect unreachable commands we use depth traversal; to keep track of which teams have already bypassed, add a new int reachcnt; field to the struct commandn int reachcnt;
A boolean field would be enough for us; but the “degree of approach” of each team may still be useful sometime later.
void markReachable(pcommandn cmd) {
while (!cmd->reachcnt++) {
if (cmd->cmd.opcode==command::jz)
markReachable(cmd->tgt);
if (hasnext(cmd))
cmd++;
else
break ;
}
}

void nopOut(pcommandn cmd) {
pcommandn next = cmd; next++;
cmd->cmd = command(command::jz, 0 , 0 );
cmd->tgt = next;
}

void simpleopt() {
//
foreach(i, pcode)
if (i->cmd.opcode==command::jz)
while (i->tgt->cmd.opcode==command::jz && !i->tgt->cmd.dest)
i->tgt = i->tgt->tgt;
//
foreach(i, pcode) i->reachcnt = 0 ;
markReachable(pcode.begin());
foreach2(i, pcode, next)
if (!i->reachcnt)
pcode.erase(i);
// nop:
foreach2(i, pcode, next)
if (i->cmd.opcode==command::jz &&
!i->cmd.dest && i->tgt==next)
pcode.erase(i);
}

// :
simpleopt();
//
liveness();
bool changed = false ;
foreach(i, pcode)
if (writesdest(i) && !contains(i->onexit, i->cmd.dest)) {
nopOut(i);
changed = true ;
}
if (changed) continue ;
//
// ...


Version splitting


If there is a gap in the “interval of use” of the register, i.e. a value from one command set never penetrates another set — it means that a register can be “split” into a couple of registers with shorter lifetimes.
To isolate the unbroken fragments from the lifetimes, another graph algorithm is useful: the search for connected components.
We strive more for code clarity than for its effectiveness; therefore, we have pvg inside pvg, and the merging of components is performed by traversing the entire graph.

The struct commandn will require another field, the last for today: std::map<regnum,regnum> version;
In it we will store for each match command “n-register -> version number”.
If the register in the command is used, but there is still no correspondence in version , it means that the detour has not yet reached this command.
For the external pvg, in order not to start a new field, we use the same reachcnt as in markReachable()
std::map<regnum,regnum> renameto; //

void versionizeReg(pcommandn cmd, regnum r, regnum rv) { // r rv
while (!contains(cmd- > version, r) &&
(contains(cmd- > onenter, r) || contains(cmd- > onexit, r))) {
//
cmd- > version[r] = renameto[rv];
if (cmd- > cmd.dest==r) cmd- > cmd.dest=renameto[rv];
if (has2src(cmd)) {
if (cmd- > cmd.src1==r) cmd- > cmd.src 1 =renameto[rv];
if (cmd- > cmd.src2==r) cmd- > cmd.src 2 =renameto[rv];
}
if (!contains(cmd- > onexit,r)) return ; //
if (cmd- > cmd.opcode==command::jz &&
contains(cmd- > tgt- > onenter, r)) {
versionizeReg(cmd- > tgt, r, rv);
}
if (hasnext(cmd)) {
cmd++;
if (contains(cmd- > onenter, r))
continue ;
}
return ; //
}
// ?
if (contains(cmd- > version, r) && cmd- > version[r]!=renameto[rv]) {
// renameto[rv] cmd->version[r]
regnum from = std::max(cmd- > version[r], renameto[rv]),
to = std::min(cmd- > version[r], renameto[rv]);
renameto[from] = to;
foreach(i, pcode) {
if (i- > cmd.dest==from) i- > cmd.dest = to;
if (has2src(i)) {
if (i- > cmd.src1==from) i- > cmd.src 1 = to;
if (i- > cmd.src2==from) i- > cmd.src 2 = to;
}
if (contains(i- > version, r) && i- > version[r]==from)
i- > version[r] = to;
}
}
}

void versionize(pcommandn cmd) {
while (!cmd- > reachcnt++) {
// versionize registers that live on exit
foreach(r, cmd- > onexit)
if (!contains(cmd- > version, *r)) {
regnum rv = newreg();
renameto[rv] = rv;
versionizeReg(cmd, *r, rv);
}
if (cmd- > cmd.opcode==command::jz)
versionize(cmd- > tgt);
if (hasnext(cmd))
cmd++;
else
break ;
}
}

// :
foreach(i, pcode) {
i- > version.clear();
i- > reachcnt = 0 ;
}
renameto.clear();
int lastbasereg = lastreg;
versionize(pcode.begin());
// , 1
foreach(i, pcode) {
if (i- > cmd.dest) i- > cmd.dest-=lastbasereg;
if (has2src(i)) {
if (i- > cmd.src1) i- > cmd.src 1 -=lastbasereg;
if (i- > cmd.src2) i- > cmd.src 2 -=lastbasereg;
}
}
lastreg -= lastbasereg;

It remains to implement the actual register outflow. Let us concretize the heuristics: you can pour out any register that is alive, but the command is not used.
regnum spillable(pcommandn cmd) { // ,
aliveset alive;
alive.insert(cmd->onenter.begin(), cmd->onenter.end());
alive.insert(cmd->onexit.begin(), cmd->onexit.end());
alive.erase(cmd->cmd.dest);
if (has2src(cmd)) {
alive.erase(cmd->cmd.src1);
alive.erase(cmd->cmd.src2);
}
if (!alive.size()) return 0 ;
return *alive.begin();
}

In a separate map we will store by register number the “address” of its copy in memory so that all register outflows fall into the same place. The address is assigned to the register only at the moment of the first flood.
// spill slots
std::map<regnum, int > spill;
int lastspill = 0 ;

void spillAt(pcommandn cmd, regnum victim, pcommandn next) {
// /
int spslot = spill[victim];
if (!spslot) {
spslot = ++lastspill;
spill[victim] = spslot;
}
// store load
pcommandn me = pcode.insert(next, *cmd);
cmd->cmd = command(command::store, victim, spslot);
commandn load(command(command::load, victim, spslot));
if (hasnext(me))
pcode.insert(next, load);
if (me->cmd.opcode==command::jz)
me->tgt = pcode.insert(me->tgt, load);
}

Since the team is replaced with the store-cmd-load troika, it is necessary that the existing jumps on the team jump to the store inserted in front of it. In order not to look for such jumps throughout the code, we insert the cmd copy with the following command, and then replace the original command with the store .
Similarly, if the command is a conditional transition, then it is necessary that the load executed on both outcomes; therefore, we add it to the next command, and insert it in front of the goal of the jump.

(It is a bit annoying that because of the possibility of moving teams from one commandn to another, we will have to associate the lines for echo with the commands themselves by “identifier”, as in the first tree implementation: neither command indices nor pointers remain unchanged.)

The second heuristics: if you could not find a team that has all the physical registers in, then take the physical register that “best fits” (least of all collisions with already assigned registers) and try to release it wherever it is needed, but busy .
// . - â„–r
regnum victim;
// 1: ,
foreach2(i, pcode, next)
if ((i->cmd.opcode!=command::load && i->cmd.opcode!=command::store) &&
((contains(i->onenter, r) && i->onenterp.size()==lastp) ||
(contains(i->onexit, r) && i->onexitp.size()==lastp))) {
victim = spillable(i);
assert(victim);
spillAt(i, victim, next);
goto afterspill;
}
// 2: " "
physreg bestfit = (physreg)(std::max_element(collisions.begin(), collisions.end())-collisions.begin()+ 1 );
foreach2(i, pcode, next)
if ((i->cmd.opcode!=command::load && i->cmd.opcode!=command::store) &&
((contains(i->onenter, r) && contains(i->onenterp, bestfit)) ||
(contains(i->onexit, r) && contains(i->onexitp, bestfit))) &&
(victim = spillable(i))) {
spillAt(i, victim, next);
goto afterspill;
}
yyerror( "don't know how to proceed" );

When splitting versions, you need to remember to copy the slot number (all versions of one register are stored in one slot), and then, when renaming versions, remember to update the spill .

The compiler is almost working; The last problem is that a bunch of useless load and store is generated, and because of them there are no registers left for meaningful commands.

Work with memory


First, it may turn out that some registers are used exclusively in load/store commands load/store
Such commands are useless: the register is always written in the same slot from which it was read.
Let's insert cleaning after splitting versions: there we should have just got a lot of registers with a short lifetime.
//
std::vector< bool > used(lastreg+ 1 );
foreach(i, pcode) {
if (writesdest(i) && i->cmd.opcode!=command::load)
used[i->cmd.dest] = true ;
if (readsdest(i) && i->cmd.opcode!=command::store)
used[i->cmd.dest] = true ;
if (has2src(i)) {
used[i->cmd.src1] = true ;
used[i->cmd.src2] = true ;
}
}
foreach(i, pcode)
if ((i->cmd.opcode==command::load && !used[i->cmd.dest]) ||
(i->cmd.opcode==command::store && !used[i->cmd.dest]))
nopOut(i);

Secondly, if we have chains of identical load , then everything except the last one will be deleted because the read register is not alive. But if we have chains of the same store , then to remove them you need to come up with something new. The algorithm is very similar to the iterative determination of the need for registers, only now we define “unpreserved” - in order to clear all the saved registers already saved.We see that this algorithm differs from the previous one only in the direction of propagation: the need spreads from each command to the previous ones, and the non-conserved - to the subsequent ones.
In general, the iterative algorithm for calculating the sets of registers at each point of the program is called data flow analysis , and applies, in addition to the two mentioned, and for many other optimizations.

void unsaved() {
bool changed = false ;
// rule 1
foreach(i, pcode) {
i->onenter = i->onexit = aliveset();
if (writesdest(i)) {
i->onexit.insert(i->cmd.dest);
changed = true ;
}
}
while (changed) {
changed = false ;
foreach2(i, pcode, next) {
int oldsize = i->onenter.size()+i->onexit.size();
// rule 2 (next command)
if (hasnext(i))
next->onenter.insert(i->onexit.begin(), i->onexit.end());
// rule 2 (jmp target)
if (i->cmd.opcode==command::jz)
i->tgt->onenter.insert(i->onexit.begin(), i->onexit.end());
// rule 3
i->onexit.insert(i->onenter.begin(), i->onenter.end());
if (i->cmd.opcode==command::store)
i->onexit.erase(i->cmd.dest);

if (i->onenter.size()+i->onexit.size() != oldsize)
changed = true ;
}
}
}

afterspill : // - " "
unsaved();
foreach(i, pcode)
if (i->cmd.opcode==command::store && !contains(i->onenter, i->cmd.dest))
nopOut(i);
}


What happened?


All compiler together: tyomitch.net.ru/jsk.y.regs.html
Of the five hundred lines, only one-fifth is part of the parsing. Indeed, it seems that, compared to the labor costs for the rest of the compiler, the parser is an uninteresting little thing on the side. But this is only because the creation of parsers is sucked to the bone, and we can only use the ready; whereas when processing the code, there is more room for fiction.

As a result of the assignment of registers and first optimizations, the generated code was lengthened by just a couple of commands:
 00 mov r01, 0
 01 mov r02, 0x3e8
 02 echo 0x12e
 03 echo r01
 04 echo 0xa8
 05 echo r02
 06 echo 0xaf
 07 le r03, r01, r02
 08 jz r03, + 0x1f (= 0x28)
 09 add r03, r01, r02
 0a mov r04,2
 0b div r03, r03, r04
 0c echo 0x16a
 0d echo r03
 0e echo 0xd4
 0f input r04
 10 store r01, 1
 11 mov r01, 1
 12 eq r01, r04, r01
 13 jz r01, +5 (= 0x19)
 14 load r01, 1
 15 mov r02, 1
 16 sub r02, r03, r02
 17 add r02, r02, 0
 18 jz 0, -0x12 (= 0x7)
 19 mov r01,2
 1a eq r01, r04, r01
 1b jz r01, +4 (= 0x20)
 1c mov r01, 1
 1d add r01, r03, r01
 1e add r01, r01, 0
 1f jz 0, -0x19 (= 0x7)
 20 load r01, 1
 21 mov r03, 3
 22 eq r03, r04, r03
 23 jz r03, +2 (= 0x26)
 24 echo 0x14e
 25 hlt
 26 echo 0x172
 27 jz 0, -0x21 (= 7)
 28 echo 0x107
 29 hlt

But since only four registers are used, this p-code will probably be translated into machine code of a real processor.

We will continue in August: I am leaving for vacation, and I finish this post directly from the airport.
Have a nice summer everyone!

Source: https://habr.com/ru/post/99595/


All Articles