Compilation. 8: optimization

After a pleasant stay, we continue to write a compiler for our J-squeak.
In the previous post, we implemented a heuristic taken from the ceiling for register assignment , and at the same time we started optimizing the code. And before that, readers found a bug in the implementation of the assignment.

Further in the post:

Bug fix
Cleaning up
What happened?
Folding constants
Implementation

Bug fix

The fact is that we decided to cheat, and when first assigning a value to a variable, do not perform the actual copying, but simply declare the register with the intermediate value as the storage location of the variable:

ID '=' EXPR { $$ = $3 ;
if (vars[ $1 ])
emit(command::add, vars[ $1 ], $3 , 0 );
else
vars[ $1 ] = $3 ; // new var
}

Then when compiling an operation of type a=2; we will get one MOV R1, 2 vars["a"]=R1 command (from convolution 2) and we will remember vars["a"]=R1 (from convolution of assignment).
That's right, simple and natural.

The problem arose when in the right part of the assignment was used not an intermediate value, but something long-lived: for example, another variable.

 a = 2;
 b = a;

In the second assignment convolution, no new code is generated - just remember vars["b"]=R1
Both variables were in one register, and when one of them changes, the second one will change.
')
The solution lies on the surface: for each new variable we start a new register.

ID '=' EXPR { $$ = $3 ;
if (!vars[ $1 ]) vars[ $1 ] = newreg();
emit(command::add, vars[ $1 ], $3 , 0 );
}

Then from a=2; we'll get a couple of commands

 MOV R1,2
 ADD R2, R1, 0

and remember vars["a"]=R2
If it is followed by b = a; - then MOV R3, R2, 0 and vars["b"]=R3 will be added to the code

In other words, now in the generated code there will be a lot of unnecessary copies from the register to the register, and the assignment of registers will work even slower due to the growing number of registers used.
We will try to find those cases where copying is unnecessary (for example, if the variable from the right side of the assignment does not change later), and correct the commands that use the copy, so that the original is used.

Cleaning up

Copy elimination is one of the simple optimizations that I promised from the very first post of the series. As with optimizations performed during register assignment, data-flow analysis is convenient for cleaning. An important difference from the two previous DFA applications (backward pass for detecting live registers, forward pass for detecting saved registers) will be that at each point we store not one set of registers, but the partitioning of all registers into sets of identical ones. You can look at this as a more general case of DFA than the two previously discussed. (Before, registers were always divided into two sets - “included” and “excluded”.)

How we will break registers?

ADD RA, RB, 0 takes RA to RB ;
any other command that changes the register makes it into a singleton set;
RA and RB will be together in team C, if they are together in all teams from which you can go to C (directly or by jumping).

After receiving the final split, you will see where you can use the original instead of a copy: instead of RA you can use RB if they are together in all the commands where RA used.

And one more subtlety: since during transitions from team to team we do not increase sets, but truncate (intersect all incoming sets), then before launching the DFA, you need to initialize the sets not to empty, but to comprehensive ones - and as the algorithm works, the sets will be truncated, need to. In order not to spend money and do not really hold in each team the set of all existing registers, we will agree to consider the “missing iterator” indicating exactly such a comprehensive set.

For convenience, we make the three partition operations we need into a class. In the partition we store a list of sets into which registers are divided (except for the “global” set, in which the registers are all together initially), and for each register (except those in the “global” set) it is an iterator of the set in which it is located .

With woody std::set I had some incomprehensible problems, so I wrote myself a bit::set container with a similar interface, but with std::bitset inside. Template parameter, it takes the maximum value of the key in the set.
At the same time, standard operations on sets ( &= , |= ) are made in bit::set .

typedef bit::set<regnum, 255 > regset;

class regPartition {
typedef std::list<regset> regsets;
regsets sets;
std::map<regnum, regsets::iterator> byreg;
// : ""

public :
// :
bool add(regnum copy, regnum orig) {
if (byreg.count(copy)) {
if (byreg[copy]==byreg[orig]) //
return false ;
byreg[copy]->erase(copy);
// ?
if (!byreg[copy]->size())
sets.erase(byreg[copy]);
}
assert(byreg.count(orig));
byreg[copy] = byreg[orig];
byreg[copy]->insert(copy);
return true ;
}

void remove(regnum r) {
if (byreg.count(‌r)) {
if (byreg[r]->size()== 1 ) return ; //
byreg[r]->erase(‌r);
}
byreg[r] = sets.insert(sets.end(), regset());
byreg[r]->insert(‌r);
}

// :
bool isect( /* const */ regPartition& p, const command& self) {
bool changed = false ;
// , p
foreach(i, byreg) {
regnum r = i->first;
// split by p.byreg[r]
regsets::iterator withR = i->second,
withoutR = sets.insert(sets.end(), regset());
foreach2(j, (*withR), next)
// , -- mov :
//
if (!(self.opcode==command::add && !self.src2 &&
((self.src1==r && self.dest==*j)||(self.dest==r && self.src1==*j)))
&&((!p.byreg.count(‌r) && p.byreg.count(*j)) || // R in global, J isn't
(p.byreg.count(‌r) && !p.byreg[r]->count(*j)))) { // R not in global
withR->erase(*j);
withoutR->insert(*j);
byreg[*j] = withoutR;
}
if (!withoutR->size()) //
sets.erase(withoutR);
else //
changed = true ;
}
// , this, p
foreach(i, p.sets) {
regset inP;
foreach(j, (*i))
if (!byreg.count(*j)) inP.insert(*j);
if (inP.size()) {
regsets::iterator newSet = sets.insert(sets.end(), inP);
foreach(j, inP) byreg[*j] = newSet;
changed = true ;
}
}
return changed;
}

// : r ( )
void apply(regnum r, regset& global) {
if (!r) return ; // may not be replaced
assert(byreg.count(‌r));
if (!global.size()) // uninitialized set
global = *byreg[r];
else // initialized; intersect
global &= *byreg[r];
}
}

Add a new regPartition copies; field to the struct commandn regPartition copies;
Now we implement DFA in the usual way, using ready-made operations:

void copies() {
// )
bool changed = false ;
foreach(i, pcode) {
i->copies = regPartition();
// rule 2
if (writesdest(i)) {
i->copies.remove(i->cmd.dest);
changed = true ;
}
}
while (changed) {
changed = false ;
foreach2(i, pcode, next) {
// rule 1
if (i->cmd.opcode==command::add && !i->cmd.src2)
changed |= i->copies.add(i->cmd.dest, i->cmd.src1);
// rule 3 (next command)
if (hasnext(i))
changed |= next->copies.isect(i->copies, next->cmd);
// rule 3 (jmp target)
if (i->cmd.opcode==command::jz)
changed |= i->tgt->copies.isect(i->copies, i->tgt->cmd);
}
}
// )
std::vector<regset> copies(lastreg+ 1 );
foreach(i, pcode) {
if (readsdest(i))
i->copies.apply(i->cmd.dest, copies[i->cmd.dest]);
if (has2src(i)) {
i->copies.apply(i->cmd.src1, copies[i->cmd.src1]);
i->copies.apply(i->cmd.src2, copies[i->cmd.src2]);
}
}
// )
for (regnum r= 1 ; r<=lastreg; r++) {
copies[r].erase(‌r);
if (copies[r].size()) { // ?
regnum s = *(copies[r].begin()); // r s
foreach(i, pcode) { //
if (i->cmd.dest==r)
i->cmd.dest = s;
if (has2src(i)) {
if (i->cmd.src1==r) i->cmd.src1 = s;
if (i->cmd.src2==r) i->cmd.src2 = s;
}
if (i->cmd.opcode==command::add && i->cmd.src1==i->cmd.dest && !i->cmd.src2) // self-mov
nopOut(i);
}
foreach(c, copies) //
if (c->count(‌r)) {
c->erase(‌r);
c->insert(s);
}
}
}
}

Call copies(); insert it at the very beginning of the optimization loop, before checking the liveliness.

What happened?

Compared with the last time, the code has been reduced by a couple of commands:

 00 mov r01, 0
 01 mov r02, 0x3e8
 02 echo 0x126
 03 echo r01
 04 echo 0xa0
 05 echo r02
 06 echo 0xa7
 07 le r03, r01, r02
 08 jz r03, + 0x1d (= 0x26)
 09 add r03, r01, r02
 0a mov r04,2
 0b div r03, r03, r04
 0c echo 0x162
 0d echo r03
 0e echo 0xcc
 0f input r04
 10 store r01, 1
 11 mov r01, 1
 12 eq r01, r04, r01
 13 jz r01, +4 (= 0x18)
 14 load r01, 1
 15 mov r02, 1
 16 sub r02, r03, r02
 17 jz 0, -0x11 (= 0x7)
 18 mov r01,2
 19 eq r01, r04, r01
 1a jz r01, +3 (= 0x1e)
 1b mov r01, 1
 1c add r01, r03, r01
 1d jz 0, -0x17 (= 0x7)
 1e load r01, 1
 1f mov r03, 3
 20 eq r03, r04, r03
 21 jz r03, +2 (= 0x24)
 22 echo 0x146
 23 hlt
 24 echo 0x16a
 25 jz 0, -0x1f (= 7)
 26 echo 0xff
 27 hlt

It may seem that about the disappeared commands ( add r01, r01, 0 and add r02, r02, 0 ) it was immediately obvious that they are meaningless. In fact, these commands took a meaningless form only after the assignment of physical registers, i.e. at the very last stage before the output of the finished p-code. Until then, the n-register numbers of the operands differed; only the analysis we just performed allowed us to combine them, and deleting the duplicated copying — all this long before the appointment of physical registers.

Folding constants

Another standard optimization, which, like the previous ones, is implemented using DFA, is constant folding . The principle is utterly simple: if the values of the operands are known, then the operation can be performed immediately upon compilation. For example, instead of the code

 MOV R1,2
 MOV R2, 3
 ADD R3, R1, R2

we can generate right away

  MOV R3,5

Operations on constants do not necessarily indicate the negligence of a programmer who was too lazy to calculate a result known in advance: for example, pixels=1024*768; easier to read and maintain than pixels=786432;

This time, in each command we store sets of registers for which values are known, together with values: in the form of std::map<regnum,int>
As usual, we formulate three rules for computing sets:

in the MOV R, X command MOV R, X value of R is known, and this is X ;
in any other command specifying the value of R, this value is unknown;
in team C, the value of R is known if it is known and the same in all the commands from which you can go to C (directly or by jumping).

Again we see: the direction of the passage is forward (its value in the subsequent command depends on the register value in the previous command); operation in nodes - the union of unknown registers.

When the sets stabilize, we can replace each operation, both operands of which are known, with MOV .

The same data will allow us to perform another optimization - constant propagation (the substitution of a known value instead of a reference to a register). This optimization is not possible with the n-code format that we have chosen, because there are no operations on the register and the constant; such operations, however, are present in many real-world processors, so a full-fledged “constant substitution” can be performed when generating executable code. Now we confine ourselves to replacing the zero value with R0.

For example, if (1>2) { echo("unreachable"); } if (1>2) { echo("unreachable"); } , which compiles to

 MOV R1, 1
 MOV R2, 2
 GT R3, R1, R2
 JZ R3, label
 ECHO "unreachable"
 label:

will turn at the stage of folding constants in

 MOV R1, 1
 MOV R2, 2
 MOV R3, 0
 JZ R3, label
 ECHO "unreachable"
 label:

and the optimization of the “destruction of nonliving code”, already implemented by us last time, will remove the first two MOV commands.
If at the same time we replace the zero value by R0:

 MOV R3, 0
 JZ R0, label
 ECHO "unreachable"
 label:

then the last MOV will be removed along with the inanimate code, and “destroying the unreachable code” will also delete ECHO , turning JZ into NOP .

Similarly, you can delete from the JZ code with a known non-zero value. The second implemented “special case” is the replacement of ADD RX, (0), RY ADD RX, RY, R0 with ADD RX, RY, R0 , so that the copy cleaning algorithm recognizes the copy from register to register in this command.

Another benefit of folding constants is that negative values can now be used in our teams. Due to the fact that in the lexer we set the NUM token to regexp [0-9]+ , the strings like "-123" were interpreted as a unary minus and then the literal 123; so they compiled into a p-code like

 MOV R1,123
 SUB R2, R0, R1

Now in the p-code will be an honest team MOV R1, -123 .

Implementation

struct commandn is supplemented by another pair of fields:
std::map<regnum, int > known; regset unknown;

The basis of optimization, as in the previous cases, is DFA:

void constp() {
bool changed = false ;
foreach(i, pcode) {
i->known.clear(); i->unknown.clear();
if (i->cmd.opcode==command::mov) { // rule 1
i->known[i->cmd.dest] = i->cmd.imm;
changed = true ;
} else if (writesdest(i)) { // rule 2
i->unknown.insert(i->cmd.dest);
changed = true ;
}
}
while (changed) {
changed = false ;
foreach2(i, pcode, next) {
// rule 3 (next command)
if (hasnext(i))
changed |= propagate(i, next);
// rule 3 (jmp target)
if (i->cmd.opcode==command::jz)
changed |= propagate(i, i->tgt);
}
}
//
foreach(i, pcode) {
i->known[ 0 ] = 0 ; // R0
if (has2src(i) && i->known.count(i->cmd.src1) && i->known.count(i->cmd.src2))
i->cmd = command(command::mov, i->cmd.dest, ops[i->cmd.opcode](i->known[i->cmd.src1],i->known[i->cmd.src2]));
// 0
if (has2src(i)) {
if (i->known.count(i->cmd.src1) && !i->known[i->cmd.src1])
i->cmd.src1 = 0 ;
if (i->known.count(i->cmd.src2) && !i->known[i->cmd.src2])
i->cmd.src2 = 0 ;
if (i->cmd.opcode==command::add && !i->cmd.src1) { //
i->cmd.src1 = i->cmd.src2;
i->cmd.src2 = 0 ;
}
}
if (readsdest(i) && i->known.count(i->cmd.dest))
if (!i->known[i->cmd.dest])
i->cmd.dest = 0 ;
else // , 0
if (i->cmd.opcode==command::jz) nopOut(i);
}
}

The propagate() procedure implements the union of sets of unknown registers: a register with several known values is declared unknown.

bool propagate(pcommandn from, pcommandn to) {
bool changed = false ; // :
//
foreach(i, from->known) {
regnum r = i->first;
if (to->known.count(‌r))
if ((to->known[r]!=i->second) // , 1
&&!((to->cmd.opcode==command::mov) && (to->cmd.dest==r))) {
to->known.erase(‌r);
to->unknown.insert(‌r);
changed = true ;
} else ; //
else if (!to->unknown.count(‌r)) { // ,
to->known[r]=i->second;
changed = true ;
}
}
//
foreach(r, from->unknown)
if (!to->unknown.count(*r)) {
to->unknown.insert(*r);
to->known.erase(*r);
changed = true ;
}
return changed;
}

The last thing left is the actual calculation of the value, when the operands are known. In the same way as in the executer of the jay-squeak, we start by the function for each opcode:

int hlt( int src1, int src2) { assert( false ); return 0 ; }
int add( int src1, int src2) { return src1+src2; }
int sub( int src1, int src2) { return src1-src2; }
...
int lt( int src1, int src2) { return src1<src2; }
int (*ops[])( int , int ) = {hlt, hlt, hlt, hlt, hlt, hlt, hlt, add, sub, mul, div, eq, ne, ge, le, gt, lt};

Insert a call to constp(); before copies(); - and on it we will finish with optimization.

In the next post, we compile a real executable code for x86 / x64 from a p-code with physical registers.

Source: https://habr.com/ru/post/101946/

All Articles