Compilation. 9: executable code

I remind you that we are writing a compiler for the J-Squeak language . We started with a compilation of the p-code , spent a lot of effort on its optimization , and got ready for the final stage of compilation - to the conclusion of machine-dependent executable code.
There are no intricate algorithms here: by and large, only the replacement of one instruction set with another.

Further in the post:

Code selection
Loader
P-code changes
Generation
What happened?

Code selection

I would venture to suggest that you are reading this post on a computer with a x86 or x64 processor inside. I apologize to users of mobile devices: it would be interesting to generate code for a fashionable gadget, but it is too difficult for me to get such an instance for experiments.

We will try to generate a “cross-platform” code that is equally workable on both x86 and x64. We will focus on machine code, and not on the ELF format internals and the interaction with the loader; Therefore, we will create the code in a “continuous piece”, similar to the COM DOS format. Read it from the disk and run will be our own bootloader.
')
We will use only 32-bit registers, so that the same program, even if overflows occur in it during calculations, will produce the same results on both x64 and x86.

To communicate with the outside world ( ECHO and INPUT commands), the generated code must somehow call the functions of the operating system. How does he know their addresses?
Our loader will create a “communication area” for the loaded program - an array of pointers to “standard functions”, and pass the address of this array as a parameter to the program. We have three standard functions: input of a number, output of a number, output of a string; therefore, the communication area will consist of three pointers. For convenience of addressing, immediately at the entrance to the program we will save the transmitted address in EBP / RBP .

Another interesting point - access to memory (to poured variables). Instead of the absolute addressing typical for x86, x64 introduced “floating absolute” relative to RIP . In order for the code to really turn out to be cross-platform, it is necessary to use explicit relative addressing.
Without further ado, we will address the cells of the poured variables relative to the same EBP / RBP - i.e. Immediately after the communication region, at offset 0x18 , the variable region begins.

The generated p-code uses four “abstract physical registers” R01..R04 , which it is natural for these EAX,ECX,EDX,EBX registers to associate: i.e. the number of this register is obtained from the number of "abstract" by subtracting one. If we wanted to use the rest of the processor registers, the correspondence between the numbers would be more difficult.

The way parameters are passed to a function on x86 and x64 is very different. On x86, the default parameters are passed through the stack ( cdecl ), but you can ask gcc transfer up to three parameters in the ECX,EDX,EAX registers ( fastcall ). On x64, six parameters are transmitted in the registers RDI,RSI,RDX,RCX,R8,R9 , and gcc does not allow changing the method of the call. Because of this incompatibility, you will have to implement both transfer mechanisms in the code generator, and choose the one you need. Fortunately, passing parameters will be the only platform-specific place in our code.
On both processors, the called function has the right to spoil the contents of EAX,ECX,EDX , so they will need to be saved before the call (if they are alive), and restored after. Thus, information about the liveliness of physical registers, which we used during their assignment, is also needed when generating code.

What code we will generate, figured out; business for small - to realize our plans.

Loader

We take as a basis the long-standing interpreter of the p-code, leaving all the husks (checking input, displaying the file in memory), and replacing the inside (execution cycle and command implementations) with a call to the read code, as if it were a normal sish function.

#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/stat.h>

const char * fdata = NULL ; //

int input() { int d; scanf( " %d " , &d); return d; }
void echoi( int i) { printf( " %d " , i); }
void echos( int offset) { printf( " %s " , fdata+offset); }

void * linkarea[ 1000 ] = {( void *)input, ( void *)echoi, ( void *) echos};

int main( int argc, char ** argv) {

if (argc!= 2 ) {
printf( "Missing code file name. \n " );
exit( 1 );
}

int fd = open(argv[ 1 ], O_RDONLY);
if (fd< 0 ) {
printf( "Cannot open code file. \n " );
exit( 1 );
}
struct stat finfo;
fstat(fd, &finfo);
fdata = ( const char *)mmap( 0 , finfo.st_size, PROT_READ|PROT_EXEC, MAP_PRIVATE, fd, 0 );
if (!fdata) {
printf( "Cannot read code file. \n " );
exit( 1 );
}

(( void (*)( void **)) fdata)(linkarea); //

munmap(( void *)fdata, finfo.st_size);
close(fd);
return 0 ;
}

Notice that the #include "jsk.h" that set the definition of commands disappeared: the new loader knows nothing about the jay script or about our p-code.

P-code changes

Now that the p-code has become an internal representation, invisible outside the compiler, it makes no sense to compress each command into 4 bytes; This means that neither tricky union nor field size restrictions are needed. Replace jsk.h with a regular structure declaration with int fields:

struct command {
enum opcodes {
hlt, store, jz, echo, mov, load, input, add, sub, mul, div,
eq, ne, ge, le, gt, lt
};
opcodes opcode;
regnum dest, src1, src2;
int imm;
command(opcodes opcode, regnum dest, regnum src1, regnum src2) :
opcode(opcode), dest(dest), src1(src1), src2(src2) {}
command(opcodes opcode, regnum dest, int imm) :
opcode(opcode), dest(dest), imm(imm) {}
};

The feature of the x86 / x64 instruction set is its variety and non-orthogonality. Almost any operation can be encoded in three or four different ways. We will try for each p-team to choose the most compact of the possible implementations. (This is an extreme case of "optimization through the eye" - peephole optimization , which deals not with the program as a whole, but with short blocks of adjacent teams. In general, each team is processed independently of those around.)

For example, the add n-command can be compiled into MOV , INC , DEC , into one of three ADD forms or into one of three LEA forms, depending on the combination of arguments.

Some local optimizations, like replacing mul r, r, 2 by add r, r, r , are conveniently done at the level of the p-code, before being translated into executable code. At the same stage - after choosing the physical registers, but before the final broadcast - we will get rid of the registers whose value is known in all the places where they are used: during the broadcast we will use the known value.

// :
// ...
if (i->cmd.opcode==command::mul) { //
if (i->known.count(i->cmd.src1)) std::swap(i->cmd.src1, i->cmd.src2);
if (i->known.count(i->cmd.src2)) switch (i->known[i->cmd.src2]) {
case - 1 : i->cmd = command(command::sub, i->cmd.dest, 0 , i->cmd.src1); break ;
case 0 : i->cmd = command(command::mov, i->cmd.dest, 0 ); break ;
case 1 : nopOut(i); break ;
case 2 : i->cmd = command(command::add, i->cmd.dest, i->cmd.src1, i->cmd.src1); break ;
}
}

// , ,
void postalloc() {
std::vector< bool > needed(lastreg+ 1 );
foreach(i, pcode) {
if (i->has2src()) {
if (!i->known.count(i->cmd.src1)) needed[i->cmd.src1]= true ;
if (!i->known.count(i->cmd.src2)) needed[i->cmd.src2]= true ;
else // src2 :
if (i->cmd.opcode==command::div && i->known[i->cmd.src2]!= 2 )
needed[i->cmd.src2]= true ; //
}
if (i->readsdest() && !i->known.count(i->cmd.dest))
needed[i->cmd.dest]= true ;
}
foreach(i, pcode)
if (i->writesdest() && !needed[i->cmd.dest])
nopOut(i);
}

// :
// ...
postalloc();
// NOP-
simpleopt();
// : , known
foreach(i, pcode) {
std::map<regnum, int > known;
if (i->known.count(i->cmd.dest))
known[physmap[i->cmd.dest]] = i->known[i->cmd.dest];
i->cmd.dest = physmap[i->cmd.dest];
if (i->has2src()) {
if (i->known.count(i->cmd.src1))
known[physmap[i->cmd.src1]] = i->known[i->cmd.src1];
if (i->known.count(i->cmd.src2))
known[physmap[i->cmd.src2]] = i->known[i->cmd.src2];
i->cmd.src1 = physmap[i->cmd.src1];
i->cmd.src2 = physmap[i->cmd.src2];
}
i->known = known;
}
break ;

Generation

All generated code will be stored in one large vector std::vector<char> code;
In the struct commandn we need some new fields. int offset, length; will determine the position of the command in the code vector. In addition, for JZ and ECHO commands, the offset value in which is unknown until the end of the code generation, in the int needfixat; field int needfixat; we will keep the index of the blank offset.

So, on the first pass through the p-code, we generate the entire executable code in the vector code ; then go through the code a second time, and fill in the missing offsets. After that, we output all code and all lines to the result.

I tried to remove the most uninteresting pieces from the listing in order to cut it a little. The full compiler code is posted on tyomitch.net.ru/jsk.y.natv.html

// : x86/x64
const char regsize = sizeof ( void *); // 4 8
//
if (regsize== 4 ) { // PUSH EBP / MOV EBP, [ESP+8] / PUSH EDI / PUSH ESI / PUSH EBX
code.push_back( 0x55 ); code.push_back( 0x8b ); code.push_back( 0x6c ); code.push_back( 0x24 );
code.push_back( 0x08 ); code.push_back( 0x57 ); code.push_back( 0x56 ); code.push_back( 0x53 );
} else { // PUSH EBP / MOV RBP, RDI / PUSH EBX
code.push_back( 0x55 ); code.push_back( 0x48 ); code.push_back( 0x8b ); code.push_back( 0xef );
code.push_back( 0x53 );
}
foreach(i, pcode) {
int imm, doffset, joffset;
bool jshort, saveAX, saveDX;
i->offset = code.size();
switch (i->cmd.opcode) {
case command::hlt:
if (regsize== 4 ) {
i->emit( 0x5b , 0x5e , 0x5f , 0x5d ); // POP EBX / POP ESI / POP EDI / POP EBP
i->emit( 0xc2 , 2 , 0 ); // RET 2
} else
i->emit( 0x5b , 0x5d , 0xc3 ); // POP EBX / POP EBP / RET
break ;
case command::store: // MOV [EBP+18+(imm-1)*4], dst
doffset = 0x18 +(i->cmd.imm- 1 )* 4 ;
if (i->known.count(i->cmd.dest)) {
i->emit( 0xc7 );
if (doffset< 128 )
i->emit( 0x45 , ( char )doffset);
else {
i->emit14( 0x85 , doffset);
}
i->emit4(i->known[i->cmd.dest]);
} else {
i->emit( 0x89 );
if (doffset< 128 )
i->emit( 0x45 |(i->cmd.dest- 1 )<< 3 , ( char )doffset);
else
i->emit14( 0x85 |(i->cmd.dest- 1 )<< 3 , doffset);
}
break ;
case command::jz:
joffset = i->tgt->offset - (i->offset+ 2 ); // JMP SHORT
jshort = i->tgt->offset && (joffset>=- 128 ); //
if (!i->cmd.dest) { // JMP off
if (jshort)
i->emit( 0xeb , ( char )joffset);
else {
i->emit( 0xe9 );
if (i->tgt->offset) //
i->emit4(joffset- 3 );
else {
i->needfixat = code.size();
i->emit4( 0 );
}
}
break ;
}
if (jshort && (i->cmd.dest== 2 )) { // JECXZ
i->emit( 0xe3 , ( char )joffset);
break ;
}
// OR dst, dst / JZ off
i->emit( 0x0b , 0xc0 |(i->cmd.dest- 1 )<< 3 |(i->cmd.dest- 1 ));
if (jshort && (joffset>=- 126 ))
i->emit( 0x74 , ( char )(joffset- 2 ));
else {
i->emit( 0x0f , 0x84 );
if (i->tgt->offset) //
i->emit4(joffset- 6 );
else {
i->needfixat = code.size();
i->emit4( 0 );
}
}
break ;
case command::echo: // PUSH live / PUSH dst / CALL [EBP+?] / ADD ESP, 4 / POP live
foreach(rp, i->onexitp) if (*rp!= 4 ) i->emit( 0x50 |(*rp- 1 ));
if (!i->cmd.dest) { // imm, [EBP+8]
if (regsize== 4 ) i->emit( 0x68 ); else i->emit( 0xbf );
i->needfixat = code.size();
i->emit4( 0 );
i->emit( 0xff , 0x55 , 2 *regsize);
} else {
if (i->known.count(i->cmd.dest)) { // imm, [EBP+4]
imm = i->known[i->cmd.dest];
if (regsize== 8 )
i->emit14( 0xbf , imm);
else if ((imm>=- 128 )&&(imm< 128 ))
i->emit( 0x6a , ( char )imm);
else
i->emit14( 0x68 , imm);
} else if (regsize== 4 ) // dst, [EBP+4]
i->emit( 0x50 |(i->cmd.dest- 1 ));
else
i->emit( 0x8b , 0xf8 |(i->cmd.dest- 1 ));
i->emit( 0xff , 0x55 , regsize);
}
if (regsize== 4 ) i->emit( 0x83 , 0xc4 , 4 );
foreachr(rp, i->onexitp) if (*rp!= 4 ) i->emit( 0x58 |(*rp- 1 ));
break ;
case command::mov: // MOV dst, imm
if (i->cmd.imm)
i->emit14( 0xb8 |(i->cmd.dest- 1 ), i->cmd.imm);
else // XOR dst, dst
i->emit( 0x33 , 0xc0 |(i->cmd.dest- 1 )<< 3 |(i->cmd.dest- 1 ));
break ;
case command::load: // MOV dst, [EBP+(3+i)*4]
doffset = 0x18 +(i->cmd.imm- 1 )* 4 ;
i->emit( 0x8b );
if (doffset< 128 )
i->emit( 0x45 |(i->cmd.dest- 1 )<< 3 , ( char )doffset);
else
i->emit14( 0x85 |(i->cmd.dest- 1 )<< 3 , doffset);
break ;
case command::input: // PUSH live / CALL [EBP+0] / XCHG EAX, dst / POP live
foreach(rp, i->onenterp) if (*rp!= 4 ) i->emit( 0x50 |(*rp- 1 ));
i->emit( 0xff , 0x55 , 0 );
if (i->cmd.dest!= 1 ) i->emit( 0x90 |(i->cmd.dest- 1 ));
foreachr(rp, i->onenterp) if (*rp!= 4 ) i->emit( 0x58 |(*rp- 1 ));
break ;
case command::add:
// 1 ; src2
if (i->known.count(i->cmd.src1) || (i->cmd.src2==i->cmd.dest))
std::swap(i->cmd.src1, i->cmd.src2);
if (!i->cmd.src2) // MOV dst, src1
i->emit( 0x8b , 0xc0 |(i->cmd.dest- 1 )<< 3 |(i->cmd.src1- 1 ));
else if (i->known.count(i->cmd.src2)) {
imm = i->known[i->cmd.src2];
addimm :
if (i->cmd.dest==i->cmd.src1) // ADD dst, imm
if (imm== 1 ) // INC dst
i->emit( 0xff , 0xc0 |(i->cmd.dest- 1 ));
else if (imm==- 1 ) // DEC dst
i->emit( 0xff , 0xc8 |(i->cmd.dest- 1 ));
else if ((imm>=- 128 )&&(imm< 128 ))
i->emit( 0x83 , 0xc0 |(i->cmd.dest- 1 ), ( char )imm);
else // for imm=128 we might use SUB Edst, -128
i->emit114( 0x81 , 0xc0 |(i->cmd.dest- 1 ), imm);
else // LEA dst, [src1+imm]
if ((imm>=- 128 )&&(imm< 128 ))
i->emit( 0x8d , 0x40 |(i->cmd.dest- 1 )<< 3 |(i->cmd.src1- 1 ), ( char )imm);
else
i->emit114( 0x8d , 0x80 |(i->cmd.dest- 1 )<< 3 |(i->cmd.src1- 1 ), imm);
} else if (i->cmd.dest==i->cmd.src1) // ADD dst, src2
i->emit( 0x03 , 0xc0 |(i->cmd.dest- 1 )<< 3 |(i->cmd.src2- 1 ));
else // LEA dst, [src1+src2]
i->emit( 0x8d , (i->cmd.dest- 1 )<< 3 | 4 , (i->cmd.src1- 1 )<< 3 |(i->cmd.src2- 1 ));
break ;
case command::sub: // ...
case command::mul: // ...
case command::div:
if (i->known.count(i->cmd.src2) && i->known[i->cmd.src2]== 2 ) {
if (i->cmd.dest!=i->cmd.src1) // MOV dst, src1 / SAR dst, 1
i->emit( 0x8b , 0xc0 |(i->cmd.dest- 1 )<< 3 |(i->cmd.src1- 1 ));
i->emit( 0xd1 , 0xf8 |(i->cmd.dest- 1 ));
break ;
}
// EAX,EDX;
// EDI,ESI,
// MOV EDI, EAX / MOV EAX, src1 / MOV ESI, EDX / XOR EDX, EDX
// IDIV src2 / XCHG dst, EAX / MOV EDX, ESI / MOV EAX, EDI
saveAX = i->onexitp.count((physreg) 1 ) && (i->cmd.dest!= 1 );
saveDX = i->onexitp.count((physreg) 3 ) && (i->cmd.dest!= 3 );
if (saveAX || (i->cmd.src2== 1 )) i->emit( 0x8b , 0xf8 );
if (i->cmd.src1!= 1 )
if (i->known.count(i->cmd.src1))
i->emit14( 0xb8 , i->known[i->cmd.src1]);
else
i->emit( 0x8b , 0xc0 |(i->cmd.src1- 1 ));
if (saveDX || (i->cmd.src2== 3 )) i->emit( 0x8b , 0xf2 );
i->emit( 0x33 , 0xd2 );
if (i->cmd.src2== 1 ) // EDI
i->emit( 0xf7 , 0xff );
else if (i->cmd.src2== 3 ) // ESI
i->emit( 0xf7 , 0xfe );
else
i->emit( 0xf7 , 0xf8 |(i->cmd.src2- 1 ));
if (i->cmd.dest!= 1 )
i->emit( 0x90 |(i->cmd.dest- 1 ));
if (saveDX) i->emit( 0x8b , 0xd6 );
if (saveAX) i->emit( 0x8b , 0xc7 );
break ;
case command::eq: case command::ne: case command::ge:
case command::le: case command::gt: case command::lt:
// 1 ; src2
if (i->known.count(i->cmd.src1)) {
std::swap(i->cmd.src1, i->cmd.src2);
switch (i->cmd.opcode) {
case command::ge: i->cmd.opcode = command::le; break ;
case command::le: i->cmd.opcode = command::ge; break ;
case command::gt: i->cmd.opcode = command::lt; break ;
case command::lt: i->cmd.opcode = command::gt; break ;
}
}
// CMP src1, src2 / SETcc dstL / MOVZX dst, dstL
if (i->known.count(i->cmd.src2)) {
imm = i->known[i->cmd.src2];
if ((imm>=- 128 )&&(imm< 128 ))
i->emit( 0x83 , 0xf8 |(i->cmd.src1- 1 ), ( char )imm);
else if (i->cmd.src1== 1 )
i->emit14( 0x3d , imm);
else
i->emit114( 0x81 , 0xf8 |(i->cmd.src1- 1 ), imm);
} else
i->emit( 0x3b , 0xc0 |(i->cmd.src1- 1 )<< 3 |(i->cmd.src2- 1 ));
i->emit( 0x0f );
switch (i->cmd.opcode) {
case command::eq: i->emit( 0x94 ); break ;
case command::ne: i->emit( 0x95 ); break ;
// ...
}
i->emit( 0xc0 |(i->cmd.dest- 1 ));
i->emit( 0x0f , 0xb6 , 0xc0 |(i->cmd.dest- 1 )<< 3 |(i->cmd.dest- 1 ));
break ;
default :
yyerror( "unknown opcode" );
}
}
// --
int offset = code.size();
std::vector< int > offsets(strings.size()+ 1 );
foreach(i, strings) {
offsets[i->second] = offset;
offset += i->first.length();
offset++;
}
// :
foreach(i, pcode) if (i->needfixat)
if (i->cmd.opcode==command::jz)
i->fix4(i->tgt->offset-(i->offset+i->length));
else if (i->cmd.opcode==command::echo)
i->fix4(offsets[i->cmd.imm]);

write( 1 , &*code.begin(), code.size()); //
foreach(i, strings) //
write( 1 , i->first.c_str(), i->first.length()+ 1 );

What happened?

Compile the compiler, compile a test program for them, compile the loader, and run:

[tyomitch@home ~]$ bison jsk.y
[tyomitch@home ~]$ c++ jsk.tab.c lex.yy.c -o jskc
[tyomitch@home ~]$ ./jskc < test.jsk > code
[tyomitch@home ~]$ cc jskld.c -o jskld
[tyomitch@home ~]$ ./jskld code
0 1000,
500? (1=, 2=, 3=) 1
249? (1=, 2=, 3=) 3
! !

I wonder what our program was compiled into?

 00: 55 push rbp
 01: 48 8b ef mov rbp, rdi
 04: 53 push rbx
 05: 33 c0 xor eax, eax
 07: b9 e8 03 00 00 mov ecx, 0x3e8
 0c: 50 push rax
 0d: 51 push rcx
 0e: bf 80 01 00 00 mov edi, 0x180
 13: ff 55 10 call qword ptr [rbp + 0x10]
 16: 59 pop rcx
 17: 58 pop rax
 18:50 push rax
 19: 51 push rcx
 1a: bf 00 00 00 00 mov edi, 0x0
 1f: ff 55 08 call qword ptr [rbp + 0x8]
 22: 59 pop rcx
 23: 58 pop rax
 24:50 push rax
 25: 51 push rcx
 26: bf fa 00 00 00 mov edi, 0xfa
 2b: ff 55 10 call qword ptr [rbp + 0x10]
 2e: 59 pop rcx
 2f: 58 pop rax
 30:50 push rax
 31: 51 push rcx
 32: bf e8 03 00 00 mov edi, 0x3e8
 37: ff 55 08 call qword ptr [rbp + 0x8]
 3a: 59 pop rcx
 3b: 58 pop rax
 3c: 50 push rax
 3d: 51 push rcx
 3e: bf 01 01 00 00 mov edi, 0x101
 43: ff 55 10 call qword ptr [rbp + 0x10]
 46: 59 pop rcx
 47: 58 pop rax
 48: 3b c1 cmp eax, ecx
 4a: 0f 9e c2 setle dl
 4d: 0f b6 d2 movzx edx, dl
 50: 0b d2 or edx, edx
 52: 0f 84 97 00 00 00 je 0xef
 58: 8d 14 01 lea edx, [rcx + rax * 1]
 5b: d1 fa sar edx, 1
 5d: 50 push rax
 5e: 51 push rcx
 5f: 52 push rdx
 60: bf bc 01 00 00 mov edi, 0x1bc
 65: ff 55 10 call qword ptr [rbp + 0x10]
 68: 5a pop rdx
 69: 59 pop rcx
 6a: 58 pop rax
 6b: 50 push rax
 6c: 51 push rcx
 6d: 52 push rdx
 6e: 8b fa mov edi, edx
 70: ff 55 08 call qword ptr [rbp + 0x8]
 73: 5a pop rdx
 74: 59 pop rcx
 75: 58 pop rax
 76: 50 push rax
 77: 51 push rcx
 78: 52 push rdx
 79: bf 26 01 00 00 mov edi, 0x126
 7e: ff 55 10 call qword ptr [rbp + 0x10]
 81: 5a pop rdx
 82: 59 pop rcx
 83: 58 pop rax
 84: 50 push rax
 85: 51 push rcx
 86: 52 push rdx
 87: ff 55 00 call qword ptr [rbp + 0x0]
 8a: 93 xchg ebx, eax
 8b: 5a pop rdx
 8c: 59 pop rcx
 8d: 58 pop rax
 8e: 89 45 18 mov dword ptr [rbp + 0x18], eax
 91: 83 fb 01 cmp ebx, 0x1
 94: 0f 94 c0 sete al
 97: 0f b6 c0 movzx eax, al
 9a: 0b c0 or eax, eax
 9c: 0f 84 08 00 00 00 je 0xaa
 a2: 8b 45 18 mov eax, dword ptr [rbp + 0x18]
 a5: 8d 4a ff lea ecx, [rdx-0x1]
 a8: eb 9e jmp 0x48
 aa: 83 fb 02 cmp ebx, 0x2
 ad: 0f 94 c0 sete al
 b0: 0f b6 c0 movzx eax, al
 b3: 0b c0 or eax, eax
 b5: 0f 84 04 00 00 00 je 0xbf
 bb: 03 c2 add eax, edx
 bd: eb 89 jmp 0x48
 bf: 8b 45 18 mov eax, dword ptr [rbp + 0x18]
 c2: 83 fb 03 cmp ebx, 0x3
 c5: 0f 94 c2 sete dl
 c8: 0f b6 d2 movzx edx, dl
 cb: 0b d2 or edx, edx
 cd: 0f 84 0b 00 00 00 je 0xde
 d3: bf a0 01 00 00 mov edi, 0x1a0
 d8: ff 55 10 call qword ptr [rbp + 0x10]
 db: 5b pop rbx
 dc: 5d pop rbp
 dd: c3 ret
 de: 50 push rax
 df: 51 push rcx
 e0: bf c4 01 00 00 mov edi, 0x1c4
 e5: ff 55 10 call qword ptr [rbp + 0x10]
 e8: 59 pop rcx
 e9: 58 pop rax
 ea: e9 59 ff ff ff jmp 0x48
 ef: bf 59 01 00 00 mov edi, 0x159
 f4: ff 55 10 call qword ptr [rbp + 0x10]
 f7: 5b pop rbx
 f8: 5d pop rbp
 f9: c3 ret

The naked eye can see at least two actual optimizations that need the “eye” of more than one command:

when the command ends in POP reg , and the next begins in PUSH reg , both of these instructions can be deleted;
the SETcc regl / MOVZX reg, regl / OR reg, reg / JZ offset combination can be replaced with one JNcc offset instruction if reg not alive at the JZ output.

This "optimization in the big eye" and we will do next time. The more important remaining task is to generate a real ELF so that a homemade loader is not needed.

Source: https://habr.com/ru/post/103402/

All Articles