Knowledge of the structure of machine commands for many years is not mandatory, so that a person can call himself a programmer. Naturally it was not always. Before the appearance of the first assemblers, programming was carried out directly in machine code. Hard work, coupled with a large number of errors. Modern assemblers allow (to a reasonable degree) to abstract from iron, the method of coding commands. What can we say about the compilers of high-level languages. They are striking with the complexity of their implementation and the simplicity with which the programmer is allowed to convert the source code into a sequence of machine instructions (and to convert, to a sufficient degree, optimally). From the programmer is required only knowledge of your favorite language / IDE. Knowledge of what the source listing translates into is not necessary.
Those who are interested in looking at a brief description of the coding structure of machine instructions, an example implementation, and the source code of a disassembler for x86 architecture are welcome.
Creating a disassembler for the x86 architecture is, although the task is not very difficult, but still quite specific. A certain kind of knowledge is required of a programmer — knowledge of how a microprocessor recognizes a sequence of “bytes” in machine code. Not every university can provide such knowledge in the amount sufficient for writing a fully functional modern disassembler - you have to look for yourself (usually in English). This post does not pretend to be complete coverage of the problem of creating a disassembler, it only briefly describes how the disassembler was written for the x86 architecture, 32-bit command execution mode. I would also like to note the likelihood of possible inaccuracies in the translation of certain concepts from the official specification.
Command structure for intel x86
')
The command structure is as follows:
• Optional prefixes (each prefix is ​​1 byte in size)
• Mandatory command opcode (1 or 2 bytes)
• Mod_R / M - baytik, defining the command operand structure - optional.
• Optional bytes occupied by command operands (sometimes divided as one byte of the SIB [Scale, Index, Base], offset and immediate value field).
Prefixes
The following prefixes exist:
The first six change the segment register used by the command when accessing a memory cell.
• 0x26 - prefix for replacing ES segment
• 0x2E - CS segment replacement prefix
• 0x36 - SS segment replacement prefix
• 0x3E - DS segment replacement prefix
• 0x64 - FS segment replacement prefix
• 0x65 - GS segment replacement prefix
• 0x0F - prefix for additional commands (sometimes it is not considered to be a real prefix - in this case, the opcode of a command is considered to consist of two bytes, the first of which is 0x0F)
• 0x66 - operand size override prefix (for example, ax will be used instead of eax)
• 0x67 - address size override prefix (see below)
• 0x9B - wait prefix (WAIT)
• 0xF0 - blocking prefix (LOCK with its help realizes synchronization of multi-threaded applications)
• 0xF2 - REPNZ command repeat prefix - work with byte sequences (strings)
• 0xF3 - prefix repeat command REP - work with byte sequences (strings)
Each of these prefixes changes the semantics and / or structure of the machine instruction (for example, its length or the choice of mnemonics).
Opkody teams.
The team opcode is sometimes one, sometimes together with the prefix (s) uniquely identifies the mnemonic (name) of the command. There are many teams. And with the increasing complexity of modern microprocessors, their number does not decrease - new commands appear, and obsolete ones do not disappear (backward compatibility). The list of opcodes and commands associated with them, as a rule, can be downloaded on the official websites of manufacturers of microprocessors.
The Mod_R / M byte consists of the following fields:
• Mod - the first two bits (value from 0 to 3)
• R / M - the next three bits (value from 0 to 7)
• Value of ModR / M - the next three bits (value from 0 to 7)
Implementation:
For writing disassembler we will use the following page:
http://ref.x86asm.net/geek32.html .
We see several tables. In essence, only these tables and the description of their fields will be needed for writing the disassembler. Of course, logical reasoning and free time are additionally required.
The first table contains a list of machine commands that do not contain the prefix 0x0F. In the second list of commands containing this prefix (most of these commands appeared in microprocessors of the “Pentium with MMX” family or later).
The following three tables allow you to convert the Mod_R / M byte into a sequence of command operands for a 32-bit command encoding mode. Moreover, each subsequent of these three tables specifies the Mod_R / M byte parsing of special cases of the previous table.
The last table allows you to convert the Mod_R / M byte into a sequence of command operands for a 16-bit command encoding mode. By default, the command is considered to be encoded in 32-bit mode. To change the encoding mode, use the address size override prefix (0x67).
The first thing that needs to be done is to move the first two tables to convenient data structures for work. On the same site, you can download xml-versions of these tables, and already convert them into beautiful sishny structures. I did it differently - I loaded the html tables into Excel, and already there, writing a simple script on VBA, I received the source code that, after manual corrections, was the required data structures.
The algorithm for disassembling itself is quite simple:
• A list of prefixes used in the current machine instruction is collected.
• The corresponding field is searched for in one of the two tables depending on the opcode, prefixes and generation (model) of the target (desired) microprocessor.
• The record we find is characterized by a list of fields, such as the generation (model) of the microprocessor, from which the support for this command has appeared or, for example, the list of flags that this command can change. We are mainly interested only in the mnemonic (name) of the command and the list of operands. After analyzing all the operands found and the Mod_R / M byte field, we can learn the textual representation and the length of the command.
The number of operands can range from zero to three. The source tables contain over a hundred types of operands. Some operands are duplicated - they have different names, but the sequence of actions for mod_r / m byte processing (and possibly subsequent bytes) is the same.
To view an example of processing various operands and an example of disassembling the simplest “Hello world” function, you can download the
disassembler source code for the C ++ Builder 6 compiler .
PS:
It’s not a fact that someone who has read this post ever needs information gathered from it (units write disassemblers), but in any case this disassembler was tested and even included in a
fairly large commercial tread , the source code is open and distributed freely )