Microprocessor software simulation. Transmission

In this article I want to talk about how the creators of simulators achieve maximum performance of processor models, while not sacrificing the flexibility and extensibility of a complete solution. In short, the solution is the coexistence of several engines, the best qualities of which are used at different stages of the model.
The content of this note will be based on my experience in the development of functional simulators, as well as on publications and technical articles describing various simulators and virtual machines: Wind River Simics, VMWare, Qemu, Bochs, and others. The word “functional” in the context of this article means that the accuracy of models is limited by the level of instruction set architecture (ISA).

Interpretation

An “interpreter” is someone (or something) who translates a text or speech from one language to another “on the fly”. The meaning of this term is approximately preserved when it is transferred to the field of computer technology. Original implementations of many programming languages: Basic, Unix Shell, Rexx, PHP, etc. - were interpreters. Their characteristic feature is the processing of one line of a program written in the input language at a time. The next line will be converted ( interpreted ) only when the need arises. For this reason, the term "interpreter" is usually opposed to the concepts of "translator" and "compiler", which process larger blocks of the input language: procedures, files, modules, etc.

Work cycle

The algorithm of the simulator that uses interpretation is an infinite loop (for simplicity, here I omit the details related to interrupt handling, events in peripheral devices, etc.), in structure resembling the work of the conveyor of real processors.

On the number of stages in the pipeline

I note that the lengths of the pipelines of real processors are very different. Simple microcontrollers, for example, PIC, can have only two stages. The latest generation of Intel Pentium 4 has reached 31 stages. There are incredibly monstrous products, for example, Xelerated Xelerator X10q has 1040 (!) Pipeline stages.

')
During one iteration of the cycle, the simulator executes one instruction, sequentially performing the following steps.

Fetch - reading machine code from memory.
Decode - decoding of the current function enclosed in the instruction, as well as its arguments, the operands.
Execute - execution of the function on the arguments.
Writeback - write results memory.
Advance PC - advances the register of instructions (PC, program counter).

Consider each detail.

Reading memory instructions

In real computers, RAM is located far enough (in many ways) from the processor. The code, as well as data is stored in it. It is required to deliver a portion of the machine code for further processing to the processor, which is done at the Fetch stage. As in real microprocessors, which have a faster instruction cache in front of the memory, the model can also use caching of previously read instructions. And again, as in real systems, memory reading may well end with the occurrence of an exception that needs to be simulated.

Decoding

The task of a decoder, both hardware and simulated, is to isolate a separate instruction from the sequence of bytes received from the memory and disassemble what it means.
Writing a decoder model is not the easiest thing. The creator of the simulator has to deal with many features of guest architectures (one of them, inherent to IA-32, yulyugin recently wrote on Habré). The very number of instructions that you need to be able to distinguish can be large.
If writing a simulator begins with reading the documentation for ISA, then creating a decoder in its composition - with a look at the tables of instructions, to get an idea of the scale of the disaster. I will give a few examples of tables encoding instructions for different architectures. All of them are taken from official manuals or related materials.
For ARM [3], 32-bit instructions.

Table

For PowerPC 3.0 [4]

Table

For Itanium 2.3 [2]. I always liked the fervent colorfulness of this guide!

Table

For IA-32 [5]. A diagram showing all fields used after the introduction of the EVEX prefix.

Table

A terrible scheme, but she helped me understand what is happening in AVX3.

Naturally, the sad manual work of driving the contents of the tables into the code is often preferred to automation - the generation of code according to the description made in a specialized language. For complex architectures, or in cases where it is necessary to have a common framework for creating simulators of many different systems, the cost of creating such a language and translation tools pays off.

Execution

The execution (execute) consists of a direct simulation of the function of an instruction already decoded. It can be a calculation of the result of an arithmetic or logical operation, reading or writing to memory, changing the current mode of the processor, or passing control to a branch or procedure call.
Each recognized instruction must correspond to a service routine. In the simplest scheme of the interpreter, its choice is made according to the instruction's operation code (opcode) using the multiple choice construct - the switch - used programming language. In C, this is a switch statement . This scheme is called "switched" (switched):

switch (opcode) { case OPERATION1: do_op1(arg1, arg2); break; case OPERATION2: do_op2(arg1, arg2); break; case OPERATION3: do_op3(arg1, arg2); break; ... default: do_undefined; break; }

There are several alternatives to the switchable interpretation scheme, characterized by a higher work rate [10], but I will write about them another time.
Like creating a decoder manually, writing service procedures can be tedious, especially if they look alike. For example, in the command system there may be families of instructions that have the same meaning, but differ in the width of the processed vector and elements, as well as the position of the processed data (in the register or in the memory). Hands themselves are drawn to the methods of generalized programming. Or you want to write a code generator that will take over the routine.
Again, the service procedures must be synchronized with the decoder: when editing a set of commands, you should not forget about adding / deleting semantics for their simulation. Therefore, it is most reasonable to have such a combined description in order to automatically produce both a decoder, and procedures, and (to the heap) a disassembler.

Memory entry

Working with simulated memory, even if we consider it only with an accuracy sufficient for the functional model, is also nontrivial. Many things need to be considered here: a guest scheme for converting virtual addresses into physical ones, the guest system's relation to data alignment, the byte order in a machine word (endianness), access to a memory mapped device (memory mapped input / output), read / write page permissions / execution, optional segmentation (ugh!) ...
Writing or reading memory can lead to an exception, which, as in the case of fetch, must be correctly simulated.

PC Promotion

PC (program counter) - a register storing the address of the current instruction. In different systems and books, it is called differently: IP / EIP / RIP, pc, IP ... For "linear" operations, you just need to move it to the length of the newly processed instruction. But for branching commands, procedure calls, program interrupts, it will change in a more complex way, corresponding to their semantics.

Subtotal

The interpreter is almost always the first type of model created for the new processor architecture. For this reason, there are a great many, and the enumeration of more or less known would take a lot of space. Here are just a couple of examples: Bochs [6], Zsim [7]. I would also like to recommend the FAQ of Marat Faizullin.

Binary broadcast

The main advantage of models based on interpretation is their simplicity. The main drawback is the low speed of work.
As in the case of high-level languages, you can use a technique historically called translation, but now more often referred to as compilation, to accelerate. Thus, the name of the language Fortran is an abbreviation for FORmula TRANslator.
Instead of spending time on parsing and executing each instruction every time it meets, you can simply translate a sufficiently long guest code block into an equivalent block of the host machine system. Then, during the simulation, it is enough to transfer control to it, and it will be executed "without slowing down". This idea works because, in practice, a code once executed is likely to be executed again soon - after all, programs spend most of the time in cycles of relatively small size.
In the context of simulation, this technique is called “binary translation” (DT), or “binary translation” (binary translation). The concept is very close to “compile execution time” (just in time compilation, JIT).
Below I will describe one of the simple schemes used in practice.

Capsules and Broadcast Blocks

The input of the translation process is a group of machine commands that are nearby (for example, on the same page) and / or are executed sequentially (route). They are decoded. Then each guest instruction is associated with a block of host code, called a capsule . Several capsules are assembled into larger formations, referred to as translation units . Each of them can have one or more entry points, as well as several places from which management can leave it.

The fact that in this equipment is made "iron" in the simulator should make the program. Therefore, the length of the capsule on average exceeds one instruction, which it is intended to represent.

Can a capsule be shorter?

Instructions that during functional simulation do not cause visible architectural effects can be represented by an empty capsule of zero commands, especially if the block of translation is optimized before execution. Examples: NOP (no-operation) instruction options, data preloading instructions (prefetch), barriers (fence).

Readers familiar with the organization of compilers will feel the similarity of what is happening in DT with compilation. They will be right: in DT it is possible to distinguish the work of the frontend, phases of optimization and code generation in the back end. I want to emphasize the differences between DT and compilation from high-level languages.

Code detection problem

The machine code contains less information about the program, its structures, than its source code. The ensuing code discovery task can be divided into several subtasks.

In memory, data (variables, arrays, constants, strings, etc.) and the program code that processes them are stored together. No boundaries between them are indicated. Binary translation of data blocks (considered as a code!) Is useless: control will never be transferred to them - and even harmful: the time spent on this is wasted.
In architectures that allow variable length instructions, the address from which their decoding begins is very important. Shifting even one byte results in a complete change in the meaning of the sequence.
The result of decoding depends on the processor mode. For example, for the ARM architecture there are actually two sets of instructions - a full 32 bit and a trimmed 16-bit Thumb, the transition between which occurs with the help of the BX command.

A special case of this problem is the recalculation of the relative offsets for the addresses of the transition of branch instructions. The figure below shows an example in which the source code block ends with a six-byte jump back. In the generated code, the instruction boundary to which to go is in a different place.

Self-modifying code

The executable program code and the data they process are located in the same physical memory. Therefore, it is possible in the execution process to change the machine code. This is used, for example, by operating systems to load into memory applications and dynamic libraries, polymorphic viruses to hide their presence. Let's call this situation a self-modifying code (SMC). In the case of a binary translation, this means that the translation blocks may become obsolete — to be made incorrect if the guest code from which they were received has changed.

Have you ever had this?

You write the code, you make the change, you launch the application to see how the change affected the execution, but there is no effect. You rush to figure it out, take it all, but it turns out that you forgot to recompile the application. There was a desync of the source code and its translation into the machine.

During the simulation, you have to keep track of all the entries in the memory and mark the affected blocks as outdated, requiring repeated translation. This operation takes some time, so the speed of the model on sections of programs with SMC will be low: the execution of the generated code will often be interrupted for retransmission.

Limited optimizations

After assembling the translation unit from the capsules, it may turn out that it can be further converted. In this case, the new unit will work faster. This is similar to the use of optimization phases in compilers. But, if there is a possibility that the result of a block’s execution after transformation will differ from the original one, then it will have to be abandoned. Since the machine code often contains less information about the algorithm than it was contained in the source code, the range of optimizations in DT is less than that of the compilation.
To perform a series of transformations, it is necessary to have an intermediate representation for translation blocks, which allows analyzing control and data graphs, as is done in compilers. The DT scheme with capsules described by me above is not flexible enough for this.

Subtotal

In more detail about the binary translation techniques used for simulation needs, you can read in a series of posts on the Intel Developer Zone:
Part 1
Part 2
Part 3

A couple of links to simulators known to me, in which the binary translator is the main engine for simulation: SoftSDV [8], the original implementation of Qemu [9].

Direct execution and hardware support

An important case in practice is the situation when the guest and host architectures coincide (or almost coincide). In this case, it becomes possible to significantly simplify the broadcasting - in some cases it comes down to copying the guest code as the master code or even executing it “on the spot”, without duplication. Such simulation modes have the common name “direct execution” (direct execution, DEX).

General idea

It would seem that if the program is already expressed in the machine language of the host system, it is enough to transfer control to its beginning. Further the equipment itself with the maximum speed will execute instructions.

Why it doesn't work

Of course, this will not work. The guest application should not be able to determine the fact that it is executed inside the simulator, and even more so to influence its operation, for example, by causing an exception. I will describe only part of the complications that make the naive DEX scheme untenable.

Memory accesses Guest address space is only part of the memory of the simulator. The data and code of the simulated system will not necessarily be located at the same addresses at which they were located in reality.
Return management. How can you "force" a simulated application to give control back to the simulator?
Privileged instructions. The simulator operates in the unprivileged mode of the user application, and the guest code may contain instructions for the system modes. Attempting to execute them will cause the simulator to crash.

There are binary instrumentation programs that allow you to "subtly" for the guest application to replace the machine code of selected instructions, for example, Pin from Intel or DynamoRIO from Hewlett-Packard. Thus, it is possible to build a simulator that views the guest code and replaces its “dangerous” areas before transferring control to it. We can say that the instrumentation is a variant of DT, in which almost all the capsules coincide with the original instructions, and only the “difficult” commands are transformed.

Hardware support

Some microprocessors allow you to support the direct execution of the guest code in the sandbox, at the hardware level, performing work on intercepting dangerous operations. This approach is better than binary instrumentation, both in terms of reliability and speed of operation. Hardware-supported DEX is virtualization used for simulation needs.

I tried to highlight the classic conditions of effective virtualization and the state of affairs in modern systems in my post “Hardware Virtualization. Theory, reality and support in processor architectures. ”

Restrictions

Virtualization is the fastest simulation engine considered. At the same time, it is in some sense the most inflexible, as it is strongly tied to the capabilities of the equipment.

Hardware-supported direct execution is not possible if the guest or host architectures do not match. The ARM code on IA-32 or the PowerPC MIPS program will not magically run.
Not all processor modes can be supported inside the guest. For example, the ancient 16-bit real IA-32 mode is difficult to execute using VMX, if there are no so-called features. Unrestricted Guest.
The high cost of entry and exit to guest mode can kill a speed gain.
More difficult to observe and manage the work of guest applications. It becomes difficult to set breakpoints, to execute step-by-step execution, up to instructions, to change the semantics of individual instructions, etc.
The simulator requires an OS kernel / driver module that will switch between modes. Writing, debugging this module requires specific knowledge and tools.

Subtotal

Quite a large number of simulators use Intel VT-x virtualization hardware support for their needs. We can mention Wind River Simics, Oracle VirtualBox, Qemu, and VMWare products.

Putting it all together

So, there are at least three approaches: interpretation, binary translation and direct execution - which can be used to create software models. I summarize their strengths and weaknesses.

The ease of implementation in the code. The easiest to write is the interpreter. It is difficult to say that it is more difficult to write: DT or DEX - it depends on the properties of the host system.
Portability between host architectures. The interpreter takes the first place: if it is written correctly, its code will be portable. DT is not far behind, provided that its code generator can be retargeted to another system. DEX does not work with significant differences in architectures, since tied to a specific hardware.
Easy to add functionality, such as new CPU instructions. Again the interpreter leads with a margin. DT requires more effort: it is necessary to write capsules. DEX does not allow to go beyond the limits of the master equipment; New instructions will not be recognized.
Work speed Here the interpreter as a whole loses much to the other two engines. DT is usually also inferior to DEX. However, remember that the weak side of DT is SMC, and with DEX there are frequent exits from virtualization, for example, on privileged instructions.

The conclusion is disappointing. At the same time, the two main conditions - speed and universality - are not satisfied by any scheme. Moreover, at different stages of the same simulation scenario, any scheme may be preferable.
What to do? Shift gear depending on driving mode! Use all three, including each of them, when it is most profitable!
We can formulate some recommendations determining the conditions for switching between modes. Rather, it is even heuristics.

If fast mode is not available, go to a slower one. For example, not all instructions may be supported by DT or DEX.
Switch from a fast model to a slow one if the fast one has ceased to be so. If SMC is detected, it is better to temporarily disable the DT mode.
Postpone the transmission of portions of the code until it becomes clear that this code is “hot”. Why waste time generating a block that will be executed only a few times?
Do not use DEX, if there is a high probability of an early exit on the exception.
It is necessary to provide the user with the ability to follow the current simulation speed with splitting into modes. After all, heuristics are limited in their vision. Often, only a person can determine the measures necessary to accelerate it.

The price for speed and flexibility is the need to support actually three independent models. To catch errors in them, to eliminate "minor" differences in behavior, to stumble upon the strangeness of the implementations of virtualization of different generations of equipment. In general, try to curb the three-headed Serpent Gorynych.

Ivan Bilibin. Snake Gorynych . 1912.

Conclusion

I want to refer to the textbook [1] that are interested in the internal structure of the software models of computers. The electronic version of the second edition, as well as the "beta version" of the third is available on the course website "Fundamentals of Software Modeling . "

In this note, not a word was said about yet another way to increase the speed of modeling - parallel execution. The problems associated with building such simulators, and their solutions are described in another series of posts on the Intel Developer Zone: 0 , 1 , 2 , 3 .

Reverse

Having looked at the speed-shift knob pictured at the beginning of the article, I suddenly thought: “Hmm, what about the reverse?” In terms of simulation, this means that the flow of virtual time, as well as all the processes taking place in it, are reversed. The most interesting thing is that it is feasible. However, this is a topic for a separate post that I plan to prepare someday.

Literature

[1] Fundamentals of computer software simulation : Textbook / Rechistov G.S., Yulyugin E.A., Ivanov A.A., Shishpor P.L., Schelkunov N.N., Gavrilov D.A. - 2nd ed., Corr. and add. - Publishing house of MIPT, 2013.
[2] Intel Corporation. Intel® Itanium® Architecture Software Developer's Manual Rev. 2.3
[3] ARM Limited. ARM Architecture Reference Manual - 2005
[4] IBM Corporation. PowerPC® Microprocessor Family for The 64-bit Microprocessors . Version 3.0
[5] JCS Adrian et al. Operands' Systems, Apparatuses, and Methods for Blending . US Patent Application Publication. No. 2012/0254588 A1
[6] D. Mihoka, S. Shwartsman. Virtualization Without Direct Execution or Designing a Portable Virtual Machine Infrastructure bochs.sourceforge.net
[7] Yair Lifshitz, Robert Cohn, Inbal Livni, Omer Tabach, Mark Charney, Kim Hazelwood. Zsim: A Fast Architectural Simulator for ISA Design-Space Exploration www.cs.virginia.edu/kim/docs/wish11zsim.pdf
[8] SoftSDV: A Software Development for the IA-64 Architecture / Richard Uhlig, Roman Fishtein, Oren Gershon, Israel Hirsh, Hong Wang // Intel Technology Journal. - 1999. —№ 14. - ISSN: 1535-766X. - noggin.intel.com/content / softsdv-a-pre-silicon-software-development-environment-for-the-ia-64-architecture /
[9] Fabrice Bellard. QEMU, a Fast and Portable Dynamic Translator // FREENIX Track: 2005 USENIX Annual Technical Conference. - 2005. - www.usenix.org/publications/library/proceedings/usenix05/tech/freenix/full_papers/bellard/bellard.pdf
[10] James E.Smith, Ravi Nair. Virtual machines - Versatile Platforms for Systems and Processes . - Elsevier, 2005. - ISBN: 978-1-55860-910-5.

Source: https://habr.com/ru/post/202926/

All Articles

Microprocessor software simulation. Transmission

Interpretation

Work cycle

Reading memory instructions

Decoding

Execution

Memory entry

PC Promotion

Subtotal

Binary broadcast

Capsules and Broadcast Blocks

Code detection problem

Self-modifying code

Limited optimizations

Subtotal

Direct execution and hardware support

General idea

Why it doesn't work

Hardware support

Restrictions

Subtotal

Putting it all together

Conclusion

Literature

More articles: