It is unlikely that anyone will be surprised by the fact that not only Intel but also companies such as AMD and VIA are engaged in the development of the IA-32 architecture. More information can be found, for example, in article
A. Fog'a . Today I plan to talk about one, in my opinion, not fully thought out ISA change introduced by AMD.

When thinking about the impact of AMD on the IA-32 architecture, the REX prefix and support for 64-bit processor mode are primarily remembered. And this is definitely the “positive” effect that made IA-32 better. However, there were other interesting changes that I personally cannot call positive.
The coding of the command system IA-32 due to a long evolution has become an extremely complex structure (only the
prefixes are worth). Talking about some decoding problems and their solutions in the articles
"Is your disassembler working correctly?" And
"How to cope with the IA-32 code or features of a Simics decoder" , I forgot to mention a few interesting facts. The maximum possible length of an IA-32 instruction is 15 bytes. There may be several prefixes in the encoding and their number is actually limited only by the condition on the length of the instruction. In this case, the same prefix may occur several times, or, for example, prefixes may occur that can in no way affect this instruction. All of them will be simply ignored.
')
In my opinion, a good example illustrating this situation can be given on the basis of the
NOP
instruction (No OPeration, an instruction that does nothing.
0x90
).
0x66 0x66 0x66 0x66 0x66 0x66 0x66 0x66 0x66 0x66 0x66 0x66 0x66 0x66 0x90
is also a
NOP
instruction, all 14 prefixes
0x66
simply ignored.
This is certainly a very strange feature, but one cannot get away from it. And some compilers may even use prefixes for code alignment.
On this little flowers are over, the berries begin.

For many years in the Intel architecture there is an instruction
BSR
. It first appeared in the
Intel 80386 processor . It finds the sequence number of the most significant bit of 1.
For example, for the number
0x11aa00bb
this instruction will return 28.
Let's see how it can be encoded:

Nothing interesting:
0x0F 0xBD
and Mod R / M bytes for operands.
And now let's add some prefix to the encoding of this instruction ... Let's say
0xF3
. The valid instruction will turn out, the prefix will be simply ignored, as it relates to string operations or input / output instructions. No crime.
What actually made comrades from AMD?
Having done some research, they found that the combination of the prefix
0xF3
with the
BSR
instruction in software is very rare, and reassigned this combination to a new instruction -
LZCNT
, which calculates the number of leading zeros.
For the same input number
0x11aa00bb
in 32-bit mode, this instruction will return not 28, but 3.

This instruction appeared as part of the
ABM (Advanced Bit Manipulation) command extension, consisting of two
LZCNT
and
POPCNT
(in this command, I personally do not see anything wrong), each of which has a separate bit in
CPUID .
Unfortunately, this instruction cannot be disabled.
The first
ABM
instruction set was supported by the AMD processor based on the
Barcelona microarchitecture. Intel has added a
POPCNT
instruction to the Nehalem processor instruction set. And one might have thought that Intel would stop there, but no. The
LZCNT
instruction appeared in Haswell processors.
What is this bad?
First, this change obviously violates backward compatibility. But this, in my opinion, is not its worst feature. As mentioned above, according to AMD research, the
BSR
instruction with this prefix is extremely rare. Still, theoretically, such a situation is possible.
But the article is not about that, so now let's move away a bit from the typical needs of an ordinary user and look at the needs of developers.
As you know, most of the software stack is written and debugged on the simulator before baking the chip itself. So let's see how this change can affect the speed and accuracy of the simulation.
Of course, everyone wants to model as quickly as possible. The speed of an ordinary interpreter is never enough. Everyone wants to load the BIOS in seconds, and the operating system in minutes. For this reason, the model is much more complicated, there is an
optimizing binary translator , which allows to reduce the time of the simulator. But this is still not enough! Add support for direct execution of guest instructions on the host, which further complicates the model, while improving performance several times. More information about the various modes of operation of the simulator can be found in the article
“Programming simulation of a microprocessor. Transmission .
It is easy to guess that neither the interpreter nor the translator should have any problems. Problems may arise when using
hardware virtualization . Neither
LZCNT
nor, moreover,
BSR
causes an output to the VM monitor.
This leads to the fact that if you need to simulate a Haswell + processor, then on an older processor, such as Sandy Bridge, you can execute
BSR
instead of
LZCNT
. And vice versa, if you want to model some simpler processor, for example,
Quark on a host with Haswell, you risk getting the opposite effect -
LZCNT
instead of
BSR
.
They broke virtualization!

However, the solution to this problem is to preview the page.
The existing virtualization mechanism allows you to limit the set of memory pages that guest software can access. Thus, we can allow direct execution of code located only on pages that do not contain
LZCNT
encodings instructions. And each new page is pre-scanned for the presence of these commands.
Such a change, of course, leads to a drop in performance and complication without even a simple simulator. It seems to me that this is the negative effect of these changes.
PS Such instruction is not the only one. Together with the
BMI1 extension
, Intel added a new
TZCNT
instruction, which is likewise linked to the
BSF
team.