Many years ago, I worked in the Microsoft Xbox 360 department. We thought about the release of a new console, and decided that it would be great if this console could launch games from the console of the previous generation.
Emulation is always difficult, but it turns out to be even harder if your corporate bosses are constantly changing the types of CPUs. The first Xbox (not to be confused with the Xbox One) used x86 CPU. In the second Xbox, that is, I'm sorry, the Xbox
360 used a PowerPC processor. The third Xbox, i.e., the Xbox
One , used x86 / x64 CPU. Similar jumps between different
ISA did not simplify our lives.
I participated in the work of the team that taught the Xbox 360 to emulate many games of the first Xbox, that is, to emulate x86 on a PowerPC, and for this job I received the title
“Ninja Emulation” . Then I was asked to examine the issue of emulating the PowerPC CPU Xbox 360 console on x64 CPU. I will say in advance that I have not found a satisfactory solution.
FMA! = MMA
One of the most disturbing aspects of me was fused multiply add, or
FMA instructions. These instructions received three parameters at the input, multiplied the first two, and then added the third. Fused meant that rounding was not performed until the end of the operation. That is, the multiplication is performed with complete accuracy, after which the addition is performed, and only then the result is rounded to the final answer.
')
To show this with a concrete example, let's imagine that we use decimal numbers with a floating point and two digits of precision. Imagine this calculation, shown as a function:
FMA(8.1e1, 2.9e1, 4.1e1), 8.1e1 * 2.9e1 + 4.1e1, 81 * 29 + 41
81*29
equals
2349
and after adding 41 we get
2390
. Rounding up to two digits, we get
2400
or
2.4e3
.
If we do not have FMA, then we will have to multiply first, get
2349
, which will be rounded to two precision bits and will give
2300 (2.3e3)
. Then we add
41
and get
2341
, which
will be rounded
again and we will get the final result of
2300 (2.3e3)
, which is less accurate than the FMA answer.
Note 1: FMA(a,b, -a*b)
computes an error in a*b
, which is actually cool.
Note 2: One of the side effects of note 1 is that x = a * b – a * b
may not return zero if the computer automatically generates FMA instructions.
So, it is obvious that FMA gives more accurate results than individual instructions for multiplication and addition. We will not go deep, but we will agree that if we need to multiply two numbers and then add the third one, then the FMA will be more accurate than its alternatives. In addition, FMA instructions often have a lower delay than the multiplication instruction followed by the addition instruction. In the Xbox 360 CPU, delays and FMA processing speed were equal to those of
fmul or
fadd , so using FMA instead of
fmul followed by dependent
fadd made it possible to halve the delay.
FMA emulation
The Xbox 360 compiler
has always generated
FMA instructions , both vector and scalar. We weren't sure that the x64 processors we chose would support these instructions, so it was critically important to emulate them quickly and accurately. It was necessary that our emulation of these instructions become ideal, because from previous experience of emulating floating-point calculations, I knew that “fairly close” results led to the characters falling through the floor, spreading cars out of the world, and so on.
So what is
needed to perfectly emulate FMA instructions if x64 CPU does not support them?
Fortunately, the vast majority of floating-point calculations in games are performed with float accuracy (32 bits), and I could gladly use FMA instructions with double accuracy (64 bits) in emulation.
It seems that the emulation of FMA instructions with float accuracy using double precision should be simple (
voice of the narrator: but this is not the case; working with a floating point is never simple ). Float has a precision of 24 bits, and a double has a precision of 53 bits. This means that if you convert the incoming float to double precision (lossless conversion), then you can perform the multiplication without errors. That is, only 48 bits of accuracy are enough to store fully accurate results, and we have more, that is, everything is in order.
Then we need to perform addition. It is enough to take the second term in the float format, convert it to double, and then add it to the result of the multiplication. Since in the process of multiplication rounding does not occur, and it is performed only after addition, this is completely sufficient for FMA emulation. Our logic is perfect. You can declare victory and return home.
The victory was so close ...
But it does not work. Or at least fails for a portion of the incoming data. Think for yourself why this could happen.
The hold call music sounds ...
Failure occurs because, by definition, FMA multiplication and addition are performed with complete accuracy, after which the result is rounded off with precision float. We
almost managed to achieve this.
Multiplication occurs without rounding, and then, after addition, rounding is performed. This is
similar to what we are trying to do. But rounding after addition is performed with
double accuracy. After that, we need to save the result with float accuracy, which causes rounding again.
W-FF.
Double rounding .
It will be hard to show it visually, so let's go back to our floating point decimal formats, where single precision is two decimal places and double precision is four bits. And let's imagine that we calculate
FMA(8.1e1, 2.9e1, 9.9e-1)
, or
81 * 29 + .99
.
The exact answer of this expression is
2349.99
or
2.34999e3
. Rounding up to single precision (two digits), we get
2.3e3
. Let's see what goes wrong when we try to emulate these calculations.
When we multiply
81
and
29
with double precision, we get
2349
. So far, so good.
Then we add
.99
and get
2349.99
. As before, everything is fine.
This result is rounded to double accuracy and we get
2350 (2.350e3)
. Oh, her.
We round it to single precision and, according to the IEEE
rounding rules,
to the nearest even one we get
2400 (2.4e3)
. This is the wrong answer. It has a slightly larger error than the correctly rounded result returned by the FMA instruction.
You can state that the problem is in the IEEE environment rule to the nearest even. However, whatever rounding rule you choose, there will always be a case where double rounding returns a result that is different from the true FMA.
What is all over?
I could not find a fully satisfying solution to this problem.
I left the Xbox team long before the release of the Xbox One and since then have not paid much attention to the console, so I do not know what decision they came to. Modern x64 CPUs have FMA instructions that can perfectly emulate such operations. You can also somehow use the x87 math coprocessor for FMA emulation — I don’t remember which conclusion I came to when studying this question. Or perhaps the developers simply decided that the results were close enough and could be used.