Exercises in emulation: FMA Xbox 360 console instruction

Many years ago, I worked in the Microsoft Xbox 360 department. We thought about the release of a new console, and decided that it would be great if this console could launch games from the console of the previous generation.

Emulation is always difficult, but it turns out to be even harder if your corporate bosses are constantly changing the types of CPUs. The first Xbox (not to be confused with the Xbox One) used x86 CPU. In the second Xbox, that is, I'm sorry, the Xbox 360 used a PowerPC processor. The third Xbox, i.e., the Xbox One , used x86 / x64 CPU. Similar jumps between different ISA did not simplify our lives.

I participated in the work of the team that taught the Xbox 360 to emulate many games of the first Xbox, that is, to emulate x86 on a PowerPC, and for this job I received the title “Ninja Emulation” . Then I was asked to examine the issue of emulating the PowerPC CPU Xbox 360 console on x64 CPU. I will say in advance that I have not found a satisfactory solution.

FMA! = MMA

One of the most disturbing aspects of me was fused multiply add, or FMA instructions. These instructions received three parameters at the input, multiplied the first two, and then added the third. Fused meant that rounding was not performed until the end of the operation. That is, the multiplication is performed with complete accuracy, after which the addition is performed, and only then the result is rounded to the final answer.
')
To show this with a concrete example, let's imagine that we use decimal numbers with a floating point and two digits of precision. Imagine this calculation, shown as a function:

FMA(8.1e1, 2.9e1, 4.1e1), 8.1e1 * 2.9e1 + 4.1e1, 81 * 29 + 41

81*29 equals 2349 and after adding 41 we get 2390 . Rounding up to two digits, we get 2400 or 2.4e3 .

If we do not have FMA, then we will have to multiply first, get 2349 , which will be rounded to two precision bits and will give 2300 (2.3e3) . Then we add 41 and get 2341 , which will be rounded again and we will get the final result of 2300 (2.3e3) , which is less accurate than the FMA answer.

Note 1: FMA(a,b, -a*b) computes an error in a*b , which is actually cool.

Note 2: One of the side effects of note 1 is that x = a * b – a * b may not return zero if the computer automatically generates FMA instructions.

So, it is obvious that FMA gives more accurate results than individual instructions for multiplication and addition. We will not go deep, but we will agree that if we need to multiply two numbers and then add the third one, then the FMA will be more accurate than its alternatives. In addition, FMA instructions often have a lower delay than the multiplication instruction followed by the addition instruction. In the Xbox 360 CPU, delays and FMA processing speed were equal to those of fmul or fadd , so using FMA instead of fmul followed by dependent fadd made it possible to halve the delay.

FMA emulation

The Xbox 360 compiler has always generated FMA instructions , both vector and scalar. We weren't sure that the x64 processors we chose would support these instructions, so it was critically important to emulate them quickly and accurately. It was necessary that our emulation of these instructions become ideal, because from previous experience of emulating floating-point calculations, I knew that “fairly close” results led to the characters falling through the floor, spreading cars out of the world, and so on.

So what is needed to perfectly emulate FMA instructions if x64 CPU does not support them?

Fortunately, the vast majority of floating-point calculations in games are performed with float accuracy (32 bits), and I could gladly use FMA instructions with double accuracy (64 bits) in emulation.

It seems that the emulation of FMA instructions with float accuracy using double precision should be simple ( voice of the narrator: but this is not the case; working with a floating point is never simple ). Float has a precision of 24 bits, and a double has a precision of 53 bits. This means that if you convert the incoming float to double precision (lossless conversion), then you can perform the multiplication without errors. That is, only 48 bits of accuracy are enough to store fully accurate results, and we have more, that is, everything is in order.

Then we need to perform addition. It is enough to take the second term in the float format, convert it to double, and then add it to the result of the multiplication. Since in the process of multiplication rounding does not occur, and it is performed only after addition, this is completely sufficient for FMA emulation. Our logic is perfect. You can declare victory and return home.

The victory was so close ...

But it does not work. Or at least fails for a portion of the incoming data. Think for yourself why this could happen.

The hold call music sounds ...

Failure occurs because, by definition, FMA multiplication and addition are performed with complete accuracy, after which the result is rounded off with precision float. We almost managed to achieve this.

Multiplication occurs without rounding, and then, after addition, rounding is performed. This is similar to what we are trying to do. But rounding after addition is performed with double accuracy. After that, we need to save the result with float accuracy, which causes rounding again.

W-FF. Double rounding .

It will be hard to show it visually, so let's go back to our floating point decimal formats, where single precision is two decimal places and double precision is four bits. And let's imagine that we calculate FMA(8.1e1, 2.9e1, 9.9e-1) , or 81 * 29 + .99 .

The exact answer of this expression is 2349.99 or 2.34999e3 . Rounding up to single precision (two digits), we get 2.3e3 . Let's see what goes wrong when we try to emulate these calculations.

When we multiply 81 and 29 with double precision, we get 2349 . So far, so good.

Then we add .99 and get 2349.99 . As before, everything is fine.

This result is rounded to double accuracy and we get 2350 (2.350e3) . Oh, her.

We round it to single precision and, according to the IEEE rounding rules, to the nearest even one we get 2400 (2.4e3) . This is the wrong answer. It has a slightly larger error than the correctly rounded result returned by the FMA instruction.

You can state that the problem is in the IEEE environment rule to the nearest even. However, whatever rounding rule you choose, there will always be a case where double rounding returns a result that is different from the true FMA.

What is all over?

I could not find a fully satisfying solution to this problem.

I left the Xbox team long before the release of the Xbox One and since then have not paid much attention to the console, so I do not know what decision they came to. Modern x64 CPUs have FMA instructions that can perfectly emulate such operations. You can also somehow use the x87 math coprocessor for FMA emulation — I don’t remember which conclusion I came to when studying this question. Or perhaps the developers simply decided that the results were close enough and could be used.

Source: https://habr.com/ru/post/447680/

All Articles

Exercises in emulation: FMA Xbox 360 console instruction

FMA! = MMA

FMA emulation

The victory was so close ...

What is all over?

More articles: