How we doubled the speed of working with Float in Mono

My friend Aras recently wrote the same ray tracer in different languages, including C ++, C # and the Unity Burst compiler. Of course, it is natural to expect that C # will be slower than C ++, but it seemed to me interesting that Mono is so slower than .NET Core.

The figures he published were bad:

C # (.NET Core): Mac 17.5 Mray / s,
C # (Unity, Mono): Mac 4.6 Mray / s,
C # (Unity, IL2CPP): Mac 17.1 Mray / s

I decided to see what was happening and document the areas that could be improved.
')
As a result of this benchmark and the study of this problem, we found three areas in which improvement is possible:

First, you need to improve the default Mono settings, because users usually do not configure settings on their own.
Secondly, we need to more actively acquaint the world with the backend of optimizing the LLVM code in Mono
Thirdly, we improved the setting of some Mono parameters.

The reference point of this test was the results of the ray tracer run on my machine, and since I have another iron, we cannot compare the numbers.

The results on my home iMac for Mono and .NET Core were as follows:

Workspace	Results, MRay / sec
.NET Core 2.1.4, `dotnet run` debug build	3.6
.NET Core 2.1.4, `dotnet run -c Release` build	21.7
Vanilla Mono, `mono Maths.exe`	6.6
Vanilla Mono with LLVM and float32	15.5

In the process of studying this problem, we found a couple of problems, after correcting which the following results were obtained:

Workspace	Results, MRay / sec
Mono with LLVM and float32	15.5
Enhanced Mono with LLVM, float32 and fixed inline	29.6

Overall picture:

By simply using LLVM and float32, you can increase the performance of the floating point code almost 2.3 times. And after tuning, which we added to Mono as a result of these experiments, it is possible to increase productivity by 4.4 times in comparison with standard Mono - these parameters in future versions of Mono will become the default parameters.

In this article I will explain our findings.

32-bit and 64-bit float

Aras uses 32-bit floating-point numbers for the main part of the computation (type float in C # or System.Single in .NET). In Mono, we made a mistake long ago - all 32-bit floating-point calculations were performed as 64-bit, and the data was still stored in 32-bit areas.

Today, my memory is not as sharp as before, and I can’t remember exactly why we made this decision.

I can only assume that the trends and ideas of that time influenced him.

Then around the float-calculations with increased accuracy soared a positive aura. For example, in Intel x87 processors, 80-bit precision was used for floating-point calculations, even when the operands were double, which provided users with more accurate results.

At that time, the idea that in one of my previous projects - Gnumeric spreadsheets - statistical functions were implemented more qualitatively than in Excel was also relevant. Therefore, many communities are well aware of the idea that more accurate results can be used with increased accuracy.

At the initial stages of Mono development, most of the mathematical operations performed on all platforms could receive only double at the input. 32-bit versions were added to C99, Posix and ISO, but in those days they were not widely available for the entire industry (for example, sinf is a float version of sin , fabsf is a version of fabs , and so on).

In short, the beginning of the 2000s was a time of optimism.

Applications paid a large price to increase computational time, but Mono was mainly used for Linux desktop applications serving HTTP pages and some server processes, so the speed of floating point calculations was not a problem that we encountered daily. It became visible only in some scientific benchmarks, and in 2003 they were rarely developed on .NET.

Today, games, 3D applications, image processing, VR, AR, and machine learning have made floating point operations a more common data type. The trouble does not come alone, and there are no exceptions. Float was no longer a friendly data type that was used in the code just in a couple of places. They turned into an avalanche, from which no one can hide. There are a lot of them and their distribution cannot be stopped.

Workspace float32 flag

Therefore, a couple of years ago we decided to add support for performing 32-bit float operations using 32-bit operations, as in all other cases. We called this environment feature "float32". In Mono, it is enabled by adding the option --O=float32 in the working environment, and in Xamarin applications this parameter is changed in the project settings.

This new flag was well received by our mobile users, because mostly mobile devices are still not very powerful, and they prefer to process data faster than to have increased accuracy. We recommended that mobile users simultaneously enable the LLVM optimizing compiler and the float32 flag.

Although this flag has been implemented for several years, we have not made it the default to avoid unpleasant surprises for users. However, we started to encounter cases in which surprises arise due to standard 64-bit behavior, see this bug report sent by a Unity user .

Now we will use Mono float32 , progress can be tracked here: https://github.com/mono/mono/issues/6985 .

In the meantime, I returned to the project of my friend Aras. He used new APIs that were added to .NET Core. Although .NET Core always performed 32-bit float operations as 32-bit float, during its operation, the System.Math API still performs conversions from float to double . For example, if you need to calculate the sine function for a float value, then the only option is to call Math.Sin (double) , and you will have to perform a conversion from float to double.

To fix this, a new type of System.MathF was added to the .NET Core, which contains single-precision floating point math operations, and now we have just transferred this [System.MathF] to Mono .

The transition from 64-bit to 32-bit float significantly improves performance, which can be seen from this table:

Work environment and options	Mrays / second
Mono with System.Math	6.6
Mono with System.Math and `-O=float32`	8.1
Mono with System.MathF	6.5
Mono with System.MathF and `-O=float32`	8.2

That is, the use of float32 in this test actually improves performance, and MathF has little effect.

LLVM setup

In the course of this study, we found that although there is float32 support in the Fast JIT Mono float32 , we did not add this support to the LLVM backend. This meant that Mono with LLVM was still performing costly conversions from float to double.

Therefore, Zoltan added float32 support to the LLVM code generation engine.

Then he noticed that our code inline (inliner) uses the same heuristics for Fast JIT that were used for LLVM. When working with Fast JIT, it is necessary to maintain a balance between JIT speed and execution speed, so we limited the amount of embedded code to reduce the workload of the JIT engine.

But if you decide to use Mono LLVM, then you strive for the code as quickly as possible, so we changed the settings accordingly. Today, this parameter can be changed using the MONO_INLINELIMIT environment MONO_INLINELIMIT , but in fact it needs to be written to its default values.

Here are the results with the modified LLVM settings:

Work environment and options	Mrays / seconds
Mono with System.Math `--llvm -O=float32`	16.0
Mono with System.Math `--llvm -O=float32` , persistent heuristics	29.1
Mono with System.MathF `--llvm -O=float32` , persistent heuristics	29.6

Next steps

To make all these improvements was enough minor effort. These changes led to periodic discussions in Slack. I even managed to carve out a few hours one evening to transfer the System.MathF to Mono.

The ray tracer code Aras became an ideal object for study, because it was self-contained, was a real application, and not a synthetic benchmark. We want to find other similar software that can be used to study the binary code we generate, and make sure that we are transmitting the best data to LLVM in order to perform its work optimally.

We are also thinking about updating the LLVM used by us, and using new added optimizations.

Separate note

Extra accuracy has nice side effects. For example, reading the Godot engine pull requests, I saw that there was an active discussion about whether the precision of floating-point operations was customizable at compile time ( https://github.com/godotengine/godot/pull/17134 ).

I asked Juan why anyone would need this, because I thought that 32-bit floating point operations were enough for the games.

Juan explained that, in general, float works great, but if you “move away” from the center, say, move 100 kilometers from the center of the game, a calculation error begins to accumulate, which can lead to interesting graphic glitches. You can use different strategies to reduce the impact of this problem, and one of them is working with increased accuracy, for which you have to pay with performance.

Shortly after our conversation in my Twitter feed, I saw a post demonstrating this problem: http://pharr.org/matt/blog/2018/03/02/rendering-in-camera-space.html

The problem is shown in the images below. Here we see a model of a sports car from the pbrt-v3-scenes ** package . Both the camera and the scene are near the origin point, and everything looks great.

** (The author of the model car Yasutoshi Mori .)

Then we move the camera and the scene 200,000 units in xx, yy and zz from the origin point. It can be seen that the model of the car has become rather fragmentary; this is solely due to the lack of precision of floating-point numbers.

If we move further 5 × 5 × 5 times, 1 million units from the origin point, the model begins to decay; the machine turns into an extremely coarse voxel approximation of itself, both interesting and terrifying. (Keanu asked the question: is Minecraft so cubic simply because everything is rendered very far from the origin of the coordinates?)

** (I apologize to Yasutoshi Mori for what we did with his beautiful model.)

Source: https://habr.com/ru/post/432176/

All Articles