📜 ⬆️ ⬇️

C # is a low level language?

I am a big fan of everything that Fabien Sanglard does, I like his blog, and I read both of his books from cover to cover (they were told about in a recent podcast by Hansleminutes ).

Recently, Fabien wrote a great post, where he deciphered a tiny raytracer , deobfusing the code and explaining mathematics in a fantastically beautiful way. I really recommend taking the time to read this!

But it made me think, is it possible to transfer this C ++ code to C # ? Since I have been writing quite a lot in C ++ in my main job lately, I thought that I could try.
')
But more importantly, I wanted to get a better idea of whether C # is a low-level language ?

A slightly different, but related question: how suitable is C # for “system programming”? On this topic, I really recommend Joe Duffy's excellent post from 2013 .

Line by line


I started with a simple transfer of deobfuscated C ++ code line by line in C #. It was quite simple: it seems that the truth is that C # is C ++++ !!!

The example shows the basic data structure - 'vector', here’s a comparison, C ++ on the left, C # on the right:



So, there are a few syntactic differences, but since .NET allows you to define your own value types , I was able to get the same functionality. This is important because processing 'vector' as structures means that we can get better “data locality”, and there is no need to involve the .NET garbage collector, since the data will go to the stack (yes, I know this is an implementation detail).

For more information about structs or “value types” in .NET, see here:


In particular, in the last post of Eric Lippert we find such a useful quotation, which makes it clear what the “types of values” really are:

Of course, the most important fact about the types of values ​​is not the implementation details, as they stand out , but rather the original semantic meaning of the “ value type”, but that it is always copied “by value” . If the allocation information were important, we would call them "heap types" and "stack types." But in most cases it does not matter. Most of the time, the semantics of copying and identification is relevant.

Now let's see how some other methods look like in comparison (again C ++ on the left, C # on the right), first RayTracing(..) :



Then QueryDatabase (..) :



(see Fabian’s post for an explanation of what these two functions do)

But again, the fact is that C # makes it very easy to write C ++ code! In this case, the ref keyword helps us the most. It allows you to pass a value by reference . We have been using ref for quite some time in method calls, but recently efforts have been made to allow ref in other places:


Now sometimes using ref will improve performance, because then the structure does not need to be copied, see benchmarks in the post of Adam Stinix and “Performance traps ref locals and ref returns to C #” for more information.

But the most important thing is that such a script provides our C # port with the same behavior as the C ++ source code. Although I want to note that the so-called "managed links" are not quite the same as "pointers", in particular, you cannot do arithmetic on them, for more details, see here:


Performance


Thus, the code is well ported, but performance also matters. Especially in the raytracer, which can cheat a frame a few minutes. The C ++ code contains the sampleCount variable, which controls the final image quality, while sampleCount = 2 looks like this:



Clearly not very realistic!

But when you get to sampleCount = 2048 , everything looks much better:



But running with sampleCount = 2048 takes a lot of time, so all the other runs are performed with a value of 2 in order to keep at least a minute. Changing sampleCount affects only the number of iterations of the outermost loop of the code, see this gist for an explanation.

Results after the “naive” progressive port


To meaningfully compare C ++ and C #, I used the time-windows tool, this is the port of the unix command time . Initial results looked like this:

C ++ (VS 2017).NET Framework (4.7.2).NET Core (2.2)
Time (sec)47.4080.1478.02
In the core (s)0.14 (0.3%)0.72 (0.9%)0.63 (0.8%)
In user space (sec)43.86 (92.5%)73.06 (91.2%)70.66 (90.6%)
Number of page fault errors114348185945
Work Set (KB)423213,62417,052
Memory preemptive (KB)95172154
Non-preemptive memory714sixteen
Page File (KB)146010,93611,024

Initially, we see that the C # code is a bit slower than the C ++ version, but it gets better (see below).

But let's first see what the .NET JIT is doing to us, even with this “naive” line-by-line port. First, it does a good job of embedding smaller “helper methods”. This is evident in the output of the magnificent tool Inlining Analyzer (green = embedded):



However, it does not embed all methods, for example, because of the complexity, QueryDatabase(..) skipped:



Another function of the .NET Just-In-Time (JIT) compiler is to convert certain method calls to the corresponding CPU instructions. We can see this in action with the sqrt shell function, here is the C # source code (notice the Math.Sqrt call):

 // intnv square root public static Vec operator !(Vec q) { return q * (1.0f / (float)Math.Sqrt(q % q)); } 

And here is the assembler code that generates the .NET JIT: there is no call to Math.Sqrt and the Math.Sqrt processor instruction is used :

 ; Assembly listing for method Program:sqrtf(float):float ; Emitting BLENDED_CODE for X64 CPU with AVX - Windows ; Tier-1 compilation ; optimized code ; rsp based frame ; partially interruptible ; Final local variable assignments ; ; V00 arg0 [V00,T00] ( 3, 3 ) float -> mm0 ;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00] "OutgoingArgSpace" ; ; Lcl frame size = 0 G_M8216_IG01: vzeroupper G_M8216_IG02: vcvtss2sd xmm0, xmm0 vsqrtsd xmm0, xmm0 vcvtsd2ss xmm0, xmm0 G_M8216_IG03: ret ; Total bytes of code 16, prolog size 3 for method Program:sqrtf(float):float ; ============================================================ 

(To get this issue, follow these instructions , use the Disasmo VS2019 add-in or look at SharpLab.io )

These replacements are also known as intrinsics , and in the code below we can see how JIT generates them. This snippet shows mapping for AMD64 only, but JIT also targets X86 , ARM and ARM64 , the full method here .

 bool Compiler::IsTargetIntrinsic(CorInfoIntrinsics intrinsicId) { #if defined(_TARGET_AMD64_) || (defined(_TARGET_X86_) && !defined(LEGACY_BACKEND)) switch (intrinsicId) { // AMD64/x86 has SSE2 instructions to directly compute sqrt/abs and SSE4.1 // instructions to directly compute round/ceiling/floor. // // TODO: Because the x86 backend only targets SSE for floating-point code, // it does not treat Sine, Cosine, or Round as intrinsics (JIT32 // implemented those intrinsics as x87 instructions). If this poses // a CQ problem, it may be necessary to change the implementation of // the helper calls to decrease call overhead or switch back to the // x87 instructions. This is tracked by #7097. case CORINFO_INTRINSIC_Sqrt: case CORINFO_INTRINSIC_Abs: return true; case CORINFO_INTRINSIC_Round: case CORINFO_INTRINSIC_Ceiling: case CORINFO_INTRINSIC_Floor: return compSupports(InstructionSet_SSE41); default: return false; } ... } 

As you can see, some methods are implemented as, for example, Sqrt and Abs , while for others, functions of the C ++ runtime are used, for example, powf .

This whole process is very well explained in the article “How is Math.Pow () implemented in the .NET Framework?” , It can also be seen in the CoreCLR source code:


Results after simple performance improvements.


I wonder whether it is possible to improve on the go naive line-by-line port. After some profiling, I made two major changes:


These changes are explained in more detail below.

Remove initialization of embedded array


For more information on why this is necessary, see this excellent answer to Stack Overflow from Andrei Akinshin , along with benchmarks and assembler code. He comes to the following conclusion:

Conclusion

  • Is .NET caching hard-coded local arrays? Like those that put the Roslyn compiler in the metadata.
  • In this case there will be overhead costs? Unfortunately, yes: for each JIT call, it will copy the contents of the array from the metadata, which takes extra time compared to a static array. The runtime also allocates objects and creates traffic in memory.
  • Should I worry about it? Maybe. If this is a hot method and you want to achieve a good level of performance, you need to use a static array. If this is a cold method that does not affect the performance of the application, you probably need to write “good” source code and place the array in the method area.

Changes can be seen in this diff .

Using MathF Functions Instead of Math


Secondly, and most importantly, I have significantly improved performance by making the following changes:

 #if NETSTANDARD2_1 || NETCOREAPP2_0 || NETCOREAPP2_1 || NETCOREAPP2_2 || NETCOREAPP3_0 // intnv square root public static Vec operator !(Vec q) { return q * (1.0f / MathF.Sqrt(q % q)); } #else public static Vec operator !(Vec q) { return q * (1.0f / (float)Math.Sqrt(q % q)); } #endif 

Beginning with .NET Standard 2.1, there are specific implementations of float common math functions. They are located in the System.MathF class. For more information about this API and its implementation, see here:


After these changes, the difference in performance of C # and C ++ code was reduced to about 10%:

C ++ (VS C ++ 2017).NET Framework (4.7.2).NET Core (2.2) TC OFF.NET Core (2.2) TC ON
Time (sec)41.3858,8946.0444.33
In the core (s)0.05 (0.1%)0.06 (0.1%)0.14 (0.3%)0.13 (0.3%)
In user space (sec)41.19 (99.5%)58.34 (99.1%)44.72 (97.1%)44.03 (99.3%)
Number of page fault errors1119474957765661
Work Set (KB)413613,44016 78816 652
Memory preemptive (KB)89172150150
Non-preemptive memory713sixteensixteen
Page File (KB)142810 90410,96011,044

TC - layered compilation, Tiered Compilation ( I suppose it will be enabled by default in .NET Core 3.0)

For completeness, here are the results of several runs:

RunC ++ (VS C ++ 2017).NET Framework (4.7.2).NET Core (2.2) TC OFF.NET Core (2.2) TC ON
TestRun-0141.3858,8946.0444.33
TestRun-0241.1957.6546.2345.96
TestRun-0342.1762.6446.2248.73

Note : the difference between the .NET Core and the .NET Framework is due to the absence of the MathF API in the .NET Framework 4.7.2, for more information, see the .Net Framework Support Ticket (4.8?) For netstandard 2.1 .

Further increase in productivity


I am sure that the code can still be improved!

If you are interested in eliminating the performance difference, here is the C # code . For comparison, you can watch the C ++ assembler code from the excellent Compiler Explorer service.

Finally, if it helps, here’s the output of the Visual Studio profiler with the “hot path” display (after the performance improvements described above):



Is C # a low level language?


Or more specifically:

What language features of C # / F # / VB.NET or BCL / Runtime functionality mean “low level” * programming?

* Yes, I understand that “low level” is a subjective term.

Note: each C # developer has his own idea of ​​what “low level” is, these functions will be taken for granted by C ++ or Rust programmers.

Here is the list I made:


I also threw a cry on Twitter and got a lot more options for inclusion in the list:


So in the end, I would say that C # certainly allows you to write code that looks like C ++ and, in combination with the runtime and base class libraries, provides many low-level functions.

Further reading



Unity Burst Compiler:

Source: https://habr.com/ru/post/443804/


All Articles