.NET and CLR: An inside look

What is C #? Object-oriented Esperanto with a garbage collector, functional lotions and a free afternoon massage. It allows you to write Really Important Things, hiding from us unnecessary details of work with memory, processor and other low-level programming. Naturally, there are people with a high level of curiosity in their blood, who want to know how .NET actually works (of course, they study .NET solely for the sake of increasing the productivity of the developed software). Today they talk to us:

Sasha Goldstein . Regular DotNext speaker, author of “Pro .NET Performance” and many times MVP. Recently, his little performance directly language, and he decided to squeeze the most out of iron.
Karlen szKarlen Simonyan . Written by atomics.net, a beginner WebKit committer, as well as a specialist in the Just-In-Time Compilation.

Sasha Goldstein

Good afternoon, Sasha. How did you get to hardware? Interest or harsh necessity?
')
I am a consultant, I work with many clients. Sometimes there are situations when it is necessary to optimize an algorithm or a piece of code that is no longer optimized in other ways. You may know .NET at more than a high level, but there are no options other than hardware. At the same time, if you understand how the processor works, you have a clear understanding in your head how it works - you can organize the code differently: focus on other processor instructions, make better use of the cache. Some examples, like vectorization, I have already given on one of the DotNext , some will talk about the next.

So is this a tool of last resort? Or can it be used on a permanent basis?

It depends. For many applications, strong optimization is not really needed, in some situations you just can’t go down to the lower level. On the other hand, if you have cars with a bunch of hertz on board, expensive cars, it makes sense to squeeze the maximum out of them. In my practice, approximately 10 percent of clients found themselves in situations where the use of such optimizations was justified.

Why C #? In the .NET world, performance goes to the pros.

On the one hand, C ++ language is more flexible, its compilers are more focused on optimizing hardware. On the other hand, there is the price of supporting C ++ code, because not every Sharpist knows the advantages. If you process the signals \ images, and you know the pros - go ahead. Otherwise - alas. As an example, the guys from stackoverflow will show examples when they need to achieve maximum performance in a far from trivial piece of their code base, and the insertion of pluses there is not justified: there are no suitable specialists, the complexity of support is doubled. Introducing C ++ specialist into the team is dangerous, nobody canceled the bus factor.

What is the typical effect of low-level optimizations?

Of course, it strongly depends on the situation. Vectorization can speed up the algorithm up to 8 times. Cash - up to ten times. A few tens of percent due to the choice of the correct instructions. In sum, you can speed up a part of the code a hundred times. Again, perhaps there is only one loop that can be accelerated by vectorization / parallelization. Or maybe there will be several dozen of such places. Of course, the cumulative effect of ten optimizations will be more noticeable (in absolute values).

How to search for hardware bottlenecks? Experience, intuition, ready tool?

I do not believe in optimization at random. Intel and AMD are now releasing utilities to track iron behavior: for Intel, this is an Amplifier, for AMD, for Catalyst utilities. Previously, they were quite primitive: they detected cache misses, they helped a little with instructions, but now they give a variety of hints, from memory optimization to vectoring. From the developer requires knowledge of how to bring these tips to life, but analyze the utility perfectly.

So, on the guide for Intel and AMD. How do they differ? Will you have to learn more?

For the desktop and servers in the .NET world, these two are enough, the guides are almost the same as the instruction sets. If anyone else enters this market, the instructions are likely to be almost the same.

For mobile devices, there is ARM, which will require additional time to study: a different set of instructions, a different architecture

Now there are BitCoin mining devices, there are CUDA, rumors about iron for neural networks flashed. Is the era waiting for us "for each task - its own fee"? Or a variety of frameworks will remain only in JS?

I hope we will not see the same diversity in hardware. Even now, in addition to specialized boards, the same Bitcoin can be mined on regular video cards. The trend, of course, is interesting, but it is unlikely to get a strong development. Too inflexible solution: produce a fee for an individual task. Intel is trying to reach the point where the latest processors are compatible with the price / quality with video cards

What about compilers? Javista have a lot of them, but what about us?

The answer to this question consists of two parts. On the one hand, we now have a fairly smart and easily extensible Roslyn compiler, and that's enough. On the other hand, things are worse with JIT compilation. Look at CoreCLR, a lot of possible optimizations are not used: time is saved. However, the project is still young, work on it continues. In any case, if new compilers appear and they will be able to compile code on any machine, any platform is good. If the competition of standards begins, the ecosystem will only suffer.

At one of the previous conferences, you mentioned code that is executed differently on different processors. How acute is this problem visible in hardware optimizations? In particular, how important is it when working in the clouds?

This is generally a persistent problem in many areas: memory, I / O. You test the program on SSD, in the cloud it is spinning on a slower device, and you get a serious performance problem. Speaking of clouds: you order the processors you need, which suit you in terms of performance and price. It may be that there will be problems during migration, but for several years I have not come across this. In any case, Microsoft is unlikely to decide to change all processors of the same type for tomorrow morning. The update is incremental, so you can count on several years of migration, and this is enough for preparation.

What about code complexity? How long does it take to optimize this level?

If there is only one person in a team who can do this, it is better to purchase equipment. But usually such things can be localized in one “black box”: the standard application sends requests to the database \ service, (de) serializes JSON \ XML, writes logs, there are few places for serious optimizations. Also take into account that really complex optimizations are rarely used; simple vectorization is usually used. Here, as with collections: standard lists \ arrays \ dictionaries are most often used, and other options are much less often considered.

How likely is the situation when you create a solution, hide it in a black box, and then a colleague kills it? For example, you work with a cache, and then a colleague calls 10 threads per processor, causing numerous errors.

This is possible. Your colleagues must understand the limitations of your black box. Plus we need autotests with regress.

Andrey DreamWalker Akinshin said that productivity tests are dangerous and not completely reliable. Windows will update, or another software on the machine with the agent, and all agents must be configured exactly the same.

Yes, this is a standard autotest problem with benchmarks. You can usually ignore the first or second file, but if there are 5 files in a row, you need to run the manual test.

How useful is knowledge of iron when working with other languages (not. NET)?

In my speech, I will deal mainly with C #, some C ++. In general, the answer depends on the language: the easier it is to access the memory - the greater the benefit. The same C # allows you to work with pointers, Java, as I recall, no. And in JS with such things, it is completely pointless to bother. But in general, this knowledge will clearly be useful when working with server languages.

Karlen Simonyan

Karlen, you've been studying the insides of .NET for a long time. Is it an interest or a practical necessity? Can you give a couple of the most effective examples of JIT optimization from your practice?

Initially this, of course, was of interest. But for the last 4 years I have been developing multi-threaded and distributed applications on .NET, which requires knowledge of the specific capabilities of the platform, its structure. You can’t just study the work of the GC, for example, everything is very interconnected. Both the framework API and JIT play an important role. By the way, the latter is directly responsible for the performance of our code.

It seems to me that one of the most effective optimizations / opportunities is the cooperation of the runtime itself with JIT. So in the CLR, dispatching methods at the interfaces is very interesting. One of the blocks in my report will be devoted to this topic.

At DotNext 2016 Moscow, will you focus on RyuJit or tell about JIT-TER in general?

At the conference, we will review both RyuJIT and the updated x86. This is due to two factors. First, sooner or later, 32-bit applications will have to leave the stage. Secondly, RyuJIT is only for 64-bit applications, and a lot of effort has been put on its development, including the port on ARM. An in-depth study of the x86 compiler behavior (which is quite predictable) is reasonable, but only in the short term. There is also a legacy x64 JIT, but it is less efficient than x86.

I hope the information from my report is useful not only within the framework of .NET, but also allows you to understand some basic concepts in the world of managed platforms. The focus will be on the cooperation of GC and JIT, device methods and objects, as well as benchmarks.

Choosing the right collection or algorithm allows you to speed up the program by an order. How effective is JIT optimization? Can they be considered a regular tool?

The JIT compiler constantly balances between code efficiency and native performance, i.e. the output code must be optimized, and the generation process itself must be fast.

So, many optimizations are known from theory and implemented in C ++ compilers, simultaneously migrating to RyuJIT. Some may seem unexpected, especially the copy-propagation technique, which can break the “bad” code, working without side effects with x86 JIT, but revealed with legacy x64 and RyuJIT.

With the arrival of RyuJIT, a fairly demanded feature of SIMD appeared. It is here that one of the advantages of Just-in-time compilation is revealed: the compiler determines the processor architecture on the fly, and the code using the new Vectors API is converted to either SSE2 or AVX instructions (or any other SIMD set), unlike AOT- compilations where you need to specify the minimum CPU architecture. There are options, for example, “smart” C ++ compilers that support conditional compilation of code sections, but this is “from another opera”.

The main imperative of development is complexity management. How complicated is the code, taking into account the features of JIT? How complicated is the support?

The main goal of any compiler is not to change the behavior of the resulting code using various optimizations. And here the main problem comes to light: even non-optimized code can work differently on different architectures. It sounds in the spirit of KO, but it is. Blame the love of some CPUs for excessive out-of-order execution. Now on desktops, and on servers, the x86-64 architecture dominates, which is very conservative in terms of permutations of instructions, as on mobile systems everywhere ARM. It is worth considering, because C # has already settled on mobile devices, and the code we write on x86 systems, where emulators, respectively, also consume x86 code.

It may seem that all the complexity falls on our shoulders as developers, in part this is true, but the JIT itself tries to produce a code compatible with x86-64.

In general, the Memory Model should answer the questions of permissibility of permutations, possible optimizations and their compatibility. Unfortunately, in .NET (i.e. CLI implementations) this is poorly described. Rather, the ECMA335 standard gives us the definition of a memory model in the chapter “12.6 Memory model and optimizations”, but it describes a system with a weak model, and x86 is an architecture with a strong model, i.e. use acquire / release semantics by default. Starting with CLR 2.0, the model itself became closer to x86, and the JIT itself began to generate code for ARM and Itanium (which is no longer supported) compatible with it.

As you can see, this topic is very nontrivial. The answer to the question "What to do?" is the use of ready-made primitives and APIs where these problems are solved.

Usually, they are looking for a bottleneck before optimization. What approaches and tools allow you to search for it in the lower levels?

The main tool is a profilator, whether it is the performance profiler, or memory.
I would single out two classes of problems that most often occur locally in an application: non-optimal use of the processor’s cache + working with unaligned data and too expensive synchronization.

Such a concept as False Sharing has long been known in narrow circles, but only recently it is gaining momentum. To reduce these side effects, it is sometimes necessary to refactor data structures, which leads to a general improvement in the responsiveness of the system.

Solving the problem of expensive synchronization, in fact, is very nontrivial: it is not easy to switch to lock-free code. And verification is even harder. Even the Amdal law has not been canceled.

If a quick glance at the code and the profiler does not reveal a problem, then you should go down a level, i.e. CPU. This is where hardware performance counters come to the rescue, i.e. "Iron" counters. You can explore the cache-miss, etc. There are usually a lot of them. For reading, you can use the operating system API, but this means creating such a proflator yourself. It is best to use off-the-shelf tools from the manufacturers of the target processors themselves.

Benchmarks. The shorter the execution time, the greater the rake. What tools do you use for measurements?

Fully agree with the thesis. But before measuring anything, the first thing is to determine the metrics, i.e. what we will measure: memory consumption, CPU utilization efficiency, which areas can be changed, etc. The fact is that .NET is a world with GC, which significantly increases the entropy of the system. On the one hand, the benchmark can show that this piece of code is “fast.” But this does not mean that effective. Perhaps the memory consumption is very large, which will be felt later in the garbage collection itself. Therefore, with the definition of the question: "What do we want?" - I begin the study.

The problem with benchmarks is that they measure not a complete picture of the real application.

There are two approaches to building a benchmark: use embedding of “counters” in the code (a-la approach used by the proflator), or test individual pieces of the application and / or do microbenchmarking.

The first is a very comprehensive approach, and it is best to use the Performance profiler with the Memory Profiler. The results may be slightly different from the real code (after all, the code is being rewritten), but it shows the full picture. There are many tools of this kind - there is a question of convenience / preference when choosing them.

The second way is also difficult, requiring precise measurement, a block of statistics and minimal impact of the benchmark code on the results. Writing such a code is difficult and dreary. With the advent of the remarkable BenchmarkDotNet library, this business has become easier. I use it.

In the last couple of years, the .NET world has become richer in compilers. Will this catch up with Java? Is it worth doing?

Of course, the JIT in HotSpot is excellent. But each platform focuses on its usage scenarios. So in the Java language all methods are virtual by default. This creates a problem: how to optimally implement method calls? The technique of dynamic de-optimization comes to the rescue, i.e. in the absence of a method override, it is considered non-virtual. If a successor appears, JIT recompiles the code. In .NET, all JIT compilers use the technique of recompiling certain sections of code (stubs). Very interesting is the case with interfaces, as I said above.

Many possible optimizations in HotSpot are permissible, in my opinion, due to a more rigorous approach to the capabilities of the environment: .NET and C # are closer to hardware and cooperation with native code, while Java is not. For example, both environments, both JVM and CLR, do the optimal alignment of class fields, but the latter allows manual control, just like in C. The @Contended annotation appeared for Java, but it is not equivalent to the fully StructLayout and FieldOffset attributes in .NET. Also, the more aggressive use of the CPU registers and the more aggressive inliner, as well as C2 (the final optimizing compiler) give the JVM a definite plus, while the CLR JIT follows only fastcall calling convention and less inline.

But .NET has unsafe. Someone uses it, but someone does not. If there are not enough JIT optimizations, then you take and optimize with your hands.

RyuJIT is still a young compiler, but in some tasks I notice an increase in the quality of the code (compared to legacy x86 and x64), which is good news.

How fragile is the resulting code? What is the probability that a less experienced colleague will break it (well-meaning)?

It seems to me that on such optimized parts of the code, you need to leave comments like “Do not touch!”. Optimizations usually affect the structure of structures, the location of fields, the order of arguments in a method, etc. That can lead to badly readable code. When writing productive code, for example, LINQ should not occur at all. Or, instead, it should use its own methods that do not allocate so much memory, or else simple constructions should be used a la for for loops. To beginners, this seems "ugly", and they begin to edit the code.

Standard optimization - algorithms \ cache \ parallelization. How do cache and parallelization affect JIT optimizations? Are they blocking? Complementary?

One of the main optimizations performed by modern JIT compilers is loop unrolling. It's hard to overestimate how much this helps the branch predictor of modern CPUs. True, the unwinding itself can do legacy x64 JIT, and only under certain conditions. RyuJIT pulls up, x86 rarely unwinds. Parallel execution of instructions can sometimes speed up the code greatly. Let the JITs in .NET are not the most advanced in this regard, but the overall performance of the generated code is excellent.

Another nice bonus is a more aggressive inlining at RyuJIT, which is good news.

How do JIT optimizations limit Win-Linux, Intel-Amd-ARM migrations?

Optimizations themselves are not limited. The CLR is heavily tied to the Win32 API (and CoreCLR is no exception). Yes, it has a Platform Abstraction Layer, but such things as lightweight locks (critical sections), for example, have been ported to other operating systems. Until recently, Background GC was unavailable for CoreCLR on Linux, but the patch has corrected this situation with community efforts.

If we talk about Intel-Amd, it seems to me that they are more focused on the features of Intel processors. Now the port RyuJIT on ARM is actively developed. This process is still going on - there are bugs with code generation, but the community helps with their closure. If the code is single-threaded, then the difference in behavior on ARMs is difficult to notice. Fully out-of-order execution can make itself felt on multi-threaded code. But this question is solved.

If you have read these interviews and not enough for you - come to DotNext 2016 Moscow . In addition to Sasha and Karlen, they tell about performance there:

Source: https://habr.com/ru/post/311286/

All Articles

.NET and CLR: An inside look

Sasha Goldstein

Karlen Simonyan

More articles: