.NET Performance: The Real Jedi Tricks

Do you need to introduce the real Jedi .NET, performance guru, multiple Microsoft MVP, standing speaker of the DotNext conference Sasha Goldstein? Probably not worth it. And if you suddenly stand, look here .

In our conversation, Sasha shares professional tips for .NET and .NET Core developers. Describes what to look for when profiling and debugging applications and what tools to use.

- Sasha, there are many articles and tips about performance in .NET. Starting with the fact that it is undesirable to get involved in the universal throwing of exceptions and using StringBuilder instead of concatenation, ending with low-level optimizations. But .NET is constantly evolving, new opportunities and new problems appear. For modern .NET 4.7, can you give any practical advice on how to optimize code performance?
')
Sasha Goldstein: Many are already familiar with “obvious” tips about string concatenation, exceptions, or boxing / unboxing, but there are still some misconceptions regarding performance when new or higher-level APIs appear. I'm worried about excessive use of LINQ in many code bases. Despite the fact that many years have passed since LINQ appeared and there is a lot of data showing that most LINQ queries can be done 10 times faster using regular cycles, people still often resort to using LINQ in performance sensitive code. To clarify: I have nothing against LINQ as a whole, but such a solution will not work well if your cycle runs 1 million times.

Another thing that we have not been able to put into our heads for everyone is the danger of excessive memory allocation. The garbage collector in .NET introduced some improvements, but it still does not help if you allocate too much, especially for objects that end up dying in older generations. It is difficult to teach everyone to be more attentive to the problem of memory allocation, although with some tools it becomes easier. For example, Heap Allocation Analyzer on Roslyn or profilers such as dotMemory .

The last point I would like to make is that we return to the times when the minimum memory and startup time become critical for server applications (and not just for workstations) due to container technologies. If we plan to pack 300 instances of microservice running in Docker on the same physical host, then we need to be very careful with memory management, avoid unnecessary dependencies and get rid of unnecessary work. When you are used to 16-core 32-gigabyte servers as a standard runtime environment, you will be sobered up by trying to compress your service to “--memory 256m --cpus 0.25”. By the way, we are faced with the same problems with regard to “ serverless ” technologies, but the deployment unit is so small that it is usually easier to live with resource limitations.

- Today .NET is not only for Windows. There are Mono and .NET Core. Tell me, how good are these tools in terms of performance, especially in comparison with native Linux applications?

Sasha Goldstein: Mono exists for a long time, but, to be honest, I did not use it in combat projects. I know that he has a mature and decent runtime, but Mono did not get the recognition I was counting on. In any case, I cannot make recommendations regarding its effectiveness.

Today, I can tell you more about .NET Core. In Linux, .NET Core uses a more or less similar stack and code base as in Windows. The compiler performs the same optimizations as in Windows, there is a way to avoid JIT compilation by compiling AOT (which is called Crossgen in .NET Core), there is a garbage collector with some features, and so on. I would say that the only part of .NET Core that can cause questions is PAL (Platform Adaptation Layer), because this is really the only place that has a completely different code base from the Windows version. In fact, there were performance problems in PAL, mainly around the misuse of the Linux API or the use of the Linux API with unexpected performance behavior compared to the similar API in Windows. These wrinkles will naturally be smoothed over time, as more and more developers use .NET Core in production and face these problems.

An interesting development that will take time to mature is CoreRT (.NET Native for Windows), which ultimately seeks to create a managed application, has no external dependencies and does not generate a JIT at all. In addition to simplifying isolation (without sharing, without installation, without dependency management), CoreRT will reduce startup time and reduce memory by “shaking off” unnecessary code. This can move .NET even closer to native Linux applications.

- One of the main tips, which is indicated in almost all articles about improving .NET performance, is the use of a profiler. On DotNext 2017 Moscow, you’ll talk about debugging and profiling .NET Core applications under Linux. How are things going with Visual Studio debugging and profiling applications for .NET Core?

Sasha Goldstein: Visual Studio currently does not offer anything for profiling .NET Core applications on Linux. You cannot use Windows profiling tools for .NET Core applications — you need to do profiling on Linux, and the only way to actually analyze Windows results (if you are prone to this) is to use PerfView, a slightly more complex tool than Visual Studio Profiler.

Similarly, Visual Studio cannot open .NET Core application core dumps on Linux. The rich debugger data that you use when opening a Windows dump file in Visual Studio does not exist for core dumps; memory analysis features introduced in Visual Studio 2013 do not work for core dumps. In fact, there is no Windows tool that can open application dumps from Linux. To do this, you need to use Linux tools, which currently offer much more features.

As you have already noticed, I do not quite like it. I can not say that I am very surprised. Despite all the efforts that MS puts into the development of the platform, and the fact that we are already at version 2.0, we still do not have good tools for profiling and debugging. And three years after its announcement under Linux, there is no decent profiler and debugger that the average developer could use. This is a serious limitation, for me personally, it sometimes makes me angry. I honestly do not recommend my customers to switch to .NET Core on Linux, because I know that they will beat the wall when they try to debug or profile their production applications.

- In the world of Linux-development are popular perf and ftrace, which provide quite good opportunities for analyzing application performance. Will they help with debugging under .NET Core? Are there any differences when using perf or ftrace for native Linux applications and for applications under .NET Core?

Sasha Goldstein: Yes. The formal workflow for profiling .NET Core applications on Linux is based on perf. Microsoft provides a Bash script called “perfcollect” that runs perf, collects performance data, combines them into one file, and invites you to open it in Windows using PerfView. Let's ignore for a moment the ludicrousness of this story and talk about how this process works.

Perf is a multipurpose tool with many different modes of operation. In particular, it can be used as a profiler. Connect to various system events, collect stack traces when these events occur, and then meppit addresses from this trace stack to the method names. It also has some visualization capabilities, but they are often replaced, for example with Flame Graphs. Now perf is not just a CPU profiler: you can attach it to processes, cache miss events, write or read disk events, you can attach it to context switches with the scheduler and other thousands of additional static and dynamic events. The closest thing we have in Windows is the ETW tool (for example, PerfView, Windows Performance Recorder, etc.), but perf can offer a dynamic tool (joining an arbitrary function in the kernel or user space library), which ETW can not do.

When using perf for .NET Core applications, there is one problem you need to overcome. It is associated with address translation. When perf captures a stack trace, it needs to be able to map return addresses on the stack to method names.

Since .NET Core uses JIT compilation, there is no static location by which this mapping can be done (debuginfo, symbols, no matter how you call it). The way perf works in this case is that it expects the target application to write a simple text file named /tmp/perf-$PID.map, which contains the matches of the method addresses with their names.

Indeed, .NET Core supports this convention - if you set the “COMPlus_PerfMapEnabled” environment variable to 1, the JIT will write to this text file each time the method is compiled, and perf will be able to use this information for successful address mapping. It is a pity that you need to do this in advance - if you did not set an environment variable and want to profile an already running process, you are not lucky - but Node.js also works, for example, so I think this is more or less acceptable.

The story has another twist, this time with AOT-compiled assemblies. If you are using Crossgen for AOT compilation (.NET Core uses it in some of its assemblies in the .NET Core package, which you get from your distribution package's package manager), you need another way to get debug information for these assemblies.

The Crossgen tool itself can generate this information if you specify it in the Crossgen-compiled build. So far so good, right? Not really. First, Crossgen is not installed with .NET Core, so you need to either compile CoreCLR from source to get it, or use the stupid NuGet recovery trick. Secondly, Crossgen provides debug information in a format that is incompatible with what perf expects, so you need to format and merge your output with the main peft map file. And third, perf currently does not support viewing perf-maps files for memory partitions that are written to disk, as Crossgen does. Therefore, even if you could get a good perf-map for these assemblies, perf will ignore them. Fortunately, there are other tools that will still work in this case.

By the way, there are many other tools that “light up” and work with .NET Core processes. In my report on DotNext, I will use perf together with BPF-based tools from BCC. We discussed BPF and BCC a few months ago at a JPoint conference about profiling JVM with BPF.

- I would like to ask a few questions about LTTng. The official documentation states that in LTTng performance is put at the forefront. Does this rule persist when tracing .NET Core applications? Are there any limitations of LTTng when using tracepoint or kprobes?

Sasha Goldstein: Let's dot the “and”. LTTng is a powerful Linux trace framework that has two modes of operation. It has a kernel module that can connect to a tracepoint, which are statically defined trace locations scattered around the kernel: scheduler events, disk accesses, process execution, etc. In addition, LTTng has a userspace-library that can be used to track events in a user application, and this is what uses the .NET Core on Linux. In both cases, LTTng is optimized for high frequency event triggering due to shared memory buffers, a compact binary format, and fast disk writing.

As you may know, .NET on Windows has many ETW events that can be used to profile performance and understand system behavior. These include GC events, build, JIT compilation, object creation, and many others. Of course, ETW (Event Tracing for Windows) is not available in Linux, so Microsoft chose to use LTTng. You receive the same events, but they are created via LTTng, not ETW — and with a few reservations.

First, you must set the environment variable (COMPlus_EnableEventLog = 1). If this is not done, LTTng will not create any events at all. Secondly, LTTng does not support stack traces for events in user space. This means that you can intercept GC events, but you do not have a call stack in which they were called; You can capture build download events, but you don’t know what code loaded this build. These are very painful issues that limit the use of these events in real-world troubleshooting scenarios.

- LLDB debugger for Linux is very similar to WinDbg for Windows. For him, even the SOS extension is available, which allows you to debug managed code. How applicable is it now for debugging .NET Code applications?

Sasha Goldstein: LLDB is a very powerful debugger, and Microsoft provides the library libsosplugin.so, which is a version of SOS.dll for Linux. It provides almost the same command set with the same semantics, which is good if you are familiar with SOS (although you still have to learn your own LLDB commands, which differ significantly from WinDbg). But this is not the topic of this conversation, is it? For LLDB with libsosplugin you will have to face the following obstacles:

LLDB plugins are closely related to a specific version of LLDB. Since libsosplugin.so comes with the CLR, it is built for the LLDB version used by Microsoft during the build process, which - at the time of writing - was LLDB 3.6. This is a rather old build of LLDB, in which there are a lot of known bugs, and in fact it is almost impossible to install it on a lot of modern distributions without compiling it from source code.
LLDB before 4.0 does not understand core dumps generated on demand, so you can either open core dumps created as a result of a crash or join a running process.
When you open a core dump or join a running process, you must teach the SOS plugin to match the OS thread identifiers with the .NET managed thread IDs. If you have 400 threads, this is really annoying (I have a script for this).

- Developing a high-quality multi-threaded application is a difficult task. But even more difficult - the search for bottlenecks and errors in such an application. What can developers use for debugging and profiling multithreaded applications under .NET?

Sasha Goldstein: You can talk about this for as long as you want, so I will be brief. There are several important methods that can be automated using modern tools:

Understanding which code causes frequent problems (blocking, but not only). This can be done by analyzing context switch events.
Understanding the distribution of workload between different threads. This is often done visually, using tools such as Visual Studio Concurrency Visualizer or the timeTool function in dotTrace.
Understanding the different types of oversubscription - strangling the processor (for example, due to containers), problems of priorities, or simply the lack of a sufficient number of cores for all your threads. This can be done by analyzing automatic task switching events and creating histograms of the times at which the thread is processed on the CPU or waits for its turn to be serviced, provided that there are no internal reasons preventing it.

- What reports did you prepare for DotNext 2017?

Sasha Goldstein: In Moscow, I will talk about profiling and debugging. NET Core applications in Linux. This is the result of many months of research. I’ll tell you about some of the tools and methods that I mentioned above, along with live demonstrations of common performance problems in .NET Core applications. As a special bonus, I'll show you how to use some of these tools to profile a .NET Core application running in a Docker container. All of my demos are available on GitHub , so you can experiment with them after the conference.

And for those who want to complete immersion, November 11 in Moscow will be held an 8-hour hands-on training “ Production Performance and Troubleshooting of .NET Applications ”, dedicated to tools and approaches for monitoring and solving performance problems on sales.

If you also love the .NET kernel as we do, then you may be interested in the presentations of other experts at the upcoming DotNext 2017 Moscow conference, where more than 30 speakers will give presentations on the present and future of the .NET platform, on optimizing performance and multithreading, on internal device .NET and CLR, about profiling and debugging .NET-code.

Source: https://habr.com/ru/post/339230/

All Articles

.NET Performance: The Real Jedi Tricks

More articles: