Memory optimization: difficult, but sometimes necessary

On the eve of DotNext 2017, we talked with an optimization specialist, including .NET applications from JetBrains company, Andrey Akinshin. At the conference, he will talk about how to track and fix various problems of working with memory, both of a general nature and specific to .NET. As a preface to the report, we talked about the place memory optimization takes in general in the struggle for application performance.

- Tell us about yourself and your work. What is the role of memory optimization in your work?

Andrey Akinshin: My name is Andrey Akinshin, and I work at JetBrains, where I spend a lot of time optimizing applications. Among other things, we are developing a cross-platform .NET IDE called Rider , based on IntelliJ and ReSharper.
This is a very big product that can do a lot of things. Naturally, it should be well optimized in terms of working with memory. It is necessary to make so that work with memory does not slow down the product, and the consumption of this memory was as little as possible. For this kind of optimization work, you need to understand well how this very memory is arranged and what problems it may have with it.
')
In my spare time, I also develop an open source project, BenchmarkDotNet . At the moment it is already quite large, the development is supported by the .NET Foundation. This library helps people write benchmarks that measure the performance of certain things.

In general, when it comes to memory, it is quite difficult to measure performance accurately. We have to understand and take into account a lot of factors that may affect performance. Therefore, it is often not enough to conduct any one experiment (benchmark) to draw conclusions about the performance of memory as a whole. It is also important to understand that for different processor architectures and JIT compilers it is necessary to conduct separate performance studies. BenchmarkDotNet simplifies this work.

- In your opinion, where in the optimization process should work with memory be at all? Or is it all individually and depends on the details of the application?

Andrey Akinshin: Of course, a lot depends on the application. It is very important to understand the bottleneck in the performance of your particular program. For example, if you work a lot with databases, it is likely that the bottleneck is a database, i.e. first you need to think about it. Or maybe you spend 99% of your time on network operations. It is always necessary to understand what your productivity basically consists of.

However, in a very large number of different projects, situations often arise when we perform some operations not on the database, not on the network or disks, but on the main memory of the computer. And we would like these sites to work as quickly as possible.

But if it seems to you that some component is working slowly, there is no need to immediately rush to optimize the memory, and indeed anything. First of all, it’s worth realizing right from the start that you have a certain problem that is causing the application performance to deteriorate. Secondly, it is necessary to formulate how it does not suit you and how much you need to overclock the application, because optimization for the sake of optimization is not what you should do. It is important to understand the business goals behind this work. And only when we have decided on goals, when we understand what objective metrics we have, which we want to improve (how much we want to overclock them) - then it’s already worth looking at what the program rests on in terms of speed. And if this is memory, you need to optimize memory. If this is something else - you need to work in another direction.

- Why is memory benchmarking so complicated?

Andrey Akinshin: The fact is that if we twice measure the time of the same program, we will get two different results. After performing a lot of measurements, we get some distribution with a certain variance. And in the case of working with the main memory, this variance is quite large. In order to improve the program in terms of memory access performance, you need to understand very well how this distribution looks and what can, regardless of our direct logic, affect the final speed of the program.

The speed of access to memory is influenced by a lot of factors (which we often don’t think about), we can make measurements on our machine in some conditions, perform some optimizations, and the user profile on the computer will be completely different, since there’s another iron.

- In your report on DotNext you are going to talk about a lot of low-level things. Do .NET programmers really need to understand the nuances of the CPU device for optimization work?

Andrey Akinshin: In most cases - no. The main performance problems are not particularly intellectual, they can be solved using common knowledge. But when simple problems are solved, the performance is still limited by the memory, and the speed of work must somehow be increased, the low-level knowledge will not be superfluous.

I'll start with a familiar thing - with the processor cache. It is very important to write algorithms that are sufficiently friendly to the cache. It is not so difficult and does not require any great knowledge, and you can gain a lot in speed.
Unfortunately, many profilers do not allow to simply get the number of cache miss (cache miss). But there are specialized tools that allow you to look at the hardware information. I use the Intel VTune Amplifier, it is a very good tool - it shows problems not only with the cache, but also with other things, for example, with alignment (if you have a lot of access to unaligned data, performance may decline due to this).

There are things that many people don’t know or don’t think about - for example, store forwarding or 4K aliasing: they can easily spoil our benchmarks and lead to incorrect conclusions. We can get a sink in speed, simply by contacting simultaneously at two addresses, the distance between which was "unsuccessful." Optimizing for these things is unlikely to give you some kind of giant gain in performance, but it can help you where nothing else can help. A misunderstanding of the internal kitchen can easily lead to the writing of incorrect benchmarks. Therefore, it is useful to be able to look at and analyze hardwired counters and draw conclusions about the direction in which optimization work should continue.

- Now many people are talking about the development of cross-platform .NET-applications. Are there any differences between working with memory under different runtimes and operating systems?

Andrey Akinshin: Of course. Mainly, these are high-level problems associated with garbage collection. For example, the garbage collection algorithms under Mono and the full framework are completely different. There is a different work with generations, large objects are processed differently.

By the way, large objects are a fairly common problem. In the full framework, we have a bunch of large objects (Large Object Heap, LOH) in which objects are larger than 85,000 bytes. And if a regular bunch of garbage collector constantly checks, cleans and defragments, then a bunch of large objects defragmentation operation is rarely performed (by default it is not performed at all, but in recent versions of the framework you can call it manually if you think that there is a need) . Therefore, it is necessary to work with this bunch very carefully: make sure that we do not have as many objects as possible.

And under Mono, we already have not 3 generations, but 2; the concept of large objects is also there, but work with it is completely different. And our old heuristics, which we used for large objects under Windows, will not work on Linux under Mono.

- Please tell us about any problem from the production?

Andrey Akinshin: In Rider there are a lot of problems with the same bunch of large objects.
If we create a very large array (which falls into the LOH) for a not very long period of time, this is not very good, because increases loh fragmentation. The classic solution in this situation is to create an auxiliary class - the so-called chunk list (chunk list): instead of selecting a large array, we create several small (chunks), each of which is small enough not to fall into LOH. Outside, we wrap them in a beautiful interface so that they look like a single list to the user. This saves us from a big LOH and gives us a nice win on memory consumption.

We have used this solution in ReSharper for a long time. However, now, when we write Rider (i.e., in fact, we run ReSharper on Mono under Linux), this hack does not work according to the idea: as I said, in Mono, the logic of working with hip is completely different. And such a fragmentation not only does not give a positive effect in terms of performance, but in some cases even negatively affects the work with memory. Therefore, now we are looking at how best to optimize such places so that they effectively work with us not only under Windows, but also under other OSs (Linux or MacOS).

- How should you start working with memory as part of improving the performance of the application?
Andrey Akinshin: The first thing you should always start with is with measurements. I have seen so many programmers who try to understand something at a glance (“surely we have too many objects allocated here - let's start optimizing this place right now”). But this approach, especially in a large application, rarely ends up with something good. Of course, you need to take measurements.
There are various memory profilers for .NET. For example - dotMemory . This is a very good tool, it allows you to look for memory leaks, identify various problems, look at what objects take up how much memory and how they are distributed across heaps, generations, etc.
Under Windows there are many profilers, and they all measure fairly honestly. If we are talking about Linux / Mac and mono, there is a tool for memory profiling that is much more scarce, it’s not always possible to measure what you want.

- What tools in terms of profiling you no longer have?

Andrey Akinshin: First of all, a normal profiler for Linux and Mac.
For example, mono has the capabilities of built-in profiling - you need to start it with special keys for this, but the possibilities are quite scarce there, it is difficult to work, the results can not be trusted. With CoreCLR, all the profiling tools are also quite raw. There is a lot of useful information on this topic in Sasha Goldstein's recent posts, it tells you how to start working with it. Alas, I would not say that there is a particular convenience in running profiling sessions and analyzing results. Therefore, personally, I am waiting for a convenient cross-platform tool for profiling (including memory) to appear.

- By the way, about evolution. As the evolution of .NET and related tools, is it easier to work with memory? Or does the headache only get bigger because of the appearance of new mechanisms?

Andrey Akinshin: I would say that life is gradually getting better. If we are talking about Windows, then at the full framework of some significant changes have not been for quite some time, everyone is accustomed to how the garbage collector works. Those who need to know the insides and how to properly handle it, so that all would be well. Now cross-platform development is gaining popularity - under Linux and Mac. And there, unfortunately, the toolkit is not so good. But gradually it also develops.

In more detail about optimizing .NET applications, especially about working with memory and other components, Andrey Akinshin will tell you as part of his talk on DotNext 2017 “ Let's talk about memory ”.

Source: https://habr.com/ru/post/325178/

All Articles

Memory optimization: difficult, but sometimes necessary

More articles: