📜 ⬆️ ⬇️

Optimization of garbage collection in a highly loaded .NET service

Every day tens of thousands of employees from several thousand organizations around the world work in the Pyrus service. Responsiveness of the service (speed of processing requests), we consider an important competitive advantage, since it directly affects the impression of users. The key metric for us is "the percentage of slow queries." Studying its behavior, we noticed that once a minute there are pauses about 1000 ms long on the application servers. During these intervals, the server does not respond and a queue of several dozen requests appears. The search for the causes and the elimination of bottlenecks caused by garbage collection in the application will be discussed in this article.



Modern programming languages ​​can be divided into two groups. In languages ​​like C / C ++ or Rust, manual memory management is used, so programmers spend more time writing code, managing the lifetime of objects, and then debugging. At the same time, bugs due to memory misuse are among the most difficult to debug, so most modern development is done in languages ​​with automatic memory management. These include, for example, Java, C #, Python, Ruby, Go, PHP, JavaScript, etc. Programmers save development time, but they have to pay for this with additional execution time, which the program regularly spends on garbage collection — freeing up memory occupied by objects to which the program has no links left. In small programs, this time is negligible, but as the number of objects grows and the intensity of their creation increases, garbage collection begins to make a noticeable contribution to the total program execution time.

Pyrus web servers run on the .NET platform, which uses automatic memory management. Most garbage collections are blocking ('stop the world'), i.e. for the time of their work, stop all threads (threads) of the application. Non-blocking (background) assemblies actually also stop all threads, but for a very short period of time. During the blocking of threads, the server does not process requests, existing requests hang, new ones are added to the queue. As a result, requests that were processed at the moment of garbage collection are directly slowed down, requests are also slower executed immediately after the garbage collection is completed due to the accumulated queues. This impairs the percentage of slow queries.
')
Armed with the recently published book Konrad Kokosa: Pro .NET Memory Management (about how we brought its first copy to Russia in 2 days, you can write a separate post), entirely devoted to the topic of memory management in .NET, we began to study the problem.

Measurement


For profiling the Pyrus web server, we used the PerfView utility ( https://github.com/Microsoft/perfview ), customized for profiling .NET applications. The utility is based on the Event Tracing for Windows (ETW) mechanism and has minimal impact on the performance of the profiled application, which allows it to be used on the combat server. In addition, the performance impact depends on what types of events and what information we collect. We do not collect anything - the application works as usual. PerfView also does not require any recompilation or application restart.

Run the PerfView trace with the / GCCollectOnly parameter (trace time 1.5 hours). In this mode, it collects only garbage collection events and has minimal impact on performance. Let's look at the Memory Group / GCStats trace report, and the summary of the garbage collector events in it:



Here we see several interesting indicators:

We explain the observation about the number of garbage collections. The idea of ​​dividing objects by their lifetime is based on the generational hypothesis hypothesis : a significant part of the created objects dies quickly, and most of the others live for a long time (in other words, few objects have an “average” lifetime). It is under this mode that the .NET garbage collector is sharpened, and in this mode, the second generation assemblies should be much less than the 0th generation. That is, for optimal work of the garbage collector, we must adjust the work of our application to the hypothesis of generations. We formulate the rule as follows: objects must either die quickly, not living to the older generation, or live to it and live there forever. This rule also applies to other platforms that use automatic memory management with separation by generations, such as Java, for example.

Interesting data for us can be extracted from another table in the GCStats report:



Here are the cases when an application tries to create a large object (in the .NET Framework objects with a size of> 85000 bytes are created in LOH - Large Object Heap), and it has to wait until the end of the 2nd generation assembly, which occurs in parallel in the background. These allocator pauses are not as critical as garbage collector pauses, since they only affect one stream. Prior to this, we used the .NET Framework version 4.6.1, and in version 4.7.1 Microsoft finalized the garbage collector, now it allows you to allocate memory in the Large Object Heap during the background generation of the 2nd generation: https://docs.microsoft.com / en-us / dotnet / framework / whats-new / # common-language-runtime-clr
Therefore, we have updated to the latest at that time version 4.7.2.

2nd generation builds


Why do we have so many older assemblies? The first assumption is that we have a memory leak. To test this hypothesis, we look at the size of the second generation (we configured monitoring of the corresponding performance counters in Zabbix). From the graphs of the size of the 2nd generation for 2 Pyrus servers, it is clear that its size first grows (mainly due to filling caches), but then stabilizes (large dips in the graph - a regular restart of the web service for updating the version):



This means that there are no noticeable memory leaks, that is, a large number of assemblies of the 2nd generation arise for another reason. The next hypothesis is a large memory traffic, i.e., many objects get into the 2nd generation, and many objects die there. To find such objects in PerfView there is a / GCOnly mode. From the trace reports, look at the 'Gen 2 Object Deaths (Coarse Sampling) Stacks', which contains a selection of objects dying in the 2nd generation, along with the call stacks of the places where these objects were created. Here we see the following results:



Having opened the line, inside we see a stack of calls for those places in the code that create objects that live to the 2nd generation. Among them:








Interestingly, buffers for JSON and for computing client caches are all temporary objects that live for a single request. Why do they live to the 2nd generation? Note that all these objects are arrays of rather large size. And with a size of> 85,000 bytes, the memory for them is allocated in the Large Object Heap, which is assembled only with the 2nd generation.

To check, we will open the section 'GC Heap Alloc Ignore Free (Coarse Sampling) stacks' in the perfview / GCOnly results. There we see the LargeObject line, in which PerfView groups the creation of large objects, and inside we will see all the same arrays that we saw in the previous analysis. We confirm the root cause of problems with the garbage collector: we create many temporary large objects.





Pyrus system changes


According to the measurement results, we identified the main directions of further work: the fight against large objects in the calculation of client caches and serialization in JSON. There are several solutions to this problem:

Consider separately both cases of large objects - the calculation of client caches and serialization in JSON.

Calculating client caches


The Pyrus web client and mobile applications cache data that is available to the user (projects, forms, users, etc.). Caching is used to speed up work, it is also necessary for working in offline mode. Caches are calculated on the server and transferred to the client. They are individual for each user, as they depend on their access rights, and are frequently updated, for example, when the directories to which he has access are changed.

Thus, on the server a lot of client cache computations regularly occur, and many temporary short-lived objects are created. If the user is in a large organization, he can get access to many objects, respectively, client caches for him will be large. That is why we saw memory allocation for large temporary arrays in the Large Object Heap.

Let us analyze the proposed options for eliminating the creation of large objects:

We decided to go along the 3rd path and invent our bicycle to write a List and HashSet, not loading the Large Object Heap.

Piece list


Our ChunkedList <T> implements standard interfaces, including IList <T>, thanks to which minimal changes to the existing code are required. And the Newtonsoft.Json library used by us is automatically able to serialize it, since it implements IEnumerable <T>:

public sealed class ChunkedList<T> : IList<T>, ICollection<T>, IEnumerable<T>, IEnumerable, IList, ICollection, IReadOnlyList<T>, IReadOnlyCollection<T> { 

The standard List <T> has the following fields: an array for the elements and the number of filled elements. In ChunkedList <T> there is an array of arrays of elements, the number of fully populated arrays, the number of elements in the last array. Each of the arrays of elements with less than 85000 bytes:


 private T[][] chunks; private int currentChunk; private int currentChunkSize; 

Since ChunkedList <T> is rather complicated, we wrote detailed tests for it. Any operation should be tested at least in 2 modes: in the “small”, when the whole list fits in one piece up to 85000 bytes in size, and “large” when it consists of more than one piece. Moreover, for methods that change the size (for example, Add), there are even more scenarios: “small” -> “small”, “small” -> “large”, “large” -> “large”, “large” -> “ little". There are quite a few confusing boundary cases with which unit tests do a good job.

The situation is simplified by the fact that some of the methods from the IList interface are not used, and they can not be implemented (such as Insert, Remove). Their implementation and testing would be quite expensive. In addition, the writing of unit tests is simplified by the fact that we do not need to invent a new functionality, ChunkedList <T> should behave in the same way as List <T>. That is, all the tests are arranged as follows: create a List <T> and ChunkedList <T>, perform the same operations on them and compare the results.

We measured performance using the BenchmarkDotNet library to make sure that we didn’t slow down our code too much when switching from List <T> to ChunkedList <T>. Test, for example, adding items to the list:

 [Benchmark] public ChunkedList<int> ChunkedList() { var list = new ChunkedList<int>(); for (int i = 0; i < N; i++) list.Add(i); return list; } 

And the same test using List <T> for comparison. Results when adding 500 elements (everything fits into one array):
MethodMeanErrorStddevGen 0 / 1k OpGen 1 / 1k OpGen 2 / 1k OpAllocated Memory / Op
StandardList1.415 us0.0149 us0.0140 us0.68470.0095-4.21 KB
ChunkedList3.728 us0.0238 us0.0222 us0.69430.0076-4.28 KB

Results when adding 50,000 items (split into several arrays):
MethodMeanErrorStddevGen 0 / 1k OpGen 1 / 1k OpGen 2 / 1k OpAllocated Memory / Op
StandardList146.273 us3.1466 us4.8053 us124.7559124.7559124.7559513.23 KB
ChunkedList287.687 us1.4630 us1.2969 us41.503920.5078-256.75 KB

Detailed description of the columns in the results
 BenchmarkDotNet=v0.11.4, OS=Windows 10.0.17763.379 (1809/October2018Update/Redstone5) Intel Core i7-8700K CPU 3.70GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores [Host] : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3324.0 DefaultJob : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3324.0 // * Hints * Outliers ListAdd.StandardList: Default -> 2 outliers were removed ListAdd.ChunkedList: Default -> 1 outlier was removed // * Legends * Mean : Arithmetic mean of all measurements Error : Half of 99.9% confidence interval StdDev : Standard deviation of all measurements Gen 0/1k Op : GC Generation 0 collects per 1k Operations Gen 1/1k Op : GC Generation 1 collects per 1k Operations Gen 2/1k Op : GC Generation 2 collects per 1k Operations Allocated Memory/Op : Allocated memory per single operation (managed only, inclusive, 1KB = 1024B) 1 us : 1 Microsecond (0.000001 sec) 


If you look at the 'Mean' column, in which the average test run time is displayed, you can see that our implementation is slower than the standard by only 2-2.5 times. Taking into account the fact that in a real code, operations with lists are only a small part of all performed actions, this difference becomes insignificant. But the column 'Gen 2 / 1k op' (the number of assemblies of the 2nd generation for 1000 test runs) shows that we achieved the goal: with a large number of elements, ChunkedList does not create garbage in the 2nd generation, which was our task.

Piece set


Similarly, ChunkedHashSet <T> implements the ISet <T> interface. When writing ChunkedHashSet <T> we reused the logic of breaking into small pieces, already implemented in ChunkedList. To do this, we took a ready-made implementation of the HashSet <T> from the .NET Reference Source, which is available under the MIT license, and replaced the arrays with ChunkedList.

In unit tests, we will also use the same trick as for the lists: we will compare the behavior of ChunkedHashSet <T> with the reference HashSet <T>.

Finally, performance tests. The main operation that we use is the union of sets, therefore we will test it:

 public ChunkedHashSet<int> ChunkedHashSet(int[][] source) { var set = new ChunkedHashSet<int>(); foreach (var arr in source) set.UnionWith(arr); return set; } 

And exactly the same test for the standard HashSet. The first test for small sets:

 var source = new int[][] { Enumerable.Range(0, 300).ToArray(), Enumerable.Range(100, 600).ToArray(), Enumerable.Range(300, 1000).ToArray(), } 

MethodMeanErrorStddevGen 0 / 1k OpGen 1 / 1k OpGen 2 / 1k OpAllocated Memory / Op
StandardHashSet30.16 us0.1046 us0.0979 us9.30791.6785-57.41 KB
ChunkedHashSet73.54 us0.5919 us0.5247 us9.52151.5869-58.84 KB

The second test for large sets that caused a problem with a bunch of large objects:

 var source = new int[][] { Enumerable.Range(0, 30000).ToArray(), Enumerable.Range(10000, 60000).ToArray(), Enumerable.Range(30000, 100000).ToArray(), } 

MethodMeanErrorStddevGen 0 / 1k OpGen 1 / 1k OpGen 2 / 1k OpAllocated Memory / Op
StandardHashSet3,031.30 us32.0797 us28.4378 us699.2188667.9688664.06254718.23 KB
ChunkedHashSet7,189.66 us25.6319 us23.9761 us539.0625265.62507.81253280.71 KB

Results are similar with lists. ChunkedHashSet is 2-2.5 times slower, but at the same time on large sets the 2nd generation loads 2 orders of magnitude less.

JSON serialization


The Pyrus web server provides several APIs that use different serialization. We found the creation of large objects in the API used by bots and the synchronization utility (hereinafter the Public API). Note that the main API uses its own serialization, which is not affected by this problem. We wrote about this in the article https://habr.com/ru/en/post/227595/ , in the section “2. You don’t know where your application’s bottleneck is. ” That is, the main API is already working well, and the problem manifested itself in the Public API as the number of requests and the amount of data in the responses grew.

We are going to optimize the Public API. For example, the main API, we know that you can return the response to the user in streaming mode. That is, it is necessary not to create intermediate buffers containing the entire answer, but to write the response immediately to the stream.

Upon closer inspection, we found that in the process of serializing the response, we create a temporary buffer for the intermediate result ('content' is an array of bytes containing JSON in UTF-8 encoding):

 var serializer = Newtonsoft.Json.JsonSerializer.Create(...); byte[] content; var sw = new StreamWriter(new MemoryStream(), new UTF8Encoding(false)); using (var writer = new Newtonsoft.Json.JsonTextWriter(sw)) { serializer.Serialize(writer, result); writer.Flush(); content = ms.ToArray(); } 

Let's trace where content is used. For historical reasons, the Public API is based on WCF, for which XML is the standard request and response format. In our case, there is a single 'Binary' element in the XML response, inside of which JSON encoded in Base64 is written:

 public class RawBodyWriter : BodyWriter { private readonly byte[] _content; public RawBodyWriter(byte[] content) : base(true) { _content = content; } protected override void OnWriteBodyContents(XmlDictionaryWriter writer) { writer.WriteStartElement("Binary"); writer.WriteBase64(_content, 0, _content.Length); writer.WriteEndElement(); } } 

Note that a temporary buffer is not needed here. JSON can be written directly to the XmlWriter buffer that WCF provides us with, encoding it in Base64 on the fly. Thus, we will go the first way, getting rid of memory allocation:

 protected override void OnWriteBodyContents(XmlDictionaryWriter writer) { var serializer = Newtonsoft.Json.JsonSerializer.Create(...); writer.WriteStartElement("Binary"); Stream stream = new Base64Writer(writer); Var sw = new StreamWriter(stream, new UTF8Encoding(false)); using (var jsonWriter = new Newtonsoft.Json.JsonTextWriter(sw)) { serializer.Serialize(jsonWriter, _result); jsonWriter.Flush(); } writer.WriteEndElement(); } 

Here, Base64Writer is a simple wrapper over an XmlWriter that implements the Stream interface, which writes to the XmlWriter as Base64. In this case, of the entire interface, it is sufficient to implement only one Write method, which is called in the StreamWriter:

 public class Base64Writer : Stream { private readonly XmlWriter _writer; public Base64Writer(XmlWriter writer) { _writer = writer; } public override void Write(byte[] buffer, int offset, int count) { _writer.WriteBase64(buffer, offset, count); } <...> } 

Induced GC


Let's try to deal with the mysterious garbage collections. We checked our code 10 times for the presence of GC.Collect calls, but this did not produce results. It was possible to catch these events in PerfView, but the call stack is not particularly indicative (DotNETRuntime / GC / Triggered event):



There is a small lead - calling RecycleLimitMonitor.RaiseRecycleLimitEvent before the induced garbage collection. Let's follow the call stack of the RaiseRecycleLimitEvent method:

 RecycleLimitMonitor.RaiseRecycleLimitEvent(...) RecycleLimitMonitor.RecycleLimitMonitorSingleton.AlertProxyMonitors(...) RecycleLimitMonitor.RecycleLimitMonitorSingleton.CollectInfrequently(...) RecycleLimitMonitor.RecycleLimitMonitorSingleton.PBytesMonitorThread(...) 

The names of the methods fully correspond to their functions:


Further investigation reveals IObserver <RecycleLimitInfo> handlers are added in the RecycleLimitMonitor.Subscribe () method, which is called in the AspNetMemoryMonitor.Subscribe () method. Also, in the AspNetMemoryMonitor class, the default IObserver <RecycleLimitInfo> handler is hung (the RecycleLimitObserver class), which cleans ASP.NET caches and sometimes requests garbage collection.

The riddle of Induced GC is almost solved. It remains to clarify the question, why this garbage collection is called. RecycleLimitMonitor monitors memory usage of IIS (more precisely, the private bytes figure), and when its use approaches a certain limit, it starts to raise the RaiseRecycleLimitEvent event using a rather confusing algorithm. The AspNetMemoryMonitor.ProcessPrivateBytesLimit value is used as the memory limit, and in turn the following logic is contained in it:

The conclusion of the investigation is as follows: ASP.NET memory usage approaches its limit and begins to regularly trigger garbage collection. 'Private Memory Limit (KB)' was not set to a value, so ASP.NET was limited to 60% of the physical memory. The problem was masked by the fact that the Task Manager server showed a lot of free memory and it seemed to be enough. We increased the value of 'Private Memory Limit (KB)' in the settings of the Application Pool in IIS to 80% of the physical memory. This encourages ASP.NET to use more available memory. We also added the monitoring of the '.NET CLR Memory / # Induced GC' performance counter, so as not to miss the next time ASP.NET decides that it is approaching the limit of memory usage.

Repeated measurements


Let's see what happened to the garbage collection after all these changes. Let's start with perfview / GCCollectOnly (tracing time - 1 hour), GCStats report:



It is seen that the 2nd generation assemblies are now 2 orders of magnitude smaller than the 0th and 1st. Also, the time of these assemblies has decreased. Induced assemblies are no longer observed. Let's look at the list of assemblies of the 2nd generation:



From the Gen column it can be seen that all assemblies of the 2nd generation became background ('2B' means 2nd generation, Background). That is, most of the work is done in parallel with the execution of the application, and all threads are blocked for a short time (the 'Pause MSec' column). Let's look at the pauses when creating large objects:



It can be seen that the number of such pauses when creating large objects fell by several times.

Results


Thanks to the changes described in the article, it was possible to significantly reduce the number and duration of 2nd generation assemblies. It was possible to find the cause of the induced assemblies, and get rid of them. The number of assemblies of the 0th and 1st generation increased, but their average duration decreased (from ~ 200 msec to ~ 60 msec). The maximum duration of assemblies of the 0th and 1st generations has decreased, but not so noticeably. Assemblies of the 2nd generation became faster, long pauses up to 1000ms were completely gone.

As for the key metric - “percentage of slow queries”, it decreased by 40% after all changes.

Thanks to the work done, we realized which performance counters are needed to assess the situation with memory and garbage collection, adding them to Zabbix for continuous monitoring. Here is a list of the most important, to which we pay attention, and find out the reason (for example, an increased flow of requests, a large amount of transmitted data, a bug in the application):
Performance counterDescriptionWhen should I pay attention
\ Process (*) \ Private BytesThe amount of memory allocated for the applicationValues ​​greatly exceed the threshold. As a threshold, you can take the median for 2 weeks from the maximum daily indicators.
\ .NET CLR Memory (*) \ # Gen 2 CollectionsThe amount of memory in the older generation
\ .NET CLR Memory (*) \ Large Object Heap sizeMemory size for large objects
\ .NET CLR Memory (*) \% Time in GCThe percentage of time spent on garbage collectionThe value is greater than 5%.
\ .NET CLR Memory (*) \ # Induced GCThe number of induced assembliesValue is greater than 0.

Source: https://habr.com/ru/post/452298/


All Articles