Optimization of programs under the Garbage Collector
Not so long ago, a great article appeared on Habré Optimizing garbage collection in a high-loaded .NET service . This article is very interesting because the authors, having armed themselves with the theory, have made the previously impossible: they optimized their application using the knowledge of the work of GC. And if earlier we had no idea how this GC works, now it is presented to us on a platter through the efforts of Konrad Kokos in his book Pro .NET Memory Management . What conclusions did I draw for myself? Let's make a list of problem areas and think about how to solve them.
At a recent CLRium # 5: Garbage Collector workshop, we talked about GC all day. However, I decided to publish one report with a text transcript. This is a report about the conclusions regarding application optimization.
Reduce cross-generational connectivity
Problem
To optimize garbage collection speed, the GC collects the younger generation whenever possible. But to do this, he also needs information about links from older generations (in this case, they are an additional root): the card table.
At the same time, one link from the older to the younger generation forces us to cover the area with a card table:
4 bytes overlaps 4 kb or max. 320 objects - for x86 architecture
8 bytes overlaps 8 kb or max. 320 objects - for x64 architecture
Those. GC, checking the card table, meeting in it a non-zero value, is forced to check a maximum of 320 objects for the presence of outgoing links in our generation.
Therefore, sparse links to the younger generation will make GC more time consuming.
Decision
To have objects with connections to the younger generation - close by;
If the traffic of objects of zero generation is supposed, use pulling. Those. make a pool of objects (there will be no new ones: there will be no zero generation objects). And further, the “warming up” of the pool with two consecutive GCs so that its contents are guaranteed to fail into the second generation, thereby avoiding references to the younger generation and having zeros in the card table;
Avoid references to the younger generation;
Do not allow strong connectivity
Problem
As follows from the phase compression algorithms for objects in SOH:
To compress a heap, it is necessary to bypass the tree and check all the links correcting them for new values.
At the same time, links from the card table affect entire groups of objects.
Therefore, the general strong connectivity of objects can lead to subsidence at GC.
Decision
Positioning strongly connected objects side by side, in the same generation
Avoid unnecessary links in general (for example, instead of duplicating this-> handle links, you should use an existing this-> Service-> handle)
Avoid code with hidden connectivity. For example, closures
Monitor segment usage
Problem
With intensive work, a situation may arise when the allocation of new objects leads to delays: the allocation of new segments under the heap and their further decommissioning when cleaning garbage
Decision
Using PerfMon / Sysinternal Utilities, check the points of selection of new segments and their decommitting and release
If we are talking about LOH, in which there is a dense traffic of buffers, use ArrayPool
When it comes to SOH, make sure that objects of the same lifetime stand out side by side, providing a Sweep trigger instead of Collect
SOH: use object pools
Do not allocate memory in loaded sections of code.
Problem
Loaded part of the code allocates memory:
As a result, the GC selects the allocation window not 1K, but 8K.
If the window does not have enough space, this leads to a GC and expansion of the zoned area.
A dense stream of new objects will make short-lived objects from other streams quickly go to the older generation with worse garbage collection conditions.
Which will increase garbage collection time
That will lead to longer Stop the World even in Concurrent mode
Decision
A complete ban on the use of closures in critical parts of the code
Complete prohibition of boxing on critical parts of the code (you can use emulation through pulling if necessary)
Where you need to create a temporary object for data storage, use the structure. Better - ref struct. When the number of fields is more than 2, transfer by ref
Avoid unnecessary memory allocations in the LOH
Problem
Placing arrays in LOH leads either to its fragmentation or to the weighting of the GC procedure.
Decision
Use partitioning of arrays into sub-arrays and a class that encapsulates the logic of working with such arrays (i.e. instead of List <T>, where the mega-array is stored, your MyList with array [] [], dividing the array into several shorter)
Arrays go to SOH
After a couple of garbage collections will lay down near the ever-living objects and will no longer affect the garbage collection
Control the use of double arrays, longer than 1000 elements.
Where justified and possible, use thread stack
Problem
There are a number of ultra short-lived objects or objects living within the framework of a method call (including internal calls). They create traffic objects
Decision
Using memory allocation on the stack, where possible:
It does not load a bunch
Does not load GC
Memory free - instant
Use Span T x = stackalloc T[]; instead of new T[] where possible
Use Span/Memory where possible
Translate algorithms to ref stack types (StackList: struct, ValueStringBuilder )
Release objects as early as possible.
Problem
Conceived as short-lived, objects fall into gen1, and sometimes into gen2. This results in a heavier GC that lasts longer.
Decision
You must release the object link as soon as possible.
If a lengthy algorithm contains code that works with any objects, separated by code. But which can be grouped in one place, it is necessary to group it, allowing thereby to collect them earlier.
For example, on line 10 they took out a collection, and on line 120 they filtered it out.
Calling GC.Collect () is not necessary
Problem
It often seems that calling GC.Collect () will fix the situation.
Decision
Much more correct to learn the algorithms of the GC, look at the application for ETW and other diagnostic tools (JetBrains dotMemory, ...)
May leave some objects in a younger generation, thus forming links from the card table
Decision
If there is no other way, use fixed () {}. This method of fixing does not make a real fix: it only happens when the GC has worked inside the curly braces.
Avoid finalizing
Problem
The finalization is not undetermined:
Undisclosed Dispose () causes finalization with all outgoing links from the object.
Dependent objects are delayed longer than scheduled
Grow older by moving to older generations.
If they contain links to younger ones, generate links from the card table.
Complicating the assembly of older generations, fragmenting them and leading to Compacting instead of Sweep
Decision
Carefully call Dispose ()
Avoid lots of threads
Problem
With a large number of threads, the number of allocation context grows, since they are allocated to each thread:
As a result, GC.Collect comes faster.
Due to the lack of space in the ephemeral segment, Sweep will follow Collect
Decision
Monitor the number of threads by the number of cores
Avoid traffic to objects of different sizes.
Problem
When traffic objects of different size and lifetime of fragmentation occurs:
Increase Fragmentation ratio
Triggering Collection with address change phase in all referencing objects
Decision
If you intend to traffic objects:
Check for extra margins, approximate dimensions
To control the absence of manipulations with strings: where possible, replace with ReadOnlySpan / ReadOnlyMemory
Release link as soon as possible.
Use pulling
Caches and pools "warm up" double GC to compact objects. Thereby you avoid problems with the card table.