Good day.
About 2 weeks ago, our monitoring tool (NewRelic) began to detect a large number of drops of a site that lasted no more than 1 minute, but with a very large frequency. In addition, it was visually noticeable that the overall performance of the web application (
Umbraco 6.1.6, .net 4.0 ) dropped.
The red stripes in the picture are our falls.
')

Yes, make a reservation. Before we noticed, the new module for the blog was installed and, accordingly, the company's blog was migrated from Worldpress to Umbraco.
As a result, we have the following input data: the application began to store more data (much more) + a third-party module = High CPU was installed.
Let's hit the road
Before starting the study, it was decided to test Googe Analytics to make sure that the number of users did not change (the result - everything was as before) + it was decided to do load testing - to determine the throughput.
Here we were completely disappointed; our application died at
30 simultaneous sessions. The site through the browser did not open at all. And it was a production.
Step 1 - build performance dumps under load using Debug Diagnostics tools
1. Install it on the production server.
2. Run, create a new rule with the type “Performance”.

3. Specify that dumps should be collected using Performance Counters.

4. In our case, select% Processor Time, threshold - 80%, duration - 15 seconds.

This means that dumps will be collected if the CPU is loaded more than 80% within 15 seconds.
5. We study the resultsWhat you need to pay attention to is highlighted with red rectangles.


Namely:- At the time of building the dump, the Garbage Collector was launched (at first I didn’t give it any attention);
- Very large heap size;
- All 4 threads belong to the Garbage Collector and eat 100% of the CPU.
Here I would like to draw attention to the fact that the problem itself lies not in the GC, but in the fact that improper allocation of memory makes it work in this way.Some theory
In GC, the most laborious is garbage collection of the Gen 2 generation (which causes the Gen 1 and Gen 0 assembly, respectively). Also, each generation has its own threshold, exceeding which garbage collection will automatically run. This means that the more often the threshold is exceeded, the more often the garbage collection will be launched.
A small example:
Suppose Gen 2 generation threshold: 300 MB
In one second GC can clear: 100 MB (Gen 2)
Each new user per second results in allocation: 10 MB (in Gen 2)
If we have 10 users, then 10 * 10 = 100 MB, therefore there are no problems.
If we have 40 users, then 400 MB is allocated every second, which causes the assembly of the sumor (the threshold is exceeded), and so on.
That is, the more users, the more memory is allocated (incrementally), the more often garbage collection is caused with a large interval of time for assembly.
In .net 4.0, when garbage collection starts, all GC flows are given the highest priority. This means that all server resources will be directed to garbage collection and, in addition, all other threads (processing incoming requests) will be temporarily suspended until the garbage collection is over.
This is the reason why the server did not respond to requests even with incomplete loading.Therefore, we can conclude: the reason is the incorrect allocation of a large amount of memory for short periods of time. To solve the problem, you need to find where in our code the so-called memory leaks.
Step 2 - search for objects. which occupy the most memory (Memory Profiling)
For this, I used
dotMemory as a memory profiler.
We start under load dotMemory and try to take a snapshot of memory when its volume starts to grow strongly. (The green area in the image below is Gen 2.)

Next, proceed to the analysis of the image.

The largest volume in memory is occupied by
HttpContext, DynanicNode, Node .
HttpContext is excluded, since it stores references to DynanicNode and Node objects.
Next we group by generations, since we only need Gen 2 objects.

In the Gen 2 generation, we again group by the dominators.

This allows 100% to find the necessary objects that occupy the most memory. After you need to work with specific instances of the object to determine what kind of object (id, properties, etc.)

At this point, it became clear what data is the source of the problem, it remains only to find the place where they are created and fixed.
Step 3 - Fix Problems
In my particular case, the problem lay in the control that generated the main navigation of the site. This control was not in the cache, that is, it worked out with each page request. And the specific 'memory leek' was associated with calling the native Umbraco method DynamicNode.isAncestor () . As it turned out, in order to identify the parent, the method lifted the entire site tree into memory. This confirmed the fact that the problem began to manifest itself only with the growth of data, and specifically with the import of the blog.
Therefore, the problem fix itself was to replace the isAncestor method with our own implementation + applying OutputCache to our control.findings
- High CPU is not only a recursion or a large load, but also a GC;
- Creation of objects must be thought out and consistent with the architecture of the application;
- Output cache - always and everywhere;
- Everything that is not visible during normal testing will manifest itself under load!
And note:At the time of this writing, NewRelic did not help me catch the source of High CPU, but performance counter
% Time in GC easily indicated the source of the problem.
If the CPU peaks in the graph grow in accordance with the peaks of the
% Time in GC graph and the
% Time in GC value is above the 20% line, => High CPU due to garbage collection.
Thanks for attention. I hope it was interesting.