Go to the global pile without traffic jams. Explore memory managers

Somehow, analyzing a defect in a developed product, I came across an architectural feature of the memory manager that we used. The defect led to an increase in the creation time of some objects. The peculiarity of the architecture was to use the Singleton pattern when working with the memory manager (hereinafter referred to as X allocator). Schematically it looks like this:

Figure 1 - Block diagram of the X allocator

From the diagram it is clear that access to the global heap is protected by a mutex. Such an architecture, with the intensive creation of objects of the same type from several threads, can lead to threads being queued on this mutex. But one of the main features of the product is the ability to scale it by increasing the number of processing threads (threads performing the same actions). Therefore, this approach could potentially become a bottleneck.

Realizing the scale of the threat, I decided to clarify and climbed into the Internet to look for information about multi-threaded memory managers (multi-threaded pooled allocators). While searching, I came across several projects:
• Hoard (http://www.hoard.org/), distributed under the GPL v.2.0 license;
• TBB Memory Allocator from Intel (discarded, because the library is paid);
• Thread-Caching Malloc from Google gperftools (https://code.google.com/p/gperftools/).
')
Also in the search process I found a very interesting article that describes various approaches when working with dynamic memory - www.realcoding.net/article/view/2747 .

Having several solutions in my hands, I decided to experiment: to compare the performance of various memory managers and look for a solution to optimize the X allocator.

When I analyzed the implementation of Hoard and TCMalloc, I had an understanding that the memory managers implemented in them use caching by threads (an alternative to this approach is to use lock-free lists). Caching allows, when deleting an object, to return the memory not to the global heap, but to put in the cache of this stream. So, when re-creating the same object, the thread will not have to go under the mutex into the common heap. Schematically, this approach can be depicted as follows (Figure 2):

Figure 2 - Block diagram of the X allocator with the use of caching flows

Having implemented caching on the knee in the X allocator (hereinafter pooled X allocator), I proceeded to a performance comparison.

The study of the behavior of memory managers was carried out with intensive CPU consumption. The test was to create X objects on N threads, while testing the hypothesis that the test time is inversely proportional to the number of threads. An example for an ideal world: the creation of 1000 objects on 1 stream will be performed 2 times longer than the creation of 1000 objects on 2 threads (500 objects per stream).

The tests performed can be divided into the following categories:
1. OS Type:
a. Windows;
b. Linux
2. Size of objects:
a. 128b;
b. 1Kb;
c. 1Mb.
3. Memory Manager:
a. system memory manager;
b. TCMalloc (Google);
c. Hoard;
d. pooled X allocator (replaced X allocator singleton with an array of X allocators, when allocating and freeing memory, the number of X allocator is calculated by the tid of the flow).
e. X allocator;
4. Number of threads: from 1 to 8;
5. I conducted each experiment 5 times, after calculating the arithmetic average value.

Below I will cite only one graph (Figure 3) and the results of the experiment for it (Table 1), since for all cases the results were similar. The results are shown for Linux OS objects with a size of 1 Kb.

Figure 3 - Results of a study of the performance of memory managers

On the graph, the abscissa axis is the number of streams, the ordinate axis is the relative test run time (100% takes the test run time on one thread using the system memory manager).

Table 1 - Results of a study of the performance of memory managers

Based on the results obtained, the following conclusions can be drawn:
1. The X allocator, which we use, showed good performance only on 1 thread, with an increase in the number of threads, performance drops (due to competition on the mutex);
2. the rest of the memory managers coped with the task of paralleling, indeed, with an increase in the number of threads, the task began to run faster;
3. TCMalloc showed better performance.

Of course, I understood that the experiment was artificial and in real conditions the growth of productivity would not be as significant (some tests showed an increase in productivity 80 times). When I replaced X allocator with TCMalloc in my project and conducted an experiment on real data, the performance increase was 10%, and this is quite a good increase.
Performance is certainly great, but the X allocator provides the ability to analyze memory consumption and look for leaks. And what about the rest of the memory managers? I will try to explain this comparison in detail in the next article.

Source: https://habr.com/ru/post/238787/

All Articles

Go to the global pile without traffic jams. Explore memory managers

More articles: