Optimizing dynamic memory allocation performance in a multi-threaded library

Foreword

This article has grown out of a problem that I had to solve relatively recently: the speed of code designed to work simultaneously in several streams dropped sharply after another functional expansion, but only on Windows XP / 2003. Using Process Explorer, I found out that only 1 stream is executed at most points in time, the rest are pending, and the TID of the active stream is constantly changing. On the face of a clear competition for the resource, and this resource was a bunch of default (default heap). The new code actively uses dynamic allocation / release of memory (copying lines, copying / modifying STL containers of large size), which actually led to this problem.

Some theory

As is known, the default allocator for STL containers and std :: basic_string (std :: allocator) allocates memory from the heap by default, and memory allocation / freeing operations in it are blocking ( indirect confirmation ). Based on this, with frequent calls to HeapAlloc / HeapFree, we risk tightly blocking the heap for other threads. Actually this happened in my case.
')

A simple solution

The first solution to this problem is the inclusion of the so-called low fragmentation heap . When a heap is switched to this mode, the memory consumption increases due to allocating more often a significantly larger fragment than was requested, but the overall speed of the heap increases. This mode is enabled by default in Windows Vista and later (a plausible reason for the brakes on XP / 2003 and their absence on 2008/7 / 2008R2). The default heap switching code for this mode is extremely simple:

#define HEAP_LFH 2 ULONG HeapInformation = HEAP_LFH; HeapSetInformation(GetProcessHeap(), HeapCompatibilityInformation, &HeapInformation, sizeof(HeapInformation));

Corrected and compiled using visual studio 2013, the code did its job - the CPU is now 100% loaded, the time for passing the benchmark test has decreased N times! However, when building on our (it hurts me to say this ...) visual studio 2003 (yes, she still enjoys) the effect of the fix 0 ... Houston, we have problems ...

Universal solution

The second possible solution to the problem is to create a separate heap for each thread. Let's get started

To store handles (handle) of different heaps, I suggest using TLS slots . Since my code is only a part of the library that does not directly control the creation / termination of threads, I do not have information when it is necessary to remove a heap (with creation everything is trivial, with deletion is more difficult). As a solution, I suggest using a pool of heaps of the following form:

 class HeapPool { private: std::stack<HANDLE> pool; CRITICAL_SECTION sync; bool isOlderNt6; public: HeapPool() { InitializeCriticalSection(&sync); DWORD dwMajorVersion = (DWORD)(GetVersion() & 0xFF); isOlderNt6 = (dwMajorVersion < 6); } ~HeapPool() { DeleteCriticalSection(&sync); while (!pool.empty()) //       { HeapDestroy(pool.top()); pool.pop(); } } HANDLE GetHeap() { EnterCriticalSection(&sync); HANDLE hHeap = NULL; if (pool.empty()) { hHeap = HeapCreate(0, 0x100000, 0); if (isOlderNt6) //  NT6.0+      { ULONG info = 2 /* HEAP_LFH */; HeapSetInformation(hHeap, HeapCompatibilityInformation, &info, sizeof(info)); } } else { hHeap = pool.top(); pool.pop(); } LeaveCriticalSection(&sync); return hHeap; } void PushHeap(HANDLE hHeap) { EnterCriticalSection(&sync); pool.push(hHeap); LeaveCriticalSection(&sync); } }; HeapPool heapPool; //

Now create a global slot variable:

 class TlsHeapSlot { private: DWORD index; public: TlsHeapSlot() { index = TlsAlloc(); } void set(HANDLE hHeap) { TlsSetValue(index, hHeap); } HANDLE get() { return (HANDLE)TlsGetValue(index); } }; /* ,    get()   ,     */ TlsHeapSlot heapSlot;

Since our goal is to optimize the operation of std :: basic_string and STL containers, we will need a “self-made” allocator:

 template <typename T> class parallel_allocator : public std::allocator<T> { public: typedef size_t size_type; typedef T* pointer; typedef const T* const_pointer; template<typename _Tp1> struct rebind { typedef parallel_allocator<_Tp1> other; }; pointer allocate(size_type n, const void *hint = 0) { return (pointer)HeapAlloc(heapSlot.get(), 0, sizeof(T) * n); } void deallocate(pointer p, size_type n) { HeapFree(heapSlot.get(), 0, p); } parallel_allocator() throw() : std::allocator<T>() {} parallel_allocator(const parallel_allocator &a) throw() : std::allocator<T>(a) { } template <class U> parallel_allocator(const parallel_allocator<U> &a) throw() : std::allocator<T>(a) { } ~parallel_allocator() throw() { } };

And the final touch:

 class HeapWatch //          { private: HANDLE hHeap; public: HeapWatch(HANDLE heap) : hHeap(heap) {} ~HeapWatch() { heapPool.PushHeap(hHeap); } }; extern "C" int my_api(const char * arg) // ,    { HANDLE hHeap = heapPool.GetHeap(); HeapWatch watch(hHeap); heapSlot.set(hHeap); /*      ,  / .       std::basic_string<char, std::char_traits<char>, parallel_allocator<char> > str  std::list<int, parallel_allocator<int> > lst,    ,      (  ) */ }

Results:

The main thing - the problem is solved, and, moreover, is not a crutch. The second solution is more complicated, but it gives performance gains on all tested platforms (win xp - windows 10) when building the solution in both VS2013 and VS2003 (the maximum effect is achieved in XP / 2003 using VS 2003 - the test runtime has decreased by almost N times (where N is the number of CPU cores), on newer platforms - from 3 to 10 percent). A simple solution is ideal for those who use the latest versions of compilers. I hope this information will be useful not only for me.

Source: https://habr.com/ru/post/267155/

All Articles