In one fell swoop 100 million ubivahom. Or lock-free memory allocator

Formulation of the problem

One of the algorithms that I implemented had interesting features when working with memory:

Could stand out a huge amount, up to tens and hundreds of millions of small objects of the same type.
The objects were POD types.
Pod
It contains a number of variables, and it can be used as the data structure.
It was not known in advance how many objects would be needed, it could happen that it would take a hundred, or maybe a hundred million.
Objects are never deleted one by one, at some point they become not needed all at once.
The algorithm is well parallelized; according to this, several threads simultaneously deal with the selection of objects, by the number of processor cores (s).

The use of the standard new - delete in such conditions leads to a very large loss of time for deleting objects. If, without a debugger, the deletion occurred at least within a few seconds, then in the presence of the debugger, the release of memory slows down about 100 (!) Times, and debugging the project becomes impossible. In addition, due to the large number of selected objects, the memory overrun on the internal data of the memory allocator became quite noticeable.
To solve the problem of allocating a huge number of objects of the same type, and their batch deletion, a MassAllocator lock-free container was made. The code is compiled by Visual Studio 2012. The full code of the project is laid out on github .

Additional features

In my case, the objects could refer to each other, and to save memory, a small hack was invented: instead of the pointer, the object number is saved, and the object itself is obtained by requesting the storage. The number of objects is guaranteed to be less than four billion, so a 32 bit index was used instead of a 64 bit pointer, which saves 4 bytes. So I managed to reduce memory consumption by about 12%.
A nice bonus was that you could easily iterate to the repository to apply standard library algorithms, for example std :: sort.

Implementation

The idea is to sequentially select elements by blocks. Standard malloc is allocated a block of memory, which is logically represented as an array of elements. For each user request to allocate an element, a pointer is returned to the next element of the array, and the counter of the selected elements is incremented. When all the elements from the array are allocated, the next block of memory is requested, and so on. Memory release occurs very quickly, all the allocated blocks just release, without any calls to destructors for the elements.
All blocks have the same size, so it is very easy to turn to an element by a through number:

template <typename T> typename MassAllocator<T>::reference MassAllocator<T>::operator[](size_type index) { size_t indexOfBlock = index / elementsInBlockCount_; size_t indexInBlock = index % elementsInBlockCount_; return blocks_[indexOfBlock][indexInBlock]; }

Highlight new item

So, all elements are sequentially arranged in blocks. The distribution of the new element looks very simple: you need to increase the counter of the selected elements, make sure that the place in the current block, from which the distribution of elements is kept, is not over, and if necessary, select a new block. Of course, the counter incremented from different threads must be implemented through the atd variable std :: atomic, and the algorithm itself must be lock-free!
An atomic counter consistently issues indices, and everything is fine as long as there is space in the block. But here the block comes to an end, and it is necessary to select the new. A single stream should allocate a block, and the rest for this time should pause and resume operation after block allocation. Using one atomic counter, I managed to implement this logic with one assumption: the allocation time of the memory block should be sufficiently short so that the remaining threads could not overflow the 32-bit counter in the idle wait cycle. For synchronization of access, a 64-bit atomic variable is used, which is logically divided into 2 parts: the lower 32 bits is the element counter inside the block, and the older 32 bits is the counter of allocated memory blocks. The counter is declared as:

  std::atomic<uint64_t> curAtomicIndex_

In each memory block, the same number of elements is distributed, for example, 100000. After the counter has been incremented, three different situations can arise for the element counter in the block:

A number from 1 to 99999 was received. This situation means that there is enough space in the block, and we have reserved the element with the received number.
An index was obtained that coincided with the block size - 100000. It means that the thread was “lucky”, and it was on it that the block ended. In this case, you need to allocate a new memory block, pick up the zero element from it, increment the senior block counter, set the lower counter to the first free element - 1, and write the new value to the atomic variable.
An index was obtained with a number larger than the block size. This means that, at the moment, one of the threads allocates memory, and we have to wait until it sets the low-order counter to 1. In this case, there is no need to hurry, and you can give the processor time to other threads by calling
```
 std::this_thread::yield(); 
```

At first I tried to do the implementation on two 32bit counters, but in this case the effect of the race appears. For example, the first query is the index in the block, and then the block number.

 //   ! uint32_t itemIndex = atomicIndex++; uint32_t blockIndx = atomicBlock.load();

It may happen that the thread received a valid index of the itemIndex element, but until the blockIndx block index was obtained, the thread scheduler put it to sleep, and a new block was allocated during sleep by another block; Therefore, both the element index and the block index should be obtained atomically, either in the critical section, or through a single atomic variable.
The element selection code has a feature in that simultaneously with the pointer to the selected element, the index of this element can be returned, and the appeal to the upper and lower parts of the 64-bit integer is organized through union, and not bit arithmetic.

 template <typename T> typename MassAllocator<T>::pointer MassAllocator<T>::createElement(size_type *returningIndex) { // union       32  64   union { uint64_t index; struct HiLoParts { uint32_t itemIndex; uint32_t blockIndx; } parts; }; //    index = curAtomicIndex_++; //        ,          if(parts.itemIndex < elementsInBlockCount_) { if (returningIndex != nullptr) *returningIndex = parts.blockIndx * elementsInBlockCount_ + parts.itemIndex; return &(blocks_[parts.blockIndx][parts.itemIndex]); } if (parts.itemIndex == elementsInBlockCount_) { //              auto bufferSize = elementsInBlockCount_ * sizeof(T); T* buffer = (T*)malloc(bufferSize); memset(buffer, 0, bufferSize); blocks_.push_back(buffer); //       parts.blockIndx = (unsigned int)(blocks_.size() - 1); parts.itemIndex = 0; if (returningIndex != nullptr) *returningIndex = parts.blockIndx * elementsInBlockCount_ + parts.itemIndex; //       setIndex(parts.blockIndx, 1); return &(blocks_[parts.blockIndx][parts.itemIndex]); } //,        while(true) { //    index = curAtomicIndex_++; if (parts.itemIndex == 0xffffffff) //     ,    throw std::string("Atomic index overflow"); if (parts.itemIndex >= elementsInBlockCount_) { //   ,   std::this_thread::yield(); continue; } //    ,         if (returningIndex != nullptr) *returningIndex = parts.blockIndx * elementsInBlockCount_ + parts.itemIndex; return &(blocks_[parts.blockIndx][parts.itemIndex]); } }

Performance

Under the x64 platform, the allocation of 80 million elements in 8 streams using MassAllocator is performed on the i5 2500K in about 2000 ms, the release in 70 ms. Selection using new occurs in approximately 1350 ms, but deletion through delete is performed as much as 17,400 ms! Under the debugger, even if the project is assembled in a release configuration, I have never been able to wait for the test to finish.
For testing on the x86 platform, we had to reduce the number of allocated objects by half, since new-delete has large overhead and the address space is not enough for 80 million objects. 40 million objects are allocated by MassAllocator for 2400 ms, released for 35 ms, while new performs its work for 750 ms and delete for 6430 ms.
As expected, profiling shows a bottleneck - incrementing an atomic counter, especially under x86. I do not have any radical ideas to accelerate this fragment.

Conclusion

Lock-free algorithms are a new area for me, so I will be glad to hear thoughts on the correctness of the algorithm and / or suggestions for speeding up the code.

Update1

As profiling showed, most of the time is spent on the atomic increment of the 64-bit curAtomicIndex_ counter. On x64, this translates into one assembler command lock xadd QWORD PTR ; in x86 mode, translates into a whole poem

 $again$158: ; 2424 : again: ; 2425 : mov ecx, edx; mov ecx, edx ; 2426 : mov ebx, eax; mov ebx, eax ; 2427 : add ebx, dword ptr _Value; add ebx, DWORD PTR $T7[ebp] ; 2428 : adc ecx, dword ptr _Value[4]; adc ecx, DWORD PTR $T7[ebp+4] ; 2429 : lock cmpxchg8b [esi]; lock cmpxchg8b QWORD PTR [esi] ; 2430 : jnz again; jne SHORT $again$158

Therefore, the selection of an item in 64bit mode is much faster.
Memory release occurs quickly in both x64 and x86.
')
Alternatively, you could use boost::object_pool , which fits ideologically, but it is not multithreaded and can work with multiple threads.
You can try boost::fast_pool_allocator replace new-delete, it shows better speed than new-delete, and in 32-bit mode it even goes along with the overall standings (selection + release) with MassAlloc due to a faster selection.
In terms of memory efficiency, all alternatives are inferior to MassAlloc, so in 32 bit mode neither new-delete nor boost :: fast_pool_allocator have enough address space to accommodate 80 million objects, while MassAllocator consumes only 1570 MB.

Source: https://habr.com/ru/post/182722/

All Articles