Formulation of the problem
One of the algorithms that I implemented had interesting features when working with memory:
- Could stand out a huge amount, up to tens and hundreds of millions of small objects of the same type.
- The objects were POD types.
PodIt contains a number of variables, and it can be used as the data structure.
- It was not known in advance how many objects would be needed, it could happen that it would take a hundred, or maybe a hundred million.
- Objects are never deleted one by one, at some point they become not needed all at once.
- The algorithm is well parallelized; according to this, several threads simultaneously deal with the selection of objects, by the number of processor cores (s).
The use of the standard new - delete in such conditions leads to a very large loss of time for deleting objects. If, without a debugger, the deletion occurred at least within a few seconds, then in the presence of the debugger, the release of memory slows down about 100 (!) Times, and debugging the project becomes impossible. In addition, due to the large number of selected objects, the memory overrun on the internal data of the memory allocator became quite noticeable.
To solve the problem of allocating a huge number of objects of the same type, and their batch deletion, a MassAllocator lock-free container was made. The code is compiled by Visual Studio 2012. The full code of the project is laid out on
github .
Additional features
In my case, the objects could refer to each other, and to save memory, a small hack was invented: instead of the pointer, the object number is saved, and the object itself is obtained by requesting the storage. The number of objects is guaranteed to be less than four billion, so a 32 bit index was used instead of a 64 bit pointer, which saves 4 bytes. So I managed to reduce memory consumption by about 12%.
A nice bonus was that you could easily iterate to the repository to apply standard library algorithms, for example std :: sort.
Implementation
The idea is to sequentially select elements by blocks. Standard malloc is allocated a block of memory, which is logically represented as an array of elements. For each user request to allocate an element, a pointer is returned to the next element of the array, and the counter of the selected elements is incremented. When all the elements from the array are allocated, the next block of memory is requested, and so on. Memory release occurs very quickly, all the allocated blocks just release, without any calls to destructors for the elements.
All blocks have the same size, so it is very easy to turn to an element by a through number:
template <typename T> typename MassAllocator<T>::reference MassAllocator<T>::operator[](size_type index) { size_t indexOfBlock = index / elementsInBlockCount_; size_t indexInBlock = index % elementsInBlockCount_; return blocks_[indexOfBlock][indexInBlock]; }
Highlight new item
So, all elements are sequentially arranged in blocks. The distribution of the new element looks very simple: you need to increase the counter of the selected elements, make sure that the place in the current block, from which the distribution of elements is kept, is not over, and if necessary, select a new block. Of course, the counter incremented from different threads must be implemented through the atd variable std :: atomic, and the algorithm itself must be lock-free!
An atomic counter consistently issues indices, and everything is fine as long as there is space in the block. But here the block comes to an end, and it is necessary to select the new. A single stream should allocate a block, and the rest for this time should pause and resume operation after block allocation. Using one atomic counter, I managed to implement this logic with one assumption: the allocation time of the memory block should be sufficiently short so that the remaining threads could not overflow the 32-bit counter in the idle wait cycle. For synchronization of access, a 64-bit atomic variable is used, which is logically divided into 2 parts: the lower 32 bits is the element counter inside the block, and the older 32 bits is the counter of allocated memory blocks. The counter is declared as:
std::atomic<uint64_t> curAtomicIndex_
In each memory block, the same number of elements is distributed, for example, 100000. After the counter has been incremented, three different situations can arise for the element counter in the block:
At first I tried to do the implementation on two 32bit counters, but in this case the effect of the race appears. For example, the first query is the index in the block, and then the block number.
It may happen that the thread received a valid index of the itemIndex element, but until the blockIndx block index was obtained, the thread scheduler put it to sleep, and a new block was allocated during sleep by another block; Therefore, both the element index and the block index should be obtained atomically, either in the critical section, or through a single atomic variable.
The element selection code has a feature in that simultaneously with the pointer to the selected element, the index of this element can be returned, and the appeal to the upper and lower parts of the 64-bit integer is organized through union, and not bit arithmetic.
template <typename T> typename MassAllocator<T>::pointer MassAllocator<T>::createElement(size_type *returningIndex) {
Performance
Under the x64 platform, the allocation of 80 million elements in 8 streams using MassAllocator is performed on the i5 2500K in about 2000 ms, the release in 70 ms. Selection using new occurs in approximately 1350 ms, but deletion through delete is performed as much as 17,400 ms! Under the debugger, even if the project is assembled in a release configuration, I have never been able to wait for the test to finish.
For testing on the x86 platform, we had to reduce the number of allocated objects by half, since new-delete has large overhead and the address space is not enough for 80 million objects. 40 million objects are allocated by MassAllocator for 2400 ms, released for 35 ms, while new performs its work for 750 ms and delete for 6430 ms.
As expected, profiling shows a bottleneck - incrementing an atomic counter, especially under x86. I do not have any radical ideas to accelerate this fragment.
Conclusion
Lock-free algorithms are a new area for me, so I will be glad to hear thoughts on the correctness of the algorithm and / or suggestions for speeding up the code.
Update1
As profiling showed, most of the time is spent on the atomic increment of the 64-bit curAtomicIndex_ counter. On x64, this translates into one assembler command
lock xadd QWORD PTR
; in x86 mode, translates into a whole poem
$again$158: ; 2424 : again: ; 2425 : mov ecx, edx; mov ecx, edx ; 2426 : mov ebx, eax; mov ebx, eax ; 2427 : add ebx, dword ptr _Value; add ebx, DWORD PTR $T7[ebp] ; 2428 : adc ecx, dword ptr _Value[4]; adc ecx, DWORD PTR $T7[ebp+4] ; 2429 : lock cmpxchg8b [esi]; lock cmpxchg8b QWORD PTR [esi] ; 2430 : jnz again; jne SHORT $again$158
Therefore, the selection of an item in 64bit mode is much faster.
Memory release occurs quickly in both x64 and x86.
')
Alternatively, you could use
boost::object_pool
, which fits ideologically, but it is not multithreaded and can work with multiple threads.
You can try
boost::fast_pool_allocator
replace new-delete, it shows better speed than new-delete, and in 32-bit mode it even goes along with the overall standings (selection + release) with MassAlloc due to a faster selection.
In terms of memory efficiency, all alternatives are inferior to MassAlloc, so in 32 bit mode neither new-delete nor boost :: fast_pool_allocator have enough address space to accommodate 80 million objects, while MassAllocator consumes only 1570 MB.