Radix Compact sorting algorithm. Part 1: implementation on the CPU

In one of my projects, which was associated with computer vision, the problem arose of sorting a large array of numbers (about 100 million items). The sorting code should be executed as quickly as possible, and with the possibility of execution on multiple processors, and preferably on the GPU. The sorting implemented in the standard C ++ library did not fit: it is based on the Quick Sort algorithm, which currently cannot be massively parallelized on the GPU. The search for other methods led to the Radix Sort algorithm, but the sources found described an implementation requiring large memory consumption, more precisely, the memory required: (the number of array elements) * (the size of the radix array). For an array of 100 million elements and a radix array with a size of 256 memory elements, 25.6 GB would be required, a little real requirement for the current development of computing technology. But the algorithm of Radix Sort is well suited for paralleling the computations. Actually, therefore, the author tried to modify this method in order to reduce the memory consumption to acceptable values.

First of all, let's take a look at the general scheme of the Radix Sort algorithm, by execution steps:
So, let's take an array of two-digit decimal numbers (decimal to simplify the reasoning), which need to be sorted by increasing values:

12, 10, 45, 29, 74, 32, 11, 47, 22.27.

To perform the sorting, you need to create an auxiliary array (in the current article it is called the radix array). The size of the array should be 10 elements, note that the number of elements must be equal to the maximum value of the sorting number (that such sorting number will be clear from further reasoning). The auxiliary array is initially empty and looks like this:

Pass 1 fill the radix array.
We begin to fill in the radix array: to do this, take the first number from the sort array: 12, take its last digit (2) and store the number 12 in the element of the radix array with index 2, the radix array will look like this:

Idx0	Idx1	Idx2	Idx3	Idx4	Idx5	Idx6	Idx7	Idx8	Idx9
		1 2

Next, take the second number from the original array: 10, save it in the element with the index 0:

Idx0	Idx1	Idx2	Idx3	Idx4	Idx5	Idx6	Idx7	Idx8	Idx9
		1 2			4 5

In the same way we copy all the numbers from the source array into the radix array, in the end it will take the following form:

Idx0	Idx1	Idx2	Idx3	Idx4	Idx5	Idx6	Idx7	Idx8	Idx9
1 0	1 1	1 2 , 3 2 , 2 2		7 4	4 5		4 7 , 2 7		2 9

After all the numbers are copied to the radix array, copy them to the new array, starting with the first element of the radix array, you get the following result:
10,11,12,32,22,74,45,47,27,29.
')
As you can see, the original numbers have not yet been sorted, you need to perform an additional pass filling the radix array (note that for two-digit decimal numbers you just need to do 2 passes, for three-digit 3, for four-digit 4, etc.).

Pass 2 filling the radix array.
We take the first number from the intermediate sorting array: 10, take its penultimate digit (1) and store the number 10 in the radix element of the array with index 1:

Idx0	Idx1	Idx2	Idx3	Idx4	Idx5	Idx6	Idx7	Idx8	Idx9
	1 0

Next, take the second number from the intermediate array: 11, save it in the element with the index 1:

Idx0	Idx1	Idx2	Idx3	Idx4	Idx5	Idx6	Idx7	Idx8	Idx9
	1 0, 1 1

And so on, we fill the entire radix array with numbers from the intermediate array, the index of the position of the number in the radix array is equal to the value of the penultimate digit of the original number:

Idx0	Idx1	Idx2	Idx3	Idx4	Idx5	Idx6	Idx7	Idx8	Idx9
	1 0, 1 1, 1 2	2 2, 2 7, 2 9	3 2	4 5, 4 7			7 4

Everything, now it is enough to copy the numbers from the radix array to the new array and the numbers will be sorted in ascending order of values:
10,11,12,22,27,29,32,45,47,74.

Now we will consider how to write to the radix array in practice, when executing the sorting code on the CPU. It is clear that the values of the numbers that fill the radix array must be stored somewhere, and to do this as quickly as possible (this is crucial for the sorting algorithms). In the “classic” Radix Sort implementation, it is proposed to store values in the radix array itself, i.e. each element is also an array in which the original numbers are written; here is the reason for the high memory consumption. Due to the fact that we do not know in advance what source numbers we need to sort, we have to reserve space in each element of the radix array for all the numbers in the original array. For our simplified example, we would have to reserve memory in the size of 10x10 elements, in the case of the working version, the memory size would be 256 * 10, which is a lot. Let us see an illustration of the organization of the memory of the radix array for our example, on the second pass of the filling:








12	29
eleven	27		47
ten	22	32	45	74

As you can see, such a memory allocation scheme leads to its overspending, the scheme is redundant and costly. Let's try to find a way to organize the radix array, which would not require memory overruns. First, let's think: how much memory do we need? As many elements in the source array! How to practically implement the storage of the original numbers? The author reviewed various schemes: dynamic memory allocation, with such a radix scheme, an array looks like this (simplified):

void *

The second scheme was to allocate memory in an additional array, numbers from one element of the radix array are stored as a linked list, and the index of the first element of the list is stored in the radix array.
Both of these methods can reduce the memory consumption to an acceptable level, but in practical implementation showed low speed: the slowest was the method with dynamic memory allocation, the faster method was using linked lists. The reason for the low speed was in the costly call of the function of the redistribution of memory for the first case and high delays in case of random access to memory in the second case.

As a result, a third method was found that uses memory more optimally, but requires an additional pass of the radix array after filling it (but because there are a few elements in the radix array, this is acceptable), so the scheme is this: create an additional array in which we store the original numbers, but not in the form of a linked list, but in the form: the initial number, the index of the number in the radix array. And in the radix array store the total number of source numbers for each element of the radix array. After pass 2, these arrays will look like this:
Radix array:

Idx0	Idx1	Idx2	Idx3	Idx4	Idx5	Idx6	Idx7	Idx8	Idx9
count: 0	count: 3	count: 3	count: 1	count: 2	count: 0	count: 0	count: 1	count: 0	count: 0

Additional array (pairs are stored in it: initial value, index of value in radix array):

10.1

11.1

12.1

32.3

22.2

74.7

45.4

47.4

27.2

29.2

Next, we make an additional pass of the radix array, in it we calculate the indices of the positions of filling the final array with numbers from the additional array, the result of the additional pass (intermediate radix array):

Idx0	Idx1	Idx2	Idx3	Idx4	Idx5	Idx6	Idx7	Idx8	Idx9
IdxR: -1	IdxR: 0	IdxR: 3	IdxR: 6	IdxR: 7	IdxR: -1	IdxR: -1	IdxR: 9	IdxR: -1	IdxR: -1

Note: A value of -1 indicates that the item is not set.

Finally, we fill the final array with numbers from the additional array (using values from the intermediate array): for example, for the first number (10.1) the copy position will be 0, for the fourth number (32.3) the copy position will be 6, for the sixth number (74, 7) the copy position will be 9. If the number copy position is already taken, then we are looking for the first non-occupied element in the final array.

The resulting array:
10,11,12,22,27,29,32,45,47,74.
As you can see, the result is correct.

The described sorting method was called Radix Compact, according to its property: reduced memory consumption, compared to the original Radix Sort.
Now let's compare the general properties of the Radix Compact algorithm with the properties of the Quick Sort algorithm (for the case of sorting 32-bit numbers and radix to 8 bits):

	Radix Compact	Quick Sort
Computational complexity best case	N * (2) + 256 * 1	N * Log (N)
Computational complexity worst case	N * (8) + 256 * 4	N ²
Extra memory	N + size (radix array)	Log (N)
The possibility of massively parallel calculations	Yes	Not
Memory access	Moderately random	Linear
Data locality during processing (important for placement data in the CPU cache)	Enough	High

Theoretically, the Radix Compact algorithm has quite good characteristics, so let's test it in practice. To begin with, let's compare the results of the work of algorithms on the Quick Sort “home field”: in a single-threaded implementation on arrays of 32-bit numbers:

Testing was conducted on an Intel i5-3330 CPU (3000 MHz), memory size: 16 GB, OS: Windows 8.0 64-bit. The Radix Compact sorting code is written in C ++, the implementation of Quick Sort: std :: sort (). Note: the slightly modified Radix Compact was tested, and the running time of the Radix Compact does not include the allocation time.

As we see, even in the most advantageous conditions for Quick Sort, the Radix Compact surpasses it in speeds up to 7 times, playing only on small-sized arrays (up to 100 elements). This is understandable, because in Radix Compact, additional processing of a small radix array is required, for large arrays this time has little effect on the result, but for small arrays it is significant. But in general, the results of the Radix Compact are quite decent. In the next part of the article I will write about the result of the adaptation of the algorithm to execution on the GPU; parallelizing the calculations promises to give an even greater gain compared to the standard Quick Sort.

Source: https://habr.com/ru/post/260343/

All Articles

Radix Compact sorting algorithm. Part 1: implementation on the CPU

More articles: