String sorting algorithm

Hi, Habr! I propose to your court a new sorting algorithm. Not theoretical studies, namely the practical implementation. Sorting internal and external, “in one bottle”, automatically switching from one to another, depending on the availability of available RAM. Sort strings, numbers, structures, "as well as everything that is needed in the future."

Funnel sorting is a significantly improved version of the merge sorting algorithm. The principal difference is that sorted data arrays for a merge are generated dynamically, and their size is determined by the sequence of input data itself. In particular, if the array is already sorted (in direct or reverse order - it almost doesn’t matter), immediately after the data is loaded into RAM, they will already be organized as a fully sorted single list, which can only be sent to the output stream.

The algorithm can be divided into several steps:

Step 1. Read the source data

A funnel is a collection of k ordered lists (initially empty). To minimize the number of comparisons, the current input element, regardless of the number of elements in the list, is compared only with the head and tail of the corresponding list: if it is less than or equal to the first element, it becomes its head, if it is greater than or equal to the last element, it is added to the tail. If neither one nor the other, it goes to comparison with the head and tail of the next list (if there is one), or it forms the next list itself, being in it at the same time head and tail. Thus, the elements of these lists form a “funnel”: the head of each following list is guaranteed to be larger than the head of the previous one, and its tail, respectively, is smaller. Therefore, when moving along a “funnel,” the probability that the next element will be attached somewhere without increasing the size of the funnel is rather high.
')
The elements of the input stream are placed in RAM, which is also ordered dynamically, in blocks of at least the maximum size of the element (usually much larger), once the ordered RAM has been exhausted. Each element is preceded by a passport of the form “element size in bytes” (determined when reading an element) and “address of the next element” (initially zeroed out), that is, the input data stream is converted into a list. Such an organization of data allows sorting without moving the elements themselves and provides an economical use of memory for storing heterogeneous data (usually, these are lines that may vary thousands of times in length), and each volume is allocated only the amount that is required by this particular element.

Step 2. Merging lists

The need to merge lists arises if:

element cannot be attached to the funnel (all k of the funnel lists are full);
input stream exhausted (all data is loaded into RAM);
the available RAM is exhausted (there is no place to place the next element).

The merging algorithm of ordered arrays is well known: at each step, the smaller of the current elements of the subarrays is taken and sent to the resulting array. The counters of the numbers of the elements of the resulting array and the subarray from which the element was taken are increased by one. When one of the subarrays is over, the remaining elements of the second subarray are added to the resulting array. Since arrays in RAM are presented as lists, the merging of two subarrays occurs without moving the data elements themselves, which is much faster.

The mechanism for merging lists when an overflow funnel can be implemented, for example, as follows: the two smallest (by volume or by number of elements) lists merge into one - into the one that takes the place of the older one by the funnel of the merged list (in this case the funnel remains a funnel, that is the head and tail of any list are guaranteed to be within the range of any of the higher lists and outside the range of any of the younger ones, all the lists are smaller than the youngest of the merged ones are shifted by one, and the last list of the funnel is reset.

If the input data stream or available RAM is exhausted, all the lists are first merged into pairs in pairs, starting with the smallest ones, and then this list is sent to the output stream. The sorting process is complete.

Step 3. Merge files

This step is performed only when there is a shortage of RAM to accommodate the entire sortable data array. The temporary files are merged into a new file with the subsequent deletion of two merged files, until the temporary files are completely exhausted. The merge algorithm is the same as when merging lists.

The complexity of the algorithm

The funnel sorting algorithm belongs to a class of algorithms based on the comparison of elements. They are flexible, but have a fundamental performance limit for the worst case: it cannot be better than O (n * log2n). The worst case for this algorithm is well known: if the input stream is also sorted by a “funnel” and does not have identical elements, i.e. a) the minimum element b) the maximum c) the minimum of the remaining d) the maximum of the remaining (for example, when sorting numbers: 1, 99, 2, 98, 3, 97, ...), etc. In this case, only two elements fit into the funnel array, and the algorithm is actually reduced to the classical merge sorting algorithm, in which the first step is combined with the process of reading elements into RAM. As you know, this algorithm has exactly the complexity of O (n * log2n) - both for the worst case and for the best. Practice, however, shows that in the real world, sortable data sets often contain ordered subarrays, and any ascending or descending subsequence will be guaranteed to be “caught” by a funnel the size of even one list! Therefore, the funnel sorting for the best case (when the input data has already been sorted) will be performed in linear time O (n).

The average complexity with a random distribution of input data can only be estimated as probabilistic. Obviously, the more monotonically increasing or decreasing subsequences in the source data, the more repeated elements there are, the more linear the sorting time will be - especially if you take into account the loading times of the source data and the recording of results. Moreover, when reading and writing data, access to them is consistent, that is, as fast as possible. Experiments have shown that the sorting time is comparable with the time of copying files or any other linear data processing. Testing on specially prepared arrays for the best and worst case with a volume from a million to one hundred million rows showed: despite the fact that the number of comparisons of elements differs by orders of magnitude, the total sorting time slows down only by tens of percent (in all experiments less than twice) . For comparison: the lack of RAM due to a sharp increase in the number of file operations slows down the sorting time by tens of times. Thus, if the theoretical complexity for the worst case is still logarithm, the real complexity of the program is linear: the worst case time is only twice as long as the best, obviously linear.

Advantages and disadvantages

Advantages:

one of the fastest (in practice) of general-purpose internal and external sorting algorithms;
universal Sorts practically any data types - in particular, arrays with elements of constant or variable length;
stream processing of input data: if other algorithms need to pre-download the entire data array into RAM, after which the actual sorting begins, then the “funnel” has almost finished it;
natural behavior: when processing already ordered or partially ordered data, the sorting time is noticeably reduced;
works in conditions of limited resources (lack of RAM);
works on sequential access data structures;
does not have "difficult" input data;
easy to implement;
requires only O (n) additional memory for its work.

Disadvantages:

unstable.

Generally speaking, the stability of the algorithm is not a problem if all the keys are different (it depends on the input data) or when identical elements are indistinguishable - for example, when the entire element is a key (this is exactly how the element comparison algorithm in this implementation is arranged). In extreme cases, the algorithm can be easily refined to ensure stable sorting by expanding the key with the initial index of the element in the array. In the case of equality of the main keys, the comparison is made by index, thus excluding the possibility of changing the relative position of equal elements. This modification requires additional memory in the size of the index for each element, which is initialized by the value at the time of reading the array in RAM.

Implementation

The most obvious improvement in the algorithm is to separate the processes of filling the funnel lists and merging them, as well as decide on the optimal size of the funnel itself: as the maximum number of lists increases, the time for finding a place to place the next element increases, albeit slightly. to slow down the process of merging lists - after all, they can be from two elements to many thousands or even millions! Experiments have shown that the size of the funnel has almost no effect on efficiency (the funnels were tested in 1, 2, 4, 8, ... 64 lists): the increase in time for adding an element is almost offset by a decrease in the number of mergers of larger lists. Therefore, the simplest version was implemented with the size of a funnel in one list, requiring no more than two comparisons for the extension of an element. But the size of the buffer of sorted lists before the merge, on the contrary, was chosen large enough (1024 lists), with overflow of which three quarters of the smallest lists of this buffer (up to 256) merge at once. This technology allows you to effectively sort failed (close to the worst case) source data sets, combining large arrays just before writing the results to a file.

Optimizing the process of merging temporary files is not very relevant and noticeable only when there is a serious lack of RAM, when the number of cuts to files and subsequent merges reaches hundreds and even thousands - for example, when sorting gigabyte files with a utility compiled for MS-DOS with 300-400 RAM available kilobyte But even in this case, the sorting is quite confident and not very long. The maximum number of files is defined as 36 (numbers and letters of the Latin alphabet) with a collapse at overflow to 16 at once.

Development

The funnel sorting algorithm (as well as other algorithms based on comparisons) has quite a good potential for increasing functionality. In the current implementation, sorting is carried out by increasing the value of the data byte in the string - the most universal comparison mechanism, which allows sorting texts in different languages and data structures containing non-text characters. But, ideally, the method of comparing elements should be customizable by the user (or generally user). In particular, it is possible to provide the possibility of sorting in ascending or descending order, alphabetically Russian (in WIN-DOS-KOI-UTF et cetera encodings), Arabic, Chinese and other languages, with or without ignoring the case of characters, sorting data of different types (integer , real, dates, time), sorting by fields of tables and other structures, the ability to override the terminator value (by default - the line feed character 0x0A) or the array length (when sorting data structures of constant length), etc. In any case, only the procedure of comparing elements is set up - the rest of the sorting algorithm “funnel” remains unchanged.

Source: https://habr.com/ru/post/448542/

All Articles

String sorting algorithm

Step 1. Read the source data

Step 2. Merging lists

Step 3. Merge files

More articles: