📜 ⬆️ ⬇️

External Sort with O (1) Extra Memory

After reading this article , I remembered writing external sorting, which used O (1) external memory. The function received a binary file and the maximum memory size that it could allocate for the array:

void ext_sort(const std::string filename, const size_t memory) 

I used the algorithm from Effective Performance of External Sorting with No Additional Disk Space :

  1. Divide the file into blocks that fit into the available memory. We denote these blocks as Block_1, Block_2, ..., Block_ (S-1), Block_S. Set P = 1.
  2. We read Block_P in memory.
  3. We sort the data in memory and write it back to Block_P. Set P = P + 1, and if P ≤ S, then read Block_P into memory and repeat this step. In other words, sort each block of the file.
  4. We divide each block into smaller blocks B_1 and B_2. Each of these blocks takes up half of the available memory.
  5. We read block B_1 of block_1 in the first half of the available memory. Set Q = 2.
  6. We read block B_1 of block Block_Q in the second half of the available memory.
  7. We merge arrays in memory using in-place merge, write the second half of the memory into block B_1 of Block_Q and set Q = Q + 1 if Q ≤ S, read block B_1 of Block_Q into the second half of available memory and repeat this step.
  8. Write the first half of the available memory in block B_1 of Block_1. Since we always left in the memory a smaller half of the elements and merged with all the blocks, M minimal elements of the entire file are stored in this part of the memory.
  9. We read block B_2 of Block_S in the second half of the available memory. Set Q = S −1.
  10. We read block B_2 of Block_Q in the first half of the available memory.
  11. Let us combine the arrays in memory using in-place merge, write the first half of the available memory into the B_2 block of the Block_Q block and set Q = Q −1. If Q ≥ 1, we read block B_2 of Block_Q in the first half of the available memory and repeat this step.
  12. Write the second half of the available memory in the block B_2 of Block_S. Similar to step 8, the maximum elements of the entire file are stored here.
  13. Starting from block B_2 of Block_1 and to block B_1 of Block_S, we define new blocks in the file and enumerate them again Block_1 to Block_S. We divide each block into blocks B_1 and B_2. Set P = 1.
  14. We read B_1 and B_2 of Block_P in memory. We combine arrays in memory. write the sorted array back to Block_P and set P = P +1. If P ≤ S, repeat this step.
  15. If S> 1, we return to step 5. Each time we select M minimum and maximum elements, write them to the beginning and end of the file, respectively, and then do the same with the remaining elements until we reach the middle of the file.

The advantage of this algorithm, besides the lack of a buffer on the disk, is that we read data from the disk in relatively large chunks, which speeds up the algorithm.
')
We implement the algorithm in C ++.

To begin with, we will determine the number of blocks and the block size in bytes and in elements and allocate memory:

  const size_t type_size = sizeof(T); const uint64_t filesize = file_size(filename); std::fstream data(filename, std::ios::in | std::ios::out | std::ios::binary); const uint64_t chunk_number = filesize / memory; const size_t chunk_size = memory / type_size - (memory / type_size) % 2, chunk_byte_size = chunk_size * type_size, half_chunk_byte_size = chunk_byte_size / 2, half_chunk_size = chunk_size / 2; std::vector<T> *chunk = new std::vector<T>(chunk_size); 

Now points 2-3 - we sort each block:

  for (uint64_t i = 0; i < chunk_number; ++i) { data.seekg(chunk_byte_size * i); data.read((char *) chunk->data(), chunk_byte_size); flat_quick_sort(chunk->begin(), chunk->end()); data.seekp(chunk_byte_size * i); data.write((char *) chunk->data(), chunk_byte_size); } 

We will write the sorting a little later.

Let's start mergers. Bottom half:

  int64_t s = chunk_number, start = 0; while (s > 0) { data.seekp(half_chunk_byte_size * start); data.read((char *) chunk->data(), half_chunk_byte_size); for (int64_t q = 1; q < s; ++q) { data.seekg(half_chunk_byte_size * start + chunk_byte_size * q); data.read((char *) (chunk->data() + half_chunk_size), half_chunk_byte_size); in_place_merge(chunk->begin(), chunk->begin() + half_chunk_size - 1, chunk->begin() + chunk_size); data.seekp(half_chunk_byte_size * start + chunk_byte_size * q); data.write((char *) (chunk->data() + half_chunk_size), half_chunk_byte_size); } data.seekp(half_chunk_byte_size * start); data.write((char *) chunk->data(), half_chunk_byte_size); 

And the same top:

  data.seekp(half_chunk_byte_size * start + chunk_byte_size * s - half_chunk_byte_size); data.read((char *) (chunk->data() + half_chunk_size), half_chunk_byte_size); for (int64_t q = s - 2; q >= 0; --q) { data.seekg(half_chunk_byte_size * (start + 1) + chunk_byte_size * q); data.read((char *) chunk->data(), half_chunk_byte_size); in_place_merge(chunk->begin(), chunk->begin() + half_chunk_size - 1, chunk->begin() + chunk_size); data.seekp(half_chunk_byte_size * (start + 1) + chunk_byte_size * q); data.write((char *) chunk->data(), half_chunk_byte_size); } data.seekg(half_chunk_byte_size * start + chunk_byte_size * s - half_chunk_byte_size); data.write((char *) (chunk->data() + half_chunk_size), half_chunk_byte_size); 

Redistribute the blocks, end the cycle and do not forget to free the memory:

  --s; ++start; for (int64_t p = 0; p < s; ++p) { data.seekp(half_chunk_byte_size * start + chunk_byte_size * p); data.read((char *) chunk->data(), chunk_byte_size); in_place_merge(chunk->begin(), chunk->begin() + half_chunk_size - 1, chunk->begin() + chunk_size); data.seekg(half_chunk_byte_size * start + chunk_byte_size * p); data.write((char *) chunk->data(), chunk_byte_size); } } delete chunk; } 

Full function
 template<typename T> void ext_sort(const std::string filename, const size_t memory) { const size_t type_size = sizeof(T); const uint64_t filesize = file_size(filename); std::fstream data(filename, std::ios::in | std::ios::out | std::ios::binary); const uint64_t chunk_number = filesize / memory; const size_t chunk_size = memory / type_size - (memory / type_size) % 2, chunk_byte_size = chunk_size * type_size, half_chunk_byte_size = chunk_byte_size / 2, half_chunk_size = chunk_size / 2; std::vector<T> *chunk = new std::vector<T>(chunk_size); for (uint64_t i = 0; i < chunk_number; ++i) { data.seekg(chunk_byte_size * i); data.read((char *) chunk->data(), chunk_byte_size); flat_quick_sort(chunk->begin(), chunk->end()); data.seekp(chunk_byte_size * i); data.write((char *) chunk->data(), chunk_byte_size); } int64_t s = chunk_number, start = 0; while (s > 0) { data.seekp(half_chunk_byte_size * start); data.read((char *) chunk->data(), half_chunk_byte_size); for (int64_t q = 1; q < s; ++q) { data.seekg(half_chunk_byte_size * start + chunk_byte_size * q); data.read((char *) (chunk->data() + half_chunk_size), half_chunk_byte_size); in_place_merge(chunk->begin(), chunk->begin() + half_chunk_size - 1, chunk->begin() + chunk_size); data.seekp(half_chunk_byte_size * start + chunk_byte_size * q); data.write((char *) (chunk->data() + half_chunk_size), half_chunk_byte_size); } data.seekp(half_chunk_byte_size * start); data.write((char *) chunk->data(), half_chunk_byte_size); data.seekp(half_chunk_byte_size * start + chunk_byte_size * s - half_chunk_byte_size); data.read((char *) (chunk->data() + half_chunk_size), half_chunk_byte_size); for (int64_t q = s - 2; q >= 0; --q) { data.seekg(half_chunk_byte_size * (start + 1) + chunk_byte_size * q); data.read((char *) chunk->data(), half_chunk_byte_size); in_place_merge(chunk->begin(), chunk->begin() + half_chunk_size - 1, chunk->begin() + chunk_size); data.seekp(half_chunk_byte_size * (start + 1) + chunk_byte_size * q); data.write((char *) chunk->data(), half_chunk_byte_size); } data.seekg(half_chunk_byte_size * start + chunk_byte_size * s - half_chunk_byte_size); data.write((char *) (chunk->data() + half_chunk_size), half_chunk_byte_size); --s; ++start; for (int64_t p = 0; p < s; ++p) { data.seekp(half_chunk_byte_size * start + chunk_byte_size * p); data.read((char *) chunk->data(), chunk_byte_size); in_place_merge(chunk->begin(), chunk->begin() + half_chunk_size - 1, chunk->begin() + chunk_size); data.seekg(half_chunk_byte_size * start + chunk_byte_size * p); data.write((char *) chunk->data(), chunk_byte_size); } } delete chunk; } 


It remains to implement the flat_quick_sort and in_place_merge functions. I took the idea (and most of the implementation) of flat quick sort in the article of the ripatti habraiser . I just replaced the median of medians (considered it overkill on average) with median-of-nine and added sorting inserts for small parts of the array.

flat_quicksort.h
 #ifndef EXTERNAL_SORT_FLAT_QUICKSORT_H #define EXTERNAL_SORT_FLAT_QUICKSORT_H template<class ForwIt> void insertion_sort(ForwIt be, ForwIt en) { for (ForwIt ii = be + 1; ii != en; ii++) { ForwIt jj = ii - 1; auto val = *ii; while (jj >= be and *jj > val) { *(jj + 1) = *jj; --jj; } *(jj + 1) = val; } } template<class ForwIt> ForwIt median_of_3(ForwIt it1, ForwIt it2, ForwIt it3) { return (*it1 > *it2) ? (*it3 > *it2) ? (*it1 > *it3) ? it3 : it1 : it2 : (*it3 > *it1) ? (*it2 > *it3) ? it3 : it2 : it1; } template<class ForwIt> ForwIt choose_pivot(ForwIt be, ForwIt en) { ptrdiff_t s = (en - be) / 8; ForwIt mid = be + (en - be) / 2; ForwIt best1 = median_of_3(be, be + s, be + 2 * s), best2 = median_of_3(mid - s, mid, mid + s), best3 = median_of_3( en - 2 * s, en - s, en); return median_of_3(best1, best2, best3); } // search for the end of the current block template<class ForwIt> ForwIt block_range(ForwIt be, ForwIt en) { ForwIt it = be; while (it != en) { if (*be < *it) return it; ++it; } return it; } // warning: after the partition outer pivot may point to random element template<class ForwIt> std::pair<ForwIt, ForwIt> partition_range(ForwIt be, ForwIt en, ForwIt pivot) { std::pair<ForwIt, ForwIt> re; ForwIt j = be; for (ForwIt i = be; i != en; ++i) if (*i < *pivot) { if (pivot == i) pivot = j; else if (pivot == j) pivot = i; std::swap(*j, *i); ++j; } re.first = j; for (ForwIt i = j; i != en; ++i) if (*pivot == *i) { if (pivot == i) pivot = j; else if (pivot == j) pivot = i; std::swap(*j, *i); ++j; } re.second = j; return re; } // makes largest element the first template<class ForwIt> void blockify(ForwIt be, ForwIt en) { for (ForwIt it = be; it != en; ++it) if (*be < *it) std::swap(*be, *it); } template<class ForwIt> void flat_quick_sort(ForwIt be, ForwIt en) { ForwIt tmp = en; // the current end of block while (be != en) { if (std::is_sorted(be, tmp)) { be = tmp; tmp = block_range(be, en); continue; } if (tmp - be < 32) insertion_sort(be, tmp); else { ForwIt pivot = choose_pivot(be, tmp); std::pair<ForwIt, ForwIt> range = partition_range(be, tmp, pivot); blockify(range.second, tmp); tmp = range.first; } } } #endif //EXTERNAL_SORT_FLAT_QUICKSORT_H 


Merging was harder. First, I used a stub using O (n) memory:

 template<typename T> void merge(std::vector<T> &chunk, size_t s, size_t q, size_t r) { std::vector<T> *chunk2 = new std::vector<T>(q - s + 1); auto it2 = chunk2->begin(), it1 = chunk.begin() + q + 1, it = chunk.begin() + s; std::copy(it, it1, it2); while (it2 != chunk2->end() && it1 != chunk.begin() + r + 1) { if (*it1 > *it2) { *it = *it2; ++it2; } else { *it = *it1; ++it1; } ++it; } if (it1 == chunk.begin() + r + 1) std::copy(it2, chunk2->end(), it); else std::copy(it1, chunk.begin() + r + 1, it); delete chunk2; } 

When I wanted to replace the in-place stub with the version, it turned out that the fast in-place merge algorithms are mostly quite complicated (see, for example, On optimal and efficient in place merging ). I needed something simpler, and I chose the algorithm from the article A simple algorithm for in-place merging :

in-place merge
 template<typename Iter> void merge_B_and_Y(Iter z, Iter y, Iter yn) { for (; z < y && y < yn; ++z) { Iter j = std::min_element(z, y); if (*j <= *y) std::swap(*z, *j); else { std::swap(*z, *y); ++y; } } if (z < y) flat_quick_sort(z, yn); } template<typename Iter> Iter find_next_X_block(Iter x0, Iter z, Iter y, size_t k, size_t f, Iter b1, Iter b2, auto max) { auto min1 = max, min2 = max; Iter m = x0 + (ptrdiff_t) floor((z - x0 - f) / k) * k + f, x = x0; if (m <= z) m += k; for (auto i = m; i + k <= y; i += k) { if (i != b1 && i != b2) { Iter j = (i < b1 && b1 < i + k) ? m - 1 : i + k - 1; if (*i <= min1 && *j <= min2) { x = i; min1 = *i; min2 = *j; } } } return x; } template<typename Iter> void in_place_merge(Iter x0, Iter y0, Iter yn, int64_t k, bool rec) { if (k == -1) k = (int64_t) sqrt(yn - x0); size_t f = (y0 - x0) % k; Iter x = (f == 0) ? y0 - 2 * k : y0 - k - f; auto t = *x, max = *std::max_element(x0, yn); *x = *x0; Iter z = x0, y = y0, b1 = x + 1, b2 = y0 - k; int i = 0; while (y - z > 2 * k) { ++i; if (*x <= *y || y >= yn) { *z = *x; *x = *b1; ++x; if ((x - x0) % k == f) if (z < x - k) b2 = x - k; x = find_next_X_block(x0, z, y, k, f, b1, b2, max); } else { *z = *y; *y = *b1; ++y; if ((y - y0) % k == 0) b2 = y - k; } ++z; *b1 = *z; if (z == x) x = b1; if (z == b2) b2 = yn + 1; ++b1; if ((b1 - x0) % k == f) b1 = (b2 == yn + 1) ? b1 - k : b2; } *z = t; if (rec) merge_B_and_Y(z, y, yn); else { flat_quick_sort(z, y); in_place_merge(z,y,yn,(int64_t)sqrt(k),true); } } 


But on my computer, replacing merge with in-place merge slowed down the algorithm by almost an order of magnitude. Maybe I was mistaken in the implementation or simply chose a slow algorithm in pursuit of simplicity. There was no time to understand, as always, and for some reason, the gprof fell. And then it dawned on me. If we allocate M bytes of dynamic memory, it doesn’t matter how we use it, we still get O (1). Then just select 2/3 for the data, and a third for the merge buffer. The slowdown will be much less. And the truth is:
AlgorithmTime (75MB ​​int64 in 7.5MB of memory)Speed ​​(75MB ​​int64 in 7.5MB of memory)Time (7.5MB int64 to 75KB of memory)Speed ​​(7.5MB int64 in 75KB of memory)Time (750MB int64 to 75MB of memory)Speed ​​(750MB int64 in 75MB memory)
In-place merge6.04329 s1,241,045 b / s24.2993 s3,086,508 b / s--
Merge0.932663 s8,041,489 b / s2.73895 s27,382,756 B / s47.7946 s15 691 689 b / s
Algorithm SLY_G1.79629 s4175272 b / s3.84775 s19,491,910 B / s39.77 s18,858,436 B / s
Unfortunately, the algorithm slows down on large volumes, which is expected, because we do not use any buffer at all. Nevertheless, the speed of the algorithm is quite adequate, and I am sure it can be improved.

All sources are here .

Source: https://habr.com/ru/post/268535/


All Articles