Sorting integers when memory is low

The author of the original in English - dzeban habrauzer

Introduction

Last time we discussed how to artificially limit the memory available to a program. As a bonus, I got myself a libmemrestrict , a library with wrappers for functions like malloc to track memory usage, and ptrace-restrict , a ptrace- based tool that hooks brk, sbrk and mmap calls for the same purpose.

So why should we try to organize a memory limit — is it often the case? When was the OOM last hit your app? Do you always think about memory consumption while programming? Memory is cheap, and if you don’t have enough memory, add another couple of gigabytes.
')
And, nevertheless, it is impossible to add memory indefinitely - and not because you do not have its infinite source. When processing Big Data, it is simply impossible to fit the entire input into the array - it is necessary to distribute the data between the RAM, the media and the network. Algorithms and techniques are required for such data processing.

And so I started doing similar tasks, starting with a simple one - how to sort a million integers (4 MiB data) with 2 MiB memory? This task can be generalized in case you do not have enough memory to hold all the data.

Given

You must write a program to sort a set of integers stored in a file. To create it, I wrote the simplest tools randints and rangeints.

The program should output a sorted array to stdout as text.

It should measure the time of work and bring it to stderr. You can not just run the program through the time utility, because it counts the time to read the file and the time to output it.

It should work, having memory at least two times less than the file size. To do this, we will use libmemrestrict or ptrace-restrict.

For some methods, these tools are not useful. For example, for mmap they will not work - you have to physically limit the use of memory.

They will be checked to solve the original problem (sorting 4 MiB into 2 MiB). Also I will run them on a virtual machine with 128 MiB of memory for sorting 500 Mb (125 million four-byte integers).

Naive approach

Let's try to sort the numbers directly and calculate the memory usage (and time). Just count the file into an array of integers, and call qsort.

Let's try a file with 4 MB of data. Without limitations, everything works:

$ ./naive 4M.bin > /dev/null 4000000 bytes sorted in 0.323146 seconds

but it is not interesting. Restrict memory 2 MiB:

 $ LD_PRELOAD=./libmemrestrict.so ./naive ints > ints.sorted Segmentation fault

Let us raise the limit to 4 MiB - and fail again. (libmemrestrict reads settings from environment).

 $ MR_THRESHOLD=5000000 LD_PRELOAD=./libmemrestrict.so ./naive ints > ints.sorted Segmentation fault

Obviously, qsort requires more memory. Let's see how much he wants, using the massif from valgrind :

 $ valgrind --tool=massif ./naive ints $ ms_print massif.out.10676

Here is a beautiful schedule:

  MB 8.819^ :::::::::::::::::::::::::::# | : # | : # | : # | : # | : # | : # | : # | : # | :::::::@ #:::::::::::::::::::::::: | : @ # | : @ # | : @ # | : @ # | : @ # | @@@@@@: @ # | @ : @ # | @ : @ # | :::@ : @ # | ::: @ : @ # 0 +----------------------------------------------------------------------->Gi 0 1.721

There are several data placements that double the memory requests to 4 MiB - this is my array, and then four more MiB for qsort. Statistics:

 -------------------------------------------------------------------------------- n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B) -------------------------------------------------------------------------------- 21 173,222,581 5,247,504 4,000,568 1,246,936 0 22 173,223,802 5,246,920 4,000,000 1,246,920 0 23 173,226,655 5,247,504 4,000,568 1,246,936 0 24 173,229,202 5,246,920 4,000,000 1,246,920 0 25 173,229,311 9,246,928 8,000,000 1,246,928 0 26 869,058,772 9,246,928 8,000,000 1,246,928 0 86.52% (8,000,000B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc. ->43.26% (4,000,000B) 0x400A26: readfile (in /home/avd/dev/cs/sorting/external/naive) | ->43.26% (4,000,000B) 0x400ACD: main (in /home/avd/dev/cs/sorting/external/naive) | ->43.26% (4,000,000B) 0x35D40383F7: qsort_r (in /usr/lib64/libc-2.18.so) | ->43.26% (4,000,000B) 0x400B3D: main (in /home/avd/dev/cs/sorting/external/naive) | ->00.00% (0B) in 1+ places, all below ms_print's threshold (01.00%)

4 million bytes I use, and another 4 million by qsort_r. And another 1.2 MB in addition to the heap uses massif.

Apparently, in this case, qsort behaves like O (n) with respect to volume complexity. Which is strange, since qsort works “on site” and uses various optimization techniques that guarantee volume complexity in O (log n). Additional reading on glibc qsort implementation .

Well, can he sort out 500 MB in 128 MiB RAM?

 $ ./naive 500M.bin > /dev/null Segmentation fault

Of course not. Speed:

 $ ./naive 4M.bin > /dev/null 4000000 bytes sorted in 0.322712 seconds

This means that naive sorting works well without restrictions, does not work with restrictions at all, and qsort requires O (n) memory. This is strange. For example, if you limit the memory to 5.3 MB, it will work and will not require memory of O (n). I still understand this.

File and mmap

mmap is a hacker method of sorting a large amount of data in conditions of limited memory He shifts the burden of data distribution between memory and swap on OS shoulders.

It works like this:

Through mmap we send the entire file to memory.
We get from him a pointer to the data
Call the data sorting algorithm by this pointer.

And that's it! Now you do not get memory overflow, even sorting a file by size larger than the available memory. To understand how the mechanism works, you need to understand a little about memory management in the OS.

Each program is represented by a process that has its own personal, virtual address space isolated from others. Its length is limited by the width of the CPU bus, that is, for a 32-bit CPU, it is equal to 2 ³² , that is 4 GiB.

All memory allocation, which is involved in the process, occurs in virtual memory. This virtual memory is mapped to the physical kernel subsystem for working with memory - MMU. And it usually happens in the "lazy" mode - that is, when the process asks for memory, the kernel immediately gives it to it, but it does not physically place it instantly - that is, the virtual memory page does not immediately appear on the physical. When such a page is accessed (for example, for writing), the MMU throws an “page fault” exception, which the kernel handles, rendering the virtual page on physical. Now it is displayed, and the entry to this page will be broadcast by the MMU as an entry to a specific address in physical memory.

On the other hand, if you remember that the virtual address space is limited only by the size of the CPU bus, you can get into a situation in which the program will want to take up more memory than is available. For example, in a 32-bit system with 256 MiB RAM, a process can accommodate and use 1 GiB of memory. In this case, the memory pages will fall into the swap - instead of RAM, they will go to the drive, such as a hard disk. When accessing such a page, the kernel reads it from the drive and sends it to memory (replacing another page in memory).

The kernel copes well with the distribution of data between the memory and the drive, so it is natural to try to use this property in our task. When we call mmap for our file, the kernel will reserve a range of virtual addresses that will not be placed immediately. When we try to access them, the kernel loads it from the input file into memory. When we run out of physical memory, the kernel will remove the pages in the swap. This way we will balance the data between the disk file, the memory and the swap.

The only limitation is the virtual address space (4 GiB for a 32bit system and 256 TiB for a 64bit), and swap - but drives are inexpensive today.

Due to the fact that mmap loads the entire file into virtual memory, we will not be able to use libmemrestrict or ptrace-restrict, since they limit the virtual memory itself. Trying to limit the sorting of data to 100M in virtual memory with a volume of 10M, we get an error from mmap:

 $ qemu-x86_64 -R 10M ./mmaped 100M.bin mmap stack: Cannot allocate memory

No wonder - the file is loaded into virtual memory, and the kernel distributes it between the physical memory and the swap. So we need at least 100M of virtual memory, plus some more space for qemu.

Therefore, for this method, I use a virtual machine with 128 MiB of memory. Here is my sorting program using mmap. And it works!

 $ free -m total used free shared buffers cached Mem: 119 42 76 0 4 16 -/+ buffers/cache: 21 97 Swap: 382 0 382 $ ll -h 500M.bin -rw-r--r-- 1 root root 477M Feb 3 05:39 500M.bin $ ./mmaped 500M.bin > /dev/null 500000000 bytes sorted in 32.250000 seconds

Information from top:

 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3167 root 20 0 480m 90m 90m R 84.6 76.4 1:18.65 mmaped

We use 500 MB of virtual memory, and the actual available memory is 90 MiB. Note that MiB is 2 ²⁰ , and MB is 10 ⁶ = 1 million. And 500 MB = 500 000 000 bytes, and 500 000 000 >> 20 = 476 MiB.

Looking at the details from vmstat while sorting 500 MB, we will see how the kernel balances between swap, disk cache, buffers and free memory:

 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- rb swpd free buff cache si so bi bo in cs us sy id wa 0 0 0 77776 2120 15104 1 27 710 971 24 34 3 1 95 1 1 1 0 2060 488 90068 1 27 785 1057 25 37 3 1 95 1 1 0 928 3400 60 89744 1 27 799 1092 25 38 3 1 94 1 0 2 1908 1928 212 92040 1 27 852 1174 26 40 4 1 94 1 0 2 3436 2360 280 93056 1 27 911 1282 28 42 4 1 94 2 0 0 5272 3688 196 94688 1 27 1066 1471 31 48 4 1 93 2 0 0 5272 3720 204 94700 1 27 1064 1469 31 48 4 1 93 2

At first we had ~ 70 MiB of free memory, an empty swap and a little bit of memory was allocated for I / O buffers and cache. Then after mmap, all this memory went to the cache. When the free memory ran out, the core began to use a swap, which increases with the increase in I / O load. And we come to the fact that almost all the memory is allocated for the disk cache - which is normal, since the pages with disk cache, in the case when we need memory for the application, go first.

So, sorting through mmap is a cool hack that requires basic concepts of working with memory, and gives a simple solution for processing large amounts of data with a small amount of memory.

Outer merge sort

Suppose mmap cannot be used - you want to sort the file into 5 GiB on a 32-bit system.

There is a well-known popular method called "merge external sorting". If you do not have enough memory, you need to use an external drive - for example, a hard disk. We'll just have to work with the data bit by bit, because they all do not fit into the memory.

External merge sorting works like this:

break the data into pieces by the amount of available memory
each piece is sorted in memory and written to external media
you now have pieces the size of filesize / buffersize
read pieces of buffersize / # chunks pieces to combine them in a buffer and output the result to a file

I did a simple, non-optimized implementation :

 $ LD_PRELOAD=./libmemrestrict.so ./external-merge 4M.bin 1048576 > /dev/null 4194304 bytes sorted in 0.383171 seconds

Sorted with 2 MiB of memory and using a buffer of 1 MiB.

Now we will sort 500 MB. First, turn off the swap, because we manage the exchange of data pieces manually.

 $ swapoff /dev/vda5

Flush the caches:

 $ echo 3 > /proc/sys/vm/drop_caches

Check available memory:

 $ free -m total used free shared buffers cached Mem: 119 28 90 0 0 6 -/+ buffers/cache: 21 97 Swap: 0 0 0

We will use a buffer of 50 MB - 10 times smaller than the file size.

 $ ./external-merge 500M.bin 50000000 > /dev/null 500000000 bytes sorted in 120.115180 seconds

Two minutes! Why did it happen? Of course, due to I / O operations. Three things slow down the process. In the data separation phase, we sequentially read the file using a small buffer. In the merge phase, we open and close files with pieces of information. And there is also a conclusion - in the merge phase, we output 50 MB of data to stdout, which, despite the redirection to / dev / null, gives a load. If this is disabled, we get a performance boost of 25%.

 $ ./external-merge-no-output 500M.bin 50000000 > /dev/null 500000000 bytes sorted in 87.140000 seconds

Memory usage is fine with me. If you run the program through massif, you can see that the peak of consumption is the size of the buffer and a small pile.

 -------------------------------------------------------------------------------- Command: ./external-merge 500M.bin 50000000 Massif arguments: (none) ms_print arguments: massif.out.17423 -------------------------------------------------------------------------------- MB 47.75^ ::::: |#::::::@:::::::::::@:::::::::@:::@::::@::::@::::::::@::::@::::@:::@ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ 0 +----------------------------------------------------------------------->Gi 0 332.5 Number of snapshots: 98 Detailed snapshots: [4 (peak), 10, 20, 29, 32, 35, 38, 45, 48, 54, 64, 74, 84, 94] -------------------------------------------------------------------------------- n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B) -------------------------------------------------------------------------------- 0 0 0 0 0 0 1 119,690 584 568 16 0 2 123,141 50,004,496 50,000,568 3,928 0 3 4,814,014 50,005,080 50,001,136 3,944 0 4 4,817,234 50,005,080 50,001,136 3,944 0 99.99% (50,001,136B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc. ->99.99% (50,000,000B) 0x400FA2: external_merge_sort (in /root/external-merge) | ->99.99% (50,000,000B) 0x40128E: main (in /root/external-merge) | ->00.00% (1,136B) in 1+ places, all below ms_print's threshold (01.00%)

You can limit the memory and 50 MB, plus another 500 KB for temporary lines containing the paths to the files:

 $ LD_PRELOAD=./libmemrestrict.so MR_THRESHOLD=51000000 ./external-merge 500M.bin 50000000 > /dev/null 500000000 bytes sorted in 87.900000 seconds

In general, with memory - ok, with speed - not ok. mmap allowed you to do this operation in 32 seconds. Let's improve our way.

Perform program profiling with gprof. Create a binary

 $ gcc -pg -g -Wall -Wextra external-merge.c -o external-merge-gprof

And we will call the program repeatedly to accumulate statistics with the help of my convenient script from the article on gprof. Here is the result:

 % cumulative self self total time seconds seconds calls Ts/call Ts/call name 81.98 432.87 432.87 compar 18.17 528.82 95.95 print_arr 0.00 528.82 0.00 1100 0.00 0.00 form_filename 0.00 528.82 0.00 100 0.00 0.00 merge 0.00 528.82 0.00 100 0.00 0.00 save_buf 0.00 528.82 0.00 10 0.00 0.00 external_merge_sort 0.00 528.82 0.00 10 0.00 0.00 split

Most of the time was spent on sorting and output. But do not forget that gprof does not show the time spent on system calls and I / O.

What can be improved?

add multithreading and I / O tricks to external sorting
take a different sorting algorithm

Universal external merge sorting is a simple idea for sorting big data with a small amount of memory, but without any improvements it works slowly.

Customize sorting

You can, of course, use multithreading to separate and merge - but this is a bad idea. Using it during the separation phase does not make sense, because the data is contained in a single buffer. You can try to influence how the kernel reads the data. There are two functions for this:

readahead (Linux only).
posix_fadvise with POSIX_FADV_SEQUENTIAL.

They tell the memory management subsystem how we will read the data. In our case, the reading is sequential, so it would be nice to see the contents of the file in the page cache.

During the merge phase, we can not do the permanent opening and closing of files, but create dedicated streams for each of the files. Each stream will keep its file open, and we will fill the buffer for it. When it is filled, it is sorted and displayed. And readahead will work for each thread.

Here is an improved multi-threaded version of merge external sorting. Well, as I said, multithreading is not a good idea. There is no difference in the single core process.

 $ ./mt-ext-merge 500M.bin 50000000 > /dev/null 500000000 bytes sorted in 117.380000 seconds

This is with data output. And without output:

 $ ./mt-ext-merge-no-output 500M.bin 50000000 > /dev/null 500000000 bytes sorted in 91.040000 seconds

Well, let's try on a quad-core machine (Intel Core i7-3612QM CPU @ 2.10GHz):

 $ ./naive 500M.bin > /dev/null 500000000 bytes sorted in 23.040499 seconds $ ./mmaped 500M.bin > /dev/null 500000000 bytes sorted in 23.542076 seconds $ ./external-merge 500M.bin 50000000 > /dev/null 500000000 bytes sorted in 39.228695 seconds $ ./mt-external-merge 500M.bin 50000000 > /dev/null 500000000 bytes sorted in 41.062793 seconds $ ./external-merge-no-output 500M.bin 50000000 > /dev/null 500000000 bytes sorted in 28.893745 seconds $ ./mt-external-merge-no-output 500M.bin 50000000 > /dev/null 500000000 bytes sorted in 28.368976 seconds       : $ ./external-merge-no-output 500M.bin 5000000 > /dev/null 500000000 bytes sorted in 27.107728 seconds $ ./mt-external-merge-no-output 500M.bin 5000000 > /dev/null 500000000 bytes sorted in 28.558468 seconds

There is no difference between external-merge and mt-external-merge. Why is that? Yes, because multithreading does not solve the problems of input and output limitations. It is well suited for those cases where:

execution of threads independently
input and output resources can work in parallel — for example, like RAID

Our threads are interdependent - the main thread has to wait for the buffer to be sorted, and only then begin the next reading from the file. And reading for separation is faster than buffer sorting, so most of the time, threads wait until the main thread finishes.

We need other ways to improve the algorithm.

Special sorting algorithms

Let's try to use something other than QuickSort. Since we know that we are sorting integers, we need to use it. There are specific algorithms that are used for specific data types, and they can be divided into two groups:

do not use comparisons
do not require loading the entire array into memory

They work better than O (n log (n)) - the lower limit for comparing algorithms like QuickSort. But not all of them are suitable in case of memory limitations. Therefore, I stopped on the use of sorting by counting

Sort by count

If we have a lot of data with a small spread, you can use sorting by counting. The idea is simple - we will not store data in memory, but an array of counters. We consistently read the data and increase the corresponding counters. The complexity of the algorithm is linear in time, and in terms of volume - proportional to the range of data.

A simple implementation works with an array from 0 to N. Integers correspond to array indices. Here is my version , which works well without any tuning. The second argument is the size of the buffer in the elements. Buffering greatly speeds up work, because the program reads from a file not 4 bytes each.

 $ ./counting-array 500M-range.bin 1000000 > /dev/null Range is 1000000 500000000 bytes sorted in 3.240000 seconds

Ugums Polgiga data sorted for 3 and a half seconds on 128 MiB of memory and one CPU. Compare to qsort or mmap:

 $ ./mmaped 500M-range.bin > /dev/null 500000000 bytes sorted in 76.150000 seconds

23 times faster!

But do not forget about the restrictions - only integers (or their equivalent), and their small consecutive interval. I tried to make a variant with inconsistent intervals through hashes and binary search - but its speed is very bad.

And if we assume the uniqueness of our numbers, then the counters can only be in two states - is it or not, so they can be single-bit. Then our array will shrink. Yes, and we do not need an array - we can store numbers as bits, that is, instead of an array, we will have a vector. We read the file and set the Nth bit, if the number N was found there. After that, we simply go through the vector and output to the file those numbers for which the bits are cocked.

Such decisions require a careful approach, since you can still go beyond the limits. For example, to sort all the numbers from the interval of integers (2 ³² ), you need 1 bit for each number, and this is 4294967296 bits = 536870912 bytes = 512 MiB. And we have only 128 MiB, which is not enough for you. But in some cases, the gain will be enormous - this is the story on this topic from “Programming Pearls” by Jon Bentley .

Knowing your data is very helpful.

Total

During the 5 months spent on the article, I did a lot of things - a dozen programs, some good ideas, many bad ones. And much more can be done and corrected.

The simple problem of sorting data with a lack of memory revealed a whole set of oddities, which we usually do not think about:

common algorithms are not suitable for all problems
debugging and profiling - very useful and visual things
I / O is a problem area, if you do not shift all the work to the core
multithreading is not a panacea for speed
know your data and your surroundings

Sorting plate:

Test	Naive QuickSort	mmap and quicksort	Outer merge sort	Multi-thread merge external sorting	Sort by count
4 MiB in 2 MiB	Segfault	N / A	0.38s	0.41s	0.01
500 MB in 128 MiB	Segfault	32.25s	87.14s	91.04	3.24

Get to know your data and develop a simple algorithm to work with them!

Links

Source: https://habr.com/ru/post/266557/

All Articles