To solve one of the tasks from the Stanford cryptography course, it was necessary to create a Word64 -> Integer correspondence table and check the presence of an element several million times and add a new one. The solution is obvious: hash tables. hoogle offered Data.HashTable, the program was written, it worked successfully, and you could forget about everything, but I wanted to practice profiling and optimization.
Running with + RTS -sstderr caused a light shock: almost half the time was spent on garbage collection.
13,902,917,208 bytes allocated in the heap
3,275,041,156 bytes copied during GC
MUT time 26.69s (40.10s elapsed)
GC time 27.84s (32.47s elapsed)
% GC time 51.1% (44.7% elapsed)Referring to the profiler heap and collective intelligence explained: interleaving in a heap of data hashtable with intermediate results of calculations distresses the GC, and increasing and copying internal tables when they are filled finally kills performance. Optimization of the counting code in order to reduce the amount of intermediate data yielded almost nothing, it is necessary to change the hash table.
')
To begin with, Data.HashTable has been replaced by a triple of hashtables: Basic, Cuckoo and Linear. The results were somewhat surprising: Basic and Cuckoo almost did not change the total work time, but at the same time, GC began to take up to 70% of the time. Own code started working faster, and GC took the rest of the time. The memory was copied significantly less.
Basic:
12,781,006,960 bytes allocated in the heap
408,935,780 bytes copied during GC
MUT time 15.00s (15.16s elapsed)
GC time 43.29s (43.79s elapsed)
% GC time 74.3% (74.3% elapsed)Cuckoo:
11,611,257,712 bytes allocated in the heap
419,308,168 bytes copied during GC
MUT time 17.31s (17.65s elapsed)
GC time 38.08s (38.42s elapsed)
% GC time 68.7% (68.5% elapsed)The linear table, as expected, is designed entirely for other scenarios and is eliminated from the competition immediately:
12,547,294,380 bytes allocated in the heap
17,997,812,764 bytes copied during GC
MUT time 26.65s (27.16s elapsed)
GC time 123.72s (124.30s elapsed)
% GC time 82.3% (82.1% elapsed)A few more optimizations of the counting code, the indication of the initially correct size of the table and the leaders confidently entered CuckooHashTable:
3,835,164,148 bytes allocated in the heap
179,941,588 bytes copied during GC
MUT time 7.84s (7.97s elapsed)
GC time 20.05s (20.18s elapsed)
% GC time 71.9% (71.7% elapsed)The total work time has already decreased by half, but 70% on GC? It is necessary to change something in the conservatory.
Data.Judy is a wrapper to the implementation of judy tables, under GC does not fall completely and drastically changes the rules of the game:
5,366,256,252 bytes allocated in the heap
3,725,456 bytes copied during GC
MUT time 8.99s (9.12s elapsed)
GC time 1.44s (1.45s elapsed)
% GC time 13.4% (13.3% elapsed)On the heap profile, it is clear that the amount of allocated memory does not change at all: a flat horizontal line, all data lies outside the GC access.
Three disadvantages:
- you need to patch a package by throwing -fvia-C, without this, completely unintelligible link errors get out;
- the display is strictly from Word to Word, so for Word64 keys you have to make an array of arrays. An example of storing a ByteString is in the sources, but still unnecessary work;
- and finally, on C, I can write all this directly! Haskell with us or where?
What is left? Of course Data.Map. No IO, no C, faithful haskell code that looks prettier. Results are EXTREME:
5,569,024,484 bytes allocated in the heap
4,241,345,748 bytes copied during GC
MUT time 18.55s (18.77s elapsed)
GC time 12.66s (13.07s elapsed)
% GC time 40.6% (41.1% elapsed)Only 10% slower than CuckooHashTable, and no comparison at all with Data.HashTable. A ready-made illustration of “no need to think, you need to shake”: why all these dances, if you can write a completely straightforward code and the losses are almost imperceptible?
And the final chord is unordered-containers with Data.HashMap.Strict, an instant replacement for Data.Map: just change the import and structure type, the rest is compatible.
5,688,939,936 bytes allocated in the heap
3,659,088,964 bytes copied during GC
227,864,212 bytes maximum residency (23 sample (s))
MUT time 10.16s (10.75s elapsed)
GC time 8.93s (9.61s elapsed)
% GC time 46.8% (47.2% elapsed)Too much is copied, the profile clearly shows the re-allocation of the table. Let's try to immediately slip more memory, + RTS -A250M.
5,663,421,460 bytes allocated in the heap
887,689,168 bytes copied during GC
MUT time 11.40s (11.70s elapsed)
GC time 3.61s (3.83s elapsed)
% GC time 24.1% (24.7% elapsed)Ta-da! Until judy does not reach, but all other options are left behind. As free bonuses, the same buns as Data.Map: you don’t have to write your key-to-hash conversion, as in hashtables, don’t put everything in monads, the code is two times shorter.
Well, the traditional conclusion: premature. Frontal solution may well be faster than others.