📜 ⬆️ ⬇️

Compressing files in a multi-core era

We did a daily big backup for Stack Overflow , so again I played around a bit with file compression.

Our server has the latest 64-bit version 7zip (4.64). I believe that a dual-core processor is enough for a desktop system. But everything is simple with servers: the more cores, the merrier! The server has two quad-core processors - only eight cores, so I was a little disappointed to find that neither RAR nor 7zip use more than two cores.

But, even using only two cores for compression, the 7zip algorithm is incredibly efficient, and recently it has become quite fast. I used to recommend using RAR instead of Zip. Given the free, unlike RAR, and the effectiveness of 7zip, now it is logical to choose it.

Here are a couple of my tests for compressing a 4.76 GB backup database file (using a server with two 2.5 GHz Xeon E5420 quad-core processors on board):
7zipfastest5 min14 MB / sec973 MB
7zipfast7 min11 MB / sec926 MB
7zipnormal34 min2.5 MB / sec752 MB
7zipmaximum41 min2.0 MB / sec714 MB
7zipultra48 min1.7 MB / sec698 MB
If you think that, since 7zip shows good results with maximum and ultra compression ratios, then at ultra-plus the result will be incredible and will disappoint you. There are certain reasons why vendors of archivers set the normal compression ratio by default. If the compression ratio is reduced, the size of the output archive increases dramatically, while an increase in the compression ratio will lead to a relatively small decrease in the size of the output archive instead of a large increase in the compression time.
')
And now let's see what happens if I switch 7zip to use bzip2 :
7zip with bzip2 selected

We will compress the same file (4.76 GB) on the same machine:
bzip2fastest2 min36 MB / sec1092 MB
bzip2fast2.5 min29 MB / sec1011 MB
bzip2normal3.5 min22 MB / sec989 MB
bzip2maximum7 min12 MB / sec987 MB
bzip2ultra21 min4 MB / sec986 MB
Why is bzip2 so much faster than 7zip? It's simple:
CPU usage when using 7zip
7zip multithreaded cpu usage
CPU usage when using bzip2
bzip2-multithreaded-cpu-usage.png

Bzip2 uses more than two cores for parallelization. I do not know how many kernels it can use, but in the drop-down menu, 7zip offers no more than sixteen cores for bzip2. Our server has eight cores, so I chose so much during the test.

Unfortunately, the increase in the speed of bzip2 is meaningless with a high degree of compression - the difference between normal, maximum, and ultra is a measly 0.06%. The algorithm scales well in time, but practically does not scale to the size of the output file. This is very strange, because it would be to reduce the size I would like to spend the time won by parallelization.

But even a minimal size reduction can make sense, depending on the circumstances:

= + n * ( / + )

For example, if you compress a file to send it over the network, n = 1, so the compression time has a noticeable effect on the total time. If you want to put the file on the Internet, n is large, so a long compression time will have little effect on the total time.
After all, slow networks work well with slow but effective algorithms, while fast networks need fast, but perhaps less efficient algorithms.

On the other hand, the ability to compress a five-gigabyte file five times in two minutes is impressive. Therefore, the idea that how quickly the 7zip algorithm would work if it was rewritten and parallelized to work more than on two cores did not leave me at all.

Source: https://habr.com/ru/post/53217/


All Articles