Data compression

This question began to be interested for a long time. The idea of saving memory, disk space soared for a long time and probably at all. The first attempt to write a compression algorithm similar to the RLE algorithm was in 2000, then there were some more original compression concepts, but it didn’t go further than the preliminary tests. In 2005, he took up the idea of compression seriously. After 6 years, I share some excerpts from research.

To understand and comprehend this article we introduce some terminology:

N - natural number
A word is N Bit, that is, N bits are grouped into a word (byte is an 8 bit word)
Combinations in the word - 2 ^ N (two to the power of N) (in byte 256 combinations)
Block (sentence) - N * 2 ^ N (Word * Combinations of words)
Combinations in a block - 2 ^ (N * 2 ^ N)
Unique words in the block - 2 ^ N
Block allocation table - N rows (in a row, the number of combinations of blocks with N unique words)
Entry table - N rows (in the row the number of occurrences of blocks with N unique words)
Incoming stream - a finite number of bytes in the piece of information

Definition of unique words in the block - the number of non-repeating combinations of words in the block. Example:

1 1 1 1 1 1 1 1
8 8 8 8 8 8 8 8
1 8 2 5 1 5 1 2
4 6 7 3 3 3 3 3

Finding unique words in the block with N = 3

After going through all the combinations in a block, we get a table of block allocations. Example:

1 4
2 84
3 144
4 24

256 (8 ) - 2^(N*2^N)

Block allocation table N = 2

01 16
02 7864080
03 23996064960
04 7504175995680
05 574579238688000
06 15768930151054080
07 189225474428390400
08 1111400775560275200
09 3407360398041600000
10 5630409646557696000
11 5045340384103219200
12 2403608358767616000
13 577538743541760000
14 62977597562880000
15 2510734786560000
16 20922789888000

18446744073709551616 (64 ) - 2^(N*2^N)

Block allocation table N = 4

With N = 4 or more, the table of block allocations using the brute force method cannot be obtained in a few seconds.

The first part of the algorithm consists in analyzing the input stream (creating the Table of occurrences). The incoming stream is divided into blocks, which are checked for the number of unique words. The number of a unique word is taken as the line number in the Table of occurrences. The value in the string is incremented by one. Having obtained the table of occurrences it is possible to see some properties. I give a couple of examples:

Plotting on the table of occurrences with N = 8, files can be downloaded here

Different degree of aggressiveness of the compression

Plotting on the table of occurrences with N = 8, files can be downloaded here

Plotting on the table of occurrences with N = 8, files can be downloaded here

Analyzing data streams in this way, it is clear that there are unused lines on the compressed and encrypted data that have not hit more than one block, because the blocks are separated according to the Unique words, the number of blocks that are not included can be recognized by the Distribution Table.

In the row of the Table of distributions, a zero is written in the case where there is zero on the row of the Table of occurrences, and the lines where there is a number other than zero (in the table of occurrences) are not touched. Summing up the modified table of occurrences, we obtain the number of used blocks. Next, the incoming stream is encoded according to the base number system, to which the amount obtained in the previous step due to this is data compression. To get the inverse data conversion from the encoded stream, you need a small dictionary in N bits that is responsible for the belonging of the used rows in the distribution table. Compression is not effective, but proving that it is possible to reduce the size of data already compressed with classical compression and encryption algorithms.
')

Demonstration of a special algorithm

The video was created to demonstrate the possibility of improving data compression on data "without redundancy." This is a special algorithm that creates redundancy in the source data.

PS The first algorithm runs too long; I didn’t do a video of his work. The second algorithm is shown in the video. It works on the same idea as the BTW algorithm, that is, it first prepares the data, and then the data is compressed by the classical algorithm.

Source: https://habr.com/ru/post/126314/

All Articles

Data compression

Finding unique words in the block with N = 3

Block allocation table N = 2

Block allocation table N = 4

Plotting on the table of occurrences with N = 8, files can be downloaded here

Plotting on the table of occurrences with N = 8, files can be downloaded here

Plotting on the table of occurrences with N = 8, files can be downloaded here

Demonstration of a special algorithm

More articles: