This question began to be interested for a long time. The idea of saving memory, disk space soared for a long time and probably at all. The first attempt to write a compression algorithm similar to the RLE algorithm was in 2000, then there were some more original compression concepts, but it didn’t go further than the preliminary tests. In 2005, he took up the idea of compression seriously. After 6 years, I share some excerpts from research. To understand and comprehend this article we introduce some terminology:
N - natural number
A word is N Bit, that is, N bits are grouped into a word (byte is an 8 bit word)
Combinations in the word - 2 ^ N (two to the power of N) (in byte 256 combinations)
Block (sentence) - N * 2 ^ N (Word * Combinations of words)
Combinations in a block - 2 ^ (N * 2 ^ N)
Unique words in the block - 2 ^ N
Block allocation table - N rows (in a row, the number of combinations of blocks with N unique words)
Entry table - N rows (in the row the number of occurrences of blocks with N unique words)
Incoming stream - a finite number of bytes in the piece of information
Definition of unique words in the block - the number of non-repeating combinations of words in the block. Example: 1 1 1 1 1 1 1 1 8 8 8 8 8 8 8 8 1 8 2 5 1 5 1 2 4 6 7 3 3 3 3 3
Finding unique words in the block with N = 3
After going through all the combinations in a block, we get a table of block allocations. Example: 1 4 2 84 3 144 4 24
With N = 4 or more, the table of block allocations using the brute force method cannot be obtained in a few seconds.
The first part of the algorithm consists in analyzing the input stream (creating the Table of occurrences). The incoming stream is divided into blocks, which are checked for the number of unique words. The number of a unique word is taken as the line number in the Table of occurrences. The value in the string is incremented by one. Having obtained the table of occurrences it is possible to see some properties. I give a couple of examples:
Plotting on the table of occurrences with N = 8, files can be downloaded here
Plotting on the table of occurrences with N = 8, files can be downloaded here
Plotting on the table of occurrences with N = 8, files can be downloaded here
Analyzing data streams in this way, it is clear that there are unused lines on the compressed and encrypted data that have not hit more than one block, because the blocks are separated according to the Unique words, the number of blocks that are not included can be recognized by the Distribution Table.
In the row of the Table of distributions, a zero is written in the case where there is zero on the row of the Table of occurrences, and the lines where there is a number other than zero (in the table of occurrences) are not touched. Summing up the modified table of occurrences, we obtain the number of used blocks. Next, the incoming stream is encoded according to the base number system, to which the amount obtained in the previous step due to this is data compression. To get the inverse data conversion from the encoded stream, you need a small dictionary in N bits that is responsible for the belonging of the used rows in the distribution table. Compression is not effective, but proving that it is possible to reduce the size of data already compressed with classical compression and encryption algorithms. ')
Demonstration of a special algorithm
The video was created to demonstrate the possibility of improving data compression on data "without redundancy." This is a special algorithm that creates redundancy in the source data.
PS The first algorithm runs too long; I didn’t do a video of his work. The second algorithm is shown in the video. It works on the same idea as the BTW algorithm, that is, it first prepares the data, and then the data is compressed by the classical algorithm.