Literary Archiver

First of all, I congratulate all the Orthodox and they sympathize with Easter and the end of Lent, all the rest - with the onset of spring. In the sandbox, only a month ago, my Cyrillic programming debut finally drowned. I do not know what attracted the attention of readers to the green, but commented on the sheets, like a real article. In his sheet TrllServ proposed to use the idea for archiving. I love people who know how to find practical application for ideas. Having unfolded the notebook, I tried to sketch an algorithm based on the property of my encoding, namely, the unique typing of a symbol by the first bits. Compressing such an algorithm is conveniently the text , that is, articles, books, or copy-paste from the Internet — that is, which consists of words, and where the case of letters has a grammatical meaning. Subsequently, the average, based on the rules of the Russian language, was added to the simple algorithm, and it all gathered into one complex program that effectively compressed a literature textbook. Let's call it "Literary Archiver".

Archiving methods

As mentioned, the text consists of words. In experimental archivers, symbols are always replaced with sequences of bits, because there are few of them, and the codes take up less memory.

A = 00, B = 01, C = 10, D = 110

When you need to squeeze a huge book, where each letter occurs millions of times, the saved memory is also huge. However, this method works not only with symbols. Words are also repeating text units, and there are not as many of them as it seems. Dahl’s dictionary contains about 200,000 words, and the average anonym in everyday speech uses no more than 3-4 thousand. Of course, in the Russian language, words may be inclined, conjugate, etc., but if you believe tribayana.ru/?p=4143 , only 38873 original lexemes are found in the “War and Peace” of L. Tolstoy. With one register (see method 2) and a fixed code length, each word will be encoded with two bytes. Not bad, if the source file is written in UTF-8, where two bytes is a Cyrillic grapheme. The average word length in Russian is 7.2 letters. That is, it turns out a gain of 7.2 times. In the article, the list of words in the file is called a "dictionary."
')
“Dad” = 0, “soup” = 10, “herbs” = 11
If you took to do the archiver, limit the encoding. In normal cases, we do not need all unicode characters to encode a file. Most often only the ASCII page and the Russian alphabet are useful. 256 characters - not so much, but TrllServ knowingly looked in the sandbox. Letters are uppercase and lowercase. Let's forget about hieroglyphs for now. If the lowercase letters are the basis of the text, then the uppercase letters most often stand alone in each sentence. At the same time, if you build the code table according to the usual rules, they will have to select rather long bit sequences (with prefix encoding), and, even worse, the “dictionary” will be filled with the same words, differing only in case. A lot of memory is wasted. For us, “” and “” are the same letter, and for a machine, two completely different values from the code table. In order not to produce unnecessary words, all our letters are already stored in pairs in the archiver itself. He omits all capital letters to lowercase. At the same time, the words of the text where the capital letters were marked are marked, and their number in each word is preserved. Capital letters are always at the beginning of the word, “CamelStyle” is cut into separate lexemes: “Camel” and “Style”. So the number of the word in the text with the number of capital in it is uniquely decoded.

"Start" - 1
"KAPS" - 4
“OUT” - 3
"Optovik": "Op" - 1, "TOB" - 2, "Ik" - 1
Each non-alphabetic character (arithmetic or punctuation) is encoded by a separate lexeme (one character). Only the space is encoded separately. Words are almost always separated by spaces, manually inserting each one is meaningless. It is enough to set the standard: when decoding, between the two words implies one space, unless otherwise specified. And the other is indicated by the space in the archive. There are several exceptions:
- A series of spaces (greater than 1) in the archive means the same sequence of spaces in the text. Useful when compressing python scripts with strict indentation.
  
  “Word1 ____ word2” -> “word1” + "____" + "word"
  * underscore here - spaces
- A space between two words in the archive means that there is no space between them in the text. Yes, like this. The crutch to camel style.
  
  “ZabBocherG” -> “for” + "" + "for" + "" + "che" + "" + "rg"
  * all 4 words are marked, the number of capital letters for each is saved
- By default, non-letter characters “attach” closely to the word on the left, and are separated by a space from the word (or sign) standing on the right. The space character in the archive changes the rules.
  
  “Saltpeter -, _ sugar” -> “saltpeter” + "," + "sugar"
  “Comma _, _ sir” -> “comma” + "" + "," + "sir"
  "Hey -, - goon!" -> "hey" + "," + "" + "goon" + "!"
  "Zatrolil _, - school student" -> "zatrolil" + "" + "," + "" + "school student"
  * here "_" is a space, and "-" is its absence. Due to Habr's features, auto-correct.
- A single space at the very beginning of the archive (the first character) means a single space at the beginning of the text.
  
  [the beginning of the text] "first" -> [the beginning of the archive] "" + "first"
A series of spaces replaces single automatic spaces.
Numbers are encoded as letters, numbers as words. Letters with numbers interlink. Changing the case of numbers is not possible.

"265" -> "265"
“228 articles” -> “228 articles”
“Adyn1111shmel” -> “adyn1111shmel”

These 4 methods provide "folding" of the source file without loss and distortion.

General algorithm

So what happens?

The text is read from the source file and is parallelly cut by the parser, a “dictionary” is compiled, “vocabulary” indices are substituted for the words of the text themselves. At the same time, the frequency of occurrence of each “vocabulary” word is calculated, and labels of capital letters are collected.
The “dictionary” is sorted by frequency of occurrence. In the implementation of the program, the index substitution list is sorted for speeding up.
An “alphabet” is formed - a list of all characters existing in the text, it is sorted by the frequency of occurrence of characters in the “dictionary”. Instead of characters in the "dictionary" substituted indices of the "alphabet". The implementation also sorts the wildcard list.
"Alphabet" and "Dictionary" are encoded in any way, for example, by Huffman.
The archive file for recording opens. The code table of an ordered “alphabet” is introduced. The end is separated by a vertical tab.
With the help of the “alphabet” code table, the “dictionary” code table is entered bitwise. The end is separated by a vertical tab.
Introduced indices of capital and their number for those words where there are several. The end is separated by a vertical tab.
With the help of the code table of the "dictionary" text is entered bit by bit.

Implementation

The figure is sad. While I was writing the code, he got confused. In the article, the algorithm looks simple, but understandable to the machine, it has become incomprehensible to man. All substitutions and indices are in fact a terrible pyramid of pointers. I did everything I could, but I could almost everything. The program works and is almost ready, it does not implement only the prefix coding algorithm of the “dictionary” and “alphabet” - not the most difficult, but the most important part. Without it, the application works and correctly performs the other functions, but the size of the archive decreases slightly. Why niasilil? Yes, ashamed. Lost in the array of pointers, and therefore not encoded. To unravel and refactor your creation, time is needed, and people of my specialization lack it now. Minobra course: less knowledge - more sample tests for preparation. It would be possible to finish the project together with colleagues, ask them for help, but as it turned out, the archiver's algorithm is deadly for most of them. However, I'll show you what it is. I did not find “War and Peace” without the “introductory part”, therefore I carved the archive of “The Master and Margarita”. Fewer pages makes more sense.

source size: 1.4 MB.
archive size: 1.6 MB.

Eureka! Archiver increased the file size. Without coding, it is meaningless. We better consider a specific example.
Two lines with the exceptions mentioned:

JLK ijo, ewrt lhuiOij jfds + huk uih EGE! *
897 y98 uih - 124 jfds JLK

turned into this:

Of course, there is no question of any compression without proper coding, but the structure of the archive is clearly shown (the essence).

In general, if, while the exams are going, there is a kind anonymous author who adores dancing with tambourines and is ready to write a crutch for crutches, here are the sources in C.

PS Implementation is ready.

With love.

Source: https://habr.com/ru/post/353144/

All Articles

Literary Archiver

Archiving methods

General algorithm

Implementation

More articles: