About encodings and Unicode

First, it is worth explaining a couple of terms. A code page is a table of a previously known size, each position (or code) of which is associated with a single character or its absence. For example, a code page of dimension 256, where the 71st position corresponds to the letter “G”. Encoding is the rule for encoding a character into a numeric representation. Any encoding is created for a specific code page. For example, the symbol “G” in the Abrul encoding will take the value 71. By the way, the simplest encodings do just that - they represent characters with their values in the code tables, ASCII also applies to this.

Previously, only 7 bits per character was enough for encoding. Why? Enough for 128 different characters, everything that was necessary for the then users: the English alphabet, punctuation marks, numbers and some special characters. The main English 7-bit encoding with the corresponding code page was called ASCII (American Standard Code for Information Interchange) , they also laid the foundation for the future. Later, when computers spread to non-English-speaking countries, the need for national languages appeared, this is where the ASCII foundation came in handy. Computers process information at the byte level, and ASCII code takes up only the first 7 bits. The use of the 8th expanded the space to 256 places without loss of compatibility, and with it the support of the English language, this was important. The majority of non-English-language code pages and encodings are built on this fact: the bottom 128 positions are as in ASCII , and the top 128 are reserved for national needs and were encoded with the high bit. However, the creation for each language (sometimes a group of similar languages) of its own page and encoding led to the emergence of problems with the support of such an economy by the developers of operating systems and software in general.

To get out of the situation, they organized a consortium that developed and proposed a Unicode standard. It was supposed to combine the signs of all the languages of the world in one large table. In addition, encodings were determined. At first, the guys thought that 65,535 seats should be enough for everyone, they entered UCS-2 - an encoding with a fixed 16-bit length of codes. But Asians came with multi-volume alphabets, and the calculations collapsed. The code area was doubled, UCS-2 could no longer cope, a 32-bit UCS-4 appeared . The tangible benefits of UCS encodings were a constant multiple of two code lengths and a simple coding algorithm, both of which contributed to the speed of tex processing by a computer. But at the same time there was also an unjustified, excessively wasteful waste of space: imagine that in ASCII 00010101, then in UCS-2 00000000 00010101, and UCS-4 already 00000000 00000000 00000000 00010101. With this, something had to be done.

Unicode development has turned towards variable length encoding of the resulting codes. Representatives of steel are UTF-8 , UTF-16 and UTF-32 , the latter is conditional-early, since at the moment it is identical to UCS-4 . Each character in UTF-8 takes from 8 to 32 bits, and there is compatibility with ASCII. In UTF-16 16 or 32 bits, UTF-32 - 32 bits (if the Unicode space was doubled, then 32 or 64 bits), the two are not friendly with ASCII . The number of bytes occupied depends on the position of the character in the Unicode table. Obviously, the most practical encoding is UTF-8. Due to its compatibility with ASCII , a small gluttony to memory and fairly simple coding rules, it is the most common and promising Unicode encoding. Well, in conclusion, a beautiful scheme for converting a character code to UTF-8 :

Source: https://habr.com/ru/post/107827/

All Articles

About encodings and Unicode

More articles: