About encodings and code pages

It is unlikely that this is highly relevant now, but it may seem interesting to someone (or just remember the past years).

I'll start with a little insight into the history of the computer. Since the computer was used to process information, it is simply obliged to present this information in a “human” form. A computer stores information in the form of numbers (bytes), and a person perceives characters (letters, numbers, various signs). So, you need to make a comparison number <-> symbol and the task will be solved. First, let's calculate how many characters we need (let's not forget that “we” are Americans using the Latin alphabet). We need 10 digits + 26 upper case letters of the English alphabet + 26 lower case letters + mathematical signs (at least + - / * => <%) + punctuation marks (.,!?:; ') + Different brackets + service characters (_ ^% $ @ |) + 32 non-printable control characters for working with devices (first of all, with teletype). In general, 128 characters are enough "butt" and this standard character set "we" was called ASCII, i.e. "American Standard Code for Information Interchange"

Well, for 128 characters, 7 bits is enough. On the other hand, there are 8 bits in a byte and 8-bit communication channels (let's forget about the “prehistoric” times when there were fewer bits in the byte and channels). On the 8-bit channel, we will transmit 7 bits of the symbol code and 1 bit of the control code (for increased reliability and error recognition). And everything was great until computers were used in other countries (where the Latin alphabet contains more than 26 characters or the non-Latin alphabet is used at all). Instead of everyone learning English, the people of the USSR, France, Germany, Georgia and dozens of other countries wanted the computer to communicate with them in their native language. The paths were different (depending on the severity of the problem): one thing, if you need to add 2-3 national symbols to the 26 Latin characters (you can sacrifice some special ones) and another thing, when you need to “plug in” the Cyrillic alphabet. Now "we" are Russians, seeking to "Russify" the technique. The first were decisions based on the replacement of lowercase English letters with Russian capital letters. However, the problem is that Russian letters (33) and they do not fit in 26 places. It is necessary to “seal up” and the first victim of this seal was the letter E (it was simply universally replaced with E). Another method - instead of “Russian” A, E, K, M, H, O, P, C, T, began to use similar English letters (there are even more such letters than necessary, but in some pairs the letters are similar and the lowercase letters are not very good: Hh Tt Bb Kk Mm). But still they “stuck in” and as a result, the whole conclusion was made in CAPITAL LETTERS, which is inconvenient and ugly, but eventually got used to it. The second technique is “language switching”. The code of the Russian character coincided with the code of the English character, but the device remembered that it was now in Russian mode and output the Cyrillic character (and in the English mode - Latin). The mode was switched by two service symbols: Shift Out (SO, code 14) into Russian and Shift IN (SI, code 15) into English (it is interesting that sometime in typewriters a two-color tape was used and SO resulted in a physical lifting of the tape and as a result the seal went red, and SI put the ribbon in place and the seal went black again). Text with large and small letters began to look quite decent. All of these options more or less worked on large computers, but after the release of the IBM PC, the mass distribution of personal computers began around the world and something had to be centralized.

The solution was the technology developed by IBM technology code pages. By this time, the “control character” in transmission has lost its relevance and all 8-bit could be used for the character code. Instead of the range of codes 0-127, the range 0-255 became available. A code page (or encoding) is a comparison of a code from the range 0-255 to a certain graphic image (for example, the letter “I” of the Cyrillic alphabet or the letter “Omega” of the Greek). You can't say “the symbol with code 211 looks like this,” but you can say “the symbol with code 211 in code page CP1251 looks like this: Y, and in CP1253 (Greek) it looks like this: Σ”. In all (or almost all) code tables, the first 128 codes correspond to the ASCII table, only for the first 32 non-printable codes IBM “assigned” its pictures (which are shown when displayed on the monitor screen). IBM placed pseudographics characters (for drawing various frames), additional Latin characters used in Western European countries, some mathematical symbols, and some characters of the Greek alphabet in the upper part of IBM. This code page was called CP437 (IBM developed many other code pages) and was used by default in video adapters. In addition, various standardization centers (world and national) created code pages for displaying national symbols. Our computer "minds" offered 2 options: the basic DOS encoding and the alternative DOS encoding. The main one was intended for work everywhere, and the alternative one - in special cases when using the main one is inconvenient. It turned out that the majority of such special cases and the main (not by name, but by use) was precisely the “alternative” encoding. I think that such an outcome was clear from the very beginning for most specialists (except for “pundits" separated from life). The fact is that in most cases English programs were used, which “for beauty” actively used pseudographics for drawing various frames and so on. A typical example is the super popular Norton Commander, then standing on most computers. The basic coding on the ground for pseudographics placed Russian characters and the Norton panels looked just awful (as well as any other pseudographic output). And the alternative encoding carefully preserved the symbols of psedukografiki, using other places for Russian letters. As a result, it was quite possible to work with Norton Commander and other programs. Andrei Chernov (a well-known personality at the time) developed the KOI8-R (KOI8) encoding that came from the “big” computers, where UNIX dominated. Its peculiarity was that if the Russian character lost the 8th bit, then the resulting English character would be consonant with the original Russian. And instead of “Hello” it turned out to be “pRIVET”, which is not quite that, but at least readable. As a result, in the USSR, 3 different code pages were used on computers (main, alternative and KOI8). And this is not counting the various "variations", when in the alternative encoding, for example, individual characters (and even strings) were changed. KOI8 also “budded” the options — Ukrainian, Belarusian, Tajik, Caucasian, and others. The equipment (printers, video adapters) also had to be configured (or “flashed”) to work with its encodings. Merchants could bring a cheap batch of printers (from the emirates, for example, by barter) and they did not work with Russian encodings.
')
Nevertheless, in general, code pages solved the problem of outputting national characters (the device should simply be able to work with the corresponding code page), but gave rise to the problem of multiplicity of encodings when the mail program sends data in one encoding, and the receiving program displays them in another. As a result, the user sees the so-called "krakozyabry" (instead of "hello" it says "ЏўҐ" or ""). It took transcoders to translate data from one encoding to another. Alas, sometimes letters when passing through mail servers were repeatedly automatically recoded (or even the 8th bit was cut off) and it was necessary to find and execute the whole chain of inverse transformations.

After the mass transition to Windows, a fourth one was added to the three code pages (Windows-1251 aka CP1251 aka ANSI) and a fifth (CP866 aka OEM or DOS). Do not be surprised - Windows to work with Cyrillic in the console by default uses the CP866 encoding (Russian characters are the same as in the “alternative encoding”, only some special characters are different), for other purposes - the CP1251 encoding. Why did Windows need two encodings, really it was impossible to do with one? Alas, it does not work: DOS-encoding is used in file names (heavy DOS legacy) and console commands like dir, copy must correctly display and properly handle dos files. On the other hand, in this encoding, many codes are assigned to pseudographic characters (various frames, etc.), while Windows works in graphic mode and it (or rather, windows applications) do not need pseudographic characters (but they need the codes they use). in CP1251 used for other useful characters). Five Cyrillic encodings at first aggravated the situation even more, but over time, Windows-1251 and KOI8 became the most popular, and they simply began to use less of DOS. Even when using Windows, it didn’t matter what encoding was in the video adapter (only occasionally, you can see “cracks” in the diagnostic messages before loading Windows).

The solution to the problem of encodings came when the Unicode system began to be introduced everywhere (for both personal operating systems and servers). Unicode assigns to each national symbol a 20-bit number assigned to it once and for all (the “point” in the Unicode code space, and most often 16 bits are enough, since 20-bit codes are used for rare characters and hieroglyphs), so there is no need transcode (for more on Unicode, see the following log entry). Now for any pair of <byte code> + <code page>, you can determine the corresponding Unicode code (now 16-bit Unicode code is shown in the code pages for each 8-bit code) and then, if necessary, output this symbol for any code page, where he is present. Currently, the problem of encodings and transcodings for users has practically disappeared, but still there are occasionally letters, where either the subject of the letter or the content is “not in that” encoding.

Interestingly, about a year ago, the problem of encodings briefly surfaced when the FAS attacked cellular operators, saying that they discriminate against Russian-speaking users, since they take more Cyrillic transmissions. This is due to the technical solution chosen by the developer of the SMS communication protocol. If the Russians had developed it, they would probably have given priority to the Cyrillic alphabet. In this article, “the head of the transport and communications control department, Dmitry Ruthenberg, noted that there are eight-bit encodings for the Cyrillic alphabet that operators could use.” In both - 21st Century Street, Unicode is sweeping the world, and Mr. Rutenberg is pulling us into the beginning of the 90s, when there was a “war of encodings” and the problem of transcoding was full-length. I wonder in what encoding SMS Vasya Pupkin should receive, using a Finnish phone, being on vacation in Turkey, from his wife with a Korean phone, sending SMS from Kazakhstan? And from his French companion (with a Japanese phone) located in Spain? I think no boss can give an answer to this question. Fortunately, this “economical” proposal was not realized.

The young reader may ask - and what prevented the immediate use of Unicode, why were these troubles with code pages invented? I think the point is the financial side of the problem. Unicode requires 2 times more memory, and memory costs money (both disk and RAM). Would an American buy a computer for 1-2 thousand more because “now the new OS requires more memory, but allows you to work with Russian, European, Arabic languages without any problems”? I'm afraid a simple English-speaking buyer would take such an argument “inadequately” (and would turn to other manufacturers).

Source: https://habr.com/ru/post/238497/

All Articles

About encodings and code pages

More articles: