What each developer needs to know about character encodings and character sets

This is the first part of the translation of the Character Sets to Work With Text.

If you work with text in a computer, you definitely need to know about encodings. Even if you send emails. Even if you only get them. It is not necessary to understand every detail, but you should at least know what the encodings are. And here's the first good news: the article may be a bit confusing, but the basic idea is very, very simple.

This article is about encodings and character sets.

')
Joel Spolsky’s article titled “The Absolute Minimum about Unicode and the character set for each developer (no exceptions!)” Will be a good introduction and it gives me great pleasure to re-read it from time to time. I am embarrassed to refer to her those people who have difficulty understanding the problems with encodings, although it is quite easy in terms of technical details. I hope this article will shed some light on exactly what the encodings are, and why all your texts are spoiled at the most unnecessary moment. The article is intended for developers (mainly in PHP), but any computer user can benefit from it.

The basics

Everyone has more or less heard about it, but somehow the knowledge evaporates when it comes to discussion, so here you are: a computer cannot store letters, numbers, pictures, or anything else. He can only memorize bits. The bit has only two meanings: YES or NO, TRUE or FALSE, 1 or 0, or any other pair you can imagine. Since the computer works with electricity, the bit is represented by an electric charge: it either has one or it does not. It is easier for people to represent this in the form of 1 and 0, so I will stick to these notation.

In order to use bits to represent something that is useful, we need rules. It is necessary to convert a sequence of bits into something similar to letters, numbers and images using an encoding scheme, or, briefly, an encoding. So, for example:

01100010 01101001 01110100 01110011
bits

In this encoding, 01100010 is 'b', 01101001 - 'i', 01110100 - 't', 01110011 - 's'. A specific sequence of bits corresponds to a letter, and a letter to a specific sequence of bits. If you can memorize sequences for 26 letters, or know how to find the right match really quickly, then you can read bits like books.
The above scheme is called ASCII. A string with zeroes and ones is split into parts of 8 bits (bytes). ASCII encoding defines a byte-to-human translation table. Here is a small piece of this table:

bits character

01000001 A
01000010 B
01000011 C
01000100 D
01000101 E
01000110 F

It has 95 characters, including letters from A to Z, in lower and upper case, numbers from 0 to 9, with a dozen punctuation marks, ampersand, dollar sign and others. It also includes 33 values, such as space, tab, newline, character return, and so on. These are unprintable characters, although they are visible to man and used by him. Some values are useful only to a computer, such as the codes for the beginning and end of the text. In total, 128 characters are included in the ASCII encoding - a perfect even number for those who understand computers, as it uses all combinations of 7 bits (from 0000000 to 1111111).

Here is a way to present a human string using only ones and zeros:

01001000 01100101 01101100 01101100 01101111 00100000
01010111 01101111 01110010 01101100 01100100

"Hello World"

Important Terms

To encode something in ASCII, move from right to left, replacing the letters with bits. To decode bits into symbols, follow the table from left to right, replacing the bits with letters.

encode | enˈkōd |
verb [with obj. ]
convert into a coded form
code | kōd |
noun
letters, letters, figures, figures, etc.

Encoding is the presentation of something else. Encoding is a set of rules that describes how to translate one representation into another.

Other terms that deserve clarification:

Character set, charset, charset - A character set that can be encoded. "ASCII encoding includes a set of 128 characters." Synonym for encoding.

A code page is a code page that assigns a set of bits to a character. Table. Synonym for encoding.

A string is a bundle of something combined together. A bit string is a bundle of bits, such as 00011011. A character string is a stack of characters, for example, “This one”. Synonym for consistency.

Binary, octal, decimal, hexadecimal

There are many ways to write numbers. 10011111 is a binary entry for 237 in octal, 159 in decimal and 9F in hexadecimal. The values of all these numbers are the same, but the hexadecimal system is shorter and easier to understand than the binary one. I will stick to the binary system in this article to improve understanding and remove the extra level of abstraction. Do not be alarmed when encountering character codes in other notations; all values are equivalent.

Excusez-Moi?

Since we now know what we are talking about, we note: 95 characters are very few when it comes to languages. This set covers basic English, but what about French characters? But is it Straßen¬übergangs¬änderungs¬gesetz from German? And the invitation to smörgåsbord in Swedish? In general, it will not work. Not in ASCII. The specification for the presentation of é, ß, ü, ä, ö is simply missing.

“Wait a minute”, Europeans will say, “in ordinary computers with 8 bits per byte, ASCII does not use a bit, which is always 0! We can use it to extend the table by another 128 values. ” And so it was. But there are still too many ways to mark the sound of vowels. Not all combinations of letters and meanings used in European languages fit into a table of 256 entries. So the world has come to an abundance of encodings, standards, de facto standards and non-standards that cover all subsets of symbols. Someone needed to write a document in Swedish or Czech, and, not finding the desired encoding, just invented another one. Or I think it all happened that way.

Do not forget about Russian, Hindi, Arabic, Korean and many other living languages of the planet. We are silent about the dead. Once you find a way to write a document that uses multiple languages, try adding Chinese. Or Japanese. Both contain thousands of characters. And you have only 256 values. Forward!

Multibyte encodings

To create tables that contain more than 256 characters, one byte is simply not enough. Two bytes (16 bits) is enough to encode 65,536 different values. Big-5, for example, is a two-byte encoding. Instead of splitting a sequence of bits into blocks of 8, it uses blocks of 16 bits each and contains a large (I mean BIG) table with a match. Big-5 in its basic form covers most of the characters of traditional Chinese. GB18030 is a similar encoding, but it includes both traditional and simplified Chinese. And, before you ask, yes, there are encodings only for simplified Chinese. Isn't one alone enough?

Here is a piece of table GB18030:

bits character
10000001 01000000 丂
10000001 01000001 丄
10000001 01000010 丅
10000001 01000011 丆
10000001 01000100 丏

GB18030 covers a fairly large range of characters, including most of the Latin characters, but in the end, this is just another encoding, among many others.

Unicode confusion

As a result, those who are most tired of this mess, came up with the idea to develop a single standard that unites all encodings. This standard has become Unicode. It defines an incredible table of 1,114,112 points used for all variations of letters and symbols. This is enough to encode all the European, Central Asian, Far Eastern, southern, northern, western, prehistoric and future characters, which are known to mankind. Unicode allows you to create a document in any language with any characters that can be entered into the computer. It was impossible, or very difficult, before the Unicode era. There is even an unofficial Klingon section in the standard. You understand, Unicode is so big that it allows unofficial sections.

So how many bytes does Unicode use for encoding? Not at all . Because Unicode is not an encoding.
Embarrassed? Not you alone. Unicode first and foremost defines a table of items for characters. This is such a way to say "65 - A, 66 - B, 9731 -" (I'm not joking, it is). How these items are encoded in bytes is the subject of another conversation. For the representation of 1 114 112 values of two bytes is not enough. Three is enough, but 3 is a strange number, so 4 is a comfortable minimum. But, as long as you do not use Chinese, or another language with a lot of characters that require a large number of bits to encode, you will never think to use a thick sausage of 4 bytes. If “A” is always encoded as 00000000 00000000 00000000 01000001, and “B” is 00000000 00000000 00000000 01000010, then a document using this encoding will swell 4 times.

There are several ways to solve this problem. UTF-32 is an encoding that translates all characters into sets of 32 bits. This is a simple algorithm, but wasting a lot of space. UTF-16 and UTF-8 are variable length coding encodings. If a character can be encoded in one byte (because the item number of the character is very small), UTF-8 encodes it in one byte. If 2 bytes are needed, then 2 bytes are used. The encoding tells the upper bits how many bits the current character is encoded. This method saves space, but also spends it if these signal bits are often used. UTF-16 is a compromise: all characters are at least two bytes, but their size can increase up to 4 bytes, if needed.

character encoding bits
A UTF-8 01000001
A UTF-16 00000000 01000001
A UTF-32 00000000 00000000 00000000 01000001
あ UTF-8 11100011 10000001 10000010
あ UTF-16 00110000 01000010
あ UTF-32 00000000 00000000 00110000 01000010

And that's all. Unicode is a huge table of matching characters and numbers, and various UTF encodings determine how these numbers are translated into bits. In general, Unicode is just another scheme. Nothing special, she is just trying to cover everything that is possible, while remaining effective. And this is good.

Items

Symbols are identified by their Unicode clauses. These items are written in hexadecimal and preceded by “U +” (just for convenience, doesn’t mean anything but “This is a Unicode item”). The symbol has the item U + 1E00. In other (decimal) words, this is the 7680th character of the Unicode table. It is officially called “LATIN CAPITAL LETTER AND WITH A RING BOTTOM”.

Niasilil

The essence of the above: any character can be encoded with many different bit sequences, and any bit sequence can represent different characters, depending on the encoding used. The reason is that different encodings use different numbers of bits per character and different values to encode different characters.

bits encoding characters

11000100 01000010 Windows Latin 1 ÄB
11000100 01000010 Mac Roman ƒB
11000100 01000010 GB18030

characters encoding bits

Føö Windows Latin 1 01000110 11111000 11110110
Føö Mac Roman 01000110 10111111 10011010
Føö UTF-8 01000110 11000011 10111000 11000011 10110110

Delusions, Embarrassment, and Problems

Having all of the above, we come to the pressing problems that many users and developers experience every day, how they compare with the above, and what are the solutions. The biggest problem is

What the hell is my text unreadable?

ÉGÉìÉRÅ [ÉfÉBÉìÉOÇÕìÔÇµÇ ≠ Ç Ç Ç ¢

If you open the document and it looks like the text above, then the reason for this is one: your program was wrong with the encoding. And that's all. The document is not corrupted (at least for now), and no magic is needed. Instead, you just need to select the correct encoding to display the text. The proposed document above contains bits:

10000011 01000111 10000011 10010011 10000011 01010010 10000001 01011011
10000011 01100110 10000011 01000010 10000011 10010011 10000011 01001111
10000010 11001101 10010011 11101111 10000010 10110101 10000010 10101101
10000010 11001000 10000010 10100010

So, quickly guess the encoding? If you shrug your shoulders, then you are right. Yes, who knows?
Let's try with ASCII. Most of these bytes start at 1. If you remember correctly, ASCII doesn't actually use this bit. So ASCII is not an option. What about UTF-8? Most bytes are not valid values in this encoding. What about Mac Roman (another European encoding)? Hmm, for her, these bytes are valid values. 10000011 decodes in “É”, in “G” and so on. So in Mac Roman, the text will look like this: ÉGÉìÉRÅ [ÉfÉBÉìÉOÇÕìÔÇµÇ ≠ Ç »Ç ¢. Right? Not? May be? How does a computer know? Maybe someone wanted to write exactly that. As far as I know, this could be a DNA sequence! So let's decide: this is Mac Roman, and this is DNA.

Of course, this is complete nonsense. The correct answer is: the text is encoded in Japanes Shift-JIS and should look like エンコーディングは難しくない. Who would have thought?
The first reason for unreadable text is that someone is trying to read a sequence of bytes in the wrong encoding. The computer always needs to be prompted. He himself will not guess. Some types of documents determine the encoding of their content, but the byte sequence always remains a black box.
Most browsers provide the ability to specify the page encoding using a special menu item. Other programs also have similar items.

The author has no division into parts, but the article is so long. Continuation will be in a couple of days.

Source: https://habr.com/ru/post/158639/

All Articles