What each developer needs to know about character encodings and character sets for working with text, part 2

This is the second part of the translation of the article. The first part is here .

My document is complete nonsense in any encoding!

If the sequence of bits does not look reasonable (from the point of view of a person), then this is the case when the document was most likely converted incorrectly at a certain moment. For example, we take the text ÉGÉìÉRÅ [ÉfÉBÉìÉOÇÕìÔÇµÇ ≠ Ç Ç Ç ¢, and, without inventing anything better, save it in UTF-8. The text editor suggested that he correctly read the text encoded with Mac Roman and now it needs to be saved in another encoding. In the end, all these characters are valid in Unicode. In a sense, in Unicode there is a point for É, for G, and so on. So we just save it to UTF-8:
')

11000011 10001001 01000111 11000011 10001001 11000011 10101100 11000011 10001001 01010010 11000011 10000101 01011011 11000011 10001001 01100110 11000011 10001001 01000010 11000011 10001001 11000011 10101100 11000011 10001001 01001111 11000011 10000111 11000011 10010101 11000011 10101100 11000011 10010100 11000011 10000111 11000010 10110101 11000011 10000111 11100010 10001001 10100000 11000011 10000111 11000010 10111011 11000011 10000111 11000010 10100010

This is how the text ÉGÉìÉRÅ [ÉfÉBÉìÉOÇÕÇÕÇÕμÔÇÇ Ç Ç »Ç is now represented by a sequence of bits UTF-8. This bit sequence is completely divorced from what was in the original document. In whatever coding we would not open this sequence, we would never see the source text エンコーディングは難しくない. He is simply lost. It could be recovered if we knew the original Shift-JIS encoding and that we regarded the text as Mac Roman, and then saved it in UTF-8. But such miracles are rarely found.

Many times a particular bit sequence turns out to be incorrect in a particular encoding. If we tried to open the original document in ASCII, we would see that some of the characters were recognized, and some were not. The program you are using might decide to simply discard the bytes that do not fit the current encoding, or replace them with question marks. Or for a special replacement character in Unicode: (U + FFFD). If you save the document after the procedure of removing unsuitable characters, you will lose them forever.

If you incorrectly guessed the encoding, and then saved it to another one, then you will spoil the document. You can try to fix it, but these attempts usually do not end in success. Bit-shifted magic usually remains dead magic: like a dead poultice.

And how to correctly change the encoding?

It is really easy! You need to know the encoding of a specific piece of text (bit sequence) and apply it to decrypt. This is all you need to do. If you are writing a program that receives text from the user, determine in which encoding it will do it. Any text field should know in what encoding it receives data. For any type of file that the user can load into the program, the encoding must be defined. Or there should be a way to ask about this user. The information can be provided by the file format or by the user (most of them hardly know until they have finished reading the article).

If you want to overtake the text from one encoding to another, use special tools. Conversion is a tedious job comparing the two code pages and deciding that the character 152 in the encoding A matches the character 4122 in the encoding B and then changing the bits. No need to reinvent this bike: in every common programming language, there are tools abstract from bits and code pages for converting text from encoding to encoding.

Let's say your application must accept files in GB18030, but inside you work in UTF-32. The iconv tool can do one-line conversion: iconv ('GB18030', 'UTF-32', $ string). The characters will remain unchanged, despite the fact that the bit representation has changed.

character GB18030 encoding UTF-32 encoding
縧 10111111 01101100 00000000 00000000 01111110 00100111

That's all. The content of the string in its human understanding has not changed, but now it is the correct string in UTF-32. If you continue to work with it in UTF-32, you will not have any problems with unreadable characters. However, as we discussed earlier, not all encodings are able to display all characters. It is impossible to encode the symbol 縧 in any of the encodings for European languages. And something terrible can happen.

All in unicode

That is why there is no excuse in the 21st century not to use Unicode. Some specialized encodings for European languages may be more productive than Unicode for specific languages. But as long as you don’t have to work with terabytes of special text (which is VERY much), you have nothing to worry about. Problems arising from the incompatibility of encodings are much worse than lost gigabytes. And this argument will only weigh with the growth and cheapening of data storage and channel width.

If the system needs to work with other encodings, convert the text into Unicode first of all, and overtake it if you need to output it somewhere. Otherwise, you will have to carefully monitor each case of accessing the data and carry out the necessary conversions, if possible without losing information.

Happy accidents

I had a website connected to the database. My application processed everything as UTF-8 and stored it in the database and everything was super, but when I entered the database admin, I could not understand anything.
- anonymous bydlokoder

There are situations when the encodings are processed incorrectly, but everything still works fine. It often happens that the database encoding is set to latin-1, and the application works with UTF-8 (or any other). In general, any combination of 1 and 0 is valid in single-byte latin-1. If the database receives data from the application like 11100111 10111000 10100111, then it happily saves them, thinking that the application meant ¸¸§. Why not? Later, the database returns the same bits to the application, which is happy, since it received the UTF-8 縧 character, which was conceived. But the DB administration interface knows that latin-1 is being used, and here's the result: nothing is possible to understand.
The fool simply won the lottery, although the stars were not on his side. Any operation on the text in the database may work, but it may not be executed as intended, because the database does not correctly perceive the text. In the worst case, the DB will inadvertently destroy the entire text, performing an arbitrary operation 2 years after installation due to incorrect coding (and of course, no backup to you).

UTF-8 and ASCII

The genius of UTF-8 in binary compatibility with ASCII, which is the de facto basis for all encodings. All ASCII characters occupy a maximum of bytes in UTF-8 and use the same bits as in ASCII. In other words, ASCII can be reflected 1: 1 in UTF-8. Any non-ASCII character occupies 2 or more bytes in UTF-8. Most programming languages that use ASCII as the source code encoding allow you to include text in UTF-8 directly into text:

$ string = "漢字";

Saving in UTF-8 will give the sequence:

00100100 01110011 01110100 01110010 01101001 01101110 01100111 00100000
00111101 00100000 00100010 11100110 10111100 10100010 11100101 10101101
10010111 00100010 00111011

Only 12 bytes out of 17 (those starting with 1) are UTF-8 characters (2 characters 3 bytes each). Other characters are in ASCII. The parser will read the following:

$ string = "11100110 10111100 10100010 11100101 10101101 10010111";

The parser takes everything by quotation mark as a sequence of bits, which should be treated as is, everything, down to another quotation mark. If you just output this sequence, you will output text to UTF-8. No need to do anything else. The parser does not need to specifically support utf-8, you just need to take the string literally. Simple parsers can support Unicode this way, without actually supporting Unicode. However, many programming languages explicitly support Unicode.

Encoding and PHP.

PHP does not support Unicode. True, he supports him quite well. The previous paragraph shows how to include UTF-8 characters directly in the program text without any problems, because UTF-8 is backward compatible with ASCII, and that’s all PHP needs. However, the statement that “PHP does not support Unicode” is true, because it causes many difficulties in the PHP community.

False promises

The utf8_encode and utf8_decode functions have become my one corn. I often see stupid things like "To use Unicode in PHP, you need to call utf8_encode for the input text and utf8_decode for the output." These two functions promise some kind of automatic UTF-8 text conversion, which is supposedly mandatory, because "PHP does not support Unicode." If you are reading this article not diagonally, you know that

There is nothing specific in UTF-8
You cannot encode text in UTF-8 after the fact

Let me clarify paragraph 2: any text is already encoded. When you insert strings in the source code, they are already encoded. More precisely, the encoding that your text editor now uses. If you get them from the database, they are already encoded. If you read them from a file ... you already know, right?

The text is either encoded in UTF-8, or not encoded. If not, it is encoded in ASCII, ISO-8859-1, UTF-16 or something else. If it is not in UTF-8, but it is assumed that it contains "UTF-8 characters", then you have cognitive dissonance. If the text does contain the necessary characters, encoded in UTF-8, then it is in UTF-8.

So what the hell is utf8_encode doing?

"Translates the string ISO-8859-1 to UTF-8"

Aha The author wanted to say that the function converts text from ISO-8859-1 to UTF-8. Here it is for what. Such a terrible name she probably gave some unpremeditious European. The same goes for utf8_decode. These functions are not applicable to anything other than converting from ISO-8859-1 to UTF-8. If you need another pair of encodings, use iconv.
utf8_encode is not a magic wand that you need to wave over each word because "PHP does not support Unicode." It causes more problems than it solves - say thanks to that European and ignorant programmers.

Native shmativny

So what do they mean when they say that the language supports Unicode? It is important whether the language assumes that one character occupies one byte, or not. So, PHP allows you to access up to the selected character, treating the string as a character array:

echo $ string [0];

If $ string has a single-byte encoding, then it will give us the first character. But only because the “character” coincides with the “byte” in a single-byte encoding. PHP simply gives the first byte without a single thought about the characters. The strings for PHP are nothing more than a sequence of bytes, no more, no less. These are your "readable characters" - no more than human invention, PHP does not give a damn about them.

01000100 01101111 01101110 00100111 01110100
D on 't
01100011 01100001 01110010 01100101 00100001
care!

The same applies to many standard functions, such as substr, strpos, trim and others. Support ends where the correspondence between the byte and the character ends:

11100110 10111100 10100010 11100101 10101101 10010111
漢字

$ string [0] for the specified string will, again, give only the first byte, equal to 11100110. In other words, the third byte of the 漢 character. The sequence 11100110 is incorrect for UTF-8, so the string is now also incorrect. If you think so too, you can try a different encoding, in which 11100110 will be some valid random character. You can have fun, but not on the battle server.

That's all. “PHP does not support Unicode” means that most functions in the language assume that one byte is equal to one character, which leads to trimming of multibyte characters or incorrectly counting the length of a string. This does not mean that you cannot use Unicode in PHP, or that any text should be run through utf8_encode, or some other nonsense.

Fortunately, there is a special extension that adds all the important string functions, but with support for multibyte character encodings. mb_substr ($ string, 0, 1, 'UTF-8') on the above line will correctly correctly return the sequence 11100110 10111100 10100010 corresponding to the character. Because the function needs to think about what it does, it needs to pass the encoding. Therefore, these functions take the $ encoding parameter. By the way, the encoding can be set globally for all mb_ functions using mb_internal_encoding.

Use and abuse of PHP error handling

The problem with PHP's (non-) Unicode support is that the interpreter doesn't care. Byte sequences, ha. No matter what they mean. Nothing is done except storing strings in memory. PHP doesn't even have a clue - encoding. And while it is not necessary to manipulate strings, this is not important. Work is done with byte sequences that can be accepted by someone as characters. PHP only requires you to save the source code in anything that is compatible with ASCII. PHP parser is looking for specific characters that tell it what to do. 00100100 says: “declare the variable”, 00111101 - “assign”, 00100010 - the beginning or end of the line, etc. Anything that does not matter to the parser is perceived as byte sequence literals. This also applies to everything that is quoted. It means:

You will not be able to save the source code with PHP to an incompatible ASCII encoding. For example, in UTF-16, the quotation mark is encoded as 00000000 00100010. For PHP, which interprets everything as ASCII, this is a NUL-byte, followed by a quotation mark. PHP probably would hiccup on every character would be NUL.
You can save PHP to ASCII compatible encoding. If the first 128 encoding points match ASCII, PHP will eat them. All meaningful characters for PHP are within the first 128 clauses defined by ASCII. If string literals contain something beyond this limit, PHP will not pay attention. You can save the source in ISO-8859-1, Mac Roman, UTF-8 or any other encoding. String characters in your code will get the encoding in which you save the file.
Any external file for PHP can be arbitrarily encoded. If the parser does not need to process the file, it will be satisfied.

$ foo = file_get_contents ('bar.txt');

The above will just pop the bytes from bar.txt into the $ foo variable. PHP will not attempt to interpret, encode or otherwise manipulate content. The file may contain binary data or a picture, PHP does not care.
If the external and internal encodings should match, then they really should. Localization is an ordinary case: in the code you write something like echo localize ('Foobar'), and in the external file it is:

msgid "foobar"
msgstr "ーバー"

Both strings of foobar must have identical bit representation. If the source code is in ASCII and the localization code is in UTF-16, you are out of luck. It is necessary to carry out additional conversion.

An astute reader may ask, say, whether it is possible to save UTF-16 bytes in succession to the ASCII literal of the source file, and the answer will always be this: of course.

echo "UTF-16";

If you force your editor to save echo ““ in ASCII and UTF-16 in UTF-16, then everything will work. Here is the binary representation:

01100101 01100011 01101000 01101111 00100000 00100010
echo "
11111110 11111111 00000000 01010101 00000000 01010100
(UTF-16 marker) UT
00000000 01000110 00000000 00101101 00000000 00110001
F - 1
00000000 00110110 00100010 00111011
6 ";

The first line and the last 2 bytes are from ASCII. The rest is represented in UTF-16 by 2 bytes per character. Leading 11111110 11111111 on the second line is a marker of the beginning of the text in UTF-16 (requires according to the standard, PHP has not heard a damn thing about it). This script prints the string “UTF-16” encoded in UTF-16, because it simply prints the bytes between two quotes, which translates into the text “UTF-16” encoded in UTF-16. On the other hand, the source is not completely correct in either ASCII or UTF-16, so you can open the editor and have fun.

Total

PHP supports Unicode, or more precisely, any encoding is pretty accurate, as long as you can force the parser to do its job, and the developer to understand what it does. You need to be careful only when working with strings: division, removal of spaces, counting, and all other operations that require working with characters, not bytes. If nothing is done with strings other than reading and output, then there are hardly any problems that are not found in other languages.

Encoding-supported languages

So what does it mean for a language to support Unicode? Javascript for example supports Unicode. In fact, any string in Javascript is encoded in UTF-8. And this is the only encoding that Javascript works with. You just don’t get a non-UTF-8 string in Javascript. Javascript worships Unicode to such an extent that the core language simply does not have tools for working with another encoding. Since Javascript is most often executed in the browser, you have no problems: the browser is able to execute the trivial logic of encoding and decoding I / O.

Other languages just support encodings. Internal work is done in a single encoding, often in UTF-16. But this means that they need to suggest in what encoding the text, or they will try to identify it. It is necessary to indicate in which encoding the source code is stored, in which encoding the file is saved, which will be read, in what encoding the output should be made. The language will perform the conversion on the fly if it is indicated that you need to use Unicode. They do everything that PHP needs to do in semi-automatic mode somewhere in the background. No better and no worse than PHP, just different. The good news is that the string functions finally just work, and you don’t need to think about whether the string contains multibyte characters or does not contain, which functions to choose for work, and other things that you would have to do in PHP.

Wilds Unicode

Since Unicode solves so many different problems and works in many different scenarios, you have to pay for it in the wilds. For example, the Unicode standard contains information on solving such problems as the unification of the hieroglyph of the YAC . The set of symbols common to Japan, China and Korea are depicted a little differently. Or the problem of converting characters from lowercase to uppercase, vice versa or round-trip, which is not always as simple as with the codings of Western European languages. Some characters may be represented by different items. The letter ö, for example, can be represented by U + 00F6 (“LATIN LITTLE LETTER OF DIERESIS”) or as two points of U + 006F (“LITTLE LETTER O”) and U + 0308 (“SUPPORTED DIARESIS”), meaning o with. In UTF-8, this is either 2 bytes or 3 bytes, which in both cases are a normal character. Therefore, there are rules for normalization in a standard, i.e. how to convert these forms from one to another. This and much more is outside the article, but you need to know about these points.

Again niasilil!

Text is always a sequence of bits that need to be translated into natural language using tables. Invalid table - invalid character.
You can not work directly with the text - you always work with bits that are folded into an abstraction. Errors are related to errors in one of the abstractions.
Systems transmitting information to each other must always indicate the working encoding. The site, for example, tells the browser that it gives information to UTF-8.
Nowadays, UTF-8 is backward compatible with ASCII, despite the fact that it can encode almost any character, and yet it is relatively effective in most cases. Other encodings are also used, but there must be a good reason to suffer with encodings that support only part of Unicode.
Both the program and the programmer must deal with the problem of matching a byte and a character.

Now there is nothing to justify when you again spoil the text.

Source: https://habr.com/ru/post/158895/

All Articles