📜 ⬆️ ⬇️

The sad story of forgotten characters. How not to go crazy when working with encodings in C ++



Speaking of text, most C ++ programmers think of arrays of character codes and the encoding that these codes correspond to. The most experienced developers do not think at all the concept of text without specifying the encoding, the least experienced simply consider the array of bytes with character codes given and interpreted in terms of the encoding of the operating system. The fundamental difference between these two approaches is not only in the experience of the developer, but also in the fact that not to think about encoding is much easier. It’s time to consider how not to care about storing the encoding, re-encoding the text, getting free access to the characters and at the same time seeing an unmistakable representation of the text regardless of who and where is looking at the line of text: whether in China, in the USA or on the island of Madagascar.

8 bits and all-all-all ...


Let's start with the main thing. The creators of the C language were minimalists. To this day, the C / C ++ standard does not provide a "byte" type. Instead, the type is char. Char means character, in other words, a character. Accordingly, speaking in C / C ++ about the type of char, we mean “byte”, and vice versa. This is where the fun begins. The fact is that the maximum possible number of characters encoded by 8 bits is 256, and this is despite the fact that today there are hundreds of thousands of characters in the Unicode table.

Sly creators of ASCII codes immediately reserved the first 128 codes for standard characters, with which you can safely encode almost everything in the English-speaking world, leaving us only half a byte to fit your needs, or rather only one free major bit. As a result, in the first years of the development of informatics, everyone tried to shrink to these remaining “negative” numbers from –128 to –1. Each set of codes was standardized under a certain name and from that moment was called coding. At some point, there were more encodings than characters in a byte, and all of them were incompatible with each other in the part that went beyond the first 128 ASCII characters. As a result, if you do not guess with the encoding, everything that is not a set of essential symbols for the American community will be displayed in the form of so-called cracks, characters, as a rule, generally unreadable.
')
Moreover, for the same alphabets, different systems introduced codings that were completely mismatched, even if they were two systems authored by one company. So, Cyrillic in MS DOS used 855 and 866 encodings, and 1251 for Windows, all for the same Cyrillic Mac OS already has its own encoding, KOI8 and KOI7 stand apart from them, there is even ISO 8859-5, and everyone will treat the same char sets with completely different characters. Not only was it impossible to use several encodings when processing various byte characters, for example, when translating from Russian to German with umlauts, in addition, the characters themselves in some alphabets didn’t want to fit into the 128 positions left for them. As a result, in international programs, characters could be interpreted in different encodings, even in adjacent lines, you had to remember which line was in which encoding, which inevitably led to text display errors, from funny to not funny at all.

Put on your virtual machine any other operating system with a different encoding by default than on your host system, for example, Windows with encoding 1251, if you have Linux with UTF-8 by default, and vice versa. Try to write a code with Cyrillic string output to std :: cout, which, without changing the code, will be built and work the same for both systems. Admit it, internationalizing a cross-platform code is not such an easy task.

Coming Unicode


Unicode's idea was simple. Each character is once and for all assigned one code for all eternity, it is standardized in the next version of the Unicode character table specification, and the character code is no longer limited to one byte. A great idea in all but one: in the C / C ++ programming languages ​​and not only in them, the char character was once and for all associated with a byte. Throughout the code, sizeof (char) was implied to be one. The lines of the text were the usual sequences of these very char, ending with a character with zero code. In defense of the creators of the C language, Ritchie and Kernighan, it should be said that in those distant 70s nobody could have thought that it would take so many codes to encode a character, because a character was enough to encode a typewriter. Anyway, the main evil was created, any change of the type char would lead to loss of compatibility with the code already written. A sensible solution was the introduction of a new type of “wide character” wchar_t and duplication of all standard C functions for working with new, “wide” strings. The container of the standard C ++ string library also gained a “wide” sibling wstring.

Everyone is happy and happy if it were not for one “but”: everyone is already used to writing code based on byte strings, and the L prefix before the string literal did not add enthusiasm to the developers in C / C ++. People preferred not to use characters outside of ASCII and accept the limitations of the Latin alphabet, rather than writing unusual constructions that are incompatible with the code already written that worked with the type char. The situation was complicated by the fact that wchar_t does not have a standardized size: for example, in modern GCC compilers g ++ it is 4 bytes, in Visual C ++ it is 2 bytes, and Android NDK developers cut it to one byte and made it indistinguishable from char. It turned out so-so solution that works far from everywhere. On the one hand, 4-byte wchar_t is closest to the truth, since according to the standard one wchar_t must match one Unicode symbol, on the other hand, no one guarantees that there will be 4 bytes in the code using wchar_t.

An alternative solution was single-byte UTF-8 encoding, which is not only compatible with ASCII (the most significant bit is zero, is responsible for single-byte characters), it also allows you to encode up to 4-byte integers, that is, over 2 billion characters. The board, however, is quite substantial, the characters are of different sizes, and, for example, to replace the Latin character R with the Russian character I, you will need to completely rebuild the entire string, which is much more expensive than the usual code replacement in case of 4-byte wchar_t. Thus, any active work with string characters in UTF-8 can put an end to the idea of ​​using this encoding. However, the encoding compacts the text rather compactly, contains protection against read errors and, most importantly, is international: any person anywhere in the world will see the same characters from the Unicode table if they read a UTF-8 encoded string. Of course, except when trying to interpret this line in a different encoding, everyone remembers “krakozyabry” when trying to open the Cyrillic alphabet in UTF-8 as the text in the default encoding in Windows 1251.

Single-byte Unicode Device


UTF-8 encoding is very entertaining. Here are the basic principles:

  1. The character is encoded by a sequence of bytes, in each byte the leading bits encode the position of the byte in the sequence, and for the first byte also the length of the sequence. For example, this is the character I in UTF-8: [1101 0000] [1010 1111]
  2. The bytes of the sequence, starting from the second, always begin with bits 10, respectively, the first byte of the sequence of the code of each character cannot begin with 10. This is the basis for the basic verification of the correctness of decoding a character code from UTF-8.
  3. The first byte can be unique, then the leading bit is 0 and the character corresponds to the ASCII code, since 7 low bits are left for encoding.
  4. If the character is not ASCII, then the first bits contain as many units as the bytes in the sequence, including the leading byte, followed by 0 as the end of the sequence of ones and then the significant bits of the first byte. As can be seen from the above example, the encoding of the symbol I occupies 2 bytes, this can be recognized by the high two bits of the first byte of the sequence.
  5. All significant bits are glued together in a single sequence of bits and are already interpreted as a number. For example, for any character encoded in two bytes, I conditionally mark the significant bits with the x character: [110x xxxx] [10xx xxxx]

When gluing, as you can see, you can get the number encoded by 11 bits, that is, up to 0x7FF character of the Unicode table. This is enough for Cyrillic characters, ranging from 0x400 to 0x530. When gluing a symbol, I will get the code from the example: 1 0000 10 1111

Just 0x42F is the character code I in the Unicode character table.

In other words, if you don’t work with characters in a string, replacing them with other characters from the Unicode table, you can use UTF-8 encoding, it is reliable, compact and compatible with the char type in that the elements of strings are the same size as the byte, but they are not necessarily symbols.

Actually, it is precisely the efficiency and popularity of the UTF-8 encoding that is due to the violent introduction of the single-byte wchar_t in the Android NDK, the developers urge to use UTF-8, and the “wide” strings are not recognized as a viable form. On the other hand, Google not so long ago even denied exceptions in C ++, but the whole world does not dispute, even if you were Google three times, and you had to support exception handling. As for the wchar_t characters with a size of one byte, many libraries have become accustomed to the wchar_t type of atrocities and duplicate the “wide” functionality by processing normal byte strings.

UTF (Unicode Transformation Format) is essentially a byte representation of the text, using character codes from a Unicode table, packed into a byte array according to standardized rules. The most popular are UTF-8 and UTF-16, which represent characters in elements of 8 bits and 16 bits each, respectively. In both cases, the character does not necessarily occupy exactly 8 or 16 bits, for example, in UTF-16, surrogate pairs are used, essentially pairs of 16-bit values ​​used together. As a result, the significant bits become smaller (20 in the case of a surrogate pair) than the bits in the group representing the character, but the ability to encode characters begins to exceed the limit of 256 or 65,536 values, and any character from the Unicode table can be encoded. Favorably different from fellow UTF-32 is less popular, due to redundancy in the presentation of data, which is critical with a large amount of text.

We write in Russian in code


Tribulation and language discrimination begin when we try to use a non-ASCII string in the code. So, Visual Studio for Windows creates all files in the file system encoding by default (1251), and when you try to open the code with strings in Russian in the same Linux with the default encoding UTF-8, we get a bunch of obscure characters instead of the original text.

The situation is partially saved by the re-storage of source codes in UTF-8 encoding with the obligatory BOM symbol, without it Visual Studio begins to interpret the “wide” lines with Cyrillic in a very peculiar way. However, by specifying the BOM (Byte Order Mark - byte order mark) of the UTF-8 encoding - a character encoded by three 0xEF, 0xBB and 0xBF bytes, we get the recognition of the UTF-8 encoding in any system.

BOM is a standard header set of bytes that is needed for recognition of Unicode text encoding; for each of the UTF encodings, it looks different. Feel free to use your native language in the program. Even if you have to localize it to other countries, the internationalization mechanisms will help turn any string in one language into any string on another. Of course, this is the case if the product is developed in the Russian-speaking segment.

Try to use “wide” strings for string constants as well as for storing and processing intermediate text values. Effective replacement of characters, as well as the coincidence of the number of elements in a line with the number of characters, is worth a lot. Yes, so far not all libraries have learned to work with “wide” characters, even Boost has a whole range of libraries where wide strings support is made carelessly, but the situation is corrected, thanks in large part to the developers writing errors to the tracker of the library, do not hesitate fix errors on the library's developer site.

To write constants and variables, as well as the names of functions in Cyrillic is still not necessary. It’s one thing to write constants in your native language, it’s quite another to write code, constantly switching the layout. Not the best option.

We distinguish between the type of "bytes" and the type of "text"


The main thing you need to be able to keep in mind is that the type “text” is completely different from the type “set of bytes”. If we are talking about a message line, then this is text, and if we are talking about a text file in some encoding, then this is a set of bytes that can be read as text. If text data comes to us on the network, then it comes to us bytes, along with an indication of the encoding, how to get the text from these bytes.

If you look at Python 3 versus Python 2, then the third version made a really serious leap in development by separating these two concepts. I highly recommend even an experienced C / C ++ developer to work a little bit in Python 3 in order to experience the full depth with which the separation of text and bytes took place at the level of the language in Python. In fact, the text in Python 3 is separated from the concept of encoding, which for a C / C ++ developer sounds extremely unusual, the lines in Python 3 are displayed the same anywhere in the world, and if we want to work with the representation of this string in any encoding, then we have to convert the text into a set of bytes, with an indication of the encoding. In this case, the internal representation of an object of type str, in fact, is not as important as the understanding that the internal representation is stored in Unicode and is ready for conversion to any encoding, but in the form of a set of bytes of type bytes.

In C / C ++, this mechanism prevents us from introducing the absence of such luxury as the loss of backward compatibility, which Python 3 allowed itself with respect to the second version. The mere separation of the char type into an analogue of wchar_t and byte in one of the following editions of the standard will lead to the collapse of the language and loss of compatibility with an exorbitant amount of already written C / C ++ code. More precisely, everything that you are working on right now.

Merry transcoding


So, the original problem remained unresolved. We still have single-byte encodings, both UTF-8, and old and unkind single-byte encodings like Windows 1251 encoding. On the other hand, we set string constants in wide strings and process the text through wchar_t - “wide” characters.

Here the recoding mechanism will come to our aid. After all, knowing the encoding of a set of bytes, we can always convert it to the character set wchar_t and back. Do not rush to just create your own code conversion, I understand that the character codes of any encoding can now be found in a minute, as well as the entire table of Unicode codes of the latest edition. However, recoding libraries are sufficient without it. There is a cross-platform library libiconv, licensed under the LGPL, the most popular today for cross-platform development. Recoding is reduced to several instructions:

iconv_t conv = iconv_open("UTF-8","CP1251"); iconv(conv, &src_ptr, &src_len, &dst_ptr, &dst_len); iconv_close(conv); 

Accordingly, first creating a transcoding handler from one encoding to another, then recoding one set of bytes to another (even if one of the byte sets is in fact bytes in the wchar_t array), and then closing the created transcoding handler. There is also a more ambitious ICU library, which provides both a C ++ interface for working with transcoding, and a special type of icu :: UnicodeString for storing directly the text in the Unicode view. The ICU library is also cross-platform, and more options are provided for its use. It's nice that the library itself takes care of creating, caching and using handlers for transcoding, if you use the C ++ API library.

For example, to create a string in Unicode, it is proposed to use the usual constructor of the class icu :: UnicodeString:

 icu::UnicodeString text(source_bytes, source_encoding); 

Thus, it is proposed to completely abandon the wchar_t type. The problem, however, is that the Unicode internal representation for such a string is set to two bytes, which entails a problem when the code for these two bytes goes out. In addition, the icu :: UnicodeString interface is completely incompatible with the standard wstring, but using ICU is a good option for a C ++ developer.

In addition, there are a couple of standard mbstowcs and wcstombs functions. In general, with a correctly specified locale, they, respectively, convert the (multi-) byte string to a “wide” one and vice versa. The abbreviations mbs and wcs are decoded as Multi-Byte String and Wide Character String, respectively. By the way, most of the usual functions for working with strings in the C language are duplicated by functions in which the name str is replaced by wcs, for example wcslen instead of strlen or wcscpy instead of strcpy.

We can not forget about the Windows-development. WinAPI happy owners are waiting for the next couple of functions with a bunch of parameters: WideCharToMultiByte and MultiByteToWideChar. These functions do exactly what their names say. Specify the encoding, parameters of the input and output arrays and flags and get the result. Despite the fact that these functions are seemingly ugly, they do their work quickly and efficiently. True, it is not always accurate: they can try to convert a character to a similar one, so be careful with the flags that are passed to the function as the second parameter, it is better to specify WC_NO_BEST_FIT_CHARS.

Usage example:

 WideCharToMultiByte( CP_UTF8, WC_NO_BEST_FIT_CHARS, pszWideSource, nWideLength, pszByteSource, nByteLength, NULL, NULL ); 

Of course, this code is not portable to any platform other than Windows, so I highly recommend using the ICU4C or libiconv cross-platform libraries.

The most popular library is libiconv, but it uses exclusively char * parameters. This should not frighten, in any case, an array of numbers of any bitness is just a set of bytes. However, it should be remembered about the direction of two-byte and more numbers. That is, in what order in the byte array the bytes are represented - the components of the number. There are Big-endian and Little-endian, respectively. The generally accepted order of representing numbers in the overwhelming majority of machines is Little-endian: first comes the low byte, and at the end the high byte of the number. Big-endian is familiar to those who work with data transfer protocols over the network, where numbers are usually transmitted from the high byte (often containing service information) to the low byte. You should be careful and remember that UTF-16, UTF-16BE and UTF-16LE are not the same thing.

Text class


Now let's accumulate the knowledge gained and solve the original problem: we need to create an entity, essentially a class, initialized by a string, either “wide” or byte, indicating the encoding, and providing the interface of the usual string container std :: string, with the ability to access elements -characters, changing them, deleting, converting a copy of the text in the string as "wide" and byte with the encoding. In general, we need to significantly simplify the work with Unicode, on the one hand, and obtain compatibility with previously written code, on the other hand.

The following text constructors will thus receive a text class:

 text(char const* byte_string, char const* encoding); text(wchar_t const* wide_string); 

It is worth reloading also from the std :: string and std :: wstring variants, as well as from the iterators of the beginning and end of the source container.

Access to the element must obviously be open, but as a result you cannot use the byte char or platform-dependent wchar_t, we must use an abstraction over the integer code in the Unicode table: symbol.

 symbol& operator [] (int index); symbol const& operator [] (int index) const; 

Thus, it becomes obvious that we cannot save a Unicode string as a char or wchar_t string. We need at least std :: basic_string <int32_t>, since at the moment the UTF-8 and UTF-16 encodings encode characters within int32_t, not to mention UTF-32.

On the other hand, outside the text class nobody needs our `std :: basic_string <int32_t>`, let's call it `unicode_string`. All libraries love working with `std :: string` and` std :: wstring` or `char const *` `and` wchar_t const * `. Thus, it is best to cache both the incoming std :: string or std :: wstring, as well as the result of text conversion in the byte-string encoding. Moreover, our text class is often needed only as temporary storage for a traveling string, such as a byte string in UTF-8 from a database to a JSON string, that is, transcoding to a unicode_string we will only need to refer to the text elements on demand. The text and its internal presentation is the class that should be optimized to the maximum, as it implies intensive use, and also not allow recoding without a reason and to the first requirement. The user of the text class API must explicitly indicate that he wants to convert the text to a byte string in a specific encoding or obtain a system-specific “wide” string:

 std::string const& byte_string(std::string const& encoding) const; std::wstring const& wide_string() const; 

As you can see above, we return the link to the string that we calculated and saved in the class field. Of course, we will need to clean the cache with `std :: string` and` std :: wstring` at the first change in the value of at least one character, here we will be helped by the operator -> from the non-constant this data class `text :: data`. How to do this, see the previous two lessons of the C ++ Academy.

It is also important not to forget about getting char const * and wchar_t const *, which is easy to do, given that std :: string and std :: wstring are cached by text fields.

 char const* byte_c_str(char const* encoding) const; wchar_t const* wide_c_str() const; 

The implementation comes down to calling `c_str ()` `on the byte_string and wide_string results, respectively.

You can think of the default encoding for UTF-8 byte strings; this is much better than trying to work with the system default encoding, because the code will work differently depending on the system. Introducing a number of additional overloads without specifying the encoding when working with byte strings, we also have the ability to override the assignment operator:

 text& operator = (std::string const& byte_string); //   ”UTF-8” text& operator = (std::wstring const& wide_string); 

It is also necessary not to forget about the overloading of the + and + = operators, but in general, the remaining operations can already be reduced to an argument and a text type result, a universal value providing the text regardless of the encoding.

Of course, C ++ Academy would not be an academy if I didn’t suggest you now implement the text class yourself. Try to create a text class based on the material in this article. An implementation must satisfy two simple properties:


Here it just makes sense to process additionally non-constant `operator ->` `to reset the cache with strings, but leave it to the discretion of the developer. That is you. Good luck!

Of course, the implementation cannot do without the `copy_on_write` class from previous articles. As usual, just in case, I recall its simplified view:

 template <class data_type> class copy_on_write { public: copy_on_write(data_type* data) : m_data(data) { } data_type const* operator -> () const { return m_data.get(); } data_type* operator -> () { if (!m_data.unique()) m_data.reset(new data_type(*m_data)); return m_data.get(); } private: std::shared_ptr<data_type> m_data; }; 

What we get


By implementing the text class, we get an abstraction from a set of encodings, all we need is one overload from the text class. For example:

 text to_json() const; void from_json(text const& source); 

We no longer need a lot of overloads from `std :: string` and` std :: wstring`, we don’t need to switch to support for “wide” strings, it’s enough to replace the text links on the strings in the API, and we get Unicode automatically. In addition, we get excellent cross-platform behavior, regardless of which library we have chosen as the transcoding engine, ICU4C or libiconv, due to the fact that the internal representation is always UTF-32 when unpacking characters and we are not tied to platform-specific wchar_t.

Total: we have compatibility or interconversion with standard types, which means simplification of Unicode support on the side of code written in C ++. After all, if we write high-level logic in C ++, least of all we want to get problems when using wchar_t characters and heaps of uniform code when processing and re-encoding text.

While the transcoding itself is already implemented in the same ICU4C and libiconv, the algorithm for the inner workings of the text class is quite simple. Dare, and maybe tomorrow it will be your text processing library that will be used everywhere as a high-level abstraction when processing any textual data, from simple JSON from the client to complex text structures from various databases.

image

First published in Hacker Magazine # 191.
Author: Vladimir Qualab Kerimov, Lead C ++ Developer, Parallels

Subscribe to "Hacker"

Source: https://habr.com/ru/post/257895/


All Articles