📜 ⬆️ ⬇️

Encodings


Everyone sooner or later has to work with different encodings. Having noticed various, sometimes strange approaches to solving these problems in the code of my team, I had to conduct an explanatory conversation. Below I will share my vision of correct work with non-ASCII characters in the code. I will be glad to constructive criticism.



Principle of operation


The logic of working with different encodings in C ++ is simple and transparent. In general, it is reflected in the scheme. The program works in one - its internal encoding, and correctly localized streams are responsible for converting the encoding of data from external code to internal code and vice versa. The internal coding of the program is best fixed once and for all. If the program works with non-ASCII characters, the most logical choice for internal encoding is Unicode, and using UTF-8 and char to parameterize STL is usually unjustified (although there are situations in which this is necessary); it is more logical to switch to wchar_t extended characters and use UCS-2. The codecvt cell is responsible for converting data from external encoding to internal encoding. Localized streams themselves will call the corresponding facet functions when retrieving data (who I wrote about such facets earlier ).
I will explain the above with a commented example in which we will read data from the cp1251 file, we will show how boost :: xpressive works with Unicode and we will derive the Cyrillic cout in cp866 (windows console by default).
')


Source encoding


Before proceeding with the consideration of the example, it is necessary (just in case) to say a few words about the encoding of the source code of the program. I keep all my sources in UTF-8 (if they contain wide string constants with non-ASCII characters, then I add a BOM to the files), which I advise everyone. Modern compilers themselves convert "wide" characters, marked in the source as L "" in UCS-2 (or UCS-4). It is clear that the correct conversion depends on the source encoding. gcc by default assumes that it works with UTF-8 text, in order to convince it you will have to specifically specify the value of the -finput-charset parameter. The compiler from MS needs a little help - add the BOM file ( Byte Order Mark ) to the UTF-8. Unfortunately, Borland C ++ Compiler version 5.5 has problems with UTF-8.
For those who are going to throw a stone at me, I will explain two points: the first is that it is not convenient for me to read the code from the “unicode escape” type:
std::wstring wstr(L"\u0410\u0411\u0412\u0413\u0413");
the second is not only about the user interface, so putting all the wide string constants in a separate module and somehow working with them (like gettext) is not an option.
So it's decided - the sources in UTF-8 with BOM. If anyone does not know, in vim BOM can be added to the file using the command “set bomb”. If the BOM in the file already has vim, it is not going anywhere.

An example of working with different encodings


Well, that's got to the most interesting. As I said, the code is simple and straightforward. A small note on standard streams - by default, the facets for them are not used as they are synchronized with stdio for performance. You must specify sync_with_stdio (false).

 #include <boost/xpressive/xpressive.hpp> #include <locale> #include <fstream> #include <iostream> #include <iterator> #include "codecvt_cp866.hpp" #include "codecvt_cp1251.hpp" #include "unicyr_ctype.hpp" using namespace std; using namespace boost::xpressive; int main() { //    input.txt   cp1251,   // ", !" ofstream ofile("input.txt", std::ios::binary); ostreambuf_iterator<char> writer(ofile); writer = 0xCF; //  ++writer = 0xF0; //  ++writer = 0xE8; //  ++writer = 0xE2; //  ++writer = 0xE5; //  ++writer = 0xF2; //  ++writer = 0x2C; // , ++writer = 0x20; // ++writer = 0xEC; //  ++writer = 0xE8; //  ++writer = 0xF0; //  ++writer = 0x21; // ! ofile.close(); //   wifstream ifile("input.txt"); //    locale cp1251(locale(""), new codecvt_cp1251<wchar_t, char, mbstate_t>); ifile.imbue(cp1251); wchar_t wstr[14]; ifile.getline(wstr, 13); //  C++   cout, cin, cerr  // clog,       stdio,   //        (    //  gcc, msvc 7    ).  //  ios,       stdio. ios_base::sync_with_stdio(false); //    locale cp866(locale(""), new codecvt_cp866<wchar_t, char, mbstate_t>); //  ,      //  wcout.imbue(cp866); //    ctype locale cyrr(locale(""), new unicyr_ctype); wsregex_compiler xpr_compiler; xpr_compiler.imbue(cyrr); wsregex xpr = xpr_compiler.compile(L"", regex_constants::icase); wsmatch match; if(regex_search(wstring(wstr), match, xpr)) wcout << L"icase " << endl; else wcout << L"icase  " << endl; return 0; } 


Codecvt facet to convert Cyrillic from cp1251 to ucs-2 and back

 #include <locale> #include <map> /**@brief  codecvt      cp1251 *  UCS-2   * *       (3-   - * ).        - * .   ,   codecvt     *  .       *  .  ,      , *  ,     .   *        . State *      ,     *       ,    *       . */ template<class I, class E, class State> class codecvt_cp1251 : public std::codecvt<I, E, State> { public: //     typedef typename std::codecvt_base::result result; const result ok, //  partial, //   (  State) error, //   noconv; //   explicit codecvt_cp1251(size_t r=0) : std::codecvt<I, E, State>(r), ok(std::codecvt_base::ok), partial(std::codecvt_base::partial), error(std::codecvt_base::error), noconv(std::codecvt_base::noconv) { //    -   in_tab[0xA8] = 0x401; out_tab[0x401] = 0xA8; in_tab[0xB8] = 0x451; out_tab[0x451] = 0xB8; // ...   } ~codecvt_cp1251() { } protected: /**@brief        *  from-from_end,     in-in_end.*/ virtual result do_in(State&, const E* from, const E* from_end, const E* &from_next, I* to, I* to_end, I* &to_next) const { while(from != from_end) { if(to == to_end) { from_next = ++from; to_next = to; return partial; } // ASCII if(0 <= *from && *from <= 0x7F) { *to = static_cast<I>(*from); } else if(0xC0 <= static_cast<unsigned char>(*from) && static_cast<unsigned char>(*from) <=0xFF) { *to = static_cast<I>(static_cast<unsigned char>(*from) + 0x350); } else { typename std::map<E, I>::const_iterator s; s = in_tab.lower_bound(*from); if(s == in_tab.end()) { //   ,  from  next   from_next = ++from; to_next = ++to; return error; } *to = s->second; } ++to; ++from; } from_next = from_end; to_next = to; return ok; } /**@brief    ,  true*/ virtual int do_encoding() const throw() { return 1; } /**@brief    ,  true*/ virtual bool do_always_noconv() const throw() { return false; } /*     virtual result do_out(State&, const I* from, const I* from_end, const I* &from_next, E* to, E* to_end, E* &to_next); virtual int do_length(State& s, const E* from, const E* from_end, size_t max) const; virtual int do_max_length() const throw(); */ private: //       std::map<E, I> in_tab; std::map<I, E> out_tab; }; 


Notes


I did not litter the article with an extra code - the codecvt facet for cp866 is implemented in a similar way, but I mentioned ctype earlier, but if someone needs a working example, these facets can be taken on github - git: //github.com/hoxnox/cyrillic- facets.git.
And I apologize for the lack of line numbers - so that the UFO "swallowed" the topic had to be reduced highight code.

Read more about UNICODE here.
More about facets in the Straustrup C ++ Programming Language, third special edition, appendix
More about boost :: xpressive here

UPD 20150813:
Habrapapolzovatel Cyapa conducted an investigation and found out that the codecvt facet only works with threads that work with bisic_filebuf. According to the standard (clause 27.9.1.1, besides this is noted by Straustrup) implementations are not required to call the codecvt facet methods for other buffers — in particular, basic_stringbuf. Thus, if you create a locale with your codecvt facet and hope that the std :: stringstream to which this locale is assigned (using imbue) will pull this facet, you are mistaken.

Source: https://habr.com/ru/post/107679/


All Articles