Cross platform work with strings in C ++

Not so long ago, I was puzzled by the question of cross-platform work with strings in c ++ applications. The task was, roughly speaking, set as a case-insensitive search for a substring in any encoding on any platform.

So, the first thing I had to understand was that with strings in Linux you need to work in UTF-8 encoding and in the std :: string type, and in Windows, the strings must be in UTF-16LE (the std :: wstring type). Why? Because it is by design of operating systems. It is extremely expensive to store strings in std :: wstring in Linux, since one character wchar_t takes 4 bytes (in Windows - 2 bytes), and it was necessary to work std :: string in Windows at the time of Windows 98. To work with strings, we define our platform-independent type :

#ifdef _WIN32 typedef std::wstring mstring; #else typedef std::string mstring; #endif // _WIN32

')
The second is the task of converting text from any encoding to the mstring type. There are not so many options. The first option is to use std :: locale and other relevant standard things. The need to search for each charset of the corresponding locale (such as the “Windows-1251” encoding corresponds to the Russian_Russia.1251 locale, etc.) immediately struck the eye. Such a table was not found in the standard library (maybe it was looking bad?), I did not want to look for a lotion for the list of locales. And in general, working with C ++ in C ++ is a very non-obvious thing, in my opinion. The forums advised to use the library libiconv or icu . libiconv looked very easy and simple, coped with the task of recoding from any charset to mstring perfectly, but when it came to converting mstring to lower case, I suffered a failure. It turned out that libiconv doesn’t know how to do this, but I didn’t manage to convert the utf8 string to lowercase in a simple and beautiful way in Linux. So, the choice fell on icu, who solved all the tasks with honor (conversion and transfer to lower case). The procedure for platform independent transcoding using the icu library looks like this:

 std::string to_utf8(const std::string& source_str, const std::string& charset, bool lowercase) { const std::string::size_type srclen = source_str.size(); std::vector<UChar> target(srclen); UErrorCode status = U_ZERO_ERROR; UConverter *conv = ucnv_open(charset.c_str(), &status); if (!U_SUCCESS(status)) return std::string(); int32_t len = ucnv_toUChars(conv, target.data(), srclen, source_str.c_str(), srclen, &status); if (!U_SUCCESS(status)) return std::string(); ucnv_close(conv); UnicodeString ustr(target.data(), len); if (lowercase) ustr.toLower(); std::string retval; ustr.toUTF8String(retval); return retval; }

I will not describe the issues of working with Unicode in Windows - everything is well documented there.

Source: https://habr.com/ru/post/145187/

All Articles

Cross platform work with strings in C ++

More articles: