📜 ⬆️ ⬇️

Non ASCII String Handling

If you know what ICU is, then you probably won't learn anything new from this post.

Sometimes we hear from comrades that during the interview they were asked to write code that would draw the line. And even in Cracking the Coding Interview is the second question in the tests. However, the correct, from my point of view, solution goes far beyond the library call or one while loop.

In the case when the transferred string is an ASCII string, the simplest approach will work. But everything becomes more interesting if the UTF-8 string can be passed to the input. There are two problems at once: variable width and modifying symbols .
To make it clearer, I will start with the tests that the program should pass (on the left is the input, on the right is the expected output):
const char* cases[][2] = { //  ascii- {"", ""}, {"a", "a"}, {"ab", "ba"}, {"a b", "b a"}, //   ,   UTF-8 {"\xd1\x84", "\xd1\x84"}, //      {"x\xd1\x84", "\xd1\x84x"}, {"y\xd1\x84z", "z\xd1\x84y"}, {"\xd1\x84\xd1\x85", "\xd1\x85\xd1\x84"}, //   ,     {"\xd0\x98\xcc\x86", "\xd0\x98\xcc\x86"}, //       {"i\xd0\x98\xcc\x86", "\xd0\x98\xcc\x86i"}, {"\xd0\x98\xcc\x86i", "i\xd0\x98\xcc\x86"}, {"\xd0\x98\xcc\x86\xd1\x84", "\xd1\x84\xd0\x98\xcc\x86"}, //  : z̆̈y {"z\xd0\x98\xcc\x86\xcc\x88y", "y\xd0\x98\xcc\x86\xcc\x88z"} }; 
As you can see, the solution requires something that knows that Unicode is “under the hood”. And the responsibility for this is usually placed on the ICU library. Therefore, we can take this note as an ICU review for those who, like me, are going to start using it.

Problem solving algorithm

To solve the problem, you need to do the following steps:
  1. decode the UTF-8 string into a structure independent of the encoding,
  2. divide this structure into subsequences so that all non-extended characters and only they are in the same subsequence with their base,
  3. rearrange the subsequences in places
  4. convert back to UTF-8.

Read data

The class icu :: UnicodeString is an abstraction over the representation of data in memory, allowing you to simply work with data as with a sequence of Unicode characters. It has many methods for decoding and encoding, in particular, I will need the following:
 // Decoding icu::UnicodeString s = icu::UnicodeString::fromUTF8(cases[test_case][0]); // Encoding std::string result; s.toUTF8String(result); 

Character separation

The icu :: BreakIterator class allows you to get the coordinates of the beginning of the next / previous extended_character / words / sentences in UnicodeString (this is how the text is broken up into words when searching a page in the Chromium browser and its derivatives). It should be noted that for proper operation the iterator must know the language of the text.
 // Initialize iterator UErrorCode ec = U_ZERO_ERROR; icu::Locale ru_locale = icu::Locale("ru"); std::unique_ptr<icu::BreakIterator> iter; iter.reset(icu::BreakIterator::createCharacterInstance(ru_locale, ec)); iter->setText(my_unicode_string); // Set it to the beginning of my_unicode_string and get next character's position iter->first(); int32_t next_char = iter->next(); // Or set it to the after-last-character position and get previous character position iter->last(); int32_t this_char = iter->previous(); 

Transferring Subsequences

After the phrase is divided into characters, you need to swap the subsequence. This is not quite easy to do, since they can be of different lengths. The option to put everything on the stack, and then reading and writing it seemed unsportsmanlike, and I came to a decision with two queues: the queue of those who want to sign up at the beginning (fills up as the characters bite off the end) and those who want to sign up at the end ( is filled in as biting off characters from the beginning). When the number of "bitten off", for example, from the beginning of the characters is not less than the length of the waiting in the hour subsequence, it frees up its place in the queue.
')

Conclusion

This concludes my summary of the ICU. Hope it was useful and interesting. Full source code with examples and tests can be found on github .

Source: https://habr.com/ru/post/222331/


All Articles