.Net, UTF-16 and regular expressions

Somehow I needed to check if the XML name was correct. What could be easier? We look at the standard , where it is clearly described, with what characters the name can begin, and with which - to continue, everything is simple and clear:

[4] NameStartChar ::= ":" | [AZ] | "_" | [az] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
[4a] NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
[5] Name ::= NameStartChar (NameChar)*

Almost ready regular expression, easy file processing Ctrl + H ...
')

public const string NameStartCharPattern = @"\:|[AZ]|_|[az]|[\u00C0-\u00D6]|[\u00D8-\u00F6]|[\u00F8-\u02FF]|[\u0370-\u037D]|[\u037F-\u1FFF]|[\u200C-\u200D]|[\u2070-\u218F]|[\u2C00-\u2FEF]|[\u3001-\uD7FF]|[\uF900-\uFDCF]|[\uFDF0-\uFFFD]|[\u10000-\uEFFFF]" ; public const string NameCharPattern = NameStartCharPattern + @"|-|\.|[0-9]|\u00B7|[\u0300-\u036F]|[\u203F-\u2040]" ; public const string NamePattern = @"(?:" + NameStartCharPattern + @")(?:" + NameCharPattern + @")*" ; * This source code was highlighted with Source Code Highlighter .

Writing a test ...

Assert.That(Regex.Match( "4a" , Patterns.NamePattern), Is.False); * This source code was highlighted with Source Code Highlighter .

Clean, simple, understandable ... Fell!

The root of the evil was the last component in the first line: [\ u10000- \ uEFFFF]. He catches all the characters, although he shouldn't ... Stop, how does he catch? We have the same UTF-16, the character is limited to two bytes? .. Or not limited? ..

I had to urgently deal with the elimination of their own illiteracy in the field of encodings, and I present the results of my education in short form here. If these facts will be familiar to someone for a long time - feel free to skip the next paragraph.

It turns out that Unicode has the ability to encode much more than 65,536 characters. Unicode characters are divided into so-called planes, and each of them has a capacity of 0x10000 characters. In total, the standard defines them 17. And such a “crooked” from the programmer’s point of view, the number here is not for nothing: in fact, we have one plane, which is processed in one way, and 16 - in another. The first, the so-called base multilanguage plane, also known by the abbreviation BMP, contains the vast majority of all the symbols used today. When encoding to UTF-16, all characters from it are written in two bytes, in one word, directly corresponding to the character code in them. In the same plane, a special code range, 0xD800-0xDFFF, is defined. It contains 2048 values, which are called surrogates. By themselves, these values in UTF-16 cannot be met, only in pairs — two words (two bytes each) set the value of the following sixteen panels as follows: 0x10000 is subtracted from the character code, which gives us a clear twenty-bit number. These 20 bits are written 10 each in the first and second word, thus occupying 2048 dedicated codes. Moreover, since the first word is written with the prefix 0b110110 (giving the values 0xD800-0xDBFF, called the high or leading surrogate), and the second 0b110111 (0xDC00-0xDFFF, respectively, the final or low surrogate), this guarantees a unique definition of the purpose of each word outside context dependent.

... So, it would seem to have nothing .Net? And despite the fact that although it provides tools for working with surrogates, the regular expression engine ignores them. Ie ignores altogether, working with them as pairs of characters. As usual in such cases, I was not the first to find this problem . Again, as usual, the Microsoft verdict is Won't fix.

So, you have to somehow live with it. As suggested in the bugreport to call a third-party engine through PInvoke - from the cannon on sparrows. The second idea is to throw the hell out of all support for these surrogates was seductive, but I decided not to give up ... And then I suddenly realized that the bug can be used as a feature!

The structure of the group that should work with surrogates in our case is very simple - in fact it resolves any characters from the first 14 planes, prohibiting the last two ... Ie, prohibits a certain range of values from the high surrogate area, and we can replace our expression with the following:

[\u10000-\uEFFFF] -> (?:[\uD800-\uDB7F][\uDC00-\uDFFF])

This method is not very universal, and it will be terribly inconvenient to ask them narrower ranges of characters, it seemed to me beautiful, and therefore I decided to share it with you.

Source: https://habr.com/ru/post/140585/

All Articles

.Net, UTF-16 and regular expressions

More articles: