
Translator's notesIn the translation, I allowed myself to use some anglicisms, such as “valid”, “native” and “binary”. I hope there will be no questions with them.
Identifiers is a special term for the C # specification that identifies everything to which you can refer to by name, such as the name of a class, the name of a variable, etc.
Roslyn is a C # code compiler written in C #. Was created to replace the existing csc.exe. I usually omit the word compiler in this text.
First, a few things you might not have heard about:
- Identifiers in C # may include Unicode character escape sequences (such as \ u1234).
- Identifiers in C # may include Unicode characters of the category Cf (other, format), but when comparing identifiers for identity, these characters are ignored.
- The character "Mongolian vowel separator" (U + 180E), depending on the Unicode version, belongs to either the category Cf (other, format) or the category Zs (separator, space).
- .NET stores its own list of Unicode categories, independent of these in Win32.
- Roslyn is a .NET application and therefore uses the Unicode categories that are written in .NET files. The native compiler (csc.exe) uses either system (Win32) categories or stores a copy of Unicode tables.
- None of the Unicode character tables (neither .NET nor Win32) exactly follow any version of the Unicode standard.
- Compilers may have bugs.
All of this leads to some problems ...
')
Vladimir is to blame
It all started with a discussion at the ECMA technical group meeting last week. We looked at “normative references,” and in particular, which version of the Unicode standard we will use. At that time, ECMA-335 (4th Edition) uses Unicode 4.0, and Microsoft’s C # 5 specification uses Unicode 3.0. I don’t know for sure if compiler developers take into account such features. In my opinion, it would be better if ECMA and Microsoft did not indicate a specific version of Unicode in their specifications. Let compiler developers use the latest version of Unicode currently available. However, then the compilers will have to be supplied with their personal copy of the Unicode table, which is a bit strange, in my opinion.
During our discussion,
Vladimir Reshetnikov casually mentioned the “
Mongolian vowel separator ” (U + 180E), which was tormented by a fair amount of life. This character was added in Unicode 3.0.0 to the Cf category (other, format). Then, in Unicode 4.0.0 it was moved to the Zs category (separator, space), and in Unicode 6.3.0 it was returned to the Cf category again.
Of course, I tried to condemn such actions. My initial goal was to show you a code that would behave differently, depending on the version of the Unicode table that the compiler uses. However, it turned out that in fact everything is a little more complicated. But for starters, we assume that we are using a “hypothetical compiler” that does not contain bugs, and uses any version of Unicode that we want (which is a bug according to the requirements of the current C # specification, but we will leave aside this subtlety).
Hypothetical example 1: correct or incorrect
For simplicity, let's forget about UTFs for a while, and use the usual ASCII:
class MvsTest
{
static void Main ()
{
string stringx = "a" ;
string \ u180ex = "b" ;
Console .WriteLine (stringx);
}
}
If the compiler uses Unicode version 6.3 or higher (or a version lower than 4.0), then U + 180E will be considered a character from the Cf category, and therefore allowed for use in the identifier. If the symbol is allowed to be used in the identifier, then instead of this symbol we can use the escape sequence, and the compiler will gladly process it correctly. The identifier in the second line of this method is considered to be “identical” to stringx, so that “b” will be displayed.
So what about a compiler that uses Unicode version 4.0 - 6.2 inclusive? In this case, U + 180E will be considered a character from the category Zs, which makes it a whitespace character. Whitespace characters are allowed inside C # code, but not in the identifiers themselves. And since this symbol is not a permitted identifier and is not inside a character \ string literal, from the compiler’s point of view, using the escape sequence in this segment is incorrect, and therefore this section of the code simply does not compile.
Hypothetical example 2: correct, in two different ways
However, we can write the same code without using an escape sequence. To do this, create a regular ASCII file:
class MvsTest
{
static void Main ()
{
string stringx = "a" ;
stringAAAx = "b" ;
Console .WriteLine (stringx);
}
}
Then open it in a hex editor and replace AAA characters with E1 A0 8E bytes. Thus, we obtained a file containing the UTF-8 representation of the U + 180E symbol in the same place in which it was displayed using the escape sequence in the first example.
The compiler, which successfully adopted the first example, will also compile this option (assuming that you were able to tell the compiler that the file is encoded in UTF-8), and the result will be exactly the same — the “b” will be displayed on the screen, as the second the construction in the method is a simple assignment to an existing variable.
However, even if the compiler perceives U + 180E as a whitespace character (that is, refuses to compile the program from Example 1), problems with this variant will not arise anyway, the compiler will accept the second expression in the method as declaring a new local variable x and assigning it some initial value. You may receive a compiler warning about declaring an unused local variable, but the code will be successfully compiled and “a” will be displayed.
Reality: Microsoft Compilers
When we talk about the Microsoft C # compiler, we need to distinguish between the native compiler (csc.exe) and Roslyn (rcsc, although I usually call it simply Roslyn).
Since csc.exe is written in native code, it uses either Unicode tools built into Windows or simply stores a table of Unicode characters in its executable file. (I scoured the entire MSDN in search of a native Win32 function to determine if a character belonged to a certain Unicode category, but I didn’t find anything. And pity, such a function would be very useful ...)
At this time, Roslyn, which is written in C #, uses the
char.GetUnicodeCategory () function , which relies on the tables built into mscorlib.dll, to define the Unicode categories (as far as I know).
My experiments suggest that, regardless of what the native compiler uses to determine the category, U + 180E is always taken for the symbol Cf category. At least I tried to find old machines (including VM images) on which no updates were installed since September 2013 (this was the time when the Unicode 6.3 standard was published) and they all compiled the program from the first example without either mistakes. I begin to suspect that csc.exe probably has a copy of Unicode 3.0 table built into the binary. He definitely perceives U + 180E as a formatting character, but “dislikes” the characters U + 0600 and U + 00AD in identifiers (U + 0600 was not introduced before Unicode 4.0, but it was always a formatting character; U + 00AD in Unicode 3.0 was a punctuation character (dash), but since Unicode 4.0 it is a formatting character)
However, the table embedded in mscorlib.dll definitely changed with the advent of new versions of the .NET Framework. If you run this program:
using System;
class Test
{
static void Main ()
{
Console .WriteLine ( Environment .Version);
Console .WriteLine ( char .GetUnicodeCategory ( '\ u180e' ));
}
}
Then under the CLRv2, “SpaceSeparator” will be displayed, while under CLRv4 (at least on the recently updated system) “Format” will be displayed.
Of course, Roslyn will not work on older versions of the CLR. However, we still have hope in the face of
csharppad.com , which launches Roslyn in some kind of environment (of unknown origin, maybe Mono? Not sure), and, as a result, the “SpaceSeparator” is displayed. I am sure that the program from the first example will not be compiled. However, with the second example, everything is more complicated - csharppad.com does not allow downloading the source file, and copy / paste gives a strange result.
Reality: mcs (Mono C # compiler)
The Mono compiler also uses the GetUnicodeCategory () method, which makes our experiments much easier, but unfortunately, the Mono parser has at least 2 bugs:
- It allows you to use any escape sequence as an identifier, regardless of whether this escape sequence is a valid identifier or not. For example, from the point of view of the Mono compiler, the construction of string \ u0020x = “” is valid. Marked as bug 24968 . Source of
- It does not allow the use of formatting symbols within identifiers, including symbols from the category Mn, Mc, Nd and Pc, but not Cf. Marked as bug 24969 . Source of
For this reason, the program from the first example is always compiled, and displays “b”. However, the program from the second example will produce a compilation error, regardless of which of the categories (Zs or Cf), according to the compiler, the U + 180E symbol belongs to.
So what version is it?
Next, let's reflect on the Unicode table itself in .NET, since it is not entirely clear which version of Unicode different BCL implementations use. Run this program:
using System;
class Test
{
static void Main ()
{
Console .WriteLine ( char .GetUnicodeCategory ( '\ u00ad' ));
Console .WriteLine ( char .GetUnicodeCategory ( '\ u0600' ));
Console .WriteLine ( char .GetUnicodeCategory ( '\ u180e' ));
}
}
On my computer, this program running under CLRv4 issues “DashPunctuation, Format, Format”, and under Mono (3.3.0) and CLRv2 issues “DashPunctuation, Format, SpaceSeparator”.
This is at least strange. This behavior does not correspond to any of the versions of the standard Unicode, as far as I can say.
- U + 00AD was a Po (other, punctuation) symbol in Unicode 1.x, then Pd (dash, punctuation) in 2.x and 3.x, and starting from 4.0 it is a Cf symbol.
- U + 0600 was first introduced in Unicode 4.0 and has always been a Cf symbol.
- U + 180E was introduced as a Cf character in Unicode 3.0, then became a Zs character in Unicode 4.0, and finally returned to the Cf category in Unicode 6.3.
Thus, none of the Unicode versions of the standard matches the first or third line of output. Now I'm really confused ...
What about nameof and CallerMemberName?
Identifiers are not only used for comparison, they are available as strings (C # strings) without any use of Reflection. Starting in C # 5, the CallerMemberName attribute is available to us, allowing us to do such things:
public static void X \ u0600y ()
{
ShowCaller ();
}
public static void ShowCaller ([ CallerMemberName ] string caller = null )
{
Console .WriteLine ( "Called by {0} " , caller);
}
And in C # 6 we can write this:
string x \ u0600y = "" ;
Console .WriteLine ( "nameof = {0} " , nameof (x \ u0600y));
What will these two programs display? They simply print “Xy” and “xy” as names, as if the compiler simply threw out all the formatting characters. But what should they bring out? It is necessary to take into account that in the second case we could just write nameof (xy) and such a string would still remain equal to the string of the declared identifier.
We cannot even say: “What is the name of the declared member?”, Because you can overload it with an “other, but equal to it” identifier:
public static void Xy () {}
public static void X \ u0600y () {}
public static void X \ u070fy () {}
...
Console .WriteLine ( nameof (X \ u200by));
What should be displayed on the screen? I’m sure you’ll feel relieved to know that the creators of C # have a plan for this, but this is really one of those scenarios for which “there is no obvious right answer.” Things get even stranger when the CLI specification comes into play. Section I.8.5.1 of the 6th edition of ECMA-335 standard says:
Assemblies should be guided by Appendix 7 of Technical Report 15 of the Unicode Standard 3.0 defining a set of characters allowed for use in identifiers, which is available at www.unicode.org/unicode/reports/tr15/tr15-18.html . Identifiers must be in the canonical format defined in "Unicode Normalization Form C". To meet the CLS specification, two identifiers must be the same, only if their lowercase representations (Unicode-specific locale-independent mapping of one-to-one lowercase) are the same. For this reason, in order for two identifiers to be treated as different, according to the CLS, they must differ more than just in the case of characters. However, in order to override the inherited definition, the CLI requires the exact encoding used to encode the original definition.
I would like to study the effect of this document by adding a Cf symbol in IL, but unfortunately, I still haven’t been able to figure out how to affect the encoding used by ilasm in order to convince him that my “corrected” IL is what I want him to be.
Conclusion
As mentioned earlier, the
text is complex .
It turned out that even by limiting oneself only with identifiers, the “text is complex”. Who would have thought?
From the translator: I thank user
impwx for translating the previous publication of John Skit