As part of my “work” on standardization of C # 5 in the
ECMA-334 TC49-TG2 technical group, I was lucky to see several interesting ways that
Vladimir Reshetnikov tested C # for strength. This article describes one of the problems that he raised. Of course, it most likely will not affect 99.999% C # developers in any way ... but it's still interesting to understand.
Specifications used in the article:
What is a string?
How would you declare the type
string
(or
System.String
)? I can suggest several answers to this question, from vague to fairly specific:
')
- "Any text in quotes"
- Character string
- Unicode character sequence
- 16-bit sequence
- UTF-16 word sequence
Only the last statement is completely true. The C # 5 specification (Section 1.3) reads:
Handling strings and characters in C # uses UTF-16. The type char
represents the word UTF-16, and the type string represents the sequence of words UTF-16.
So far, so good. But this is C #. What about IL? What is used there and does it matter? It turns out that it has ... Strings should be declared in IL as constants, and the nature of this presentation method is important - not only the encoding, but also the interpretation of this encoded data. In particular, a sequence of UTF-16 words may not always be represented as a sequence of UTF-8 words.
Everything is very bad (formed)
For example, take the string literal
“X\uD800Y”
. This is a string representation of the following UTF-16 words:
0x0058
- 'X'0xD800
- the first part of the surrogate pair0x0059
- 'Y'
This is a completely normal string — it is even a Unicode string according to the specification (section D80). But it is poorly formed (section D84). This is because the word UTF-16
0xD800
does not correspond to any scalar Unicode value (section D76) - the surrogate pairs are explicitly excluded from the list of scalar values.
For those who first hear about surrogate pairs: UTF-16 uses only 16-bit words, and therefore cannot fully cover all valid Unicode values, the range of which is
U+0000
to
U+10FFFF
inclusive. If you need to represent in UTF-16 a symbol with a code greater than
U+FFFF
, then two words are used: the first part of the surrogate pair (in the range from
0xD800
to
0xDBFF
) and the second (
0xDC00 … 0xDFFF
). Thus, only the first part of the surrogate pair does not in itself make any sense — it is the correct word of UTF-16, but it only gets a value if the second part follows.
Show the code!
And how does all this relate to C #? Well, constants should be somehow represented at the IL level. It turns out that there are two ways of representing here - in most cases, UTF-16 is used, but for the arguments of the attribute constructor, UTF-8.
Here is an example:
using System; using System.ComponentModel; using System.Text; using System.Linq; [Description(Value)] class Test { const string Value = "X\ud800Y"; static void Main() { var description = (DescriptionAttribute) typeof(Test).GetCustomAttributes(typeof(DescriptionAttribute), true)[0]; DumpString("", description.Description); DumpString("", Value); } static void DumpString(string name, string text) { var utf16 = text.Select(c => ((uint) c).ToString("x4")); Console.WriteLine("{0}: {1}", name, string.Join(" ", utf16)); } }
In .NET, the output of this program will be as follows:
: 0058 fffd fffd 0059 : 0058 d800 0059
As you can see, the “constant” remained unchanged, but the
U+FFFD
characters appeared in the attribute property value (a
special code used to mark the dead data when decoding binary values into text). Let's take a deeper look and look at the IL code describing the attribute and constant:
.custom instance void [System]System.ComponentModel.DescriptionAttribute::.ctor(string) = ( 01 00 05 58 ED A0 80 59 00 00 ) .field private static literal string Value = bytearray (58 00 00 D8 59 00 )
The format of the constant (
Value
) is quite simple - it is UTF-16 with a byte order from low to high (
little-endian ). The attribute format is described in the ECMA-335 specification in section II.23.3. We analyze it in detail:
- Prologue (01 00)
- Fixed Arguments (for the selected constructor)
- 05 58 ED A0 80 59 (one packed string)
- 05 (length equal to 5 - PackedLen)
- 58 ED A0 80 59 (UTF-8 encoded string value)
- Number of named arguments (00 00)
- Named arguments themselves (there are none)
The most interesting part here is "UTF-8 encoded string value". The value is not a valid UTF-8 string because it is poorly formed. The compiler took the first word of the surrogate pair, determined that it was not followed by the second, and simply processed it in the same way as it should handle any other characters in the range from
U+0800
to
U+FFFF
inclusive.
It should be noted that if we had a whole surrogate pair, UTF-8 would encode it as one scalar Unicode value, using 4 bytes. For example, change the
Value
declaration to the following:
const string Value = "X\ud800\udc00Y";
In this case, at the IL level, we get the following set of bytes:
58 F0 90 80 80 59
- where
F0 90 80 80
is the representation of the words UTF8 under the number
U+10000
. This string is formed correctly and its values in the attribute and the constant would be the same.
However, in our original example, the value of the constant is decoded without checking whether it is correctly formed, while for the value of the attribute an additional check is used that detects and replaces incorrect codes.
Coding behavior
So which approach is the right one? According to the Unicode specification (section C10), both are true:
When a process interprets a sequence of codes, which may be an encoded Unicode character, poorly formed sequences should cause an error condition, rather than being treated as characters.
And at the same time:
Processes that comply with this specification should not interpret poorly formed sequences. However, the specification does not prohibit the handling of codes that are not encoded Unicode characters. For example, to improve performance, low-level string operations can process codes without interpreting them as characters.
It is not completely clear to me whether the values of the constants and arguments of the attributes “should be encoded Unicode characters”. In my experience, the specification almost never states whether a correctly formed string is required or not necessary.
In addition, implementations of
System.Text.Encoding
can be customized by specifying the behavior in case of attempts to encode or decode poorly formed data. For example:
Encoding.UTF8.GetBytes(Value)
It will return the byte sequence
58 EF BF BD 59
- in other words, it will detect incorrect data and replace it with
U+FFFD
, and decoding will take place without problems. But:
new UTF8Encoding(true, true).GetBytes(Value)
Throw an exception. The first argument of the constructor indicates the need to generate a
BOM , the second - how to deal with incorrect data (the
EncoderFallback
and
DecoderFallback
properties are also used).
Language behavior
So should this code be compiled at all?
At the moment, the language specification does not prohibit this - but the specification can be fixed :)
Generally speaking, both
csc
and Roslyn still prohibit the use of poorly formed strings in some attributes, for example, the
DllImportAttribute
:
[DllImport(Value)] static extern void Foo();
This code will generate a compiler error if the
Value
value is poorly formed:
error CS0591: DvalImport attribute
Perhaps there are other attributes with the same behavior - not sure.
If we assume that the attribute argument value will not be decoded into the original form when an attribute instance is created, it can be considered with a clear conscience an error at the compilation stage. (Unless, of course, we change the execution environment so that it keeps exactly the value of a poorly formed string)
But what to do with the constant? Should this be valid? Could this make sense? In the form in which the string is used in the example is unlikely, but there may be a case where the string must end with the first part of the surrogate pair, then add it to another string starting from the second part and get the correct string. Of course, extreme caution is needed here - in the
Unicode Technical Report # 36 (Security Considerations), there are quite alarming possibilities for errors.
Implications of the foregoing
One of the interesting aspects of all this is that "string encoding arithmetic" may not work the way you think:
// ! string SplitEncodeDecodeAndRecombine(string input, int splitPoint, Encoding encoding) { byte[] firstPart = encoding.GetBytes(input.Substring(0, splitPoint)); byte[] secondPart = encoding.GetBytes(input.Substring(splitPoint)); return encoding.GetString(firstPart) + encoding.GetString(secondPart); }
It may seem to you that there can be no errors if there is no
null
anywhere, and the
splitPoint
value is in the range. However, if you fall in the middle of a surrogate couple, everything will be very sad. There may also be additional problems because of things like a form of normalization - most likely, of course not, but by this time I am not 100% sure of anything.
If it seems to you that this example is divorced from reality, then imagine a large piece of text divided into several network packets or files - it does not matter. It may seem to you that you are quite prudent and take care that the binary data is not shared among the UTF-16 code pair - but even this will not save you. Oh oh.
It breaks me downright to abandon word processing at all. Floating point numbers are a nightmare, dates and times ... well, you know what I think of them. I wonder if there are any projects in which only whole numbers are used, which are guaranteed to never overflow? If you have such a project - let me know!
Conclusion
Text is hard!
Translator's Note:
I found a link to the original of this article in the post “Let's talk about the differences between Mono and MS.NET” . Thank you DreamWalker ! In his blog, by the way, there is also a small note on the topic of how the same example behaves under Mono.