Strings in C # and .NET

From the translator: John Skit wrote several articles about strings, and this article is the first one I decided to translate. Next, I plan to translate an article about string concatenation, and then about Unicode in .NET.

The System.String type (having a string alias in C # ) is one of the most commonly used and important types in .NET, and at the same time one of the most misunderstood. This article describes the basics of this type and debunks the myths and misunderstandings around it.

So what is string

A string in .NET (hereafter, string , I will not use the full name System.String every time) is a sequence of characters. Each character is a Unicode character in the range from U + 0000 to U + FFFF (which will be discussed later). The string type has the following characteristics:

The string is a reference type.

There is a common misconception that a string is a meaningful type. This fallacy expires from the immutability property of a string (see the next paragraph), since for an inexperienced programmer, immutability is often similar in behavior to meaningful types. However, string is a reference type, with all the characteristics of a reference type. I described in more detail about the differences between reference and significant types in my articles “ Parameter passing in C # ” and “ Memory in .NET - what goes where ”.
')

String is immutable

It is impossible to change the contents of the created line, at least in the safe (safe) code and without reflection. Therefore, when changing lines, you do not change the lines themselves, but the values of the variables pointing to the lines. For example, the code s = s.Replace ("foo", "bar"); does not change the contents of the string s , which was before the call to the Replace method — it simply reassigns the variable s to the newly-formed string, which is a copy of the old one except for all the substrings “foo” replaced by “bar”.

The string can be null.

In C, strings are sequences of characters ending in a ' \ 0 ', also called " nul " or " null ". I call it “null”, since that is exactly the name that has the character '\ 0' in the Unicode symbol table. Do not confuse the “null” symbol with the null keyword in C # - the System.Char type is significant, and therefore cannot be null . In .NET, strings can contain the character “null” anywhere and work with it without any problems. However, some classes (for example, in Windows Forms) may regard the “null” symbol in a string as a sign of the end of a line and not take into account the entire contents of the string after this character, so the use of “null” symbols may become a problem.

String overrides equality operator ==

When you call the operator == to determine the equality of two strings, the Equals method is called, which compares the content of strings, and not the equality of references. For example, the expression "hello".Substring(0, 4)=="hell" return true , although the references to strings on both sides of the equality operator are different (the two references refer to two different string instances, which, moreover, contain the same values ). However, it must be remembered that equality of values, and not references, occurs only when both operands are strictly string type at the time of compilation — the equality operator does not support polymorphism. Therefore, if at least one of the compared operands will be of type object , for example (although internally it will remain a string), then the comparison will be performed, rather than the contents of the strings.

Internment

In .NET there is the concept of " pool of internment " (intern pool). At its core, this is just a set of lines, but it ensures that when you use different lines with the same content in different places of the program, this content will be stored only once, and not created in a new way every time. Probably, the internment pool depends on a specific language, but it definitely exists in C # and VB.NET, and I would be very surprised to see a language on the .NET platform that does not use the internment pool; In MSIL, the internment pool is very simple to use, much easier than not to use. Along with the automatic interning of string literals, the strings can be interned manually using the Intern method, and you can also check if a string is already interned using the IsInterned method. The IsInterned method IsInterned not intuitive, since you expect it to return a Boolean , but not here — if the current line already exists in the internment pool, the method will return a reference to it, and if it does not, then null . Like it, the Intern method returns a reference to the interned string, regardless of whether the current string was in the internment pool prior to the method call, or it was entered there along with the method call, or the internment pool contains a copy of the current string.

Literals

A literal is, roughly speaking, a string value “hardcoded” in the code. There are two types of string literals in C # - standard (regular) and verbatim (verbatim). Standard literals in C # are similar to those in most programming languages - they are framed in double quotes ("), and may also contain special characters (double quotes (") itself, backslash (\), line breaks (carriage return - CR), line feed (line feed - LF) and some others) that require shielding. Literal literals allow almost the same thing as standard ones, but the literal literal ends in the first non-duplicated double quotes. To actually insert double quotes in the literal literal, you need to duplicate them (""). Also, unlike the standard literal, literally, there may be carriage returns and line breaks without escapes. To use a literal literal, you must specify @ before the opening quote. The following table summarizes examples that demonstrate the differences between the types of literals described.

Standard literal	Literal literal	Result string
`"Hello"`	`@"Hello"`	`Hello`
`" : \\"`	`@" : \"`	`: \`
`" : \""`	`@" : """`	`: "`
`"CRLF:\r\n CRLF"`	`@"CRLF:` `CRLF"`	`CRLF:` `CRLF`

Note that standard and literal literals exist only for you and the C # compiler. Once the code is compiled, all literals are uniform.
Here is a complete list of special characters requiring escapes:

\ '- single quote, used to declare literals of type System.Char
\ "- double quote, used to declare string literals
\\ - backslash
\ 0 - null character in unicode
\ a - Alert character (# 7)
\ b - Backspace character (# 8)
\ f —change the FORM FEED page (# 12)
\ n - line feed (# 10)
\ r - carriage return (№13)
\ t - horizontal tab (# 9)
\ v - vertical tab (№11)
Uxxxx is a Unicode character with a xxxx hex code
\ xn [n] [n] [n] is a Unicode character with a hexadecimal code nnnn, a version of the preceding paragraph with a variable length of digits of the code
\ Uxxxxxxxx is a Unicode character with a hexadecimal code xxxxxxxx, used to call surrogate pairs.

In my practice, I rarely use the characters \ a, \ f, \ v, \ x and \ U.

Strings and debugger

Quite often, when viewing lines in a debugger (using VS.NET 2002 and VS.NET 2003 ), people encounter problems. The irony is that these problems are most often created by the debugger, trying to be useful. Sometimes it displays a string as a standard literal, escapes all special characters with backslashes, and sometimes it displays a string as a literal literal, spelling it @. Therefore, many people ask how to remove @ from the line, although it is practically not there. In addition, debuggers in some versions of VS.NET do not display strings since the first occurrence of a null character \ 0, and even worse, incorrectly calculate their lengths, since they count them themselves instead of a request to the managed code. Naturally, all this is due to the fact that debuggers view \ 0 as a sign of line end.

Given this confusion, I came to the conclusion that when debugging suspicious lines they should be considered in a variety of ways in order to eliminate all misunderstandings. I suggest using the method below, which will print the contents of the string to the console in the "correct" way. Depending on which application you are developing, instead of outputting to the console, you can write lines to a log file, send them to tracers, display them in a modal Windows window, etc.

 static readonly string[] LowNames = { "NUL", "SOH", "STX", "ETX", "EOT", "ENQ", "ACK", "BEL", "BS", "HT", "LF", "VT", "FF", "CR", "SO", "SI", "DLE", "DC1", "DC2", "DC3", "DC4", "NAK", "SYN", "ETB", "CAN", "EM", "SUB", "ESC", "FS", "GS", "RS", "US" }; public static void DisplayString (string text) { Console.WriteLine ("String length: {0}", text.Length); foreach (char c in text) { if (c < 32) { Console.WriteLine ("<{0}> U+{1:x4}", LowNames[c], (int)c); } else if (c > 127) { Console.WriteLine ("(Possibly non-printable) U+{0:x4}", (int)c); } else { Console.WriteLine ("{0} U+{1:x4}", c, (int)c); } } }

Memory usage and internal structure

In the current implementation of the .NET Framework, each line occupies 20+ (n / 2) × 4 bytes, where n is the number of characters per line or, what is the same, its length. The string type is unusual in that its actual size in bytes is changed by itself. As far as I know, only arrays can do this. In fact, the string is the array of characters located in the memory, as well as the number denoting the actual size of the array in memory, as well as the number denoting the actual number of characters in the array. As you already understood, the length of the array is not necessarily equal to the length of the line, since the lines can be redistributed by mscorlib.dll to facilitate their processing. So itself does, for example, StringBuilder . And although for the outside world the strings are immutable, inside mscorlib they are also as mutable. Thus, when creating a string, StringBuilder allocates a slightly larger character array than the current literal requires, and then adds new characters to the created array as long as they fit. As soon as the array is filled, a new, even larger array is created, and the content from the old one is copied into it. In addition, in the number indicating the length of the string, the first bit is reserved for a special flag that determines whether the string contains non-ASCII characters or not. Thanks to this flag, the runtime environment may in some cases perform additional optimizations.

Although from the API side, strings are not null-terminated, internally character arrays representing strings are. This means that strings from .NET can be directly transferred to unmanaged code without any copying, assuming that with such interaction the strings will be marshalled as Unicode.

String Encodings

If you are not familiar with character encodings and Unicode, please read first my article about Unicode (or its translation in Habré).

As I said at the beginning of the article, strings are always stored in Unicode encoding. Any speculation about Big-5 encodings or UTF-8 encodings is an error (at least in relation to .NET) and is a consequence of not knowing the encodings themselves or how .NET processes strings. It is very important to understand this point — treating a string as one that contains some valid text in a non-Unicode encoding is almost always an error.

Further, the set of characters supported by Unicode (one of the shortcomings of Unicode is that one term is used for different things, including character encodings and character encoding schemes), exceeding 65536 characters. This means that a single char (System.Char) cannot contain any Unicode character. And this leads to the concept of surrogate pairs , where characters with a code higher than U + FFFF are represented as two characters. Essentially, strings in .NET use UTF-16 encoding. Perhaps most developers don’t need to go into details about this, but at least it’s worth knowing.

Regional and international oddities

Some oddities in Unicode lead to oddities when working with strings and characters. Most string methods are dependent on regional settings (are culture-sensitive - regionally sensitive), in other words, the work of the methods depends on the regional settings of the stream in which these methods are performed. For example, what do you think will return this method "i".toUpper() ? Most will say: "I", but no! For Turkish regional settings, the method will return "İ" (code U + 0130, character description: "Latin capital I with dot above"). To perform a regionally independent case change, you can use the CultureInfo.InvariantCulture property and pass it as a parameter to the overloaded version of the String.ToUpper method, which accepts CultureInfo .

There are other oddities associated with comparing and sorting strings, as well as finding the substring index in a string. Some of these operations are region-dependent, and some are not. For example, for all regions (as far as I can see) the literals “lassen” and “la \ u00dfen” (in the second literal, the hexadecimal code indicates the symbol “S acute” or “escet” ) are equal when they are passed to the CompareTo or Compare methods, but if we pass them to Equals , then inequality will be determined. The IndexOf method will consider the estset as “ss” (double “s”), but if you use one of the CompareInfo.IndexOf overloads, where you specify CompareOptions.Ordinal , then the estet will be processed correctly.

Some Unicode characters are generally completely invisible to the standard IndexOf method. One day someone asked a C # newsgroup why the search and replace method goes into an endless loop. This person used the Replace method to replace all the double spaces with one, and then checked whether the replacement was over and if there were no more double spaces in the line using IndexOf . If IndexOf indicated that there are double spaces, the line was sent back to processing by Replace . Unfortunately, all this “broke”, since there was a certain “wrong” character in the line, located exactly between two spaces. IndexOf reported the presence of a double space, ignoring this symbol, and Replace did not perform a replacement, since it had "seen" the symbol. I never found out what this character was, but this situation is easily reproduced using the U + 200C symbol, which is a “non-binding symbol of zero width” (English zero-width non-joiner character), whatever it is meant hell Place such or similar to it in your line, and IndexOf will ignore it, but Replace not. Again, to make both methods work in the same way, you can use CompareInfo.IndexOf and tell it CompareOptions.Ordinal . It seems to me that quite a lot of code has already been written that will “crash” on such “inconvenient” data. And I do not even hint that my own code is immune from this.

Microsoft has published some recommendations regarding string handling , and although they date back to 2005, they are still worth reading.

findings

For such a basic type as a string (and generally for text in general), a string in .NET is much more complicated than you might think. It is very important to understand the basics described in this article, even if some nuances of comparing and register strings in multi-regional contexts will elude you. In particular, the ability to diagnose string encoding errors by logging these very strings is vital.

From the translator: Since the article is relatively new, I decided to check the “oddities” and problems in the lines described by John Skit. As a result, I was able to reproduce everything that is described in the section “Regional and international oddities” using the .NET Framework versions 3.5, 4.0 and 4.5 inclusive. However, the strangeness regarding the display of literals in the debugger, described in the “Strings and Debugger” section, I have never met, at least in MS Visual Studio 2008, 2010 and 2012 inclusive.

Source: https://habr.com/ru/post/165597/

All Articles