The string data type is one of the most important in any programming language. It is hardly possible to write a useful program without using this data type. However, many developers do not know some of the nuances associated with this type. So let's look at some of the features of this type in .NET.
So, let's start by presenting strings in memory.
B.NET strings are arranged according to the BSTR rule (Basic string or binary string). This way of representing string data is used in COM (the word basic from the programming language VisualBasic, in which it was originally used). As it is known in C / C ++, PWSZ is used to represent strings, which stands for
Pointer to Wide-character String, Zero-terminated . With this location in memory at the end of the line is a null-terminated symbol, by which we can determine the end of the line. The string length in PWSZ is limited only by the amount of free memory.

With BSTR, things are a little different.

The main features of the BSTR inline memory string representation are:
- The line length is limited to a certain number, unlike PWSZ, where the line length is limited by the availability of free memory.
- The BSTR line always points to the first character in the buffer. PWSZ can point to any character in the buffer.
- In BSTR, the null character is always at the end, just like in PWSZ, but unlike the latter, it is a valid character and can appear anywhere in the string.
- Due to the presence of a null character at the end, BSTR is compatible with PWSZ, but not vice versa.
So, strings in .NET are represented in memory according to the BSTR rule. The buffer contains a four-byte string length, followed by two-byte string characters in UTF-16 format, followed by two zero bytes (\ u0000).
')
Using such an implementation has several advantages: the length of the string does not need to be recalculated; it is stored in the header, the string can contain null characters anywhere, and the most important address of the string (pinned) can be transferred to the unmanaged code without problems where WCHAR * is expected.
Go ahead ...
How much memory does a string type object occupy?
I have met articles where it was written that the size of a string object is size = 20 + (length / 2) * 4, but this formula is not entirely correct.
To begin with, the string is a reference type, so the first 4 bytes contain SyncBlockIndex, and the second 4 bytes contain a pointer to the type.
Line size = 4 + 4 + ...As mentioned above, the length of the string is stored in the buffer - this is an int field, which means another 4 bytes.
Line size = 4 + 4 + 4 + ...In order to quickly transfer a string to unmanaged code (without copying) at the end of each line there is a null-terminated character that takes 2 bytes, which means
Line size = 4 + 4 + 4 + 2 + ...It remains to recall that each character in the string is in UTF -16 encoding means it also takes 2 bytes, therefore
String size = 4 + 4 + 4 + 2 + 2 * length = 14 + 2 * lengthWe take into account another nuance, and we have a goal. Namely, the memory manager in the CLR allocates memory multiple of 4 bytes (4, 8, 12, 16, 20, 24, ...), that is, if the string length will take 34 bytes, then 36 bytes will be allocated. We need to round our value to the nearest greater multiple of four, for this we need:
String size = 4 * ((14 + 2 * length + 3) / 4) (division is naturally integer)
Version Question: In .NET up to version 4, the String class stores an additional field m_arrayLength of type int, which takes 4 bytes. This field is the actual length of the buffer allocated for the line, including null, the terminated character, that is, length + 1. In .NET 4.0, this field is removed from the class, with the result that the string type object is 4 bytes less.
The size of the empty string without the m_arrayLength field (that is, in .NET 4.0 and above) is = 4 + 4 + 4 + 2 = 14 bytes, and with this field (that is, below .NET 4.0) it is = 4 + 4 + 4 + 4 + 2 = 18 bytes. If rounded to 4 bytes, then 16 and 20 bytes, respectively.
String Features
So, we looked at how the lines are represented, and how much they actually take up space in the memory. Now let's break down about their features.
Main features of strings in .NET:
- They are reference types.
- They are immutable. Once, having created a string, we can no longer change it (honestly). Each method call of this class returns a new line, and the previous line becomes a prey for the garbage collector.
- They override the Object.Equals method, with the result that it compares not the values of the links, but the values of the characters in the strings.
Consider each item in more detail.
Strings - Reference Types
Strings are real reference types, that is, they are always located in a heap. Many people confuse them with meaningful types, because they behave as well, for example, they are immutable and their comparison occurs by value, not by reference, but it must be remembered that this is a reference type.
Strings are immutable
Strings are immutable. This is done for a reason. There are a lot of advantages in immutability of lines:
- The string type is thread safe, since no thread can change the contents of a string.
- The use of unchanged lines leads to a reduction in memory load, since there is no need to store 2 instances of one line. In this case, and less memory is spent, and the comparison is faster, as it requires a comparison only links. The mechanism that this implements in .NET is called string interning (a pool of strings), let's talk about it a bit later.
- When passing an immutable parameter to a method, we need not worry that it will be changed (unless, of course, it was passed as a ref or out).
Data structures can be divided into two types - ephemeral and persistent. Data structures that store only their latest version are called ephemeral. Persistent structures are structures that retain all their previous versions when they change. The latter are virtually immutable, since their operations do not change the structure in place; instead, they return a new one based on the previous structure.
Given that strings are unchanged, they could be persistent, but they are not. In .NET, strings are ephemeral. Read more about why this is exactly the way you can read at Eric Lippert
linkFor comparison, take the string Java. They are immutable, as in .NET, but in addition and persistent. The implementation of the String class in Java looks like this:
public final class String { private final char value[]; private final int offset; private final int count; private int hash; ..... }
In addition to the same 8 bytes in the object header, including the link to the type and the link to the synchronization object, the lines contain the following fields:
- A reference to an array of char characters;
- The index of the first character of the string in the char array (the offset of the beginning);
- The number of characters per line;
- The calculated hash code after the first call to the hashCode () method;
As you can see, strings in Java occupy more memory than in .NET, because they contain additional fields, which allow them to be persistent. Due to persistence, the
String.substring () method in Java is executed in O (1), since it does not require copying the string as in .NET, where this method is executed in O (n).
Implementing the String.substring () method in Java:
public String substring(int beginIndex, int endIndex) { if (beginIndex < 0) throw new StringIndexOutOfBoundsException(beginIndex); if (endIndex > count) throw new StringIndexOutOfBoundsException(endIndex); if (beginIndex > endIndex) throw new StringIndexOutOfBoundsException(endIndex - beginIndex); return ((beginIndex == 0) && (endIndex == count)) ? this : new String(offset + beginIndex, endIndex - beginIndex, value); } public String(int offset, int count, char value[]) { this.value = value; this.offset = offset; this.count = count; }
However, according to the principle of LDNB (there is no free lunch), which Eric Lippert so often says is not so good. If the source string is large enough, and the substring to be cut is a couple of characters, the entire character array of the original string will hang in memory as long as there is a link to the substring or, if you serialize the received substring by standard means and transmit it over the network, the entire original array will be serialized and the number of bytes transmitted over the network will be large. Therefore, in this case, instead of the code
s = ss.substring (3)can use code
s = new String (ss.substring (3)),which will not store a link to the array of characters of the source string, but will copy only the actually used part of the array. By the way, if this constructor is called on a string of length equal to the length of the array of characters, then copying in this case will not occur, and the link to the original array will be used.
As it turned out in the latest version of Java, the implementation of the string type has changed.
xonix suggested this. Now the class has no fields of offset and length, and a new hash32 (with a different hashing algorithm) has appeared. This means that strings are no longer persistent. Now the String.substring method will create a new string each time.
Strings override Object.Equals
The String class overrides the Object.Equals method, as a result of which the comparison occurs not by reference, but by value. I think the developers are grateful to the creators of the String class for redefining the == operator, because the code that uses == to compare strings looks more elegant than a method call.
if (s1 == s2)
compared
if (s1.Equals(s2))
By the way, in Java, the operator == compares by reference, but in order to compare strings, it is necessary to use the string.equals () method character-by-character.
String interning
Well, lastly let's talk about interning strings.
Consider a simple example, code that reverses a string.
var s = "Strings are immutuble"; int length = s.Length; for (int i = 0; i < length / 2; i++) { var c = s[i]; s[i] = s[length - i - 1]; s[length - i - 1] = c; }
Obviously, this code is not compiled. The compiler will swear at these lines, because we are trying to change the contents of the line. Indeed, any method of the String class returns a new instance of the string, instead of changing its contents.
In fact, the string can be changed, but for this it is necessary to resort to unsafe code. Consider an example:
var s = "Strings are immutable"; int length = s.Length; unsafe { fixed (char* c = s) { for (int i = 0; i < length / 2; i++) { var temp = c[i]; c[i] = c[length - i - 1]; c[length - i - 1] = temp; } } }
After executing this code, as expected, the line will
contain elbatummi era sgnirtS .
The fact that the lines are still changeable leads to one very interesting incident. It is connected with the internment of strings.
String interning is a mechanism by which identical literals represent one object in memory.
If you do not delve deeply into the details, then the meaning of string interning is this: within the process (the process, not the application domain), there is one internal hash table, the keys of which are strings, and the values are references to them. During JIT compilation, literal strings are sequentially entered into a table (each row in the table occurs only once). At the execution stage, references to literal strings are assigned from this table. You can place a string in an internal table at runtime using the String.Intern method. You can also check whether the string is contained in an internal table using the String.IsInterned method.
var s1 = "habrahabr"; var s2 = "habrahabr"; var s3 = "habra" + "habr"; Console.WriteLine(object.ReferenceEquals(s1, s2));
It is important to note that only string literals are interned by default. Since an internal hash table is used to implement internment, a search is performed on it during the JIT compilation, which takes time, so if all the lines were interned, this would nullify all the optimization. During compilation into IL code, the compiler concatenates all literal strings, since there is no need to contain them in parts, therefore the 2nd equality returns true. So, what is the incident. Consider the following code:
var s = "Strings are immutable"; int length = s.Length; unsafe { fixed (char* c = s) { for (int i = 0; i < length / 2; i++) { var temp = c[i]; c[i] = c[length - i - 1]; c[length - i - 1] = temp; } } } Console.WriteLine("Strings are immutable");
It seems that everything is obvious here and that such code should print the
Strings are immutable . However, no! The code prints
elbatummi era sgnirtS . The point is interning, changing the string s, we change its contents, and since it is a literal, it is interned and represented by a single instance of the string.
String interning can be waived by applying the
CompilationRelaxationsAttribute special attribute to an assembly. The
CompilationRelaxationsAttribute attribute controls the accuracy of the code generated by the CLR JIT compiler. The constructor of this attribute accepts the
CompilationRelaxations enumeration in the composition, which currently includes only
CompilationRelaxations.NoStringInterning - which marks the assembly as not requiring internment.
By the way, this attribute is not processed in the .NET Framework version 1.0., Therefore it was not possible to disable the default internment. The mscorlib build, starting with the second version, is marked with this attribute.
It turns out that the lines in .NET can still be changed if you really want to, using unsafe code.
And what if without unsafe?
It turns out that it was possible to change the contents of the line, without resorting to unsafe code, using the reflection mechanism. This trick could roll in .NET to version 2.0 inclusive, then the developers of the String class deprived us of this opportunity.
In the .NET 2.0 version, the String class has two internal methods:
SetChar , which checks for
outbound boundaries, and
InternalSetCharNoBoundsCheck , which does not check for
outbound boundaries that set the specified character at a specific index. Here is their implementation:
internal unsafe void SetChar(int index, char value) { if ((uint)index >= (uint)this.Length) throw new ArgumentOutOfRangeException("index", Environment.GetResourceString("ArgumentOutOfRange_Index")); fixed (char* chPtr = &this.m_firstChar) chPtr[index] = value; } internal unsafe void InternalSetCharNoBoundsCheck (int index, char value) { fixed (char* chPtr = &this.m_firstChar) chPtr[index] = value; }
Thus, using the following code, you can change the contents of the line, even without using unsafe code.
var s = "Strings are immutable"; int length = s.Length; var method = typeof(string).GetMethod("InternalSetCharNoBoundsCheck", BindingFlags.Instance | BindingFlags.NonPublic); for (int i = 0; i < length / 2; i++) { var temp = s[i]; method.Invoke(s, new object[] { i, s[length - i - 1] }); method.Invoke(s, new object[] { length - i - 1, temp }); } Console.WriteLine("Strings are immutable");
This code as expected, prints
elbatummi era sgnirtS .
Version Question: In different versions of the .NET Framework, string.Empty may or may not be interned.
Consider the code:
string str1 = String.Empty; StringBuilder sb = new StringBuilder().Append(String.Empty); string str2 = String.Intern(sb.ToString()); if (object.ReferenceEquals(str1, str2)) Console.WriteLine("Equal"); else Console.WriteLine("Not Equal");
In the .NET Framework 1.0, the .NET Framework 1.1 and the .NET Framework 3.5 Service Pack 1 (SP1), str1 and str2 are equal. In the .NET Framework 2.0 Service Pack 1 (SP1) and the .NET Framework 3.0, str1 and str2 are not equal. Currently string.Empty is interned.
Performance features
Internment has a negative side effect. The fact is that the reference to the interned String object that the CLR stores can be maintained even after the application and even the application domain have terminated. Therefore, large literal strings should not be used, or if it is necessary to disable internment by applying the CompilationRelaxations attribute to the assembly.
I hope this article was helpful ...