📜 ⬆️ ⬇️

StringBuilder Past and Present

Introduction


My last article was devoted to the features of the string data type String in .NET. This article continues the tradition, but this time we will look at the StringBuilder class.

As you know, strings in .NET are immutable (without using unsafe), and therefore it is not a good idea to perform concatenation operations with them in large quantities. This means that the following code has very serious problems with memory load:

string s = string.Empty; for (int i = 0; i < 100; i++) { s += "T"; } 

What is wrong with this code? And the fact that each iteration creates a string one unit longer than the previous step, and then copying characters from the old string. Thus, the total number of characters involved is equal to:

')
This formula is nothing more than the sum of an arithmetic progression:

That is, such a concatenation script requires memory in proportion to O (n 2 ), n is the length of the string.

To eliminate problems in such code, we use the StringBuilder class, knowing that operations with it do not lead to such memory wastes as with a String. StringBuilder is actually a modifiable string.

 var strB = new StringBuilder(); for (int i = 0; i < 100; i++) { strB.Append("T"); } string str = strB.ToString(); 

This code, though not devoid of all the flaws and spends some memory on the wind, but it does it more restrained, compared with the previous code.

The implementation of the StringBuilder class has changed dramatically in .NET 4.0 compared to previous versions, and therefore I think it will be interesting to write what happened to it.

StringBuilder in .NET 2.0


The StringBuilder class in .NET 2.0 has the following fields:

 public sealed class StringBuilder : ISerializable { internal const int DefaultCapacity = 16; private const string CapacityField = "Capacity"; private const string MaxCapacityField = "m_MaxCapacity"; private const string StringValueField = "m_StringValue"; private const string ThreadIDField = "m_currentThread"; internal IntPtr m_currentThread; internal int m_MaxCapacity; internal volatile string m_StringValue; <------------------------------------- } 

m_currentThread - the identifier of the thread in which the object instance was created;
m_MaxCapacity - the maximum capacity of this instance;
m_StringValue is a string containing characters.

In fact, the StringBuilder class internally works with the string data type String. Since the strings on the mscorlib.dll side are mutable, then StringBuilder does not cost anything to change the string that is in m_StringValue .

The initial length is 16 characters, and if there is not enough space to add new characters, StringBuilder replaces the internal string with a string two times longer and copies to the newly created all the characters from the previous one + new ones. Doubling the length of a string leads to linear complexity (O (n)) from memory, as opposed to quadratic, which is inherent in ordinary strings.

An important addition to the performance increase is setting the required capacity when creating an instance of the StringBuilder class. Thus, we do not need another doubling of the size and copying of data in case of insufficient memory. However, this is possible only if we know in advance the size of the resulting string.

Consider the implementation of the most commonly used methods. For code readability, I removed the conditions for checking the input parameters.

Append () method

Source
  public StringBuilder Append(string value) { if (value == null) return this; string currentString = this.m_StringValue; IntPtr currentThread = Thread.InternalGetCurrentThread(); if (this.m_currentThread != currentThread) currentString = string.GetStringForStringBuilder(currentString, currentString.Capacity); int length = currentString.Length; int requiredLength = length + value.Length; if (this.NeedsAllocation(currentString, requiredLength)) //       { //    2          string newString = this.GetNewString(currentString, requiredLength); newString.AppendInPlace(value, length); this.ReplaceString(currentThread, newString); //    } else { currentString.AppendInPlace(value, length); this.ReplaceString(currentThread, currentString); } return this; } 


This method simply checks whether there is enough space in the current instance to add a new line, if so, it simply copies it in place to the unoccupied part of the line, otherwise doubling the size and copying the old and new lines.

Insert method ()

Source
  public StringBuilder Insert(int index, string value) { if (value == null) return this.Insert(index, value, 0); else return this.Insert(index, value, 1); } public StringBuilder Insert(int index, string value, int count) { IntPtr tid; string threadSafeString = this.GetThreadSafeString(out tid); int length = threadSafeString.Length; if (value != null && value.Length != 0) { if (count != 0) { int requiredLength; try { requiredLength = checked (length + value.Length * count);//    } catch (OverflowException ex) { throw new OutOfMemoryException(); } if (this.NeedsAllocation(threadSafeString, requiredLength))//       { //    2          string newString = this.GetNewString(threadSafeString, requiredLength); newString.InsertInPlace(index, value, count, length, requiredLength);//   this.ReplaceString(tid, newString);//    } else { threadSafeString.InsertInPlace(index, value, count, length, requiredLength); this.ReplaceString(tid, threadSafeString); } return this; } } return this; } 


This method, similarly to the previous one, checks whether there is enough space in the current instance to insert a new line and, depending on this, doubles the size of the line or inserts it in place in the original line without resizing.

Remove () method

Source
 public StringBuilder Remove(int startIndex, int length) { IntPtr tid; string threadSafeString = this.GetThreadSafeString(out tid); int length1 = threadSafeString.Length; //  ,      threadSafeString.RemoveInPlace(startIndex, length, length1); this.ReplaceString(tid, threadSafeString);//    return this; } 


This method removes unnecessary characters by moving the rest of the line to the left. When deleting the last character, you don’t actually need to shift anything, so deletion from the end is much faster than from any other part of the string.

ToString () method

Source
 public override string ToString() { string str = this.m_StringValue; if (this.m_currentThread != Thread.InternalGetCurrentThread() || 2 * str.Length < str.ArrayLength) return string.InternalCopy(str);//   str.ClearPostNullChar(); this.m_currentThread = IntPtr.Zero; return str; //     } 


This method, as can be seen from the implementation, returns either a copy of the string, or a string with which it operates. As a rule, the first call to this method returns a reference to the source string, so it is executed very quickly, but each subsequent call results in copying the string. The StringBuilder class in .NET 2.0 focuses precisely on the speed of this method.

In general, the StringBuilder class in .NET 2.0 is fairly simple to implement. It uses a variable string, and if there is not enough space, it creates a new string, the length of which is twice the previous one. Such a scenario of doubling the length leads to linear complexity in memory, which is an order of magnitude better than a quadratic one. However, with large lengths of lines and it is not effective. By the way, because of its larger size, the line can often be placed in a heap for large objects (LOH), which is also not good.

StringBuilder in .NET 4.0


As I said, in .NET 4.0 the implementation of the StringBuilder class has changed. Now Char [] is used to store characters instead of String, and the class itself is a linked list of StringBuilders, like RopeString.

The reason for this change is quite obvious: with such an implementation, it is not necessary to re-allocate memory when it is scarce, which is inherent in the previous implementation. This also means that the ToString () method works a little slower, since the final string must first be formed, and the Append () method works faster, since it does not require copying. However, this fits into a typical use case for StringBuilder: many calls to Append (), and then one call to ToString ().

The StringBuilder class in .NET 4.0 has the following fields:

 public sealed class StringBuilder : ISerializable { internal const int DefaultCapacity = 16; internal const int MaxChunkSize = 8000; internal char[] m_ChunkChars; <------------------------------------- internal StringBuilder m_ChunkPrevious; <------------------------------------- internal int m_ChunkLength; internal int m_ChunkOffset; internal int m_MaxCapacity; private const string CapacityField = "Capacity"; private const string MaxCapacityField = "m_MaxCapacity"; private const string StringValueField = "m_StringValue"; private const string ThreadIDField = "m_currentThread"; } 

m_ChunkChars - an array containing the characters of the current linked list element (a piece of string);
m_ChunkPrevious - a link to the previous element (StringBuilder) in the list;
m_ChunkLength - the actual length of the current list item (the number of characters used);
m_ChunkOffset - the total number of characters used by the string (logical length);
m_MaxCapacity - the maximum capacity of the current StringBuilder instance.

In .NET Framework 4 and .NET Framework 4.5, when you create an instance of a StringBuilder object by calling the StringBuilder (Int32, Int32) constructor, the length and capacity of the StringBuilder instance can increase beyond the value of its MaxCapacity property. This can happen in particular when calling the Append and AppendFormat methods.

The maximum length of the MaxChunkSize list item is 8000. As you understand, this is not just done. Here is a comment from class developers:

We want to keep the heap (<85K bytes ~ 40K chars) to be sure. There are few signs of slowing down.

We want the array of characters not to fall into a heap for large objects. Making the maximum size of a list element (piece) large would mean that less memory allocation is needed, but more characters remain unused and insert / replace operations are slower.

Consider the implementation of the most commonly used methods.

Append () method

Source
 public unsafe StringBuilder Append(string value) { if (value != null) { char[] chArray = this.m_ChunkChars; int index = this.m_ChunkLength; int length = value.Length; int num = index + length; if (num < chArray.Length)//           { if (length <= 2) { if (length > 0) chArray[index] = value[0]; if (length > 1) chArray[index + 1] = value[1]; } else { fixed (char* smem = value) fixed (char* dmem = &chArray[index]) string.wstrcpy(dmem, smem, length); } this.m_ChunkLength = num; } else this.AppendHelper(value); } return this; } private unsafe void AppendHelper(string value) { fixed (char* chPtr = value) this.Append(chPtr, value.Length); } internal unsafe StringBuilder Append(char* value, int valueCount) { //  int num1 = valueCount + this.m_ChunkLength; if (num1 <= this.m_ChunkChars.Length) { StringBuilder.ThreadSafeCopy(value, this.m_ChunkChars, this.m_ChunkLength, valueCount); this.m_ChunkLength = num1; } else { //   () int count = this.m_ChunkChars.Length - this.m_ChunkLength; if (count > 0) { StringBuilder.ThreadSafeCopy(value, this.m_ChunkChars, this.m_ChunkLength, count); this.m_ChunkLength = this.m_ChunkChars.Length; } //  ,     int num2 = valueCount - count; this.ExpandByABlock(num2); //    () StringBuilder.ThreadSafeCopy(value + count, this.m_ChunkChars, 0, num2); this.m_ChunkLength = num2; } return this; } 


The Append () method works as follows: if there are enough characters in the current list item to insert a new line, then it is copied into it, if not, then the part that fits is copied, and for something that does not fit, a new list item is created ( an instance of StringBuilder-a), in which the length of the array is equal to the length of the entire source string or the length of the remaining string, whichever is greater. However, as mentioned above, the maximum array length is 8000.

In general, the formula for calculating the length of a new list item is as follows:

 int length = Math.Max(minBlockCharCount, Math.Min(this.Length, 8000)) 

where minBlockCharCount is the remaining length of the string after copying its part that fits into the current instance.

Thus, as a result of the work of such a code

 StringBuilder s = new StringBuilder (); for (int i = 0; i < 10000; i++) { s.Append ("T"); } 

the lengths of the arrays of the elements of the list will be equal: 8000 , 4092, 2048, 1024, 512, 256, 128, 64, 32, 16, 16.

With such lengths of arrays, the operation of accessing a certain character in the source string is performed quite quickly almost in O (1), since there are not so many elements of the list.

Insert method ()

Source
 public unsafe StringBuilder Insert(int index, string value) { if (value != null) { fixed (char* chPtr = value) this.Insert(index, chPtr, value.Length); } return this; } private unsafe void Insert(int index, char* value, int valueCount) { if (valueCount <= 0) return; StringBuilder chunk; int indexInChunk; //          (StringBuilder) this.MakeRoom(index, valueCount, out chunk, out indexInChunk, false); this.ReplaceInPlaceAtChunk(ref chunk, ref indexInChunk, value, valueCount); } 


The Insert () method works as follows: if there is enough space to insert in the current list item (StringBuilder), the available characters are shifted to give space to the new text. Otherwise, a new list item (StringBuilder) is created, into which some characters from the previous item that do not fit are copied. Subsequent characters are not shifted to the left.

What will be the result of this code?

 StringBuilder s = new StringBuilder (); for (int i = 0; i < 10000; i++) { s.Insert (0, "T"); } 

The result will be different from the code using Append (), and very seriously!

We will get a very large list of StringBuilders, each element that will be 16 characters long. As a result, the operation of accessing a certain symbol by index will be performed more slowly than expected, namely, in proportion to the list length, that is, O (n).

Remove () method

Source
 public StringBuilder Remove(int startIndex, int length) { if (this.Length == length && startIndex == 0) { // .     . this.Length = 0; return this; } else { if (length > 0) { StringBuilder chunk; int indexInChunk; this.Remove(startIndex, length, out chunk, out indexInChunk); } return this; } } private void Remove(int startIndex, int count, out StringBuilder chunk, out int indexInChunk) { int num = startIndex + count; //  ()         . chunk = this; StringBuilder stringBuilder = (StringBuilder) null; int sourceIndex = 0; while (true) { if (num - chunk.m_ChunkOffset >= 0) { if (stringBuilder == null) { stringBuilder = chunk; sourceIndex = num - stringBuilder.m_ChunkOffset; } if (startIndex - chunk.m_ChunkOffset >= 0) break; } else chunk.m_ChunkOffset -= count; chunk = chunk.m_ChunkPrevious; } indexInChunk = startIndex - chunk.m_ChunkOffset; int destinationIndex = indexInChunk; int count1 = stringBuilder.m_ChunkLength - sourceIndex; //       if (stringBuilder != chunk) { destinationIndex = 0; //    startIndex     () chunk.m_ChunkLength = indexInChunk; //       . stringBuilder.m_ChunkPrevious = chunk; stringBuilder.m_ChunkOffset = chunk.m_ChunkOffset + chunk.m_ChunkLength; //                () if (indexInChunk == 0) { stringBuilder.m_ChunkPrevious = chunk.m_ChunkPrevious; chunk = stringBuilder; } } stringBuilder.m_ChunkLength -= sourceIndex - destinationIndex; if (destinationIndex == sourceIndex) //     return; //          StringBuilder.ThreadSafeCopy(stringBuilder.m_ChunkChars, sourceIndex, stringBuilder.m_ChunkChars, destinationIndex, count1); } 


The implementation of this method has become significantly more complicated. However, it should be noted that the previous implementation copied a large number of characters, shifting them to the left. Here it is necessary to make an offset only within one element (StringBuilder-a) in the list.

ToString () method

Source
 public override unsafe string ToString() { if (this.Length == 0) return string.Empty; string str = string.FastAllocateString(this.Length); StringBuilder stringBuilder = this; fixed (char* chPtr = str) { do { if (stringBuilder.m_ChunkLength > 0) { char[] chArray = stringBuilder.m_ChunkChars; int num = stringBuilder.m_ChunkOffset; int charCount = stringBuilder.m_ChunkLength; fixed (char* smem = chArray) string.wstrcpy(chPtr + num, smem, charCount); } stringBuilder = stringBuilder.m_ChunkPrevious; } while (stringBuilder != null); } return str; } 


This method traverses the entire linked list of StringBuilders, and successively copies the characters of each of the elements of the list into the resulting string.

Performance comparison


Perhaps the most interesting part is the performance comparison between the two versions of the class.

Test 1. How much memory is required to store a string of a given length.



As you can see, with a small length of the string, the new implementation loses the old one. It is understandable, because for each element of the list (StringBuilder) information is required on the length, capacity, offset from the beginning of the string + for the array of overhead characters. But as soon as the string length becomes larger than 16384, the old implementation starts to lose (due to the doubling of the string size, it contains many unused characters).

Test 2. Append () method



Perhaps this is the very method in which the new implementation wins. Since there is no need to double the length of the line when there is not enough memory and no characters are required to copy into it, this method is performed much faster, almost twice (more precisely, 1.8 times).

Test 3. Method Insert ()

We will insert into the string already filled with characters, 1000 characters long.

1. Insert at the beginning of the line



2. Insert in the middle of the line



3. Insert at the end of the line



Comments are unnecessary - the new implementation loses when inserted in any place.

Test 4. Remove () method

We will remove 10 characters from the line already filled with characters until we have exhausted it.



The new implementation wins when deleting almost from any place, since now it is not necessary to shift the characters of the remaining line to the left (more precisely, it is required, but not as often and much as before).

Test 5. Method ToString ()

As mentioned above, this method loses the previous implementation. The previous implementation returned simply a reference to the string with which it operated (the first time it was called), and the new one is forced to assemble the resulting string in pieces, bypassing each element of the linked list.



The new implementation works much slower if the string was formed using the Insert () method, since the list will consist of a set of elements (StringBuilders) with a length of 16 characters.

Test 6. Appeal on a specific index

Given that now StringBuilder is a linked list, the operation of accessing a string at a specific index becomes expensive. Especially if it was formed using the Insert method.



Test 7. Common Scenario: Multiple calls to Append () and then ToString ()

As a rule, we work with this class in a specific scenario: a multiple call to the Append () method, followed by a single ToString () call. The implementation of this class has changed exactly in the calculation for this scenario.



Conclusion


As we have seen, the StringBuilder class in .NET 2.0 has been optimized for the speed of the ToString () method, while in .NET 4.0 it has been optimized for the speed of the Append () method. The new implementation of the Append () method works almost 2 times faster, while the Insert () and ToString () methods are slower. But since we are working with this class in a specific scenario: we call the Append () method multiple times, followed by a single call to the ToString () method, then an increase in performance takes place.

Given the new implementation of the class, in which only a multiple call of the Append () method leads to an increase in performance, the class could now be called StringAppender *)

Source: https://habr.com/ru/post/172689/


All Articles