Is the string operator + simple?

Introduction

The string data type is one of the fundamental types, along with numeric (int, long, double) and logical (bool). It's hard to imagine at least how many useful programs that do not use this type.

On the .NET platform, the string type is represented as an immutable class String. In addition, it is highly integrated into the common language CLR, as well as supported by the C # compiler.

In this article I would like to talk about concatenation, an operation that is performed on strings as often as an addition operation on numbers. It would seem that what can be said here, because we all know about the string operator +, but as it turned out, he has his own subtleties.

Language specification about string operator +

The C # language specification provides three operator + overloads for strings:
')

string operator + (string x, string y) string operator + (string x, object y) string operator + (object x, string y)

If one of the string concatenation operands is null, then an empty string is substituted. Otherwise, any argument that is not a string is converted to a string representation by calling the virtual method ToString . If the ToString method returns null, an empty string is substituted. It should be said that according to the specification, this operation should never return null.

The description of the operator looks quite clear, but if we look at the implementation of the String class, we will find an explicit definition of only two operators == and! =. A reasonable question arises, what happens behind the scenes of string concatenation? How does the compiler handle the string operator +?

The answer to this question was not so difficult. You need to take a closer look at the static method String.Concat. Method String.Concat - combines one or more instances of the class String or the representation in the form of a String values of one or more instances of Object. There are the following overload of this method:

 public static String Concat(String str0, String str1) public static String Concat(String str0, String str1, String str2) public static String Concat(String str0, String str1, String str2, String str3) public static String Concat(params String[] values) public static String Concat(IEnumerable<String> values) public static String Concat(Object arg0) public static String Concat(Object arg0, Object arg1) public static String Concat(Object arg0, Object arg1, Object arg2) public static String Concat(Object arg0, Object arg1, Object arg2, Object arg3, __arglist) public static String Concat<T>(IEnumerable<T> values)

Suppose we have the following expression s = a + b, where a and b are strings. The compiler converts it to a call to the static method Concat, that is, to

 s = string.Concat(a, b)

The string concatenation operation, like any other addition operation in C #, is left-associative.

With two lines everything is clear, but what if there are more lines? The expression s = a + b + c considering the left-associativity of the operation could be replaced by

 s = string.Concat(string.Concat(a, b), c)

however, given the presence of an overload that takes three arguments, it will be converted to

 s = string.Concat(a, b, c)

Similarly, with concatenation of four lines. For concatenation of 5 or more strings, we have an overload of string.Concat (params string []), so you need to take into account the overhead associated with allocating memory for the array.

It should also be said that the string concatenation operation is fully associative : it does not matter in what order we concatenate strings, therefore the expression s = a + (b + c), despite the explicit indication of the priority of concatenation, is treated as

 s = (a + b) + c = string.Concat(a, b, c)

instead of the expected

 s = string.Concat(a, string.Concat(b, c))

Thus, summing up the above: the string concatenation operation is always represented from left to right, and uses the call to the static method String.Concat.

Compiler optimizations for literal strings

The C # compiler has optimizations related to literal strings. So, for example, the expression s = "a" + "b" + c, taking into account the left-associativity of the operator + is equivalent to s = ("a" + "b") + c, is converted to

 s = string.Concat("ab", c)

The expression s = c + "a" + "b", despite the left-associativity of the concatenation operation (s = (c + "a") + "b"), is converted to

 s = string.Concat(c, "ab")

In general, no matter where the literals are, the compiler concatenates everything it can, and then it tries to select the appropriate overload of the Concat method. The expression s = a + "b" + "c" + d is converted to

 s = string.Concat(a, "bc", d)

It should also be said about optimizations associated with the empty and null string. The compiler knows that adding an empty string does not affect the result of concatenation, so the expression s = a + "" + b is converted to

 s = string.Concat(a, b),

instead of the expected

 s = string.Concat (a, "", b)

Similarly, for a const line whose value is null, we have:

 const string nullStr = null; s = a + nullStr + b;

converted to

 s = string.Concat(a, b)

The expression s = a + nullStr is converted to s = a ?? "" if a is a string, and calling the method string.Concat (a), if a is not a string, for example, s = 17 + nullStr, is converted to s = string.Concat (17).

An interesting feature related to the optimization of processing literals and the left-associativity of the string operator +.

Consider the expression:

 var s1 = 17 + 17 + "abc";

given the left-associativity, it is equivalent

 var s1 = (17 + 17) + "abc"; //   string.Concat(34, "abc")

as a result, at the compilation stage, the addition of numbers will occur, so the result will be 34abc.

On the other hand the expression

 var s2 = "abc" + 17 + 17;

is equivalent to

 var s2 = ("abc" + 17) + 17; //   string.Concat("abc", 17, 17)

as a result, we get abc1717.

This is how, it would seem, the same concatenation operation leads to different results.

String.Concat VS StringBuilder.Append

A few words should be said about this comparison. Consider the following code:

 string name = "Timur"; string surname = "Guev"; string patronymic = "Ahsarbecovich"; string fio = surname + name + patronymic;

It can be replaced with code using StringBuilder:

 var sb = new StringBuilder(); sb.Append(surname); sb.Append(name); sb.Append(patronymic); string fio = sb.ToString();

But in this situation, we hardly get the benefits of using StringBuilder. Besides the fact that the code has become less readable, it has also become less efficient, since the implementation of the Concat method calculates the length of the resulting string and allocates memory only once, unlike StringBuilder, which knows nothing about the length of the resultant string.

Implementation of the Concat method for 3 lines:

 public static string Concat(string str0, string str1, string str2) { if (str0 == null && str1 == null && str2 == null) return string.Empty; if (str0 == null) str0 = string.Empty; if (str1 == null) str1 = string.Empty; if (str2 == null) str2 = string.Empty; string dest = string.FastAllocateString(str0.Length + str1.Length + str2.Length); //     string.FillStringChecked(dest, 0, str0); / string.FillStringChecked(dest, str0.Length, str1); string.FillStringChecked(dest, str0.Length + str1.Length, str2); return dest; }

Java operator +

A few words about the string operator + in Java. Although I do not program in Java, it is still interesting to know how things are there. The Java language compiler optimizes the + operator so that it uses the StringBuilder class and the append method call.

The previous code is converted to

 String fio = new StringBuilder(String.valueOf(surname)).append(name).append(patronymic).ToString()

It should also be said that such optimization in C # was deliberately rejected, Eric Lippert has a post on this topic. The fact is that such optimization is not optimization as such, it is a rewriting of the code. Plus, the creators of the C # language believe that developers should know the specifics of working with the String class and, if necessary, will switch to using StringBuilder.

By the way, it was Eric Lippert who was engaged in optimizations for the C # compiler related to string concatenation.

Conclusion

Perhaps at first glance it may seem strange that the String class does not define the + operator until we think about the compiler optimization options associated with the visibility of a larger code fragment. For example, if the + operator were defined in the String class, then the expression s = a + b + c + d would create two intermediate strings, a single call to the string.Concat method (a, b, c, d) allows you to join more efficiently.

Source: https://habr.com/ru/post/220921/

All Articles