String.Intern makes strings more interesting.

Preface from translator:

Passing / conducting interviews, one has to deal with questions that reveal a general understanding of the work of .NET. In my opinion, questions about the work of a “garbage collector” are most popular among such questions, but once I was asked a question about string interning. And he, frankly, put me in a dead end. The search in runet issued several articles, but they did not give answers to the questions I was looking for. I hope my translation of the article by Andrew Stellman (author of the book “Head First C #” ) will fill this gap. I think this material will be useful for beginner .NET developers and those who were interested in what is interning strings in .NET.

String.Intern makes strings more interesting.

One of the first things that every C # novice developer encounters is working with strings. I show the basics of working with strings at the beginning of Head First C #, as they do in almost every other C # book. So one shouldn't be surprised that C # junior level and middle level developers feel that they got a pretty good row-by-line basis. But the lines are more interesting than they seem. One of the most interesting aspects of strings in C # and .NET is the String.Intern method. Understanding how this method works can improve your skills in C # development. In this post, I will do a short tutorial for the String.Intern method to show you how it works.
')
Note: At the end of this post, I'm going to show something “under the hood” using ILDasm. If you have never worked with ILDasm before, this will be a good opportunity to get acquainted with a very useful .NET tool.

Some basics of working with strings

Let's start with a brief overview of what is expected from the System.String class. (I will not go into details - if someone wants a post about the basics of strings in .NET, add a comment or contact me at Building Better Software , and I will be happy to discuss a possible article together!)

Create a new console application in Visual Studio. (Everything works the same way from the command line if you want to use csc.exe to compile the code, but for the sake of ease of perception of the material, let's stick with the development in Visual Studio.) Here is the code for the Main () method - the entry point of the console application:

Program.cs:

using System; class Program { static void Main(string[] args) { string a = "hello world"; string b = a; a = "hello"; Console.WriteLine("{0}, {1}", a, b); Console.WriteLine(a == b); Console.WriteLine(object.ReferenceEquals(a, b)); } }

There should be no surprises in this code. The program prints three lines to the console (remember, if you are working in Visual Studio, use Ctrl-F5 to start the program outside the debugger; also “Press any key ...” will be added to the program to prevent the console window from closing):

hello, hello world
False
False

The first WriteLine () prints two lines. The second compares them using the equality operator == , which returns False , because the strings do not match. And the last one compares them to see if both variables refer to the same String object. Since this is not the case, the method displays the value False .
Then add these two lines to the end of the Main () method:

  Console.WriteLine((a + " world") == b); Console.WriteLine(object.ReferenceEquals((a + " world"), b));

And again you get a pretty obvious answer. The equality operator returns True , since both strings are equal. But when you used the concatenation of the strings “Hello” and “world”, the + operator combines them and returns a new instance of System.String . That is why object.ReferenceEquals () quite reasonably returns False . The ReferenceEquals () method returns True only if both arguments refer to the same object.
This method allows you to work normally with objects. Two different objects can have the same values. This behavior is quite practical and predictable. If you create two “house” objects and set all the properties to the same values, you will have two identical objects of the “house” type, but these will be different objects.

Does it still seem a bit confusing? If so, I definitely recommend paying attention to the first few chapters of “Head First C #” , which will give you an idea of writing programs, debugging, and using objects and classes. You can download them as free scrapbooks from this book .
So while we are working with strings, everything is fine. But as soon as we start playing links to strings, things get a little weird.

Something is wrong with this link ...

Create a new console application. The code below is for it. But, before compiling and executing, carefully look at the code. Try to guess what it will display in the console?

Program.cs:

 using System; class Program { static void Main(string[] args) { string hello = "hello"; string helloWorld = "hello world"; string helloWorld2 = hello + " world"; Console.WriteLine("{0}, {1}: {2}, {3}", helloWorld, helloWorld2, helloWorld == helloWorld2, object.ReferenceEquals(helloWorld, helloWorld2)); } }

Now run the program. Here is what it displays in the console:

hello world, hello world: True, False

And so, this is exactly what we expected. In the helloWorld and helloWorld2 objects, the strings contain “Hello world", so they are equal, but the links are different.
Now add this code to the bottom of your program:

  helloWorld2 = "hello world"; Console.WriteLine("{0}, {1}: {2}, {3}", helloWorld, helloWorld2, helloWorld == helloWorld2, object.ReferenceEquals(helloWorld, helloWorld2));

Run it. This time the code will display the following line in the console:

hello world, hello world: True, True

Wait, it turns out that now HelloWorld and HelloWorld2 refer to the same line? Perhaps this behavior may seem strange to some, or at least a little unexpected. We did not change the value of helloWorld2 at all. Many end up thinking something like this: “the variable was already equal to the“ hello world ”. Setting the “hello world” one more time should not change anything. ”So what's the deal? Let's figure it out.

What is String.Intern? (plunging into the internment pool ...)

When using strings in C #, the CLR does something tricky and this is something called string interning. This is a way to store one copy of any string. If you store in a hundred or, even worse, in a million string variables the same value will turn out that the memory for storing the values of the strings will be allocated again and again. String interning is a way around this problem. The CLR maintains a table called the internment pool. This table contains one unique link to each row that is either declared or programmatically created during the execution of your program. The .NET Framework provides two useful methods for interacting with an internment pool: String.Intern () and String.IsInterned () .

The String.Intern () method works in a very simple way. You pass it a string as an argument. If this string is already in the internment pool, the method returns a reference to this string. If it is not already, it adds the string to the pool and returns a link to it. Here is an example:

  Console.WriteLine(object.ReferenceEquals( String.Intern(helloWorld), String.Intern(helloWorld2)));

This code will display True, even if HelloWorld and HelloWorld2 references two different string objects, because they both contain the string "Hello World".
Stop for a minute. It is worth a little more to sort out String.Intern () because sometimes the method gives slightly illogical at first glance results. Here is an example of this behavior:

  string a = new string(new char[] {'a', 'b', 'c'}); object o = String.Copy(a); Console.WriteLine(object.ReferenceEquals(o, a)); String.Intern(o.ToString()); Console.WriteLine(object.ReferenceEquals(o, String.Intern(a)));

Running the code will output two lines to the console. The first WriteLine () method will show the value False , and this is understandable, since the String.Copy () method creates a new copy of the string and returns a reference to the new object. But why by first executing String.Intern (o.ToString ()) then String.Intern (a) returns a reference to o ? Stop for a moment to think about it. This becomes even more illogical if you add three more lines:

  object o2 = String.Copy(a); String.Intern(o2.ToString()); Console.WriteLine(object.ReferenceEquals(o2, String.Intern(a)));

It seems that these lines of code did the same thing, only with the new variable of the object o2 . But in the last, WriteLine () will print False . So what happens?

This little mess will help us figure out what's going on under the hood of String.Intern () and the internment pool. The first thing you need to understand for yourself is that the method of a string object in ToString () always returns a reference to itself. The variable o points to a string object containing the value “abc”, so calling your own ToString () method returns a reference to this string. So, that's what happens.
At the beginning, a points to the line 1 object, which contains “abc”. The variable o points to another object of line number 2 which also contains “abc”. A call to String.Intern (o.ToString ()) adds a reference to string No. 2 in the internment pool. Now that the string object number 2 is in the internment pool, at any time, String.Intern () calling with the “abc” parameter will return a reference to the string object number 2.
Therefore, when you pass the variable o and String.Intern (a) to the ReferenceEquals () method, it returns True , because String.Intern (a) returned a reference to the object of string No. 2. Now we have created a new o2 variable and used the String.Copy () method to create another object of type String . This will be the object of line number 3, which also contains the string "abc". The call to String.Intern (o2.ToString ()) adds nothing to the internment pool this time, because “abc” is already there, but it returns a pointer to the string # 2.
So this Intern () call actually returns a reference to line number 2, but we discard it instead of assigning it to a variable. We could do something like this: string q = String.Intern (o2.ToString ()) , which would make the variable q a reference to the string object # 2. That is why the last WriteLine () displays False, since this is a comparison of the line 3 reference with a line 2 reference.

Use String.IsInterned () to check whether a string is in an internment pool.

There is another, somewhat paradoxically named method that is useful when dealing with interned strings: String.IsInterned () . It takes a reference to the string object. If this string is in the intern pool, it returns a reference to the interned string of the string; if it is not already in the intern pool, then the method returns null .
The reason why its name sounds a bit illogical is that this method starts with “Is” but does not return a boolean type, as many programmers expect.
When working with the IsInterned () method to display the fact that the string is not in the internment pool, it is convenient to use a null-coalescing operator - ?? . For example, writing:

  string o = String.IsInterned(str) ?? "not interned";

Now the IsInterned () result is returned to the variable o if it is not null, or the string “not interned” if the string is not in the intern pool.
If this is not done, then the Console.WriteLine () method will print empty lines (what this method does when it encounters null ).
Here is a simple example of how String.IsInterned () works:

  string s = new string(new char[] {'x', 'y', 'z'}); Console.WriteLine(String.IsInterned(s) ?? "not interned"); String.Intern(s); Console.WriteLine(String.IsInterned(s) ?? "not interned"); Console.WriteLine(object.ReferenceEquals( String.IsInterned(new string(new char[] { 'x', 'y', 'z' })), s));

The first WriteLine () statement will display "not interned" in the console, because "xyz" is not yet in the internment pool. The second WriteLine () statement prints "xyz" because the internment pool already contains "xyz". And the third WriteLine () will print True , since the object s points to the object added to the internment pool.

Literals are interned automatically.

Add just one line to the end of the method and run the program again:

  onsole.WriteLine(object.ReferenceEquals("xyz", ));

something completely unexpected will happen!
The program will never display "not interned" , and the last two WriteLine () methods will show False ! If we comment out the last line, then the program acts exactly as you expected. Why?! How did adding the code at the end of the program change the behavior of the program code over it? This is very, very strange!

It seems really strange the first time you come across this, but this does make sense. The reason for changing the behavior of the entire program is that the code contains the literal "xyz". And when you add a literal to your program, the CLR automatically adds it to the internment pool even before the program starts. Comment on this line, you remove the literal from the program and the internment pool will no longer contain the string "xyz".
Understanding that “xyz” is already in the internment pool when the program is started, since this line appeared in the code in the form of a literal, this change in the behavior of the program immediately becomes understandable. String.IsInterned (s) no longer returns null . Instead, it returns a reference to the literal "xyz", which also explains why ReferenceEquals () returns False. This is due to the fact that the string s will never be added to the internment pool (“xyz” is already in the pool, pointing to another object).

The compiler is smarter than you think!

Change the last line of code to this:

  Console.WriteLine( object.ReferenceEquals("x" + "y" + "z", s));

Run the program. It works exactly the same as if you used the literal "xyz"! Is + not an operator? Isn't this a method that runs on a CLR runtime? If this is the case, then there must be code that will prevent the literal "xyz" from being interned.
In fact, that is what happens if you replace “x” + “y” + “z” with String.Format ("{0} {1} {2}", 'x', 'y', 'z') . Both lines of code return "xyz". Why, using the + operator for concatenation, we get the behavior as if you used the "xyz" literal, although at the same time as String.Format () runs at runtime?
The easiest way to answer this question is to see what we actually get when compiling the code “x” + “y” + “z” .

Program.cs:

 using System; class Program { public static void Main() { Console.WriteLine("x" + "y" + "z"); } }

The next step is to figure out that the compiler has built an application of the executable type. For this we will use ILDasm.exe, MSIL disassembler. This tool is installed with each version of Visual Studio (including Express editions). And even if you do not know how to read IL, you can understand what is happening.

Run Ildasm.exe. If you are using a 64-bit version of Windows, run the following command: "% ProgramFiles (x86)% \ Microsoft SDKs \ Windows \ v7.0A \ Bin \ Ildasm.exe" (including quotes), either from the Start >> Run window , or from the command line. If you are using a 32-bit version of Windows, you should run the following command: "% ProgramFiles% \ Microsoft SDKs \ Windows \ v7.0A \ Bin \ ildasm.exe" .

If you have .NET Framework 3.5 or earlier

If you have .NET Framework 3.5 or earlier, you may need to search for ildasm.exe in the adjacent folders. Launch the Explorer window and navigate to the Program Files folder. As a rule, the necessary program is located in the Microsoft SDKs \ Windows \ vX.X \ bin folder. In addition, you can run the command line from the “Visual Studio Command Prompt” which is located in the Start menu, then type “ILDASM” to launch it.

This is how ILDasm looks at the first launch:

Then compile your code into an executable file. Click on the project in Solution Explorer - the Project Folder field should be located in the Properties window. Double click on it and copy. Going to the ILDasm window, select File >> Open in the menu, and paste the path to the folder. Then go to the bin folder. Your executable file should be located either in the bin \ Debug or bin \ Release folder . Open the executable file. ILDasm should show you the contents of the assembly.

(If you need to refresh the memory of how assemblies are created, see this post for an understanding of C # and .NET assemblies and namespaces ).
Expand the Program class and double-click the Main () method. After these actions, a disassembled method code should appear:

You do not need to know IL to see the presence of the literal "xyz" in the code. If you close ILDasm, and then change the code to use "xyz" instead of "x" + "y" + "z", disassembled the IL code looks exactly the same! This is because the compiler is smart enough to replace "x" + "y" + "z" with "xyz" during compilation, so you do not have to spend extra operations on method calls that will always return "xyz". And when a literal is compiled into a program, the CLR adds it to the intern pool when the program starts.

The material in this article should give you a good idea of string interning in C # and .NET. In principle, this is even more than necessary to understand the work of string interning. If you are interested in learning more, a good springboard is the "Performance Considerations" section on MSDN's String.Intern pages .

PS: Thanks to the team for diligent reading and objective criticism of the translation.

Source: https://habr.com/ru/post/224281/

All Articles

String.Intern makes strings more interesting.

String.Intern makes strings more interesting.

Some basics of working with strings

Something is wrong with this link ...

What is String.Intern? (plunging into the internment pool ...)

Use String.IsInterned () to check whether a string is in an internment pool.

Literals are interned automatically.

The compiler is smarter than you think!

More articles: