📜 ⬆️ ⬇️

Where do gethashcode's hands grow in .NET

Introduction


This article focuses on the generation of hash codes on the .NET platform. The topic is quite interesting, and I think any self-respecting .NET developer should know it. So let's go!

What is stored in objects besides their fields?


We begin our article by finding out what is stored in objects of the reference type in addition to their fields.

Each object of reference type has a so-called header (Header), which consists of two fields: a pointer to the type of which this object is (MethodTablePointer), as well as a synchronization index (SyncBlockIndex).

What are they needed for?
')
The first field is necessary so that each managed object can provide information about its type at runtime, that is, you cannot issue one type after another, this is done for type safety. Also, this pointer is used to implement dynamic dispatching of methods; in fact, methods of this object are called through it. The Object.GetType method actually returns exactly the MethodTablePointer pointer.

The second field is necessary for a multithreaded environment, namely, in order that each object can be used thread-safe.

When the CLR is loaded, it creates a so-called pool of synchronization blocks, you can say a regular array of these synchronization blocks. When an object needs to work in a multi-threaded environment (this is done using the Monitor.Enter method or a C # lock language construct), the CLR searches for a free synchronization block in its list and writes its index to the same field in the object header. As soon as the object no longer needs a multi-threaded environment, the CLR simply assigns a value of -1 to this field and thereby frees the synchronization block.

Synchronization blocks are in fact a new incarnation of critical sections from C ++. The creators of the CLR considered that it would be too expensive to associate with each managed object the structure of the critical section, given that most of the objects do not use a multi-threaded environment.

For a better understanding of the situation, consider the following picture:


As seen in the picture, ObjectA and ObjectB are of the same type, since their MethodTablePointer points to the same type. ObjectC has a different type. It is also seen that ObjectA and ObjectC involve a pool of synchronization blocks, that is, they actually use a multi-threaded environment. ObjectB does not use the pool because its SyncBlockIndex = -1.

Now after we have considered how objects are stored, we can proceed to the generation of hash codes.

How GetHashCode works with reference types


The fact that the GetHashCode method returns the address of an object on the managed heap is a myth. This can not be, in view of its not consistency, the garbage collector, compacting the heap, displaces objects and accordingly changes all their addresses to them.

It was not for nothing that I started the article by explaining what SyncBlock is, since in the first versions of the framework, the free index of some SyncBlock was used as a hash code of the reference type. Thus, in .NET 1.0 and .NET 1.1, calling the GetHashCode method led to the creation of SyncBlock and entering its index into the object header in the SyncBlockIndex field. As you understand, this is not a very good implementation for the hash function, because firstly, internal structures that are not needed are created, which take up memory + time is wasted creating them, secondly, hash codes will go in a row, that is, they will be predictable. Here is a link to a blog in which one of the CLR developers says that such implementation is bad and that they will change it in the next version.

Since .NET 2.0, the hashing algorithm has changed. Now it uses the manage identifier of the thread in which the method is executed. If you believe the implementation in SSCLI20, then the method looks like this:

inline DWORD GetNewHashCode() { // Every thread has its own generator for hash codes so that we won't get into a situation // where two threads consistently give out the same hash codes. // Choice of multiplier guarantees period of 2**32 - see Knuth Vol 2 p16 (3.2.1.2 Theorem A) DWORD multiplier = m_ThreadId*4 + 5; m_dwHashCodeSeed = m_dwHashCodeSeed*multiplier + 1; return m_dwHashCodeSeed; } 

Thus, each thread has its own generator for hash codes, so we cannot get into a situation where two threads sequentially generate the same hash codes.

As before, the hash code is calculated once and stored in the object header in the SyncBlockIndex field (this is CLR optimization). Now the question arises, what if after calling the GetHashCode method we need to use the synchronization index? Where to write it? And what to do with the hash code?

To answer these questions, consider the structure of SyncBlock.


The first time the GetHashCode method is called, the CLR computes the hash code and puts it in the SyncBlockIndex field. If SyncBlock is associated with the object, that is, the SyncBlockIndex field is used, the CLR writes the hash code into SyncBlock itself, the figure shows the place in the SyncBlock that is responsible for storing the hash code. As soon as SyncBlock is released, the CLR copies the hash code from its body to the header of the SyncBlockIndex object. That's all.

How GetHashCode works for significant types


Now let's talk about how the GetHashCode method works for significant types. I will say in advance, it works quite interestingly.

The first thing to say is that the creators of the CLR recommend always overriding this method with user-defined types, since it may not work very quickly, and its default behavior may not always satisfy you.

In fact, the CLR has two versions of the implementation of the GetHashCode method for significant types, and then which version will be used depends solely on the type itself.

First version:
If the structure does not have reference fields, and there is no free space between its fields, the fast version of the GetHashCode method is used. The CLR is simply xor - um every 4 bytes of the structure and gets the answer. This is a good hash, since the entire contents of the structure are involved. For example, a structure that has a bool type field and an int will have a free space of 3 bytes, since the JIT allocates fields by 4 bytes when placed, and therefore the second version will be used to get the hash code.

By the way, there was a bug in the implementation of this version, which was fixed only in .NET 4. It lies in the fact that the hash code for the decimal type was not calculated correctly.

Consider the code

 decimal d1 = 10.0m; decimal d2 = 10.00000000000000000m; 

From the point of view of numbers, d1 and d2 are equal, but their bit representations differ (due to the decimal representation of the representation). And since CLR xor - it is every 4 bytes (of which only 4 since decimal occupies 16 bytes), different hash codes are obtained. By the way, this bug was manifested not only in decimal, but also in any structure that contains this type, and also uses the fast version to calculate the hash code.

The second version:
If the structure contains reference fields or there is free space between its fields, then a slower version of the method is used. The CLR selects the first structure field, on the basis of which it creates a hash code. This field should be immutable whenever possible, for example, of the string type, otherwise when it is changed, the hash code will also change, and we will not be able to find our structure in the hash table if it was used as a key. It turns out that if the first field of the structure is changeable, then it breaks the standard logic of the GetHashCode method. This is another reason why structures should not be changed. CLR xor-it hash code of this field with a pointer to the type of this field (MethodTablePointer). The CLR does not consider static fields, since a static field can be a field with the same data types, with the result that we fall into infinite recursion.

CLR developers comment to GetHashCode on ValueType:
 /*=================================GetHashCode================================== ** Action: Our algorithm for returning the hashcode is a little bit complex. We look ** for the first non-static field and get it's hashcode. If the type has no ** non-static fields, we return the hashcode of the type. We can't take the ** hashcode of a static member because if that member is of the same type as ** the original type, we'll end up in an infinite loop. **Returns: The hashcode for the type. **Arguments: None. **Exceptions: None. ==============================================================================*/ [MethodImplAttribute(MethodImplOptions.InternalCall)] public extern override int GetHashCode(); // Note that for correctness, we can't use any field of the value type // since that field may be mutable in some way. If we use that field // and the value changes, we may not be able to look up that type in a // hash table. For correctness, we need to use something unique to // the type of this object. // HOWEVER, we decided that the perf of returning a constant value (such as // the hash code for the type) would be too big of a perf hit. We're willing // to deal with less than perfect results, and people should still be // encouraged to override GetHashCode. 


On a note

Structures cannot contain instance fields of their own type. That is, the following code will not compile:

 public struct Node { int data; Node node; } 

This is due to the fact that the structure can not be null. The following code confirms that this is not possible:

 var myNode = new Node(); myNode.node.node.node.node.node.node.node.node.node....... 

However, static fields of their own type are completely allowed, since they are stored in a single copy of the type of this structure. That is, the following code is quite valid.

 public struct Node { int data; static Node node; } 

On a note

To understand the situation, consider the following code:

 var k1 = new KeyValuePair<int, int>(10, 29); var k2 = new KeyValuePair<int, int>(10, 31); Console.WriteLine("k1 - {0}, k2 - {1}", k1.GetHashCode(), k2.GetHashCode()); var v1 = new KeyValuePair<int, string>(10, "abc"); var v2 = new KeyValuePair<int, string>(10, "def"); Console.WriteLine("v1 - {0}, v2 - {1}", v1.GetHashCode(), v2.GetHashCode()); 

In the first case, the structure has no reference fields and there is no free distance between the fields, since the int field takes 4 bytes, so the fast version is used to calculate the hash code, so it will be output to the console:

k1 - 411217769, k2 - 411217771

In the second case, the structure has a reference field (string), so a slower version is used. The CLR selects the field with the int type as the field for generating the hash code, and the string field is simply ignored, with the result that the following will be output to the console:

v1 - 411217780, v2 - 411217780

Now I think it is clear why the CLR developers say that all user-significant data types (and not only significant, but all in general) override the GetHashCode method. Firstly, it may not work very quickly, secondly, to avoid misunderstanding why the hash codes of different objects are equal, as in the second case of the example.

If you do not override the GetHashCode method, you can get a big hit on performance by using a meaningful type as a key in a hash table.

How GetHashCode works with string type


The String class overrides the GetHashCode method. Its implementation in .NET 4.5 looks like this:

GetHashCode X64
 public override unsafe int GetHashCode() { if (HashHelpers.s_UseRandomizedStringHashing) return string.InternalMarvin32HashString(this, this.Length, 0L); fixed (char* chPtr1 = this) { int num1 = 5381; int num2 = num1; char* chPtr2 = chPtr1; int num3; while ((num3 = (int) *chPtr2) != 0) { num1 = (num1 << 5) + num1 ^ num3; int num4 = (int) chPtr2[1]; if (num4 != 0) { num2 = (num2 << 5) + num2 ^ num4; chPtr2 += 2; } else break; } return num1 + num2 * 1566083941; } } 


This is the code for a 64-bit machine, but if we look at the common code with directives

Gethashcode
 public int GetHashCode() { #if FEATURE_RANDOMIZED_STRING_HASHING if(HashHelpers.s_UseRandomizedStringHashing) { return InternalMarvin32HashString(this, this.Length, 0); } #endif // FEATURE_RANDOMIZED_STRING_HASHING unsafe { fixed (char* src = this) { #if WIN32 int hash1 = (5381<<16) + 5381; #else int hash1 = 5381; #endif int hash2 = hash1; #if WIN32 // 32 bit machines. int* pint = (int *)src; int len = this.Length; while (len > 2) { hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0]; hash2 = ((hash2 << 5) + hash2 + (hash2 >> 27)) ^ pint[1]; pint += 2; len -= 4; } if (len > 0) { hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0]; } #else int c; char* s = src; while ((c = s[0]) != 0) { hash1 = ((hash1 << 5) + hash1) ^ c; c = s[1]; if (c == 0) break; hash2 = ((hash2 << 5) + hash2) ^ c; s += 2; } #endif return hash1 + (hash2 * 1566083941); } } } 


then we note that there are differences depending on which machine is 32 or 64 bit.

It should be said that the implementation of this method changes with each .NET release. This was written by Eric Lippert. He warned that our code would not in any way save the hashes that are generated in the standard way in the database or on disk, since they are likely to change the implementation in the next .NET release. So it happened during the last 4 releases of .NET.

Implementing hashing on a string type does not imply caching the result. That is, each time you call the GetHashCode method, we will re-calculate the hash code for the string. According to Eric Lippert, this is done to save memory, the extra 4 bytes for each object of the string type are not worth it. Given that the implementation is very fast, I think this is the right decision.

If you noticed, in the implementation of the GetHashCode method, code appeared that was not there before:

 if (HashHelpers.s_UseRandomizedStringHashing) return string.InternalMarvin32HashString(this, this.Length, 0L); 

It turns out that in .NET 4.5 it is possible to calculate the hash code for the strings for each domain. Thus, by setting the attribute value to 1, you can ensure that the hash code is calculated based on the domain in which the method is called. Thus, identical strings in different domains will have different hash codes. The method that generates this hash code is secret and its implementation is not disclosed.

How does GetHashCode work for delegates?


Before proceeding to the delegates' discussion of the implementation of the GetHashCode method, let's talk about how they are implemented.

Each delegate is inherited from the MulticastDelegate class, which in turn is inherited from the Delegate class. This hierarchy has developed historically, since it would be possible to do with one class MulticastDelegate.

The implementation of the GetHashCode method in the Delegate class looks like this

 public override int GetHashCode() { return this.GetType().GetHashCode(); } 

that is, a hash of the delegate type is actually returned. It turns out that delegates of the same type containing different methods for calling always return the same hash code.

As you know, delegates can contain chains of methods, that is, calling one delegate will lead to calling several methods, in which case such an implementation is not suitable because delegates of one type, regardless of the number of methods, would have one hash code, which is not very good, therefore, in MulticastDelegate, the GetHashCode method is redefined in such a way that it uses each method the underlying delegate. However, if the number of methods and the type of delegates do coincide, the hash codes will be the same.

The implementation of the method in the MulticastDelegate class looks like this

 public override sealed int GetHashCode() { if (this.IsUnmanagedFunctionPtr()) return ValueType.GetHashCodeOfPtr(this._methodPtr) ^ ValueType.GetHashCodeOfPtr(this._methodPtrAux); object[] objArray = this._invocationList as object[]; if (objArray == null) return base.GetHashCode(); int num = 0; for (int index = 0; index < (int) this._invocationCount; ++index) num = num * 33 + objArray[index].GetHashCode(); return num; } 

As you know, the delegate stores its methods in the _invocationList list only if there are more than one of them.

If the delegate contains only one method, then in the code above objArray = null and, accordingly, the delegate hash code will be equal to the hash code of the delegate type.

 object[] objArray = this._invocationList as object[]; if (objArray == null) return base.GetHashCode(); 

To clarify the situation, consider the following code.

 Func<int> f1 = () => 1; Func<int> f2 = () => 2; 

the hash codes of these delegates are equal to the hash code of the type Func <int>, that is, they are equal to each other.

 Func<int> f1 = () => 1; Func<int> f2 = () => 2; f1 += () => 3; f2 += () => 4; 

In this case, the delegates' hash codes also coincide, although the methods are different. In this case, the following code is used to calculate the hash code.

 int num = 0; for (int index = 0; index < (int) this._invocationCount; ++index) num = num * 33 + objArray[index].GetHashCode(); return num; 

And the last case

 Func<int> f1 = () => 1; Func<int> f2 = () => 2; f1 += () => 3; f1 += () => 5; f2 += () => 4; 

Hash codes will differ due to the fact that the number of methods for these delegates is not equal (each method has an effect on the resulting hash code).

How GetHashCode works for anonymous types


As you know, anonymous types are a new feature in C # 3.0. Moreover, this is a feature of the language, the so-called syntactic sugar, because the CLR knows nothing about them.

Their GetHashCode method is redefined in such a way that it uses each field. Using this implementation, two anonymous types return the same hash code if and only if all their fields are equal. Such an implementation makes anonymous types well suited for keys in hash tables.

 var newType = new { Name = "Timur", Age = 20, IsMale = true }; 

For this anonymous type, the following code will be generated:

 public override int GetHashCode() { return -1521134295 * (-1521134295 * (-1521134295 * -974875401 + EqualityComparer<string>.Default.GetHashCode(this.Name)) + EqualityComparer<int >.Default.GetHashCode(this.Age)) + EqualityComparer<bool>.Default.GetHashCode(this.IsMale); } 

On a note

Given that the GetHashCode method is overridden by the Equals method, it must also be overridden accordingly.

 var newType = new { Name = "Timur", Age = 20, IsMale = true }; var newType1 = new { Name = "Timur", Age = 20, IsMale = true }; if (newType.Equals(newType1)) Console.WriteLine("method Equals return true"); else Console.WriteLine("method Equals return false"); if (newType == newType1) Console.WriteLine("operator == return true"); else Console.WriteLine("operator == return false"); 

The following will be displayed on the console:

method Equals return true
operator == return false

The thing is that anonymous types override the Equals method in such a way that it checks all fields as it is done with ValueType (only without reflection), but does not override the equality operator. Thus, the Equals method compares by value, while the equality operator compares by reference.

Why was it necessary to override the Equals and GetHashCode methods?
Given that anonymous types were created to simplify working with LINQ, the answer becomes clear. Anonymous types are conveniently used as hash keys in grouping operations (group) and joins in LINQ.

On a note

Conclusion


As you can see the generation of hash codes is not a very simple matter. In order to generate good hashes, you need to make a lot of effort, and the CLR developers had to make many concessions to make our lives easier. It is good to generate the hash code, it is impossible for all cases, therefore it is better to override the GetHashCode method for your custom types, thereby adjusting them to your specific situation.

Thanks for reading! Hope the article has been helpful.

Thanks to the DreamWalker user for kindly providing updated images to the article.

Source: https://habr.com/ru/post/188038/


All Articles