📜 ⬆️ ⬇️

C # string comparison (default)

It often happens that we combine 2 collections or group a collection using LINQ to Objects . When this happens, the keys selected for grouping or binding are compared.
Fortunately, the cost of these operations is O (n). But in the case of large collections, the effectiveness of the comparison itself is important to us. If strings are selected as keys, which of the comparison implementations will be used by default, is this implementation suitable for your strings, and can you specify the IEqualityComparer <string> explicitly to make this operation faster?
clients.Join(orders, c => c.Name, o => o.ClientName, (c, o) => CreateOrederDto(c, o)); 

How is the implementation of the comparator chosen, if the user does not explicitly specify it?

In the source code of the Join method, you can see the following behavior:
 public static IEnumerable<TResult> Join<TOuter, TInner, TKey, TResult>(this IEnumerable<TOuter> outer, IEnumerable<TInner> inner, Func<TOuter, TKey> outerKeySelector, Func<TInner, TKey> innerKeySelector, Func<TOuter, TInner, TResult> resultSelector) { if (outer == null) throw Error.ArgumentNull("outer"); if (inner == null) throw Error.ArgumentNull("inner"); if (outerKeySelector == null) throw Error.ArgumentNull("outerKeySelector"); if (innerKeySelector == null) throw Error.ArgumentNull("innerKeySelector"); if (resultSelector == null) throw Error.ArgumentNull("resultSelector"); return JoinIterator<TOuter, TInner, TKey, TResult>(outer, inner, outerKeySelector, innerKeySelector, resultSelector, null); } static IEnumerable<TResult> JoinIterator<TOuter, TInner, TKey, TResult>(IEnumerable<TOuter> outer, IEnumerable<TInner> inner, Func<TOuter, TKey> outerKeySelector, Func<TInner, TKey> innerKeySelector, Func<TOuter, TInner, TResult> resultSelector, IEqualityComparer<TKey> comparer) { Lookup<TKey, TInner> lookup = Lookup<TKey, TInner>.CreateForJoin(inner, innerKeySelector, comparer); foreach (TOuter item in outer) { Lookup<TKey, TInner>.Grouping g = lookup.GetGrouping(outerKeySelector(item), false); if (g != null) { for (int i = 0; i < g.count; i++) { yield return resultSelector(item, g.elements[i]); } } } } 

Well, null is passed to the JoinIterator method, there are no checks inside, and the null value is passed as a parameter when the Lookup is created to the CreateForJoin method.

Using Lookup is rarely found explicitly. This class is a collection with access to elements by key, and several elements can be stored for each key, and in the case of an attempt to access by a non-existing key, an empty collection will simply be returned .

We are interested in the CreateForJoin method:
 internal static Lookup<TKey, TElement> CreateForJoin(IEnumerable<TElement> source, Func<TElement, TKey> keySelector, IEqualityComparer<TKey> comparer) { Lookup<TKey, TElement> lookup = new Lookup<TKey, TElement>(comparer); ... } Lookup(IEqualityComparer<TKey> comparer) { if (comparer == null) comparer = EqualityComparer<TKey>.Default; this.comparer = comparer; groupings = new Grouping[7]; // 7 -  ,    7 } 

')

EqualityComparer


The IEqualityComparer <T> generic interface has a basic implementation as an abstract class EqualityComparer <T> , which in turn has the static Default property for selecting the default comparator of a particular class T. That is, for any T, class, structure, or simple type, The current object comparator of this type is selected.
The complete selection code depending on the type is rather florid, but interesting from the point of view of understanding the work of our Join and GroupBy.
  1. For byte will be selected its own special ByteEqualityComparer. This class is designed to improve performance when comparing arrays of bytes, because it contains an IndexOf implementation for a byte array.
  2. If T implements the IEquatable <T> interface, then a GenericEqualityComparer <T> will be selected, which compares the objects based on a call to their implementations of the IEquatable <T>. Equals (T) method. In addition, before calling, both parameters will be checked for null inequality.
  3. If T is a Nullable <U>, the value of which implements IEquatable <U>, the NullableEqualityComparer <T> class will be used, similar to the previous one and containing additional HasValue checks.
  4. For enumerations, depending on the base type, one of the EnumEqualityComparer <T> implementations will be selected (for the long type there is its own special implementation), which differ only in JIT optimizations of how the enumeration value will be reduced to a numeric value.
  5. In all other cases, an ObjectEqualityComparer <T> is used, which compares objects based on Object.Equals . Here, everything is as usual - for reference types, the equality of references is checked, for significant types, the coincidence of object types (in the case of the ObjectEqualityComparer <T> is always true) and the coincidence of the values ​​of all object fields.

The String class implements the IEquatable <T> interface and therefore when comparing strings, the implementation of this interface will be called.
For .NET Core, the implementation of the default selection of a comparator is different - in the case of strings, EqualityComparerForString will be selected, which uses only the equality operator == when comparing.

String comparison


The String.Equals (string value) method checks the equality of the string references, the equality of the length of the strings, and if equality is not calculated based on these properties, it causes (almost) a byte-by-byte string buffer comparison :
 [System.Security.SecuritySafeCritical] // auto-generated [ReliabilityContract(Consistency.WillNotCorruptState, Cer.MayFail)] private unsafe static bool EqualsHelper(String strA, String strB) { int length = strA.Length; fixed (char* ap = &strA.m_firstChar) fixed (char* bp = &strB.m_firstChar) { char* a = ap; char* b = bp; //   #if AMD64 //   AMD64   12    3 qword  . while (length >= 12) { if (*(long*)a != *(long*)b) return false; if (*(long*)(a+4) != *(long*)(b+4)) return false; if (*(long*)(a+8) != *(long*)(b+8)) return false; a += 12; b += 12; length -= 12; } #else while (length >= 10) { if (*(int*)a != *(int*)b) return false; if (*(int*)(a+2) != *(int*)(b+2)) return false; if (*(int*)(a+4) != *(int*)(b+4)) return false; if (*(int*)(a+6) != *(int*)(b+6)) return false; if (*(int*)(a+8) != *(int*)(b+8)) return false; a += 10; b += 10; length -= 10; } #endif //          \0 while (length > 0) { if (*(int*)a != *(int*)b) break; a += 2; b += 2; length -= 2; } return (length <= 0); } } 

To improve performance, Microsoft even unwound cycles .
So, by default, byte-by-byte comparison of the contents of these strings is used to compare strings . Effectively? Probably. How else can you compare strings?

StringComparer


The String class in .net has its own abstract implementation of the StringComparer comparator and a number of classes inherited from it, access to instances of which can be obtained through the static properties of this class. There are 6 implementations of comparing strings of 3 types with and without case-sensitive characters each:

Since StringComparer implements IEqualityComparer <string>, all its descendants can be specified as a parameter where IEqualityComparer <string> is expected.
I have repeatedly met with descriptions of these methods of comparison, but I never really thought about what it means to “compare by words with regard to current culture”.
Will the byte comparison be identical to that when choosing EqualityComparer <String> .Default or explicitly calling Equals? In this case, the OrdinalComparer class is responsible for comparing :
 public override bool Equals(string x, string y) { if (Object.ReferenceEquals(x ,y)) return true; if (x == null || y == null) return false; if( _ignoreCase) { if( x.Length != y.Length) { return false; } return (String.Compare(x, y, StringComparison.OrdinalIgnoreCase) == 0); } return x.Equals(y); } 

In the case of case-sensitive registration, the difference lies in duplicating the checks for the equality of object references and checking for null. This is not very much, although from the point of view of MSIL it is a couple dozen instructions:
  IL_0000: nop IL_0001: ldarg.0 IL_0002: ldarg.1 IL_0003: call bool [mscorlib]System.Object::ReferenceEquals(object, object) IL_0008: ldc.i4.0 IL_0009: ceq IL_000b: stloc.1 IL_000c: ldloc.1 IL_000d: brtrue.s IL_0013 IL_000f: ldc.i4.1 IL_0010: stloc.0 IL_0011: br.s IL_003e IL_0013: ldarg.0 IL_0014: brfalse.s IL_001f IL_0016: ldarg.1 IL_0017: ldnull IL_0018: ceq IL_001a: ldc.i4.0 IL_001b: ceq IL_001d: br.s IL_0020 IL_001f: ldc.i4.0 IL_0020: nop IL_0021: stloc.1 IL_0022: ldloc.1 IL_0023: brtrue.s IL_0029 IL_0025: ldc.i4.0 IL_0026: stloc.0 IL_0027: br.s IL_003e 

How about saving case insensitive? Frankly, I did not expect to see something like ToLower , since this operation depends on the culture. But the result still exceeded expectations. To call String.Compare (x, y, StringComparison.OrdinalIgnoreCase), the following code branch is executed:
 case StringComparison.OrdinalIgnoreCase: if (this.Length != value.Length) return false; //      ASCII ,      . if (this.IsAscii() && value.IsAscii()) { return (CompareOrdinalIgnoreCaseHelper(this, value) == 0); } #if FEATURE_COREFX_GLOBALIZATION return CompareInfo.CompareOrdinalIgnoreCase(strA, 0, strA.Length, strB, 0, strB.Length); #else //   return TextInfo.CompareOrdinalIgnoreCase(strA, strB); #endif 

I wonder how much worse a slow solution should be so that IsAscii is justified?
In the case of ASCII strings, validation is carried out really character-by-character, each character is checked for a register and, if necessary, converted to upper case by simple subtraction of 0x20.
 private unsafe static int CompareOrdinalIgnoreCaseHelper(String strA, String strB) { int length = Math.Min(strA.Length, strB.Length); fixed (char* ap = &strA.m_firstChar) fixed (char* bp = &strB.m_firstChar) { char* a = ap; char* b = bp; while (length != 0) { int charA = *a; int charB = *b; Contract.Assert((charA | charB) <= 0x7F, "strings have to be ASCII"); //            if ((uint)(charA - 'a') <= (uint)('z' - 'a')) charA -= 0x20; if ((uint)(charB - 'a') <= (uint)('z' - 'a')) charB -= 0x20; if (charA != charB) return charA - charB; //   a++; b++; length--; } return strA.Length - strB.Length; } } 

For non-ASCII strings, native C ++ code is called and then, depending on the conditions, a string comparison method from the Windows kernel can be invoked or (judging by the comments, this is possible only for Windows XP) a method that performs the same character-by-character comparison, translating each character into the upper case based on the operating system character tables.

This is indeed of interest to the issue of performance, since the performance of the same comparator may differ depending on the input data, in a degenerate case, from changing one character per line.

What about cultures? The CultureAwareComparer class accepts a culture as input, on the basis of which it will compare strings and a flag indicating whether the case of characters is ignored. Culture information contains the CompareInfo property, the object of which contains methods for comparing strings based on a given culture, which are used in the CultureAwareComparer.
 return (_compareInfo.Compare(x, y, _ignoreCase? CompareOptions.IgnoreCase : CompareOptions.None) == 0); 

Unfortunately, inside there is nothing interesting, because in fact the native code is called, which again climbs into the kernel to call the string sorting function. To compensate for the lack of code, here's an excerpt for you, which is repeatedly found in the coreclr string functions:
 // TODO: Remove this workaround after Vista SP2 &/or turkic CompareStringEx() gets fixed on Vista. // If its Vista and we want a turkik sort, then call CompareStringW not CompareStringEx LPCWSTR pLingLocaleName = AvoidVistaTurkishBug ? GetLingusticLocaleName((LPWSTR)lpLocaleName, dwCmpFlags) : lpLocaleName; // TODO: End of workaround for turkish CompareStringEx() on Vista/Win2K8 

That is, the comparison of .net strings with regard to culture and language always takes place at the kernel level of the operating system .

Performance tests


After such a trip, I could not help but be interested in the real difference in performance. Initially, my scenario was to combine sequences, since the result is a large number of comparisons. For a clean test, I left only the string comparison without additional operations. The source code of the test can be viewed or grabbed on GitHub . The results floated a little, but I decided that 10,000 iterations of 1,000,000 would be sufficient without a confidence interval.
ScenarioMilliseconds / 1,000,000 operationsRelative difference
string.Equals25.81x
EqualityComparer <string> .Default33.51.3x
StringComparer.Ordinal29.81.16x
StringComparer.OrdinalIgnoreCase50.31.95x
StringComparer.OrdinalIgnoreCase non ASCII82.23.19x
StringComparer.CurrentCulture1365.27x
StringComparer.CurrentCulture non ASCII174.36.76x
StringComparer.CurrentCultureIgnoreCase134.55.21x
StringComparer.CurrentCultureIgnoreCase non ASCII172.16.67x
StringComparer.InvariantCulture132.25.12x
StringComparer.InvariantCulture non ASCII189.57.34x
StringComparer.InvariantCultureIgnoreCase134.15.2x
StringComparer.InvariantCultureIgnoreCase non ASCII1887.29x
The results confirm the code — an explicit call to string.Equals is the fastest, GenericEqualityComparer <string> is slower due to additional checks of input parameters. OrdinalComparer also has additional checks. And then either unwound cycles are invoked, or methods of unmanaged code that, generally speaking, behave differently on different platforms, but in the case of working with culture, call methods from the operating system kernel.

Comparison in other operations on strings


It would seem that by default the fastest and simplest comparison is used, the programmer can not worry. In fact, everything is not quite so. There are a number of row operations. For example, determining whether a string begins with a specific substring :
 public Boolean StartsWith(String value) { return StartsWith(value, StringComparison.CurrentCulture); } 

At the same time, checking the occurrence of a substring (it would seem the same thing) is again performed byte-by-byte:
 public bool Contains( string value ) { return ( IndexOf(value, StringComparison.Ordinal) >=0 ); } 

And checking the occurrence of a substring that returns the index of the beginning of this substring is again with the current culture:
 public int IndexOf(String value) { return IndexOf(value, StringComparison.CurrentCulture); } 

LastIndexOf generally calls native code. What is going on in it? Only Sathya knows.

findings


What to do with all this information?

Source: https://habr.com/ru/post/274491/


All Articles