It often happens that we combine 2 collections or group a collection using
LINQ to Objects . When this happens, the keys selected for grouping or binding are compared.
Fortunately, the cost of these operations is O (n). But in the case of large collections, the effectiveness of the comparison itself is important to us. If strings are selected as keys, which of the comparison implementations will be used by default, is this implementation suitable for your strings, and can you
specify the IEqualityComparer <string> explicitly to make this operation faster?
clients.Join(orders, c => c.Name, o => o.ClientName, (c, o) => CreateOrederDto(c, o));
How is the implementation of the comparator chosen, if the user does not explicitly specify it?
In the
source code of the Join method, you can see the following behavior:
public static IEnumerable<TResult> Join<TOuter, TInner, TKey, TResult>(this IEnumerable<TOuter> outer, IEnumerable<TInner> inner, Func<TOuter, TKey> outerKeySelector, Func<TInner, TKey> innerKeySelector, Func<TOuter, TInner, TResult> resultSelector) { if (outer == null) throw Error.ArgumentNull("outer"); if (inner == null) throw Error.ArgumentNull("inner"); if (outerKeySelector == null) throw Error.ArgumentNull("outerKeySelector"); if (innerKeySelector == null) throw Error.ArgumentNull("innerKeySelector"); if (resultSelector == null) throw Error.ArgumentNull("resultSelector"); return JoinIterator<TOuter, TInner, TKey, TResult>(outer, inner, outerKeySelector, innerKeySelector, resultSelector, null); } static IEnumerable<TResult> JoinIterator<TOuter, TInner, TKey, TResult>(IEnumerable<TOuter> outer, IEnumerable<TInner> inner, Func<TOuter, TKey> outerKeySelector, Func<TInner, TKey> innerKeySelector, Func<TOuter, TInner, TResult> resultSelector, IEqualityComparer<TKey> comparer) { Lookup<TKey, TInner> lookup = Lookup<TKey, TInner>.CreateForJoin(inner, innerKeySelector, comparer); foreach (TOuter item in outer) { Lookup<TKey, TInner>.Grouping g = lookup.GetGrouping(outerKeySelector(item), false); if (g != null) { for (int i = 0; i < g.count; i++) { yield return resultSelector(item, g.elements[i]); } } } }
Well, null is passed to the JoinIterator method, there are no checks inside, and the null value is passed as a parameter when the Lookup is created to the CreateForJoin method.
Using Lookup is rarely found explicitly. This class is a collection with access to elements by key, and several elements can be stored for each key, and in the case of an attempt to access by a non-existing key, an empty collection will simply be returned .
We are interested in the
CreateForJoin method:
internal static Lookup<TKey, TElement> CreateForJoin(IEnumerable<TElement> source, Func<TElement, TKey> keySelector, IEqualityComparer<TKey> comparer) { Lookup<TKey, TElement> lookup = new Lookup<TKey, TElement>(comparer); ... } Lookup(IEqualityComparer<TKey> comparer) { if (comparer == null) comparer = EqualityComparer<TKey>.Default; this.comparer = comparer; groupings = new Grouping[7];
')
EqualityComparer
The
IEqualityComparer <T> generic interface has a basic implementation as an abstract class
EqualityComparer <T> , which in turn has the static
Default property for selecting the default comparator of a particular class T. That is, for any T, class, structure, or simple type, The current object comparator of this type is selected.
The complete selection code depending on the type is rather florid, but interesting from the point of view of understanding the work of our Join and GroupBy.
- For byte will be selected its own special ByteEqualityComparer. This class is designed to improve performance when comparing arrays of bytes, because it contains an IndexOf implementation for a byte array.
- If T implements the IEquatable <T> interface, then a GenericEqualityComparer <T> will be selected, which compares the objects based on a call to their implementations of the IEquatable <T>. Equals (T) method. In addition, before calling, both parameters will be checked for null inequality.
- If T is a Nullable <U>, the value of which implements IEquatable <U>, the NullableEqualityComparer <T> class will be used, similar to the previous one and containing additional HasValue checks.
- For enumerations, depending on the base type, one of the EnumEqualityComparer <T> implementations will be selected (for the long type there is its own special implementation), which differ only in JIT optimizations of how the enumeration value will be reduced to a numeric value.
- In all other cases, an ObjectEqualityComparer <T> is used, which compares objects based on Object.Equals . Here, everything is as usual - for reference types, the equality of references is checked, for significant types, the coincidence of object types (in the case of the ObjectEqualityComparer <T> is always true) and the coincidence of the values ​​of all object fields.
The
String class implements the IEquatable <T> interface and therefore when comparing strings, the implementation of this interface will be called.
For .NET Core, the implementation of the default selection of a comparator is different - in the case of strings, EqualityComparerForString will be selected, which uses only the equality operator == when comparing.
String comparison
The
String.Equals (string value) method checks the equality of the string references, the equality of the length of the strings, and if equality is not calculated based on these properties, it causes (almost) a
byte-by-byte string buffer comparison :
[System.Security.SecuritySafeCritical]
To improve performance, Microsoft even
unwound cycles .
So,
by default, byte-by-byte comparison of the contents of these strings is used to compare strings . Effectively? Probably. How else can you compare strings?
StringComparer
The String class in .net has its own abstract implementation of the
StringComparer comparator and a number of classes inherited from it, access to instances of which can be obtained through the static properties of this class. There are 6 implementations of comparing strings of 3 types with and without case-sensitive characters each:
- CurrentCulture / CurrentCultureIgnoreCase - comparison by words, taking into account the rules of the current culture and language (culture of the current Thread)
- InvariantCulture / InvariantCultureIgnoreCase - word comparison without taking into account the rules of language and culture and language (used by CultureInfo.InvariantCulture)
- Ordinal / OrdinalIgnoreCase - byte comparison
Since StringComparer implements IEqualityComparer <string>, all its descendants can be specified as a parameter where IEqualityComparer <string> is expected.
I have repeatedly met with descriptions of these methods of comparison, but I never really thought about what it means to “compare by words with regard to current culture”.
Will the byte comparison be identical to that when choosing EqualityComparer <String> .Default or explicitly calling Equals? In this case, the
OrdinalComparer class is responsible for
comparing :
public override bool Equals(string x, string y) { if (Object.ReferenceEquals(x ,y)) return true; if (x == null || y == null) return false; if( _ignoreCase) { if( x.Length != y.Length) { return false; } return (String.Compare(x, y, StringComparison.OrdinalIgnoreCase) == 0); } return x.Equals(y); }
In the case of case-sensitive registration, the difference lies in duplicating the checks for the equality of object references and checking for null. This is not very much, although from the point of view of MSIL it is a couple dozen instructions:
IL_0000: nop IL_0001: ldarg.0 IL_0002: ldarg.1 IL_0003: call bool [mscorlib]System.Object::ReferenceEquals(object, object) IL_0008: ldc.i4.0 IL_0009: ceq IL_000b: stloc.1 IL_000c: ldloc.1 IL_000d: brtrue.s IL_0013 IL_000f: ldc.i4.1 IL_0010: stloc.0 IL_0011: br.s IL_003e IL_0013: ldarg.0 IL_0014: brfalse.s IL_001f IL_0016: ldarg.1 IL_0017: ldnull IL_0018: ceq IL_001a: ldc.i4.0 IL_001b: ceq IL_001d: br.s IL_0020 IL_001f: ldc.i4.0 IL_0020: nop IL_0021: stloc.1 IL_0022: ldloc.1 IL_0023: brtrue.s IL_0029 IL_0025: ldc.i4.0 IL_0026: stloc.0 IL_0027: br.s IL_003e
How about saving case insensitive? Frankly, I did not expect to see something like
ToLower , since this operation depends on the culture. But the result still exceeded expectations. To call String.Compare (x, y, StringComparison.OrdinalIgnoreCase), the following
code branch is executed:
case StringComparison.OrdinalIgnoreCase: if (this.Length != value.Length) return false;
I wonder how much worse a slow solution should be so that IsAscii is justified?
In the case of ASCII strings, validation is carried out really character-by-character, each character is checked for a register and, if necessary, converted to upper case by simple subtraction of 0x20.
private unsafe static int CompareOrdinalIgnoreCaseHelper(String strA, String strB) { int length = Math.Min(strA.Length, strB.Length); fixed (char* ap = &strA.m_firstChar) fixed (char* bp = &strB.m_firstChar) { char* a = ap; char* b = bp; while (length != 0) { int charA = *a; int charB = *b; Contract.Assert((charA | charB) <= 0x7F, "strings have to be ASCII");
For non-ASCII strings,
native C ++
code is called and then, depending on the conditions, a string comparison method from the Windows kernel can be invoked or (judging by the comments, this is possible only for Windows XP) a
method that performs the same character-by-character comparison, translating each character into the upper case based on the operating system character tables.
This is indeed of interest to the issue of performance, since the performance of the same comparator may differ depending on the input data, in a degenerate case, from changing one character per line.
What about cultures? The
CultureAwareComparer class accepts a culture as input, on the basis of which it will compare strings and a flag indicating whether the case of characters is ignored. Culture information contains the
CompareInfo property, the object of which contains methods for comparing strings based on a given culture, which are used in the CultureAwareComparer.
return (_compareInfo.Compare(x, y, _ignoreCase? CompareOptions.IgnoreCase : CompareOptions.None) == 0);
Unfortunately, inside there is nothing interesting, because in fact the
native code is called, which again climbs into the kernel to call the string sorting function. To compensate for the lack of code, here's an excerpt for you, which is repeatedly found in the coreclr string functions:
That is, the
comparison of .net strings with regard to culture and language always takes place at the kernel level of the operating system .
Performance tests
After such a trip, I could not help but be interested in the real difference in performance. Initially, my scenario was to combine sequences, since the result is a large number of comparisons. For a clean test, I left only the string comparison without additional operations. The source code of the test can be viewed or grabbed on
GitHub . The results floated a little, but I decided that 10,000 iterations of 1,000,000 would be sufficient without a confidence interval.
Scenario | Milliseconds / 1,000,000 operations | Relative difference |
---|
string.Equals | 25.8 | 1x |
EqualityComparer <string> .Default | 33.5 | 1.3x |
StringComparer.Ordinal | 29.8 | 1.16x |
StringComparer.OrdinalIgnoreCase | 50.3 | 1.95x |
StringComparer.OrdinalIgnoreCase non ASCII | 82.2 | 3.19x |
StringComparer.CurrentCulture | 136 | 5.27x |
StringComparer.CurrentCulture non ASCII | 174.3 | 6.76x |
StringComparer.CurrentCultureIgnoreCase | 134.5 | 5.21x |
StringComparer.CurrentCultureIgnoreCase non ASCII | 172.1 | 6.67x |
StringComparer.InvariantCulture | 132.2 | 5.12x |
StringComparer.InvariantCulture non ASCII | 189.5 | 7.34x |
StringComparer.InvariantCultureIgnoreCase | 134.1 | 5.2x |
StringComparer.InvariantCultureIgnoreCase non ASCII | 188 | 7.29x |
The results confirm the code — an explicit call to string.Equals is the fastest, GenericEqualityComparer <string> is slower due to additional checks of input parameters. OrdinalComparer also has additional checks. And then either unwound cycles are invoked, or methods of unmanaged code that, generally speaking, behave differently on different platforms, but in the case of working with culture, call methods from the operating system kernel.
Comparison in other operations on strings
It would seem that by default the fastest and simplest comparison is used, the programmer can not worry. In fact, everything is not quite so. There are a number of row operations. For example, determining
whether a string begins with a specific substring :
public Boolean StartsWith(String value) { return StartsWith(value, StringComparison.CurrentCulture); }
At the same time,
checking the occurrence of a substring (it would seem the same thing) is again performed byte-by-byte:
public bool Contains( string value ) { return ( IndexOf(value, StringComparison.Ordinal) >=0 ); }
And checking the occurrence of a substring
that returns the index of the beginning of this substring is again with the current culture:
public int IndexOf(String value) { return IndexOf(value, StringComparison.CurrentCulture); }
LastIndexOf generally calls native code. What is going on in it? Only Sathya knows.
findings
What to do with all this information?
- First, with a large number of comparisons, you can get an increase in the performance of these operations by about 10% by implementing your string comparator, which will not duplicate the checks that are already in the string.Equals method.
- Secondly, if the lines are keys, then you can decide to use only Ascii characters, which will give a gain of about 20% for most comparisons.
- Thirdly, it is necessary to understand and limit well the case when the comparison is applied taking into account the culture. Often in the code I met exactly the StringComparer.InvariantCulture option since the author considered such a comparison to be the most general. But most often it is enough to use StringComparer.Ordinal or, in special cases, StringComparer.OrdinalIgnoreCase.
- Using various methods of the String class it is better to double-check their behavior in the source code. Perhaps a surprise is waiting for you after the tests with “TestString” and “AnotherTestString”, when the code goes to production.