Call price

It is believed that the overhead of calling methods and organizing the execution process should not exceed 15% of the execution time of an application, otherwise you should seriously think about the question of refactoring an application and optimizing its logic. Armed with such thoughts, I came across the QuickSort method from the standard ArraySortHelper<T> class used to sort arrays in .Net.

An interesting point here is the comparison of elements - in order to provide flexibility, it was carried out in a separate class that implements the IComparer<T> interface. Armed with a variety of thoughts and the studio, it was decided to estimate how much such flexibility is and what could be done with it - under the cut, an analysis of the costs of comparing elements during QuickSort work.

')
So, we have a standard implementation of Hoare's quick sort, which uses the Compare(T x, T y) method from an object that implements the IComparer<T> interface to compare array elements. For our experiments, using a reflector, we get the code of the sorting method in the following form:

Copy Source | Copy HTML
static public void Sort <T, TValue> (T [] keys, TValue [] values, int left, int right, IComparer <T> comparer)
{
do
{
int index = left;
int num2 = right;
T y = keys [index + ((num2 - index) >> 1 )];
do
{
try
{
/ * Cmp 1 * / while (comparer.Compare (keys [index], y) < 0 )
{
index ++;
}
/ * Cmp 2 * / while (comparer.Compare (y, keys [num2]) < 0 )
{
num2--;
}
}
catch ( IndexOutOfRangeException )
{
throw new ArgumentException ( null , "keys" );
}
catch ( Exception )
{
throw new InvalidOperationException ();
}
if (index> num2)
{
break ;
}
if (index <num2)
{
T local2 = keys [index];
keys [index] = keys [num2];
keys [num2] = local2;
if (values! = null )
{
TValue local3 = values [index];
values [index] = values [num2];
values [num2] = local3;
}
}
index ++;
num2--;
}
while (index <= num2);
if ((num2 - left) <= (right - index))
{
if (left <num2)
{
/ * Call 1 * / Sort <T, TValue> (keys, values, left, num2, comparer);
}
left = index;
}
else
{
if (index <right)
{
/ * Call 2 * / Sort <T, TValue> (keys, values, index, right, comparer);
}
right = num2;
}
}
while (left <right);
}

Since we will investigate exclusively the call to the operation of comparison, all changes to this function will be made only in lines 1, 12, 16, 53 and 61, which are marked in the listing with authentication comments.

Part one. Experiments with arrays of numbers (int)

To begin, let us estimate the contribution of the overhead of a call to the operation of comparing two numbers in the duration of the sorting process. To do this, in the above function, we change the “ comparer.Compare(a, b) ” calls to an expression of the form “ a - b ”. We measure the operating time of both versions and see ... We see a terrible picture - almost two thirds of the time is spent on organizing a call to the comparison number predicate:

What could so slow down the work? Obviously, the point is the excessive complexity of the comparison process (and this despite the fact that the operation itself boils down to subtracting two numbers!):

Checking the comparer object for null
Finding the Compare method in the object's virtual table
Calling the Compare method on the comparer object
Calling the CompareTo Method
Actually a comparison of numbers

In this case, the first two points are the difference CIL-instruction callvirt from call . I recall that callvirt , due to the presence of a check for null , is generated to call all non-static class methods, regardless of their virtuality.

Well, the fourth item is caused by the standard implementation of comparer (all checks for null naturally cleaned by the JIT compiler when substituting int instead of T ):

Copy Source | Copy HTML
class GenericComparer < T >: IComparer < T > where T : IComparable < T >
{
public int Compare ( T x, T y)
{
if (x! = null )
{
if (y! = null )
{
return x.CompareTo (y);
}
return 1 ;
}
if (y! = null )
{
return - 1 ;
}
return 0 ;
}
}

We will remove the layers one by one, let's start with a call to CompareTo , since this is done simply by writing our own comparer of the following form:

Copy Source | Copy HTML
class IntComparer : IComparer < int >
{
public int Compare ( int x, int y)
{
return x - y;
}
}

Measurements show winning 15% of the time. And this is despite the fact that we consider an array of integers and calling CompareTo for them, as for all methods for all structures, is non-virtual. The next layer, the difference between callvirt and call when calling, is harder to check, especially with the same flexibility requirements for the sorting function. But nothing is impossible - when working through an instance of a structure, all methods of all structures are invoked using the call operation, including the implementation of the methods inherited from interfaces. Due to this we can do the following optimization:

Copy Source | Copy HTML
static public void SortNoVirt <T, TValue, TCmp> (T [] keys, TValue [] values, int left, int right, TCmp comparer) where TCmp: IComparer <T>
{
// ...
}
struct IntComparerNoVirt : IComparer < int >
{
public int Compare ( int x, int y)
{
return x - y;
}
}

Due to the transfer of the actual type to the sort function, when expanding the generic parameters, the actual type of comparer will be taken into account and a more efficient call instruction will be used to invoke the compare operation. Measurements of time show an increase in productivity of about 35% of the execution time of the “standard” call, while the increase from the subsequent replacement of the call Compare to subtract is 18%.

To the note: In the STL library from C ++ all predicates are transmitted in this way - the type goes through the template parameter. Additionally, thanks to this trick, the C ++ compiler obtains complete information about types and, as a rule, performs inlining of the predicate code, thereby completely eliminating the costs of a method call. My experiments with disassembling the results of .Net JIT showed that, contrary to all tricks, there is no unfolding here. Apparently this is either associated with generics, or it works exclusively for static methods.

Note # 2: It was not possible to save on the transfer of comparer with the help of an empty structure - the size ( sizeof ) of the borderless structure in C # turned out to be one (and not zero, as desired). In VC ++, the size of the empty structure is also equal to one, while the standard states that the size of the empty class (structure) must be greater than zero. This is done because of the popular construction for calculating the length of the array " sizeof(array) / sizeof(array[0]) ". In C # /. Net, this is apparently done for binary compatibility when interacting with code written in C ++.

If we consolidate the data into a single diagram, we obtain the following distribution of the sorting runtime by the standard method:

The obvious conclusion that is clear is that when developing computationally heavy biblicals, it makes sense to partly invoke predicates by passing structures and their type through generic parameters.

Part two. And what about objects?

But our applications do not live together in numbers :) There was a question of influence in the context of working with arrays of objects, will we check? Easy! As a prototype take here such a simple class:

Copy Source | Copy HTML
class IntObj : IComparable < IntObj >
{
public int value ;
public int CompareTo ( IntObj other)
{
return value - other. value ;
}
}

And we carry out similar simple experiments, but already with an array of objects, as a result we get a slightly different situation:

And if the ratio between the parts of the comparison process remained unchanged, then their contribution to the total time has noticeably decreased. It is quite a questioning question: what's the matter, what is slowing down? A quick glance at the code of the sort function is enough to understand: the work of the only different piece of code, the exchange of values between array elements, has slowed down. But we are working with the reference type, only the pointers are swapped, and they are “in the soul” of the number! If it’s interesting, then we’ll see that it slows down so much, reading CIL doesn’t give much benefit - the exchange code is asbestomatically identical except for using the *.i4 instructions for numbers and *.ref for objects.

If CIL did not help, then it’s a matter of quitting the JIT, then we’ll watch the assembler :) Here we have a weekly nuisance - regardless of configuration (Debug / Release), the JIT compiler looks at its environment and, depending on the presence of a debugger attached to the process generates different code. Therefore, to access the real code through the Disassembly window, we will start the application without debugging and set breakpoints in the form of calls to Debugger.Break(); . The following are listings with the code that is generated to exchange places in two cells of the array:

Copy Source | Copy HTML

; ==============================================
; swap in int array
; 124: T local2 = keys [index];

Copy Source | Copy HTML

; ==============================================
; swap in IntObj array
; 124: T local2 = keys [index];

00000142 movsxd rcx, ebx
00000145 mov rsi, qword ptr [rbp + 68h]
00000149 mov rax, qword ptr [rsi + 8]
0000014d cmp rcx, rax
00000150 jae 0000000000000275
00000156 mov edx, dword ptr [rsi + rcx * 4 + 10h]
; 125: keys [index] = keys [num2];
0000015a movsxd r8, edi
0000015d cmp r8, rax
00000160 jae 0000000000000275
00000166 mov eax, dword ptr [rsi + r8 * 4 + 10h]
0000016b mov dword ptr [rsi + rcx * 4 + 10h], eax
; 126: keys [num2] = local2;
0000016f mov dword ptr [rsi + r8 * 4 + 10h], edx

00000146 movsxd r12, ebx
00000149 mov rsi, qword ptr [rbp + 68h]
0000014d mov rax, qword ptr [rsi + 8]
00000151 cmp r12, rax
00000154 jae 0000000000000285
0000015a mov r14, qword ptr [rsi + r12 * 8 + 18h]
; 125: keys [index] = keys [num2];
0000015f movsxd r13, edi
00000162 cmp r13, rax
00000165 jae 0000000000000285
0000016b mov r8, qword ptr [rsi + r13 * 8 + 18h]
00000170 mov edx, ebx
00000172 mov rcx, rsi
00000175 call FFFFFFFFEF5F7CB0
; 126: keys [num2] = local2;
0000017a mov r8, r14
0000017d mov edx, edi
0000017f mov rcx, rsi
00000182 call FFFFFFFFEF5F7CB0

In the listings, one immediately catches the call of a certain function for writing to an array of objects, and this call was absent while we were watching CIL. Obviously, this function does something more than just copying the address, and in addition to the costs of organizing this call, we get something else, which means the details lie in the implementation of the CLR. Indeed, after reading the source code of the public version of CLR 2.0 (SSCLI 2.0 package), we find the following code:

Copy Source | Copy HTML
/ ************************************************* **************************** /
/ * assigns 'val to' array [idx], after doing all the proper checks * /
HCIMPL3 ( void , JIT_Stelem_Ref_Portable, PtrArray * array, unsigned idx, Object * val)
{
if (! array)
{
FCThrowVoid (kNullReferenceException);
}
if (idx> = array-> GetNumComponents ())
{
FCThrowVoid (kIndexOutOfRangeException);
}
if (val)
{
MethodTable * valMT = val-> GetMethodTable ();
TypeHandle arrayElemTH = array-> GetArrayElementTypeHandle ();
if (arrayElemTH! = TypeHandle (valMT) && arrayElemTH! = TypeHandle (g_pObjectClass))
{
TypeHandle :: CastResult result = ObjIsInstanceOfNoGC (val, arrayElemTH);
if (result! = TypeHandle :: CanCast)
{
HELPER_METHOD_FRAME_BEGIN_2 (val, array);
// This is equivalent to ArrayStoreCheck (& val, & array);
#ifdef STRESS_HEAP
// Force a GC on every jit if the stress level is high enough
if (g_pConfig-> GetGCStressLevel ()! = 0
#ifdef _DEBUG
&&! g_pConfig-> FastGCStressLevel ()
#endif
)
GCHeap :: GetGCHeap () -> StressHeap ();
#endif
#if CHECK_APP_DOMAIN_LEAKS
// If the instance is agile or check agile
if (! arrayElemTH.IsAppDomainAgile () &&! arrayElemTH.IsCheckAppDomainAgile () && g_pConfig-> AppDomainLeaks ())
{
val-> AssignAppDomain (array-> GetAppDomain ());
}
#endif
if (! ObjIsInstanceOf (val, arrayElemTH))
{
COMPlusThrow (kArrayTypeMismatchException);
}
HELPER_METHOD_FRAME_END ();
}
}
// The performance gain of the optimized JIT_Stelem_Ref in
// jitinterfacex86.cpp is mainly due to calling JIT_WriteBarrierReg_Buf.
// By calling write barrier directly here,
// we can avoid translating in-line assembly from MSVC to gcc
// while keeping most of the performance gain.
HCCALL2 (JIT_WriteBarrier, ( Object **) & array-> m_Array [idx], val);
}
else
{
// no need to go through write-barrier for NULL
ClearObjectReference (& array-> m_Array [idx]);
}
}
HCIMPLEND

As can be seen from here, when writing an element to an array of objects, in addition to the standard checking of array boundaries, type compatibility is also checked and, if necessary, the appropriate conversions are performed. It is worth noting that, being string-typed, C # itself also contains type control, but since the CIL instruction information already comes in the form System.Object , the CLR checks for reliability once again.

Part Three Similarity of conclusions

What conclusions can be drawn from this? I will not offer hardcore optimizations, but it makes sense to avoid virtual calls inside large cycles, especially in miniature predicate functions. In practice, for example:

Realization of predicates in the form of structures, and not classes with the subsequent transfer of type information through generic parameters
replacement of small utilitarian class methods with static extension methods, which, due to their static nature, will be called to bypass the virtual table

Of course, these are not the only possible conclusions, but you have to leave the reader open space for thoughts :)

Source: https://habr.com/ru/post/69520/

All Articles

Call price

Part one. Experiments with arrays of numbers (int)

Part two. And what about objects?

Part Three Similarity of conclusions

More articles: