High-performance code on the .NET platform

Hello dear readers!

Not so long ago we started working on the book " Writing High-Performance .NET code ", which has not yet been translated into Russian, although it will soon be a year.

')
Of course, we were not surprised that such a book was already being taken away for quotations, but it turned out that the respected author Ben Watson had even posted an entire article on the Codeproject website based on one of the chapters. Unfortunately, the volume of this material is too large for habro-publication, however, we decided to still translate the first part of the article, so that you can evaluate the material of the book. We invite you to read and to participate in the survey. In addition, if it is still advisable to translate the second part - write in the comments, we will try to take into account your wishes.

Context

This article is based on Chapter 5 of my Writing High-Performance .NET Code book.

Introduction

The article describes the general principles of writing code and type design. On the .NET platform, there are opportunities to implement various scenarios, and if some of them at best do not affect the performance in any way, there are those that seriously spoil it. Certain scenarios are neither good nor bad, but just what they are. We have to find the most appropriate solution for each situation.

If I tried to formulate a general principle underlying this article, it would be this:

Deep performance optimization often violates software abstractions.

This means that when trying to achieve extremely high performance, we must understand the implementation details at all levels and, possibly, try to play on these intricacies. Many of these details are described in this article.

Comparing classes and structures

Instances of a class are always allocated on the heap, and access to these instances is done by dereferencing a pointer. Transmitting them is cheap, because we are talking only about a copy of the pointer (4 or 8 bytes). However, the object also has some fixed costs: 8 bytes for 32-bit processes and 16 bytes for 64-bit processes. These costs include a pointer to the method table plus a synchronization block field. If you create a borderless object and view it in the debugger, then it turns out that in fact its size is not 8, but 12 bytes. In the case of 64-bit processes, the object will be 24 bytes in size. The fact is that the minimum size depends on the alignment of memory blocks. Fortunately, these “extra” 4 bytes will be used by the field.

A struct does not incur any costs, and the size of the memory it uses is the sum of the sizes of all its fields. If a struct is declared as a local variable in a method, then the struct is allocated on the stack. If the struct is declared as part of a class, then the memory of the struct will be part of the memory fragment occupied by this class (and, therefore, on the heap). If you pass a struct to a method, it will be copied byte by byte sequentially. Since it is not on the heap, the allocation of the structure never initiates garbage collection.

Consequently, there is a need to compromise. You can meet various recommendations about the maximum size of the structure, but I would not be tied to a specific number. As a rule, it is better to keep the size of a struct very small, especially if this structure is passed back and forth, but structures can also be passed by reference, so its size may not be a significant problem. The only way to confidently determine whether such a technique would be useful is to look closely at the usage pattern and perform the profiling yourself.

In some situations, the effectiveness may be very different. Let the value of costs on a single object and it seems scanty, but try to consider an array of objects, and then compare it with an array of structures. Suppose the data structure contains 16 bytes of data, the array length is 1,000,000, and we are working in a 32-bit system.

For an array of objects, the total space consumption is:

12 bytes of array costs +
(pointer size 4 bytes × 1,000,000) +
((cost 8 bytes + 16 data bytes) × 1,000,000)
= 28 MB

For an array of structures we get a fundamentally different result:

12 bytes of array costs +
(16 bytes of data × 1,000,000)
= 16 MB

In the case of a 64-bit process, an array of objects takes up more than 40 MB, while an array of struct requires only 16 MB.

As you can see, in a struct array, a similar amount of data takes up less memory than in an array of objects. Together with the costs associated with the use of objects, you also get more intensive garbage collection, which is explained by the higher memory load.

In addition to the use of space, there is also the issue of processor efficiency. The processor has several levels of caching. Those that are closest to the processor are very small, but they work extremely fast and are optimized for sequential access.

The struct array contains multiple values, sequentially arranged in memory. Accessing an element in a struct array is very simple. Once the correct entry is found, we already have the correct value. Thus, there can be a huge difference in access speed when searching a large array. If the value is already in the processor cache, then it can be accessed an order of magnitude faster than if it is located in RAM.

To access an object array element, you need to access the array's memory, and then dereference a pointer to this element, which can be located anywhere on the heap. The search of object arrays is associated with dereferencing an additional pointer, “hopping” along the heap, and relatively frequent emptying of the processor cache, which potentially leads to wasting more needed data.

The absence of such costs, both at the processor level and in memory, is in many cases the main reason for preferring structures. If used wisely, this technique can give a significant performance gain, as the memory locality improves.

Since structures are always copied by value, you can inadvertently get into a more interesting position. For example, the following code is written with errors and will not compile:

struct Point { public int x; public int y; } public static void Main() { List<Point> points = new List<Point>(); points.Add(new Point() { x = 1, y = 2 }); points[0].x = 3; }

The problem occurs in the last line that tries to change an existing Point in the list. This is not possible, because calling points [0 ] returns a copy of the original value, which is not stored anywhere else. The correct way to change the point is :

 Point p = points[0]; px = 3; points[0] = p;

However, it may be advisable to implement an even more stringent policy: to make structures immutable. After creating the structure, such a structure will not be able to get a new value. In this case, the situation described above becomes impossible in principle, all work with structures is simplified.

It was mentioned above that the structures should be compact, so as not to spend a lot of time copying them, but sometimes you have to use large structures. Consider an object in which a lot of details about a commercial process are tracked - for example, a lot of time stamps are put.

 class Order { public DateTime ReceivedTime {get;set;} public DateTime AcknowledgeTime {get;set;} public DateTime ProcessBeginTime {get;set;} public DateTime WarehouseReceiveTime {get;set;} public DateTime WarehouseRunnerReceiveTime {get;set;} public DateTime WarehouseRunnerCompletionTime {get;set;} public DateTime PackingBeginTime {get;set;} public DateTime PackingEndTime {get;set;} public DateTime LabelPrintTime {get;set;} public DateTime CarrierNotifyTime {get;set;} public DateTime ProcessEndTime {get;set;} public DateTime EmailSentToCustomerTime {get;set;} public DateTime CarrerPickupTime {get;set;} //    ... }

To simplify the code, it would be nice to separate each of these labels into its own substructure, which will still be available through the code of the Order class as follows:

 Order order = new Order(); Order.Times.ReceivedTime = DateTime.UtcNow;

All these substructures can be taken out in a separate class:

 class OrderTimes { public DateTime ReceivedTime {get;set;} public DateTime AcknowledgeTime {get;set;} public DateTime ProcessBeginTime {get;set;} public DateTime WarehouseReceiveTime {get;set;} public DateTime WarehouseRunnerReceiveTime {get;set;} public DateTime WarehouseRunnerCompletionTime {get;set;} public DateTime PackingBeginTime {get;set;} public DateTime PackingEndTime {get;set;} public DateTime LabelPrintTime {get;set;} public DateTime CarrierNotifyTime {get;set;} public DateTime ProcessEndTime {get;set;} public DateTime EmailSentToCustomerTime {get;set;} public DateTime CarrerPickupTime {get;set;} } class Order { public OrderTimes Times; }

However, this additionally results in 12 or 24 bytes of costs for each Order object.

If you need to transfer the entire OrderTimes object to different methods, then perhaps these costs are justified, but why not just pass a reference to the whole Order object? If you have thousands of Order objects processed at the same time, this will lead to a significant increase in garbage collection. In addition, additional dereferencing operations will be remembered.

Try to make OrderTimes a better structure. Accessing individual properties of the OrderTimes structure through the property of the Order object (for example, order.Times.ReceivedTime ) does not result in copying the structure (in .NET this probable script is specifically optimized). Thus, the OrderTimes structure is essentially included in the memory section allocated for the Order class, as if there is no substructure here. The code itself also becomes neater.

Such a technique does not violate the principle of immutable structures, but the whole focus here is as follows: we treat the fields of the OrderTimes structure exactly as if they were fields of the Order object. If you do not need to pass the OrderTimes structure as a whole entity, then the proposed mechanism is purely organizational.

Overriding Equals and GetHashCode methods for structures

When working with structures, it is extremely important to override the Equals and GetHashCode methods. If you do not do this, you will get their default versions, which are not conducive to high performance. To assess how bad it is, open the intermediate language viewer and take a look at the code for the ValueType.Equals method. It is associated with reflection on all fields of the structure. However, this is an optimization for binary compatible types. A binary type (blittable) is a type that has the same memory representation in both managed and unmanaged code. These include only primitive numeric types (for example, Int32 , UInt64 , but not Decimal , which is not primitive) and IntPtr / UIntPtr . If a structure consists only of binary compatible types, then the Equals implementation can actually perform a byte-by-by-by-minute comparison of memory across the entire structure. Just avoid such uncertainty and implement your own Equals method.

If you simply redefine Equals (object other) , then you still get an unjustifiably poor performance, since this method is associated with casting and packing of value types. Instead, implement Equals (T other) , where T is the type of your structure. The IEquatable interface is designed for this, and all structures must implement it. The compiler always gives preference to a more strictly typed version, if possible. Consider an example:

 struct Vector : IEquatable<Vector> { public int X { get; set; } public int Y { get; set; } public int Z { get; set; } public int Magnitude { get; set; } public override bool Equals(object obj) { if (obj == null) { return false; } if (obj.GetType() != this.GetType()) { return false; } return this.Equals((Vector)obj); } public bool Equals(Vector other) { return this.X == other.X && this.Y == other.Y && this.Z == other.Z && this.Magnitude == other.Magnitude; } public override int GetHashCode() { return X ^ Y ^ Z ^ Magnitude; } }

If the type implements the IEquatable interface, then the generalized .NET collections will detect it and use it for more efficient searching and sorting.

You may also decide to use the == and! = Operators with the types of your values and force them to call the existing Equals (T) method .

Even if you never compare structures or put them in a collection, I still recommend implementing these methods. You will never guess how they will be used in the future, and it will take just a few minutes to write the methods and a few bytes of the intermediate language, which will never even require dynamic compilation.

Overriding the Equals and GetHashCode methods of classes is not so important, because in this case they only calculate equality, based on the object reference. If you think that your code will have enough of a standard implementation of these methods - do not change it.

Virtual methods and sealed classes

Do not make the methods virtual by default "just in case." However, if virtual methods are required for a consistent design of your programs, then perhaps do not overdo them with their removal.

If the methods are made virtual, this prevents certain optimizations from the dynamic compiler, in particular, it hinders the embedding of methods. Methods can only be built in if the compiler is 100% aware of which method will be called. Marking a method as virtual, we lose this certainty, although there are other factors that may force us to abandon such optimization.

Virtualized classes are conceptually close to sealed classes, for example:

 public sealed class MyClass {}

A class marked sealed declares that no other class can inherit from it.

Theoretically, a dynamic compiler can use this information and be more actively involved in embedding, but this is not happening at the present time. However, by default, classes should be declared sealed and non-default methods should be declared virtual if this is not necessary. In this case, your code will be adapted to any actual optimizations of the dynamic compiler, as well as to those that are possible in the future.

If you are writing a class library that is supposed to be used in a variety of situations, especially outside your organization, you should proceed more carefully. In this case, the presence of a virtual API may be more important than bare performance — so that your library is convenient for reuse and tuning. But if the code is written for internal corporate needs and often changes, try to provide better performance.

Interface scheduling

When you call a method for the first time through an interface, .NET needs to determine which type and method to make the call to. First, a call is made to the stub, which finds the method called when working with an object that implements this interface. After this happens several times, the CLR “learns” that the same particular type is called all the time, and this indirect call through the stub is reduced to a stub consisting of just a few assembly instructions calling the desired method. Such a group of instructions is called a “monomorphic stub” because it knows how to call a method for only one type. This is ideal in situations where the place of the call always calls interface methods on the same type.

A monomorphic stub also allows you to detect if something goes wrong. If at some point the place of the call begins to use a method of another type, the CLR will eventually replace the stub with another monomorphic stub for a new type.

If the situation is even more complicated, several types are involved in it, and it is less predictable (for example, there is an interface type array, but there are several specific types in this array), then the stub will turn into polymorphic and will use a hash table that allows you to choose which method cause Table search is fast, but still not as much as when working with a monomorphic stub. In addition, the size of such a hash table is strictly limited, and if you have too many types, you may have to roll back to the generalized type search code from the very beginning. This operation can be very costly.

If such a problem arises, it can be solved in one of two ways:

1. Avoid calling these objects through a common interface.
2. Select a common base interface and replace it with an abstract base class.

This problem is not common, but it can happen if you have a huge hierarchy of types, they all implement a common interface, and you call methods through this common interface. It can be noted that the processor works at the place where these methods are called so actively that it cannot be explained only by the work that the methods themselves do.

History When designing a large system, we knew in advance that we might have thousands of different types, and all of them would come from a general type. In a couple of places we should have a need to contact them from the base type. Since our team had a person who understood the interface dispatch when solving problems of this magnitude, we decided to use the base class abstract, rather than the root interface.

A good article on interface scheduling is in the Vans Morrison blog.

Avoid packing

Packaging is the process of wrapping a meaningful type, such as a primitive or structure, into an object on a heap. In this form, this type can be passed to methods that require a reference to an object. Unpacking is retrieving the original value.

The processor time required for selecting, copying and casting an object is spent on packaging. More importantly, however, garbage collection is activated on the heap. If you carelessly treat the package, then the program may have a lot of allocation operations, and all of them will put additional strain on the garbage collector.
Explicit packaging occurs whenever similar operations are performed:

 int x = 32; object o = x;

The intermediate language here looks like this:

 IL_0001: ldc.i4.s 32 IL_0003: stloc.0 IL_0004: ldloc.0 IL_0005: box [mscorlib]System.Int32 IL_000a: stloc.1

This means that it’s relatively easy to find most packaging sources in your code: just use ILDASM to convert your entire IL into text and search.

A very common situation in which irregular packaging is possible is the use of an API that takes an object or object [] as a parameter. The most trivial of them are String.Format or traditional collections in which only references to objects are stored, and work with which must be completely avoided for one reason or another.

In addition, packaging can occur when the structure is assigned to an interface, for example:

 interface INameable { string Name { get; set; } } struct Foo : INameable { public string Name { get; set; } } void TestBoxing() { Foo foo = new Foo() { Name = "Bar" }; // ! INameable nameable = foo; ... }

If you decide to test this code - note that if in fact you are not using a packaged variable, the compiler will simply remove the packing instruction in order of optimization, since it will not be used. As soon as you call the method or use the value in some other way, the packing instruction will be in place.

One more thing that happens when packing and that needs to be considered is the result of the following code:

 int val = 13; object boxedVal = val; val = 14;

What will be the value of boxedVal after this?

When packing, the value is copied, and there is no connection between the original and the copy. For example, the value of val may change to 14, but boxedVal will retain the original value of 13.

Sometimes packaging can be traced in the processor profile, but many packaging calls are simply embedded, and there is no reliable way to find them. With excessive packaging, you will see in the processor profile only mass memory allocation with new .

If you have to actively package structures, and you come to the conclusion that you cannot do without it, then you probably should just turn the structure into a class, which may turn out to be even cheaper.

Finally, note: the transfer of a meaningful type by reference is not packaging. Look at IL and make sure that no packaging takes place. A valid type address is sent to the method.

Comparison for and foreach

Use the MeasureIt program to evaluate the difference between enumerating collections in for and foreach loops. Working with standard for loops runs much faster in almost all cases. However, if you try to conduct your own simple test, then, depending on the scenario, you may find that both cycles have approximately equal performance. Indeed, in many situations, .NET converts simple foreach statements into standard for loops.

Consider the ForEachVsFor project, which has the following code:

 int[] arr = new int[100]; for (int i = 0; i < arr.Length; i++) { arr[i] = i; } int sum = 0; foreach (int val in arr) { sum += val; } sum = 0; IEnumerable<int> arrEnum = arr; foreach (int val in arrEnum) { sum += val; }

After collecting it, try decompiling this code with the IL reflection tool. You will see that the first foreach is actually compiled as a for loop. The intermediate language will be:

 //   (head: IL_0034) IL_0024: ldloc.s CS$6$0000 IL_0026: ldloc.s CS$7$0001 IL_0028: ldelem.i4 IL_0029: stloc.3 IL_002a: ldloc.2 IL_002b: ldloc.3 IL_002c: add IL_002d: stloc.2 IL_002e: ldloc.s CS$7$0001 IL_0030: ldc.i4.1 IL_0031: add IL_0032: stloc.s CS$7$0001 IL_0034: ldloc.s CS$7$0001 IL_0036: ldloc.s CS$6$0000 IL_0038: ldlen IL_0039: conv.i4 IL_003a: blt.s IL_0024 //

There are a lot of save, load, add and one branch operations - everything is quite simple.

However, once we cast the array to IEnumerable and perform the same operation, the work is much more expensive:

 IL_0043: callvirt instance class [mscorlib]System.Collections.Generic.IEnumerator`1<!0> class [mscorlib]System.Collections.Generic.IEnumerable`1<int32>::GetEnumerator() IL_0048: stloc.s CS$5$0002 .try { IL_004a: br.s IL_005a //   (head: IL_005a) IL_004c: ldloc.s CS$5$0002 IL_004e: callvirt instance !0 class [mscorlib]System.Collections.Generic.IEnumerator`1<int32>::get_Current() IL_0053: stloc.s val IL_0055: ldloc.2 IL_0056: ldloc.s val IL_0058: add IL_0059: stloc.2 IL_005a: ldloc.s CS$5$0002 IL_005c: callvirt instance bool [mscorlib]System.Collections.IEnumerator::MoveNext() IL_0061: brtrue.s IL_004c //   IL_0063: leave.s IL_0071 } //   .try finally { IL_0065: ldloc.s CS$5$0002 IL_0067: brfalse.s IL_0070 IL_0069: ldloc.s CS$5$0002 IL_006b: callvirt instance void [mscorlib]System.IDisposable::Dispose() IL_0070: endfinally } //

We have 4 virtual method calls, a try-finally block and (not shown here) memory allocation for a local enumerator variable that tracks the state of the enumeration. Such an operation is much more expensive than a regular for loop: it uses more CPU time and more memory!

Keep in mind that the basic data structure here is still an array, which means that you can use the for loop — but we obfuscate it, bringing the type to the IEnumerable interface. Here it is important to take into account the fact already mentioned at the beginning of the article: deep optimization of performance often goes against code abstractions. So, foreach is a loop abstraction, IEnumerable is a collection abstraction. Together they give this behavior, which excludes simple optimization using a for loop, sorting through an array.

Cast

In principle, casting should be avoided whenever possible. Bringing often indicates poor-quality class design, but sometimes it is actually required. So, we have to resort to casting when converting signed numbers to unsigned ones, for example, when we work with various third-party APIs. Bringing objects should occur much less frequently.

Bringing objects is never without cost, but the cost of such an operation can differ dramatically depending on the relationship between the objects. Bringing an ancestor to the desired descendant is much more expensive than performing the inverse operation, and the cost of such an operation is higher, the larger the hierarchy. Coercion to the interface is more expensive than casting to a specific type.

Absolutely unacceptable wrong cast. If it happens, you will get an InvalidCastException exception, the cost of which will be orders of magnitude greater than the "price" of the cast operation.

See the CastingPerf project in the source code for this book, where the number of conversions of various types is noted.

When I ran a test run on my computer, I got the following result:

 JIT (ignore): 1.00x No cast: 1.00x Up cast (1 gen): 1.00x Up cast (2 gens): 1.00x Up cast (3 gens): 1.00x Down cast (1 gen): 1.25x Down cast (2 gens): 1.37x Down cast (3 gens): 1.37x Interface: 2.73x Invalid Cast: 14934.51x as (success): 1.01x as (failure): 2.60x is (success): 2.00x is (failure): 1.98x

The 'is' operator is a cast, testing the result and returning a boolean value.

The 'as' operator resembles the standard cast, but returns null if the cast does not work. As can be seen from the above results, this operation is much faster than issuing an exception.
Never use such a pattern where two casts are made:

 if (a is Foo) { Foo f = (Foo)a; }

'as' , :

 Foo f = a as Foo; if (f != null) { ... }

, .

: , MemoryStream.Length long . API, , ( MemoryStream.GetBuffer ), , int. , long . .

Source: https://habr.com/ru/post/260623/

All Articles

High-performance code on the .NET platform

More articles: