A small review of SIMD in .NET / C #

We offer you a small overview of the capabilities of the vectorization of algorithms in the .NET Framework and .NETCORE. The purpose of the article is to introduce those techniques to those who did not know them at all and to show that .NET is not far behind the "real, compiled" languages for the native
development.

I'm just starting to learn the techniques of vectorization, so if someone from the community points out the obvious jamb, or offers improved versions of the algorithms described below, I will be wildly happy.

A bit of history

In .NET, SIMD first appeared in 2015 with the release of the .NET Framework 4.6. Then the types Matrix3x2, Matrix4x4, Plane, Quaternion, Vector2, Vector3 and Vector4 were added, which allowed vectorized calculations. Later, the Vector <T> type was added, which provided more opportunities for vectorization of algorithms. But many programmers were still unhappy, because The types described above limited the flow of programmer's thoughts and did not allow using the full power of SIMD instructions of modern processors. Already in our time, in the .NET Core 3.0 Preview, the System.Runtime.Intrinsics namespace has appeared, which provides much more freedom in choosing instructions. To get the best results in speed, you need to use RyuJit and you need to either build under x64, or disable Prefer 32-bit and build under AnyCPU. I ran all the benchmarks on a computer with an Intel Core i7-6700 CPU 3.40GHz (Skylake).

Sum the array elements

I decided to start with the classic problem, which is often written first when it comes to vectorization. This is the task of finding the sum of the elements of the array. Let's write four implementations of this task, we will summarize the elements of the Array array:

Most obvious

public int Naive() { int result = 0; foreach (int i in Array) { result += i; } return result; }

Using LINQ

 public long LINQ() => Array.Aggregate<int, long>(0, (current, i) => current + i);

Using vectors from System.Numerics:

 public int Vectors() { int vectorSize = Vector<int>.Count; var accVector = Vector<int>.Zero; int i; var array = Array; for (i = 0; i < array.Length - vectorSize; i += vectorSize) { var v = new Vector<int>(array, i); accVector = Vector.Add(accVector, v); } int result = Vector.Dot(accVector, Vector<int>.One); for (; i < array.Length; i++) { result += array[i]; } return result; }

Using code from the System.Runtime.Intrinsics space:

 public unsafe int Intrinsics() { int vectorSize = 256 / 8 / 4; var accVector = Vector256<int>.Zero; int i; var array = Array; fixed (int* ptr = array) { for (i = 0; i < array.Length - vectorSize; i += vectorSize) { var v = Avx2.LoadVector256(ptr + i); accVector = Avx2.Add(accVector, v); } } int result = 0; var temp = stackalloc int[vectorSize]; Avx2.Store(temp, accVector); for (int j = 0; j < vectorSize; j++) { result += temp[j]; } for (; i < array.Length; i++) { result += array[i]; } return result; }

I launched a benchmark on these 4 methods on my computer and got the following result:

Method	Itemscount	Median
Naive	ten	75.12 ns
LINQ	ten	1 186.85 ns
Vectors	ten	60.09 ns
Intrinsics	ten	255.40 ns

Naive	100	360.56 ns
LINQ	100	2 719.24 ns
Vectors	100	60.09 ns
Intrinsics	100	345.54 ns

Naive	1000	1,847.88 ns
LINQ	1000	12 033.78 ns
Vectors	1000	240.38 ns
Intrinsics	1000	630.98 ns

Naive	10,000	18 403.72 ns
LINQ	10,000	102 489.96 ns
Vectors	10,000	7 316.42 ns
Intrinsics	10,000	3 365.25 ns

Naive	100,000	176 630.67 ns
LINQ	100,000	975 998.24 ns
Vectors	100,000	78 828.03 ns
Intrinsics	100,000	41 269.41 ns

It can be seen that the solutions with Vectors and Intrinsics greatly benefit in speed compared with the obvious solution and with LINQ. Now we need to understand what is happening in these two methods.

Let us consider the Vectors method in more detail:

Vectors

 public int Vectors() { int vectorSize = Vector<int>.Count; var accVector = Vector<int>.Zero; int i; var array = Array; for (i = 0; i < array.Length - vectorSize; i += vectorSize) { var v = new Vector<int>(array, i); accVector = Vector.Add(accVector, v); } int result = Vector.Dot(accVector, Vector<int>.One); for (; i < array.Length; i++) { result += array[i]; } return result; }

int vectorSize = Vector <int> .Count; - this is how many 4 byte numbers we can put in the vector. If hardware acceleration is used, this value indicates how many 4-byte numbers can be placed in one SIMD register. In fact, it shows how many elements of this type can be performed in parallel;
accVector is the vector in which the result of the function will accumulate;
var v = new Vector <int> (array, i); - the data is loaded into the new vector v, from the array, starting at index i. It will load exactly vectorSize data.
accVector = Vector.Add (accVector, v); - two vectors are added.
For example, Array stores 8 numbers: {0, 1, 2, 3, 4, 5, 6, 7} and vectorSize == 4, then:
In the first iteration of the cycle accVector = {0, 0, 0, 0}, v = {0, 1, 2, 3}, after adding in accVector will be: {0, 0, 0, 0} + {0, 1, 2 , 3} = {0, 1, 2, 3}.
In the second iteration, v = {4, 5, 6, 7} and after addition, accVector = {0, 1, 2, 3} + {4, 5, 6, 7} = {4, 6, 8, 10}.
It remains only to somehow get the sum of all the elements of the vector, for this you can apply scalar multiplication by a vector filled with units: int result = Vector.Dot (accVector, Vector <int> .One);
Then we get: {4, 6, 8, 10} {1, 1, 1, 1} = 4 1 + 6 1 + 8 1 + 10 * 1 = 28.
In the end, if required, numbers that do not fit in the last vector are added up.

If you look at the code for the Intrinsics method:

Intrinsics

 public unsafe int Intrinsics() { int vectorSize = 256 / 8 / 4; var accVector = Vector256<int>.Zero; int i; var array = Array; fixed (int* ptr = array) { for (i = 0; i < array.Length - vectorSize; i += vectorSize) { var v = Avx2.LoadVector256(ptr + i); accVector = Avx2.Add(accVector, v); } } int result = 0; var temp = stackalloc int[vectorSize]; Avx2.Store(temp, accVector); for (int j = 0; j < vectorSize; j++) { result += temp[j]; } for (; i < array.Length; i++) { result += array[i]; } return result; }

You can see that it is very similar to Vectors with some exceptions:

vectorSize is a constant. This happens because Avx2 instructions that operate on 256-bit registers are explicitly used in this method. In a real application, there should be a check to see if the current Avx2 processor supports the instructions and, if not supported, call another code. It looks like this:
```
 if (Avx2.IsSupported) { DoThingsForAvx2(); } else if (Avx.IsSupported) { DoThingsForAvx(); } ... else if (Sse2.IsSupported) { DoThingsForSse2(); } ... 
```
var accVector = Vector256 <int> .Zero; accVector is declared as a 256 bit vector filled with zeros.
fixed (int * ptr = Array) - the pointer to the array is entered into ptr.
Further operations are the same as in Vectors: loading data into a vector and adding two vectors.
For summing up the elements of the vector, the following method was used:
- an array is created on the stack: var temp = stackalloc int [vectorSize];
- the vector is loaded into this array: Avx2.Store (temp, accVector);
- the loop summarizes the elements of the array.
then the elements of the array that do not fit in the last vector are reached.

Compare two arrays

It is necessary to compare two byte arrays. Actually this is the task because of which I began to study SIMD in .NET. Let us write again several methods for benchmark, we will compare two arrays: ArrayA and ArrayB:

The most obvious solution:

 public bool Naive() { for (int i = 0; i < ArrayA.Length; i++) { if (ArrayA[i] != ArrayB[i]) return false; } return true; }

Solution through LINQ:

 public bool LINQ() => ArrayA.SequenceEqual(ArrayB);

Solution via MemCmp function:

 [DllImport("msvcrt.dll", CallingConvention = CallingConvention.Cdecl)] static extern int memcmp(byte[] b1, byte[] b2, long count); public bool MemCmp() => memcmp(ArrayA, ArrayB, ArrayA.Length) == 0;

Using vectors from System.Numerics:

 public bool Vectors() { int vectorSize = Vector<byte>.Count; int i = 0; for (; i < ArrayA.Length - vectorSize; i += vectorSize) { var va = new Vector<byte>(ArrayA, i); var vb = new Vector<byte>(ArrayB, i); if (!Vector.EqualsAll(va, vb)) { return false; } } for (; i < ArrayA.Length; i++) { if (ArrayA[i] != ArrayB[i]) return false; } return true; }

Using Intrinsics:

 public unsafe bool Intrinsics() { int vectorSize = 256 / 8; int i = 0; const int equalsMask = unchecked((int) (0b1111_1111_1111_1111_1111_1111_1111_1111)); fixed (byte* ptrA = ArrayA) fixed (byte* ptrB = ArrayB) { for (; i < ArrayA.Length - vectorSize; i += vectorSize) { var va = Avx2.LoadVector256(ptrA + i); var vb = Avx2.LoadVector256(ptrB + i); var areEqual = Avx2.CompareEqual(va, vb); if (Avx2.MoveMask(areEqual) != equalsMask) { return false; } } for (; i < ArrayA.Length; i++) { if (ArrayA[i] != ArrayB[i]) return false; } return true; } }

The result of the benchmark on my computer:

Method	Itemscount	Median
Naive	10,000	66 719.1 ns
LINQ	10,000	71 211.1 ns
Vectors	10,000	3,695.8 ns
Memcmp	10,000	600.9 ns
Intrinsics	10,000	1,607.5 ns

Naive	100,000	588 633.7 ns
LINQ	100,000	651 191.3 ns
Vectors	100,000	34 659.1 ns
Memcmp	100,000	5 513.6 ns
Intrinsics	100,000	12,078.9 ns

Naive	1,000,000	5,637,293.1 ns
LINQ	1,000,000	6,622,666.0 ns
Vectors	1,000,000	777 974.2 ns
Memcmp	1,000,000	361 704.5 ns
Intrinsics	1,000,000	434 252.7 ns

I think the whole code of these methods is understandable, with the exception of two lines in Intrinsics:

 var areEqual = Avx2.CompareEqual(va, vb); if (Avx2.MoveMask(areEqual) != equalsMask) { return false; }

In the first two vectors are compared for equality and the result is stored in the areEqual vector, in which the bits in the element at a particular position are set to 1 if the corresponding elements in va and vb are equal. It turns out that if the vectors from bytes va and vb are completely equal, then in areEquals all elements must be equal to 255 (11111111b). Since Avx2.CompareEqual is a wrapper over _mm256_cmpeq_epi8, then on the Intel website you can see the pseudo-code of this operation:
The MoveMask method from a vector makes a 32-bit number. The bit values are the high-order bits of each of the 32 single-byte elements of the vector. Pseudocode can be viewed here .

Thus, if some bytes in va and vb do not match, then the corresponding bytes in areEqual will be equal to 0, therefore the high bits of these bytes will also be equal to 0, and therefore the corresponding bits in the Avx2 response. MovoveMask will also be equal to 0 and the comparison with equalsMask will not work.

Let us analyze a small example, assuming that the vector length is 8 bytes (so that writing was less):

Let va = {100, 10, 20, 30, 100, 40, 50, 100}, and vb = {100, 20, 10, 30, 100, 40, 80, 90};
Then are Equal will be equal to {255, 0, 0, 255, 255, 255, 0, 0};
The MoveMask method will return 10011100b, which will need to be compared with the mask 11111111b, since Since these masks are unequal, it turns out that both the vectors va and vb are unequal.

Counting the number of times an item is found in a collection.

Sometimes it is necessary to count how many times a particular element is found in a collection, for example, ints, this algorithm can also be accelerated. Let's write several methods for comparison, we will look for the Item element in the Array array.

The most obvious:

 public int Naive() { int result = 0; foreach (int i in Array) { if (i == Item) { result++; } } return result; }

using LINQ:

 public int LINQ() => Array.Count(i => i == Item);

using vectors from System.Numerics.Vectors:

 public int Vectors() { var mask = new Vector<int>(Item); int vectorSize = Vector<int>.Count; var accResult = new Vector<int>(); int i; var array = Array; for (i = 0; i < array.Length - vectorSize; i += vectorSize) { var v = new Vector<int>(array, i); var areEqual = Vector.Equals(v, mask); accResult = Vector.Subtract(accResult, areEqual); } int result = 0; for (; i < array.Length; i++) { if (array[i] == Item) { result++; } } result += Vector.Dot(accResult, Vector<int>.One); return result; }

Using Intrinsics:

 public unsafe int Intrinsics() { int vectorSize = 256 / 8 / 4; //var mask = Avx2.SetAllVector256(Item); //var mask = Avx2.SetVector256(Item, Item, Item, Item, Item, Item, Item, Item); var temp = stackalloc int[vectorSize]; for (int j = 0; j < vectorSize; j++) { temp[j] = Item; } var mask = Avx2.LoadVector256(temp); var accVector = Vector256<int>.Zero; int i; var array = Array; fixed (int* ptr = array) { for (i = 0; i < array.Length - vectorSize; i += vectorSize) { var v = Avx2.LoadVector256(ptr + i); var areEqual = Avx2.CompareEqual(v, mask); accVector = Avx2.Subtract(accVector, areEqual); } } int result = 0; Avx2.Store(temp, accVector); for(int j = 0; j < vectorSize; j++) { result += temp[j]; } for(; i < array.Length; i++) { if (array[i] == Item) { result++; } } return result; }

The result of the benchmark on my computer:

Method	Itemscount	Median
Naive	1000	2 824.41 ns
LINQ	1000	12 138.95 ns
Vectors	1000	961.50 ns
Intrinsics	1000	691.08 ns

Naive	10,000	27 072.25 ns
LINQ	10,000	113 967.87 ns
Vectors	10,000	7 571.82 ns
Intrinsics	10,000	4,296.71 ns

Naive	100,000	361 028.46 ns
LINQ	100,000	1,091,994.28 ns
Vectors	100,000	82 839.29 ns
Intrinsics	100,000	40 307.91 ns

Naive	1,000,000	1,634 175.46 ns
LINQ	1,000,000	6 194 257.38 ns
Vectors	1,000,000	583 901.29 ns
Intrinsics	1,000,000	413 520.38 ns

Methods Vectors and Intrinsics completely coincide in logic, the differences are only in the implementation of specific operations. The whole idea is:

creates a vector mask, in which the desired number is stored in each element;
A part of the array is loaded into the vector v and compared with the mask, then all bits will be set in equal elements in equal elements, since areEqual is a vector from ints, then, if you set all the bits of one element, we get -1 in this element ((int) (1111_1111_1111_1111_1111_1111_1111_1111b) == -1);
the areEqual vector is subtracted from accVector and then in accVector there will be the sum of how many times the item element is encountered in all vectors v for each position (minus gives minutes plus).

All the code from the article can be found on GitHub

Conclusion

I considered only a very small part of the possibilities that .NET provides for vectorization of computations. For a complete and current list of available intrinsik in .NETCORE under x86, you can refer to the source code . Conveniently, there in C # files in the summary of each intrinsic is its own name from the world of C, which simplifies both the understanding of the purpose of this intrinsic and the translation of already existing C ++ / C algorithms to .NET. Documentation on System.Numerics.Vector is available on msdn .

In my opinion, .NET has a big advantage over C ++, since JIT compilation occurs already on the client machine, the compiler can optimize the code for a specific client processor, providing maximum performance. In this case, a programmer for writing fast code can remain within the framework of one language and technology.

Source: https://habr.com/ru/post/435840/

All Articles