We offer you a small overview of the capabilities of the vectorization of algorithms in the .NET Framework and .NETCORE. The purpose of the article is to introduce those techniques to those who did not know them at all and to show that .NET is not far behind the "real, compiled" languages for the native
development.
I'm just starting to learn the techniques of vectorization, so if someone from the community points out the obvious jamb, or offers improved versions of the algorithms described below, I will be wildly happy.
In .NET, SIMD first appeared in 2015 with the release of the .NET Framework 4.6. Then the types Matrix3x2, Matrix4x4, Plane, Quaternion, Vector2, Vector3 and Vector4 were added, which allowed vectorized calculations. Later, the Vector <T> type was added, which provided more opportunities for vectorization of algorithms. But many programmers were still unhappy, because The types described above limited the flow of programmer's thoughts and did not allow using the full power of SIMD instructions of modern processors. Already in our time, in the .NET Core 3.0 Preview, the System.Runtime.Intrinsics namespace has appeared, which provides much more freedom in choosing instructions. To get the best results in speed, you need to use RyuJit and you need to either build under x64, or disable Prefer 32-bit and build under AnyCPU. I ran all the benchmarks on a computer with an Intel Core i7-6700 CPU 3.40GHz (Skylake).
I decided to start with the classic problem, which is often written first when it comes to vectorization. This is the task of finding the sum of the elements of the array. Let's write four implementations of this task, we will summarize the elements of the Array array:
Most obvious
public int Naive() { int result = 0; foreach (int i in Array) { result += i; } return result; }
Using LINQ
public long LINQ() => Array.Aggregate<int, long>(0, (current, i) => current + i);
Using vectors from System.Numerics:
public int Vectors() { int vectorSize = Vector<int>.Count; var accVector = Vector<int>.Zero; int i; var array = Array; for (i = 0; i < array.Length - vectorSize; i += vectorSize) { var v = new Vector<int>(array, i); accVector = Vector.Add(accVector, v); } int result = Vector.Dot(accVector, Vector<int>.One); for (; i < array.Length; i++) { result += array[i]; } return result; }
Using code from the System.Runtime.Intrinsics space:
public unsafe int Intrinsics() { int vectorSize = 256 / 8 / 4; var accVector = Vector256<int>.Zero; int i; var array = Array; fixed (int* ptr = array) { for (i = 0; i < array.Length - vectorSize; i += vectorSize) { var v = Avx2.LoadVector256(ptr + i); accVector = Avx2.Add(accVector, v); } } int result = 0; var temp = stackalloc int[vectorSize]; Avx2.Store(temp, accVector); for (int j = 0; j < vectorSize; j++) { result += temp[j]; } for (; i < array.Length; i++) { result += array[i]; } return result; }
I launched a benchmark on these 4 methods on my computer and got the following result:
Method | Itemscount | Median |
---|---|---|
Naive | ten | 75.12 ns |
LINQ | ten | 1 186.85 ns |
Vectors | ten | 60.09 ns |
Intrinsics | ten | 255.40 ns |
Naive | 100 | 360.56 ns |
LINQ | 100 | 2 719.24 ns |
Vectors | 100 | 60.09 ns |
Intrinsics | 100 | 345.54 ns |
Naive | 1000 | 1,847.88 ns |
LINQ | 1000 | 12 033.78 ns |
Vectors | 1000 | 240.38 ns |
Intrinsics | 1000 | 630.98 ns |
Naive | 10,000 | 18 403.72 ns |
LINQ | 10,000 | 102 489.96 ns |
Vectors | 10,000 | 7 316.42 ns |
Intrinsics | 10,000 | 3 365.25 ns |
Naive | 100,000 | 176 630.67 ns |
LINQ | 100,000 | 975 998.24 ns |
Vectors | 100,000 | 78 828.03 ns |
Intrinsics | 100,000 | 41 269.41 ns |
It can be seen that the solutions with Vectors and Intrinsics greatly benefit in speed compared with the obvious solution and with LINQ. Now we need to understand what is happening in these two methods.
Let us consider the Vectors method in more detail:
public int Vectors() { int vectorSize = Vector<int>.Count; var accVector = Vector<int>.Zero; int i; var array = Array; for (i = 0; i < array.Length - vectorSize; i += vectorSize) { var v = new Vector<int>(array, i); accVector = Vector.Add(accVector, v); } int result = Vector.Dot(accVector, Vector<int>.One); for (; i < array.Length; i++) { result += array[i]; } return result; }
If you look at the code for the Intrinsics method:
public unsafe int Intrinsics() { int vectorSize = 256 / 8 / 4; var accVector = Vector256<int>.Zero; int i; var array = Array; fixed (int* ptr = array) { for (i = 0; i < array.Length - vectorSize; i += vectorSize) { var v = Avx2.LoadVector256(ptr + i); accVector = Avx2.Add(accVector, v); } } int result = 0; var temp = stackalloc int[vectorSize]; Avx2.Store(temp, accVector); for (int j = 0; j < vectorSize; j++) { result += temp[j]; } for (; i < array.Length; i++) { result += array[i]; } return result; }
You can see that it is very similar to Vectors with some exceptions:
if (Avx2.IsSupported) { DoThingsForAvx2(); } else if (Avx.IsSupported) { DoThingsForAvx(); } ... else if (Sse2.IsSupported) { DoThingsForSse2(); } ...
It is necessary to compare two byte arrays. Actually this is the task because of which I began to study SIMD in .NET. Let us write again several methods for benchmark, we will compare two arrays: ArrayA and ArrayB:
The most obvious solution:
public bool Naive() { for (int i = 0; i < ArrayA.Length; i++) { if (ArrayA[i] != ArrayB[i]) return false; } return true; }
Solution through LINQ:
public bool LINQ() => ArrayA.SequenceEqual(ArrayB);
Solution via MemCmp function:
[DllImport("msvcrt.dll", CallingConvention = CallingConvention.Cdecl)] static extern int memcmp(byte[] b1, byte[] b2, long count); public bool MemCmp() => memcmp(ArrayA, ArrayB, ArrayA.Length) == 0;
Using vectors from System.Numerics:
public bool Vectors() { int vectorSize = Vector<byte>.Count; int i = 0; for (; i < ArrayA.Length - vectorSize; i += vectorSize) { var va = new Vector<byte>(ArrayA, i); var vb = new Vector<byte>(ArrayB, i); if (!Vector.EqualsAll(va, vb)) { return false; } } for (; i < ArrayA.Length; i++) { if (ArrayA[i] != ArrayB[i]) return false; } return true; }
Using Intrinsics:
public unsafe bool Intrinsics() { int vectorSize = 256 / 8; int i = 0; const int equalsMask = unchecked((int) (0b1111_1111_1111_1111_1111_1111_1111_1111)); fixed (byte* ptrA = ArrayA) fixed (byte* ptrB = ArrayB) { for (; i < ArrayA.Length - vectorSize; i += vectorSize) { var va = Avx2.LoadVector256(ptrA + i); var vb = Avx2.LoadVector256(ptrB + i); var areEqual = Avx2.CompareEqual(va, vb); if (Avx2.MoveMask(areEqual) != equalsMask) { return false; } } for (; i < ArrayA.Length; i++) { if (ArrayA[i] != ArrayB[i]) return false; } return true; } }
The result of the benchmark on my computer:
Method | Itemscount | Median |
---|---|---|
Naive | 10,000 | 66 719.1 ns |
LINQ | 10,000 | 71 211.1 ns |
Vectors | 10,000 | 3,695.8 ns |
Memcmp | 10,000 | 600.9 ns |
Intrinsics | 10,000 | 1,607.5 ns |
Naive | 100,000 | 588 633.7 ns |
LINQ | 100,000 | 651 191.3 ns |
Vectors | 100,000 | 34 659.1 ns |
Memcmp | 100,000 | 5 513.6 ns |
Intrinsics | 100,000 | 12,078.9 ns |
Naive | 1,000,000 | 5,637,293.1 ns |
LINQ | 1,000,000 | 6,622,666.0 ns |
Vectors | 1,000,000 | 777 974.2 ns |
Memcmp | 1,000,000 | 361 704.5 ns |
Intrinsics | 1,000,000 | 434 252.7 ns |
I think the whole code of these methods is understandable, with the exception of two lines in Intrinsics:
var areEqual = Avx2.CompareEqual(va, vb); if (Avx2.MoveMask(areEqual) != equalsMask) { return false; }
In the first two vectors are compared for equality and the result is stored in the areEqual vector, in which the bits in the element at a particular position are set to 1 if the corresponding elements in va and vb are equal. It turns out that if the vectors from bytes va and vb are completely equal, then in areEquals all elements must be equal to 255 (11111111b). Since Avx2.CompareEqual is a wrapper over _mm256_cmpeq_epi8, then on the Intel website you can see the pseudo-code of this operation:
The MoveMask method from a vector makes a 32-bit number. The bit values are the high-order bits of each of the 32 single-byte elements of the vector. Pseudocode can be viewed here .
Thus, if some bytes in va and vb do not match, then the corresponding bytes in areEqual will be equal to 0, therefore the high bits of these bytes will also be equal to 0, and therefore the corresponding bits in the Avx2 response. MovoveMask will also be equal to 0 and the comparison with equalsMask will not work.
Let us analyze a small example, assuming that the vector length is 8 bytes (so that writing was less):
Sometimes it is necessary to count how many times a particular element is found in a collection, for example, ints, this algorithm can also be accelerated. Let's write several methods for comparison, we will look for the Item element in the Array array.
The most obvious:
public int Naive() { int result = 0; foreach (int i in Array) { if (i == Item) { result++; } } return result; }
using LINQ:
public int LINQ() => Array.Count(i => i == Item);
using vectors from System.Numerics.Vectors:
public int Vectors() { var mask = new Vector<int>(Item); int vectorSize = Vector<int>.Count; var accResult = new Vector<int>(); int i; var array = Array; for (i = 0; i < array.Length - vectorSize; i += vectorSize) { var v = new Vector<int>(array, i); var areEqual = Vector.Equals(v, mask); accResult = Vector.Subtract(accResult, areEqual); } int result = 0; for (; i < array.Length; i++) { if (array[i] == Item) { result++; } } result += Vector.Dot(accResult, Vector<int>.One); return result; }
Using Intrinsics:
public unsafe int Intrinsics() { int vectorSize = 256 / 8 / 4; //var mask = Avx2.SetAllVector256(Item); //var mask = Avx2.SetVector256(Item, Item, Item, Item, Item, Item, Item, Item); var temp = stackalloc int[vectorSize]; for (int j = 0; j < vectorSize; j++) { temp[j] = Item; } var mask = Avx2.LoadVector256(temp); var accVector = Vector256<int>.Zero; int i; var array = Array; fixed (int* ptr = array) { for (i = 0; i < array.Length - vectorSize; i += vectorSize) { var v = Avx2.LoadVector256(ptr + i); var areEqual = Avx2.CompareEqual(v, mask); accVector = Avx2.Subtract(accVector, areEqual); } } int result = 0; Avx2.Store(temp, accVector); for(int j = 0; j < vectorSize; j++) { result += temp[j]; } for(; i < array.Length; i++) { if (array[i] == Item) { result++; } } return result; }
The result of the benchmark on my computer:
Method | Itemscount | Median |
---|---|---|
Naive | 1000 | 2 824.41 ns |
LINQ | 1000 | 12 138.95 ns |
Vectors | 1000 | 961.50 ns |
Intrinsics | 1000 | 691.08 ns |
Naive | 10,000 | 27 072.25 ns |
LINQ | 10,000 | 113 967.87 ns |
Vectors | 10,000 | 7 571.82 ns |
Intrinsics | 10,000 | 4,296.71 ns |
Naive | 100,000 | 361 028.46 ns |
LINQ | 100,000 | 1,091,994.28 ns |
Vectors | 100,000 | 82 839.29 ns |
Intrinsics | 100,000 | 40 307.91 ns |
Naive | 1,000,000 | 1,634 175.46 ns |
LINQ | 1,000,000 | 6 194 257.38 ns |
Vectors | 1,000,000 | 583 901.29 ns |
Intrinsics | 1,000,000 | 413 520.38 ns |
Methods Vectors and Intrinsics completely coincide in logic, the differences are only in the implementation of specific operations. The whole idea is:
All the code from the article can be found on GitHub
I considered only a very small part of the possibilities that .NET provides for vectorization of computations. For a complete and current list of available intrinsik in .NETCORE under x86, you can refer to the source code . Conveniently, there in C # files in the summary of each intrinsic is its own name from the world of C, which simplifies both the understanding of the purpose of this intrinsic and the translation of already existing C ++ / C algorithms to .NET. Documentation on System.Numerics.Vector is available on msdn .
In my opinion, .NET has a big advantage over C ++, since JIT compilation occurs already on the client machine, the compiler can optimize the code for a specific client processor, providing maximum performance. In this case, a programmer for writing fast code can remain within the framework of one language and technology.
Source: https://habr.com/ru/post/435840/