📜 ⬆️ ⬇️

The amazing performance of parallel C ++ 17 algorithms. Myth or Reality?

Good evening!

From our course “Developer C ++” we offer you a small and interesting study about parallel algorithms.

Go.
')
With the advent of parallel algorithms in C ++ 17, you can easily update your “computational” code and benefit from parallel execution. In this article, I want to consider the STL algorithm, which naturally reveals the idea of ​​independent computing. Is it possible to expect 10-fold acceleration with a 10-core processor? Maybe more? Or less? Let's talk about it.

Introduction to parallel algorithms



C ++ 17 offers an execution policy option for most algorithms:


Briefly:


As a quick example, we call std::sort in parallel:

 std::sort(std::execution::par, myVec.begin(), myVec.end()); // ^^^^^^^^^^^^^^^^^^^ //   

Note how easy it is to add a parallel execution parameter to the algorithm! But will it be possible to achieve significant performance improvements? Will it increase the speed? Or are there cases of slowdown?

Parallel std::transform

In this article, I want to draw attention to the std::transform algorithm, which can potentially be the basis for other parallel methods (along with std::transform_reduce , for_each , scan , sort ...).

Our test code will be based on the following pattern:

 std::transform(execution_policy, // par, seq, par_unseq inVec.begin(), inVec.end(), outVec.begin(), ElementOperation); 

Suppose that the ElementOperation function ElementOperation not have any synchronization methods, in which case the code has the potential for parallel execution or even vectorization. Each element calculation is independent, the order does not matter, so the implementation can generate several threads (possibly in a thread pool) for independent processing of elements.

I would like to experiment with the following things:


As you can see, I want to not only test the number of elements that are “good” for using the parallel algorithm, but also the ALU operations that the processor occupies.
Other algorithms, such as sorting, accumulation (in the form of std :: reduce) also offer parallel execution, but also require more work to calculate the results. Therefore, we will consider them candidates for another article.

Benchmark Note

For my tests, I use Visual Studio 2017, 15.8 - since this is the only implementation in the popular / STL compiler implementation at the current time (November, 2018) (GCC on the way!). Moreover, I focused only on execution::par , since execution::par_unseq not available in MSVC (it works like execution::par ).

There are two computers:


The code is compiled into x64, Release more, auto-vectorization is enabled by default, I also included the extended command set (SSE2) and OpenMP (2.0).

The code is in my github: github / fenbf / ParSTLTests / TransformTests / TransformTests.cpp

For OpenMP (2.0), I use parallelism only for loops:

 #pragma omp parallel for for (int i = 0; ...) 

I run the code 5 times and look at the minimum results.

Warning : The results reflect only rough observations, check on your system / configuration before using in production. Your requirements and environment may differ from mine.

More details about MSVC implementation can be found in this post . But the latest report of Bill O'Neal with CppCon 2018 (Bill implemented Parallel STL in MSVC).

Well, let's start with simple examples!

Simple conversion

Consider the case when you apply a very simple operation to the input vector. This may be copying or multiplying elements.

For example:

 std::transform(std::execution::par, vec.begin(), vec.end(), out.begin(), [](double v) { return v * 2.0; } ); 

My computer has 6 or 4 cores ... can I expect a 4..6-fold acceleration of sequential execution? Here are my results (time in milliseconds):

OperationVector sizei7 4720 (4 cores)i7 8700 (6 cores)
execution :: seq10k0.0027630.001924
execution :: par10k0.0098690.008983
openmp parallel for10k0.0031580.002246
execution :: seq100k0.0513180.028872
execution :: par100k0.0430280.025664
openmp parallel for100k0.0225010.009624
execution :: seq1000k1.695080.52419
execution :: par1000k1.655610.359619
openmp parallel for1000k1.506780.344863

On a faster machine, we can see that it will take about 1 million items to notice an improvement in performance. On the other hand, on my laptop all parallel implementations were slower.

Thus, it is difficult to notice any significant performance improvement with such transformations, even with an increase in the number of elements.

Why so?

Since the operations are elementary, the processor cores can call it almost instantly, using only a few cycles. However, the processor cores spend more time waiting for the main memory. So, in this case, for the most part they will wait, rather than perform calculations.

Reading and writing a variable in memory takes about 2-3 clocks if it is cached, and a few hundred clocks if not cached.
https://www.agner.org/optimize/optimizing_cpp.pdf

You can roughly say that if your algorithm depends on memory, then you should not expect an improvement in performance with parallel computation.

More calculations

Because memory bandwidth is extremely important and can affect the speed of things ... let's increase the number of calculations affecting each element.

The idea is that it is better to use processor cycles, and not to waste time waiting for memory.

To begin with, I use trigonometric functions, for example, sqrt(sin*cos) (these are conditional calculations in a non-optimal form, just to occupy the processor).

We use sqrt , sin and cos , which can take ~ 20 per sqrt and ~ 100 per trigonometric function. This amount of computation can cover the memory access delay.

For more information about the delays of the teams, see the excellent Perf Guide article from Agner Fogh .

Here is the benchmark code:

 std::transform(std::execution::par, vec.begin(), vec.end(), out.begin(), [](double v) { return std::sqrt(std::sin(v)*std::cos(v)); } ); 

And now what? Can we expect to improve performance compared to the previous attempt?

Here are some results (time in milliseconds):

OperationVector sizei7 4720 (4 cores)i7 8700 (6 cores)
execution :: seq10k0.1050050.070577
execution :: par10k0.0556610.03176
openmp parallel for10k0.0963210.024702
execution :: seq100k1.087550.707048
execution :: par100k0.2593540.17195
openmp parallel for100k0.8984650.189915
execution :: seq1000k10.51597.16254
execution :: par1000k2.444721.10099
openmp parallel for1000k4.786811.89017

Finally, we see quite good numbers :)

For 1000 elements (not shown here), the time for parallel and sequential calculations was similar, so for more than 1000 elements we see an improvement in the parallel version.

For 100 thousand items, the result on a faster computer is almost 9 times better than the sequential version (similarly for the OpenMP version).

In the largest variant of a million elements - the result is faster by 5 or 8 times.
For such calculations, I achieved “linear” acceleration, depending on the number of processor cores. What was expected.

Fresnel and 3D Vectors

In the section above, I used “made-up” calculations, but what about the real code?
Let's solve the Fresnel equations that describe the reflection and curvature of light from a smooth, flat surface. This is a popular method for generating realistic lighting in 3D games.




Photo from Wikimedia

As a good sample, I found this description and implementation .

About using the GLM library

Instead of creating my own implementation, I used the glm library . I often use it in my OpenGl projects.

The library is easy to access through the Conan Package Manager , so I will use it too. Link to package.

Conan file:

 [requires] glm/0.9.9.1@g-truc/stable [generators] visual_studio 

and the command line to install the library (it generates props files that I can use in the Visual Studio project):

 conan install . -s build_type=Release -if build_release_x64 -s arch=x86_64 

The library consists of a header, so you can simply download it manually if you want.

Actual code and benchmark

I adapted the glm code from scratchapixel.com :

 //    https://www.scratchapixel.com float fresnel(const glm::vec4 &I, const glm::vec4 &N, const float ior) { float cosi = std::clamp(glm::dot(I, N), -1.0f, 1.0f); float etai = 1, etat = ior; if (cosi > 0) { std::swap(etai, etat); } //  sini     float sint = etai / etat * sqrtf(std::max(0.f, 1 - cosi * cosi)); //    if (sint >= 1) return 1.0f; float cost = sqrtf(std::max(0.f, 1 - sint * sint)); cosi = fabsf(cosi); float Rs = ((etat * cosi) - (etai * cost)) / ((etat * cosi) + (etai * cost)); float Rp = ((etai * cosi) - (etat * cost)) / ((etai * cosi) + (etat * cost)); return (Rs * Rs + Rp * Rp) / 2.0f; } 

The code uses several mathematical instructions, scalar product, multiplication, division, so the processor has something to do. Instead of the double vector, we use a vector of 4 elements to increase the amount of used memory.

Benchmark:

 std::transform(std::execution::par, vec.begin(), vec.end(), vecNormals.begin(), //   vecFresnelTerms.begin(), //  [](const glm::vec4& v, const glm::vec4& n) { return fresnel(v, n, 1.0f); } ); 

And here are the results (time in milliseconds):

OperationVector sizei7 4720 (4 cores)i7 8700 (6 cores)
execution :: seq1k0.0327640.016361
execution :: par1k0.0311860.028551
openmp parallel for1k0.0055260.007699
execution :: seq10k0.2467220.169383
execution :: par10k0.0907940.067048
openmp parallel for10k0.0497390.029835
execution :: seq100k2.497221.69768
execution :: par100k0.5301570.283268
openmp parallel for100k0.4950240.291609
execution :: seq1000k25.082816.9457
execution :: par1000k5.152352.33768
openmp parallel for1000k5.118012.95908

With “real” computations, we see that parallel algorithms provide good performance. For such operations on two of my Windows machines, I achieved acceleration with an almost linear dependence on the number of cores.

For all tests, I also showed results from OpenMP and two implementations: MSVC and OpenMP work in a similar way.

Conclusion

In this article, I examined three cases of the use of parallel computing and parallel algorithms. Replacing the standard algorithms with the version of std :: execution :: par may seem very tempting, but this is not always worth it! Each operation you use inside the algorithm may work differently and be more dependent on the processor or memory. Therefore, consider each change separately.

What is worth remembering:


Special thanks to JFT for helping with the article!

Also note my other sources about parallel algorithms:


Check out another article related to Parallel Algorithms: How to Boost Performance with Intel Parallel STL and C ++ 17 Parallel Algorithms

THE END

We are waiting for comments and questions that can be left here or with our teacher at the open door .

Source: https://habr.com/ru/post/433588/


All Articles