Good evening!
From our course
“Developer C ++” we offer you a small and interesting study about parallel algorithms.
Go.
')
With the advent of parallel algorithms in C ++ 17, you can easily update your “computational” code and benefit from parallel execution. In this article, I want to consider the STL algorithm, which naturally reveals the idea of independent computing. Is it possible to expect 10-fold acceleration with a 10-core processor? Maybe more? Or less? Let's talk about it.
Introduction to parallel algorithms
C ++ 17 offers an execution policy option for most algorithms:
- sequenced_policy is a type of execution policy, is used as a unique type to eliminate the overload of the parallel algorithm and the requirement that parallelization of the execution of a parallel algorithm is impossible: the corresponding global object is
std::execution::seq
; - parallel_policy is a type of execution policy used as a unique type to eliminate the overload of the parallel algorithm and to indicate that parallelization of the execution of the parallel algorithm is possible: the corresponding global object is
std::execution::par
; - parallel_unsequenced_policy is a type of execution policy used as a unique type for eliminating the overload of the parallel algorithm and indicating that parallelization and vectorization of the parallel algorithm execution are possible: the corresponding global object is
std::execution::par_unseq
;
Briefly:
- use
std::execution::seq
to execute the algorithm sequentially; - use
std::execution::par
to execute the algorithm in parallel (usually with the help of some implementation of the Thread Pool implementation); - use
std::execution::par_unseq
for parallel execution of the algorithm with the possibility of using vector commands (for example, SSE, AVX).
As a quick example, we call
std::sort
in parallel:
std::sort(std::execution::par, myVec.begin(), myVec.end());
Note how easy it is to add a parallel execution parameter to the algorithm! But will it be possible to achieve significant performance improvements? Will it increase the speed? Or are there cases of slowdown?
Parallel std::transform
In this article, I want to draw attention to the
std::transform
algorithm, which can potentially be the basis for other parallel methods (along with
std::transform_reduce
,
for_each
,
scan
,
sort
...).
Our test code will be based on the following pattern:
std::transform(execution_policy,
Suppose that the
ElementOperation
function
ElementOperation
not have any synchronization methods, in which case the code has the potential for parallel execution or even vectorization. Each element calculation is independent, the order does not matter, so the implementation can generate several threads (possibly in a thread pool) for independent processing of elements.
I would like to experiment with the following things:
- the size of the vector field is large or small;
- a simple conversion that spends most of the time accessing memory;
- more arithmetic (ALU) operations;
- ALU in a more realistic scenario.
As you can see, I want to not only test the number of elements that are “good” for using the parallel algorithm, but also the ALU operations that the processor occupies.
Other algorithms, such as sorting, accumulation (in the form of std :: reduce) also offer parallel execution, but also require more work to calculate the results. Therefore, we will consider them candidates for another article.
Benchmark Note
For my tests, I use Visual Studio 2017, 15.8 - since this is the only implementation in the popular / STL compiler implementation at the current time (November, 2018) (GCC on the way!). Moreover, I focused only on
execution::par
, since
execution::par_unseq
not available in MSVC (it works like
execution::par
).
There are two computers:
- i7 8700 - PC, Windows 10, i7 8700 - clock speed 3.2 GHz, 6 cores / 12 threads (Hyperthreading);
- i7 4720 - Laptop, Windows 10, i7 4720, 2.6GHz clock speed, 4 cores / 8 threads (Hyperthreading).
The code is compiled into x64, Release more, auto-vectorization is enabled by default, I also included the extended command set (SSE2) and OpenMP (2.0).
The code is in my github:
github / fenbf / ParSTLTests / TransformTests / TransformTests.cppFor OpenMP (2.0), I use parallelism only for loops:
#pragma omp parallel for for (int i = 0; ...)
I run the code 5 times and look at the minimum results.
Warning : The results reflect only rough observations, check on your system / configuration before using in production. Your requirements and environment may differ from mine.More details about MSVC implementation can be found in
this post . But the latest
report of Bill O'Neal with CppCon 2018 (Bill implemented Parallel STL in MSVC).
Well, let's start with simple examples!
Simple conversionConsider the case when you apply a very simple operation to the input vector. This may be copying or multiplying elements.
For example:
std::transform(std::execution::par, vec.begin(), vec.end(), out.begin(), [](double v) { return v * 2.0; } );
My computer has 6 or 4 cores ... can I expect a 4..6-fold acceleration of sequential execution? Here are my results (time in milliseconds):
Operation | Vector size | i7 4720 (4 cores) | i7 8700 (6 cores) |
---|
execution :: seq | 10k | 0.002763 | 0.001924 |
execution :: par | 10k | 0.009869 | 0.008983 |
openmp parallel for | 10k | 0.003158 | 0.002246 |
execution :: seq | 100k | 0.051318 | 0.028872 |
execution :: par | 100k | 0.043028 | 0.025664 |
openmp parallel for | 100k | 0.022501 | 0.009624 |
execution :: seq | 1000k | 1.69508 | 0.52419 |
execution :: par | 1000k | 1.65561 | 0.359619 |
openmp parallel for | 1000k | 1.50678 | 0.344863 |
On a faster machine, we can see that it will take about 1 million items to notice an improvement in performance. On the other hand, on my laptop all parallel implementations were slower.
Thus, it is difficult to notice any significant performance improvement with such transformations, even with an increase in the number of elements.
Why so?
Since the operations are elementary, the processor cores can call it almost instantly, using only a few cycles. However, the processor cores spend more time waiting for the main memory. So, in this case, for the most part they will wait, rather than perform calculations.
Reading and writing a variable in memory takes about 2-3 clocks if it is cached, and a few hundred clocks if not cached.https://www.agner.org/optimize/optimizing_cpp.pdfYou can roughly say that if your algorithm depends on memory, then you should not expect an improvement in performance with parallel computation.
More calculationsBecause memory bandwidth is extremely important and can affect the speed of things ... let's increase the number of calculations affecting each element.
The idea is that it is better to use processor cycles, and not to waste time waiting for memory.
To begin with, I use trigonometric functions, for example,
sqrt(sin*cos)
(these are conditional calculations in a non-optimal form, just to occupy the processor).
We use
sqrt
,
sin
and
cos
, which can take ~ 20 per
sqrt
and ~ 100 per trigonometric function. This amount of computation can cover the memory access delay.
For more information about the delays of the teams, see the excellent
Perf Guide article
from Agner Fogh .
Here is the benchmark code:
std::transform(std::execution::par, vec.begin(), vec.end(), out.begin(), [](double v) { return std::sqrt(std::sin(v)*std::cos(v)); } );
And now what? Can we expect to improve performance compared to the previous attempt?
Here are some results (time in milliseconds):
Operation | Vector size | i7 4720 (4 cores) | i7 8700 (6 cores) |
---|
execution :: seq | 10k | 0.105005 | 0.070577 |
execution :: par | 10k | 0.055661 | 0.03176 |
openmp parallel for | 10k | 0.096321 | 0.024702 |
execution :: seq | 100k | 1.08755 | 0.707048 |
execution :: par | 100k | 0.259354 | 0.17195 |
openmp parallel for | 100k | 0.898465 | 0.189915 |
execution :: seq | 1000k | 10.5159 | 7.16254 |
execution :: par | 1000k | 2.44472 | 1.10099 |
openmp parallel for | 1000k | 4.78681 | 1.89017 |
Finally, we see quite good numbers :)
For 1000 elements (not shown here), the time for parallel and sequential calculations was similar, so for more than 1000 elements we see an improvement in the parallel version.
For 100 thousand items, the result on a faster computer is almost 9 times better than the sequential version (similarly for the OpenMP version).
In the largest variant of a million elements - the result is faster by 5 or 8 times.
For such calculations, I achieved “linear” acceleration, depending on the number of processor cores. What was expected.
Fresnel and 3D VectorsIn the section above, I used “made-up” calculations, but what about the real code?
Let's solve the Fresnel equations that describe the reflection and curvature of light from a smooth, flat surface. This is a popular method for generating realistic lighting in 3D games.

Photo from WikimediaAs a good sample, I found
this description and implementation .
About using the GLM libraryInstead of creating my own implementation, I used
the glm library . I often use it in my OpenGl projects.
The library is easy to access through the
Conan Package Manager , so I will use it too.
Link to package.
Conan file:
[requires] glm/0.9.9.1@g-truc/stable [generators] visual_studio
and the command line to install the library (it generates props files that I can use in the Visual Studio project):
conan install . -s build_type=Release -if build_release_x64 -s arch=x86_64
The library consists of a header, so you can simply download it manually if you want.
Actual code and benchmarkI adapted the glm code from
scratchapixel.com :
The code uses several mathematical instructions, scalar product, multiplication, division, so the processor has something to do. Instead of the double vector, we use a vector of 4 elements to increase the amount of used memory.
Benchmark:
std::transform(std::execution::par, vec.begin(), vec.end(), vecNormals.begin(),
And here are the results (time in milliseconds):
Operation | Vector size | i7 4720 (4 cores) | i7 8700 (6 cores) |
---|
execution :: seq | 1k | 0.032764 | 0.016361 |
execution :: par | 1k | 0.031186 | 0.028551 |
openmp parallel for | 1k | 0.005526 | 0.007699 |
execution :: seq | 10k | 0.246722 | 0.169383 |
execution :: par | 10k | 0.090794 | 0.067048 |
openmp parallel for | 10k | 0.049739 | 0.029835 |
execution :: seq | 100k | 2.49722 | 1.69768 |
execution :: par | 100k | 0.530157 | 0.283268 |
openmp parallel for | 100k | 0.495024 | 0.291609 |
execution :: seq | 1000k | 25.0828 | 16.9457 |
execution :: par | 1000k | 5.15235 | 2.33768 |
openmp parallel for | 1000k | 5.11801 | 2.95908 |
With “real” computations, we see that parallel algorithms provide good performance. For such operations on two of my Windows machines, I achieved acceleration with an almost linear dependence on the number of cores.
For all tests, I also showed results from OpenMP and two implementations: MSVC and OpenMP work in a similar way.
ConclusionIn this article, I examined three cases of the use of parallel computing and parallel algorithms. Replacing the standard algorithms with the version of std :: execution :: par may seem very tempting, but this is not always worth it! Each operation you use inside the algorithm may work differently and be more dependent on the processor or memory. Therefore, consider each change separately.
What is worth remembering:
- parallel execution usually does more than sequential, since the library must prepare for parallel execution;
- not only the number of elements is important, but also the number of instructions the processor is occupied with
- it is better to take tasks that are independent of each other and other shared resources;
- parallel algorithms offer an easy way to divide work into separate threads;
- if your operations are memory dependent, you should not expect performance improvements, and in some cases, the algorithms may be slower;
- to get a decent performance improvement, always measure the timings of each problem, in some cases the results may be completely different.
Special thanks to JFT for helping with the article!
Also note my other sources about parallel algorithms:
Check out another article related to Parallel Algorithms:
How to Boost Performance with Intel Parallel STL and C ++ 17 Parallel AlgorithmsTHE END
We are waiting for comments and questions that can be left here or with our
teacher at
the open door .