Some time ago I wrote about how to get reproducible results and what difficulties are associated with it. He also spoke in detail about the models that allow controlling the work with floating-point numbers in the compiler and separately clarified that if we use any libraries or standards, we must take care that the necessary flags are specified for them too. And just recently, I came across an interesting problem related to the reproducibility of results when working with OpenMP.
What is reproducibility? Yes, everything is simple - we want to get the same "good tifir" from launch to launch, because for us it is important. This is critical in many areas where parallel computing is being actively used.
')
So, as you remember, the summation order plays an important role for machine computing, and if we have cycles parallelized using any technology, the reproducibility of results will inevitably arise because no one knows in what order the summation will be performed, and how many “chunks” our initial cycle will be broken. In particular, this is manifested when using OpenMP in reductions.
Consider a simple example.
!$OMP PARALLEL DO schedule(static) do i=1,n a=a+b(i) enddo
In this case, when using a static scheduler, we simply divide the entire iteration space into an equal number of parts and let each thread perform these iterations. For example, if we have 1000 iterations, then with 4 threads running, we will get 250 iterations “per brother”. Our array is an example of shared data for different threads, and therefore we need to take care of the security of the code. It is a working option to use the reduction and calculate its value in each stream, and then add the obtained "intermediate" results:
!$OMP PARALLEL DO REDUCTION(+:a) schedule(static) do i=1,n a=a+b(i) enddo
So, even on such a simple example, it is quite simple to get a variation in values.
I changed the number of streams using
OMP_SET_NUM_THREADS and got that with 2 streams
a = 204.5992, and with 4 already 204.6005. The method of initializing the array
b (i) and
a I dropped.
It is interesting that talking about reproducible results is possible only under a variety of conditions. So, the architecture, the OS, the version of the compiler that the application was going to have, and the number of threads should always be constant from launch to launch. If we change the number of threads, the results will be different, and this is absolutely normal. However, even if all these conditions are met, the result can still be different, and here the
KMP_DETERMINISTIC_REDUCTION environment
variable and the static scheduler should help us. I will make a reservation that using it will not give us a guarantee of the coincidence of the results of parallel and sequential versions of the application, as well as with a different launch, which used a great number of threads. It is important to understand.
This is a fairly narrow case, when we really did not change anything, and the results did not agree. And the main surprise lies in the fact that in some cases
KMP_DETERMINISTIC_REDUCTION does not work, although we “played by the rules.”
This code, which is slightly more complicated than the first example, gives different results:
!$OMP PARALLEL DO REDUCTION(+:ue) schedule(static) do is=1,ns do y=1,ny do x=1,nx ue(x,y)=ue(x,y) + ua(x,y,is) enddo enddo enddo !$OMP END PARALLEL DO
Even after setting the
KMP_DETERMINISTIC_REDUCTION variable, nothing has changed. Why? It turns out that in some cases, for performance reasons, the compiler creates its own loop implementation using locks, and does not care about the results. These cases are easily traced by the assembler. In the "good" variants, the function call
__kmp_reduce_nowait must be. But for my example this was not done, which somewhat undermines the credibility of
KMP_DETERMINISTIC_REDUCTION .
So, if you have written the code and the results of your calculations jump from start to start, do not rush to sprinkle ashes on your head. Disable optimization and run your application. Check if your data is aligned and set a "strict" model for working with floating point numbers. The following compiler options may be useful for the test:
ifort -O0 -openmp -fp-model strict -align array32byte
If even with this set of options, the results surprise you, check the cycles parallelized using OpenMP and reductions and turn on
KMP_DETERMINISTIC_REDUCTION . This may work and fix the problem. If not, look at the assembler and check for the
__kmp_reduce_nowait call. In the case of this call, the problem is probably not with OpenMP or the compiler, but an error has crept into the code. By the way, the problem with
KMP_DETERMINISTIC_REDUCTION we have to solve soon. But consider this feature now.