
In this series of posts, we will try to solve one simple problem using more or less relevant parallel programming technologies (Native streams, OpenMP, TBB, MPI, CUDA, OpenCL, OpenACC, Chapel, there may be something more exotic. As if comparatively in hands -on the key.
So that everything fits in one post (in several parts) and in general could be cumulatively covered by attention, the problem was deliberately trivial chosen. Therefore, many aspects of the mentioned technologies and the main difficulties that arise during the actual programming of modern massively parallel systems will remain behind the scenes. Also, do not use the examples and examples as a benchmark "GPU vs. CPU" or something like that. The only goal is to show "how it works." The post is designed for people with parallel programming is not very familiar. Under the cut there will be a lot of code. Actually we will consider the number Pi by numerical integration

Source codes are also available at
github.com/undertherain/pi (will be added as you write the following parts of the post).
Sequential version.
Let's first make a consistent version. Those. one that uses the same core of a conventional CPU. Let us take the simplest of the numerical methods of integration — the method of rectangles and code it in the C language (in general, we will use C \ C ++ surzhik for several reasons.
Let's declare a certain number of
cntSteps of rectangles into which we will divide our area under the integral, we will calculate the base:
')
step = 1./static_cast<double>(cntSteps);
and the entire area, counting the value of the function in each rectangle and multiplying it on the base.
for (unsigned long i=0; i<cntSteps; i++) { x = ( i + .5 ) *step; sum = sum + 4.0/(1.+ x*x); }
However, we will multiply on the basis of
step, we are better off behind the cycle - that's basically all.
Here is the complete program code:
#include <iostream> #include <iomanip> #include <sys/times.h> #include <cmath> int main(int argc, char** argv) { const unsigned long cntSteps=500000000; /* # of rectangles */ double step; const double PI25DT = 3.141592653589793238462643; //reference Pi value double pi=0; double sum=0.0; double x; std::cout<< "calculating pi on CPU single-threaded\n"; // clock_t clockStart, clockStop; tms tmsStart, tmsStop; clockStart = times(&tmsStart); // step = 1./static_cast<double>(cntSteps); for (unsigned long i=0; i<cntSteps; i++) { x = (i + .5)*step; sum = sum + 4.0/(1.+ x*x); } pi = sum*step; // clockStop = times(&tmsStop); std::cout << "The value of PI is " << pi << " Error is " << fabs(pi - PI25DT) << "\n"; std::cout << "The time to calculate PI was " ; double secs= (clockStop - clockStart)/static_cast<double>(sysconf(_SC_CLK_TCK)); std::cout << secs << " seconds\n"; return 0; }
A copy of the link
http://dumpz.org/195276/ .
Native threads
Perhaps the most accessible parallel architecture for simple users is a regular multi-core processor (it seems difficult now to find a processor with one core) or several processors on one motherboard. In principle, one core is able to execute in parallel a few threads - this mode is called pseudo-parallel or competitive. The kernel switches between processes (a process is roughly a program in memory), allocating a time slice to each. In principle, such a mode of execution can already lead to an increase in performance due to hiding the memory latency, if not on conventional "home" processors, then on specialized multi-threaded, but more on that later. In our case, an excess of threads would slow down due to the overhead of switching between threads.
The most "historical" way to use several cores on a processor at once is the mechanism of operating system threads, which existed long before true-parallel processors for competition at least for the sake of more convenient writing of programs. From a programmer's point of view, it is important that parallel threads executed on different cores or processors see the same address space, that is, there is no need to explicitly transfer data between threads. But if all of a sudden different threads write / read the same variable, then you have to attend to synchronization.
Okay, let's get closer to the code: from the point of view of the C language, the stream is a normal function or class method that satisfies a certain prototype. Let's call it
static void * worker ( void * ptrArgs ) , with an argument it gets a pointer to a structure in which you can make as many fields as necessary to pass its arguments to the stream. In our case, we say the streaming function in which range we assume our integral. In the body of the streaming function, we already know the cycle that counts according to the range that we passed to it in the parameters.
for (long long i=args->left; i<args->right; i++) { x = (i + .5)*step; sum = sum + 4.0/(1.+ x*x); }
The integration interval for each stream we will calculate in advance based on its sequence number. If one of the threads considers its part earlier, then the corresponding core will be idle, i.e. we will lose productivity. It would be ideal to divide the interval into many small sections and distribute to the streams as they do their job. But for now let's leave it as it is.
arrArgsThread[idThread].left = idThread*cntStepsPerThread; arrArgsThread[idThread].right = (idThread+1)*cntStepsPerThread;
For execution by a separate thread, the function is launched via the pthread_create system call in the POSIX case (in Linux, for example) or in the case of windows there will be a similar call from the Win32 API, it will look a little different, but it looks altogether.
The result of each thread will be added to the general variable
pi + = sum * step ; (remember that we are in the general address space).
So that the memory does not deteriorate if two threads simultaneously change one cell, we need to somehow guarantee that at one time only one thread gets access to the variable pi, i.e. create a so-called "critical section". To do this, you can use a special operating system mechanism called mutex (from the word mutual exclusion) - if one thread has blocked a mutex, another thread will wait (trying to get the mutex itself) until the first thread “releases” it.
pthread_mutex_lock(&mutexReduction); pi += sum*step; pthread_mutex_unlock(&mutexReduction);
Total goes something like this:
#include <iostream> #include <iomanip> #include <cmath> #include <cstdlib> #include <pthread.h> #include <sys/times.h> #define cntThreads 4 pthread_mutex_t mutexReduction; double pi=0.; // struct ArgsThread { long long left,right; double step; }; // static void *worker(void *ptrArgs) { ArgsThread * args = reinterpret_cast<ArgsThread *>(ptrArgs); double x; double sum=0.; double step=args->step; for (long long i=args->left; i<args->right; i++) { x = (i + .5)*step; sum = sum + 4.0/(1.+ x*x); } pthread_mutex_lock(&mutexReduction); pi += sum*step; pthread_mutex_unlock(&mutexReduction); return NULL; } int main(int argc, char** argv) { const unsigned long num_steps=500000000; const double PI25DT = 3.141592653589793238462643; pthread_t threads[cntThreads]; ArgsThread arrArgsThread[cntThreads]; std::cout<<"POSIX threads. number of threads = "<<cntThreads<<std::endl; clock_t clockStart, clockStop; tms tmsStart, tmsStop; clockStart = times(&tmsStart); double step = 1./(double)num_steps; long long cntStepsPerThread= num_steps / cntThreads; // for (unsigned int idThread=0; idThread<<cntThreads; idThread++) { arrArgsThread[idThread].left = idThread*cntStepsPerThread; arrArgsThread[idThread].right = (idThread+1)*cntStepsPerThread; arrArgsThread[idThread].step = step; if (pthread_create(&threads[idThread], NULL, worker, &arrArgsThread[idThread]) != 0) { return EXIT_FAILURE; } } // join for (unsigned int idThread=0; idThread<cntThreads; idThread++) { if (pthread_join(threads[idThread], NULL) != 0) { return EXIT_FAILURE; } } //! clockStop = times(&tmsStop); std::cout << "The value of PI is " << pi << " Error is " << fabs(pi - PI25DT) << std::endl; std::cout << "The time to calculate PI was " ; double secs= (clockStop - clockStart)/static_cast<double>(sysconf(_SC_CLK_TCK)); std::cout << secs << " seconds\n" << std::endl; return 0; }
A copy on
http://dumpz.org/195404/ in case if someone has my hellish formatting displayed irregularly.
In fact, specifically in this example (but it will not always be
lucky ), it was possible to do without mutexes if you store the result in each thread into a separate variable (
ArgsThread array
element arrArgsThread [ cntThreads ];) and then, having waited for the completion of all the flows, sum up what happened.
Here is the code without mutexes:
#include <iostream> #include <iomanip> #include <cmath> #include <cstdlib> #include <pthread.h> #include <sys/times.h> #define cntThreads 4 struct ArgsThread { long long left,right; double step; double partialSum; }; static void *worker(void *ptrArgs) { ArgsThread * args = reinterpret_cast<ArgsThread *>(ptrArgs); double x; double sum=0.; double step=args->step; for (long long i=args->left; i<args->right; i++) { x = (i + .5)*step; sum = sum + 4.0/(1.+ x*x); } args->partialSum=sum*step; return NULL; } int main(int argc, char** argv) { const unsigned long num_steps=500000000; const double PI25DT = 3.141592653589793238462643; pthread_t threads[cntThreads]; ArgsThread arrArgsThread[cntThreads]; std::cout<<"POSIX threads. number of threads = "<<cntThreads<<std::endl; clock_t clockStart, clockStop; tms tmsStart, tmsStop; clockStart = times(&tmsStart); double step = 1./(double)num_steps; long long cntStepsPerThread= num_steps / cntThreads; for (unsigned int idThread=0; idThread<cntThreads; idThread++) { arrArgsThread[idThread].left = idThread*cntStepsPerThread; arrArgsThread[idThread].right = (idThread+1)*cntStepsPerThread; arrArgsThread[idThread].step = step; if (pthread_create(&threads[idThread], NULL, worker, &arrArgsThread[idThread]) != 0) { return EXIT_FAILURE; } } double pi=0.; for (unsigned int idThread=0; idThread<cntThreads; idThread++) { if (pthread_join(threads[idThread], NULL) != 0) { return EXIT_FAILURE; } pi +=arrArgsThread[idThread].partialSum; } clockStop = times(&tmsStop); std::cout << "The value of PI is " << pi << " Error is " << fabs(pi - PI25DT) << std::endl; std::cout << "The time to calculate PI was " ; double secs= (clockStop - clockStart)/static_cast<double>(sysconf(_SC_CLK_TCK)); std::cout << secs << " seconds\n" << std::endl; return 0; }
And a copy at
http://dumpz.org/195415/ .
As you can see, the code turned out to be rather cumbersome and non-platform. If the latter is solved partly with the help of boost :: threads (but not everyone wants to put a boost) or in the new C ++ 11 threads generally become part of the language (in fact, it turned out very well) - but most of the software still uses the old C ++. But the problem of bulkiness of the code still remains.
Openmp
OpenMP is an extension of the language (C / C ++ / Fortran) that allows you to do approximately the same thing that we did using the operating system thread APIs - but much simpler and more concise with the help of so-called pragmas. The pragma as if tells the compiler to “take this code and execute it in parallel” - and the communicator does the rest.
To parallelize the for loop in our first sequential example, it suffices to add one line there:
#pragma omp parallel for private (x), reduction (+:sum) for (int i=0; i<numSteps; i++) { x = (i + .5)*step; sum = sum + 4.0/(1.+ x*x); }
This pragma says that you need to parallelize the loop loop, make the variable x private for each stream, change the sum variable, and then perform a reduction (or how is it in Russian?) by summation. Those. first create one copy for each stream — and then add them up. Approximately the same thing that we did in the past example without mutecos. OpenMP also provides a small API for service needs.
#include <iostream> #include <iomanip> #include <sys/times.h> #include <cmath> #include <omp.h> int main(int argc, char** argv) { const unsigned long numSteps=500000000; /* default # of rectangles */ double step; double PI25DT = 3.141592653589793238462643; double pi=0; double sum=0.0; double x; #pragma omp parallel { #pragma omp master { int cntThreads=omp_get_num_threads(); std::cout<<"OpenMP. number of threads = "<<cntThreads<<std::endl; } } clock_t clockStart, clockStop; tms tmsStart, tmsStop; step = 1./static_cast<double>(numSteps); clockStart = times(&tmsStart); #pragma omp parallel for private (x), reduction (+:sum) for (int i=0; i<numSteps; i++) { x = (i + .5)*step; sum = sum + 4.0/(1.+ x*x); } pi = sum*step; clockStop = times(&tmsStop); std::cout << "The value of PI is " << pi << " Error is " << fabs(pi - PI25DT) << std::endl; std::cout << "The time to calculate PI was " ; double secs= (clockStop - clockStart)/static_cast<double>(sysconf(_SC_CLK_TCK)); std::cout << secs << " seconds\n" << std::endl; return 0; }
A copy of
http://dumpz.org/195550/ .
Compile with the -fopenmp option in the case of g ++; in the case of another compiler, refer to the user manual.
Questions and comments are welcomed, continued - should!