MPI Basics for Dummies

It turned out that I had to face closely the study of parallel computing and in particular MPI. Perhaps, this direction today is very promising, so I would like to show Habrayuser the basics of this process.

Basic principles and example

As an example, the calculation of exponent (e) will be used. One of its variants is the Taylor series:
e ^ x = ((x ^ n) / n!) , where the summation occurs from n = 0 to infinity.

This formula is easy to parallelize, since the desired number is the sum of the individual addends, and thanks to this, each individual processor can do the computation of the individual addends.

The number of terms that will be calculated in each individual processor depends both on the length of the interval n and on the number of processors k that can participate in the calculation process. So, for example, if the interval length is n = 4, and five processors participate in the calculations (k = 5), then from the first to the fourth processors will receive one term each, and the fifth one will not be used. If n = 10 and k = 5, each processor will get two terms to calculate.
')
Initially, the first processor, using the broadcast function MPI_Bcast, sends the rest of the value of the user-defined variable n. In general, the MPI_Bcast function has the following format:
int MPI_Bcast (void * buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm), where buffer is the address of the buffer with the element, count is the number of elements, datatype is the corresponding data type in MPI, root is the rank of the main processor that deals forward and comm is the name of the communicator.
In my case, the role of the main processor, as already mentioned, will be the first processor with a rank of 0.

After the number n is successfully sent, each processor will calculate its addends. To do this, at each step of the cycle, a number equal to the number of processors participating in the calculations will be added to the number i, which is initially equal to the processor's rank. If the number in the following steps, the number i exceeds the user-specified number n, the cycle for this processor will stop.

During the execution of the cycle, the terms will be added to a separate variable and, after its completion, the resulting amount will be sent to the main processor. For this, the function of the cast operation MPI_Reduce will be used. In general, it looks like this:
int MPI_Reduce (void * buf, void * result, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm)

It combines the elements of the input buffer of each process in the group using the op operation, and returns the combined value in the output buffer of the process with the root number. The result of such an operation will be the only value, due to which the function of the reduction got its name.

After the program is executed on all processors, the first processor will receive the total sum of the components, which will be the exponent value we need.

It should be noted that in parallel and sequential methods of calculating the exponent, a recursive function is used to find the factorial. In the course of making a decision on the method of parallelizing the task being performed, I considered the option of finding factorial also on different processors, but in the end, this option was accepted by me as irrational.

However, the primary task is to find the value of the exponent and if the processors begin to calculate each factorial of each term in a separate way, this can lead to a direct effect back, namely a significant loss in performance and speed of calculation.
This is explained by the fact that in this case a very large load will begin on the communication environment, which is already a weak link in parallel computing systems. If the factorial calculation will occur on each processor in private, the load on the communication lines will be minimal. This case can be called a good example of the fact that the task of parallelization must also sometimes have its limits.

Code execution algorithm

1. From the visual shell, the value of the number n is transmitted to the program, which is then sent to all processors using the broadcast function.
2. When initializing the first main processor, a timer starts.
3. Each processor performs a cycle, where the increment value is the number of processors in the system. In each iteration of the loop, the term is calculated and the sum of such terms is stored in the variable drobSum.
4. Upon completion of the cycle, each processor sums its drobSum value to the Result variable using the cast function MPI_Reduce.
5. After completion of calculations on all processors, the first main processor stops the timer and sends the resulting Result variable to the output stream.
6. The time value measured by our timer is also sent to the output stream in milliseconds.

Code listing

The program is written in C ++, we will assume that the arguments for execution are transmitted from the outer shell. The code looks like this:

#include "mpi.h" #include <iostream> #include <windows.h> using namespace std; double Fact( int n) { if (n==0) return 1; else return n*Fact(n-1); } int main( int argc, char *argv[]) { SetConsoleOutputCP(1251); int n; int myid; int numprocs; int i; int rc; long double drob,drobSum=0,Result, sum; double startwtime = 0.0; double endwtime; n = atoi(argv[1]); if (rc= MPI_Init(&argc, &argv)) { cout << " , " << endl; MPI_Abort(MPI_COMM_WORLD, rc); } MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); if (myid == 0) { startwtime = MPI_Wtime(); } MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); for (i = myid; i <= n; i += numprocs) { drob = 1/Fact(i); drobSum += drob; } MPI_Reduce(&drobSum, &Result, 1, MPI_LONG_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); cout.precision(20); if (myid == 0) { cout << Result << endl; endwtime = MPI_Wtime(); cout << (endwtime-startwtime)*1000 << endl; } MPI_Finalize(); return 0; } * This source code was highlighted with Source Code Highlighter .

Conclusion

Thus, we have obtained a simple program for counting exponents using several processors at once. Probably, the bottleneck is the storage of the result itself, because with an increase in the number of digits, it is banal to contain a value using standard types and this place requires study. Perhaps a fairly rational solution is to write the result to a file, although, in view of the purely educational function of this example, attention should be paid to this point.

Source: https://habr.com/ru/post/121235/

All Articles

MPI Basics for Dummies

Basic principles and example

Code execution algorithm

Code listing

Conclusion

More articles: