Acceleration of parallel computing

The main purpose of the creation and development of numerous types of parallel machines, which we discussed in the last article, is speed. Supercomputers and multiprocessing systems can and should do everything faster! Let's try to calculate how much faster.

It is logical to think that if one processor performs work in n seconds, then four processors will spend n / 4 seconds. The concept of “speedup factor” is the ratio of time that a single processor spends to do a job to the time that a multiprocessor system spends on the same job.

S (p) = T _s / T _p
')
For the calculation it is important to use the most optimal T _s , that is, the best possible non-parallel algorithm.

Now the bad news: this acceleration has a limit. It is called Amdahl's Law (Amdal's Law) and here it’s the essence: this is how a task looks on an ordinary uniprocessor system:

The entire execution time, T _s , which is already familiar to us, consists of the part that can be distributed into several processors (parallelized) - (1-f) T _s , and the part that cannot be parallelized (serial) - fT _s .

How will the scheme of the same task look for multiple processors?

Now the total execution time Tp consists of serial time and the maximum of those that we have divided into several processors (they all run at the same time, but you need to wait for the slowest). For simplicity, assume that they all take the same time.

The acceleration factor is calculated by the formula:

That is, it all depends on which piece of the program we can parallelize (f in percent means the number of the serial code). If f = 0%, that is, absolutely all code is parallelized (which is practically impossible), then the more processors we use, the faster the task will be completed and the ratio will be linear: we want 10 times faster - we use ten processors, we want a million times faster - we buy a million processors. But it is worth leaving 5% of the serial code, and 95% parallelizing, then the acceleration factor will be equal to 20 and higher can not jump. Even if we buy a million more processors to that million, the program will still run 20 times faster. Here is how this sad fact looks on the graph with examples of different percentages f:

It turns out that the most important thing about the effectiveness of parallel code is a good design of the algorithm itself. Every detail can greatly affect the extensibility: if the algorithm now normally uses the resources of a quad-core processor, then it will probably not be able to use more than 8 cores normally, and then in a year it will be necessary to write the program again - because instead of the number of transistors, the number of cores now increases every six months!

To write a good parallel algorithm, you need to understand the essence of the problem and think about how it can be divided into separate independent (ideally) parts, so that all of them are executed in parallel on different processors. Let's look at an example of a rather capacious task of multiplying the matrix A by the vector b. The result will be written in the vector y, which is identical in size to the vector b. To get the first value of the vector y, you need to multiply the first row of the matrix by the vector b to get the second value of y — the second row by b, and so on. It turns out that each element of the vector y can be considered independently of the others, that is, in parallel.

The size of each such task is the same - all A lines are the same size (for now we don’t touch such details as the density of the matrix - perhaps there are a lot of zeros in it, but suppose that this doesn’t have a strong effect on the execution time). There are no dependencies between tasks, and all tasks use the same b.

Another example is a database query:

MODEL = “CIVIC” AND YEAR = 2001 AND (COLOR = ”GREEN” OR COLOR = ”WHITE”);

The query is trying to get all the green or white Civic 2001 from this database:

This task can be decomposed into three parts: one processor will form the table of all Tsivik 2001, another processor will form the table of all white and green machines (these two queries can go in parallel), after which the result of the join of these two tables will be the answer.

You can change the structure of the request, which can affect the parallelization

The division of the task into parts for the subsequent distribution of these parts into different processors is called decomposition. The multiplication of a matrix by a vector is an example of decomposition of the result of the problem: the result (vector y) was divided into several independent parts and each was calculated separately from the other. The same can be done with matrix multiplication:

Our first parallel program

To write our first parallel program, we will use Cilk. It is a programming language that is essentially C with convenient tools for parallelizing and synchronizing tasks. Cilk was developed at MIT in 1994, was free and free, but by 2006 it became commercialized, began to support C ++, and a year ago was bought with giblets by Intel, which, of course, is beneficial to have a good programming language for multicore systems : after all, they also produce such systems. I'm not sure how to get a compiler for free, our university has an academic version for students (Cilk developer is a good friend of the professor), so please excuse me if you cannot find it. But I think it is worth searching, because Cilk is terribly simple and convenient! Do not compare with Pthreads. All we need to know to start programming are three keywords: cilk , spawn, and sync .

It is best to start with an example, so here's your favorite recursive task for everyone - the Fibonacci numbers - in C:

int fib (int n){ if (n<2) return n; else { int x, y; x = fib (n-1); y = fib (n-2); return (x+y); } }

And here is the same program on Cilk:

 cilk int fib (int n){ if (n<2) return n; else { int x, y; x = spawn fib (n-1); y = spawn fib (n-2); sync; return (x+y); } }

Notice the difference? ,)

The cilk keyword is used to specify a function. The most important are the words spawn and sync. Spawn is placed before calling the function that we want to run on another kernel, while sync waits for all the functions called by spawn to finish. That is the line

x = spawn fib (n-1);

runs on another kernel, and immediately runs the line y = spawn fib (n-2) . Before returning the result ( return (x + y) ) you need to wait until the end of the execution, otherwise there will be no valid values in the variables x and y. This makes the sync command - as the name implies.

Is it easy?

Such a program will badly load a lot of cores! Here is the scheme of this algorithm for n = 4:

The code highlighting colors correspond to the colors of the graph nodes. Each graph level is executed simultaneously, that is, the first call for n = 4 calls two functions - for 3 and 2, they in turn - for 2 and 1 and for 1 and 0, respectively.

Here is the source code for this program:

In the next article we will talk about how the scalability of parallel algorithms is calculated, what types of shelling exist and which is used in cilk, and also touch on sorting and parallel dynamic programming.

Source: https://habr.com/ru/post/127007/

All Articles

Acceleration of parallel computing

Our first parallel program

More articles: