📜 ⬆️ ⬇️

Parallel computing is a bit of a rake

"Beetles, beetles, beetles" - snorting, sentenced Pumba in a dream, and he dreamed of them!

Everyone knows that when writing multi-threaded applications one should be especially attentive to possible “races” at the moment of change of variables shared by different threads. Moreover, there is also an opinion that such races are very rare and for this reason very difficult-detectable, and therefore tend to occur in rare situations and lead to unpredictable behavior, and even to the collapse of the system. If you have to agree with the latest statements, then in this post I intend to argue and show that the process of manifesting possible troubles can be intensified, which makes it easier to determine the causes of their appearance and eliminate them. The main idea of ​​such an intensification is the use of parallel computing on multi-core systems.

A bit of theory - when we create a thread (thread) in a modern system, it’s just a wish to perform certain actions, and when exactly they will be executed, the execution system decides on its own internal algorithms, which we cannot directly influence. An indirect effect is possible by setting priorities and service intervals, but again, these are just wishes that can be taken into account by the system, or they can be ignored in this particular situation. One of the parameters affecting the order of execution is the number of hardware cores available to the system, on which various processes can be simultaneously performed. In microcontrollers, we have traditionally become accustomed to dealing with one core and, accordingly, with different methods of switching processes, while possible troubles occur exactly at switching times, and since the switching frequency depends on the minimum quantum of process execution time and our ability to influence this parameter is very They are limited, such events are extremely rare, and not everyone will be lucky to observe their results during debugging. But the world of affordable computing systems MK is not limited, and many PCs have as many cores as the sellers of computer stores tell you about when you buy a motherboard. So how can you benefit from these cores? It turns out that many executing systems actually organize launching on different cores of different threads simultaneously without switching the context, while possible races become significantly more intense (by orders of magnitude more frequent) and their influence on the program execution process can be observed with the naked eye.

Consider a simple example - we run two threads, one of which increases the value of a variable a million times, and the second one reduces the value of the same variable a million times. What will be the value of the variable after the completion of the work of both processes, if before the start it was equal to zero? Of course, not zero, "otherwise the Armenian radio would not have asked about this." The correct answer is any number in the range from a million to - a million, inclusive, unless we have taken special precautions - more on that later. Not to be unfounded, I will give an example of the program and its results for 2 runs.
#include <iostream> #include <thread> #define NUMBER 1000000 using namespace std; static int counter=0; //         int f1(void) { for (int i=0; i<NUMBER; i++) { counter++; // ldi r1,#counter; ld r0,@r1; inc r0; st r0,@r1 }; }; int f2(void) { for (int i=0; i<NUMBER; i++) { counter--; }; }; int main() { thread t1(f1); thread t2(f2); //    t1.join(); t2.join(); //    cout << counter << endl; } 340865 -557870 

Somewhat unexpected for an unprepared user? Those who dealt in streams at MK might expect a number in the range of a dozen, but hundreds of thousands? This is the beauty of the intensification of uncertain states, that the problem rises to its full height. For a start, why do we not get a zero as a result (of course, we can get it, but the probability of this event is very small)? The thing is that in most RISC architectures direct operations with data in RAM are impossible, therefore the counter increment command will turn into a sequence of assembler commands given in the comments. Now let's consider an example with switching - there is 2 in counter, the first thread executed assembler commands 1 and 2, 2 is in register r0, the first thread is interrupted at this moment, the second stream starts working, 2 is still in counter, the second stream is a thousand times subtracts 1 from it, gets the numbers from 1 to 998, executes the last write command in memory, lies at -998 in the counter, interrupts the second thread, and puts the first thread at execution, which executes commands 3 and 4 and fits the counter 3. As a result , the results of a thousand subtractions disappeared. Of course, in order for this to happen, the first thread needs to be interrupted after executing command 2 and before executing command 4 and this will not happen too often, because besides the given fragment even in our simple program there are commands related to the organization of the cycle, but the probability getting exactly where you need to get a crash is not at all zero.

If our MK had a team that could increase or decrease the number in memory, and not in the register, then such a trouble would not have happened, but only in the case of preemptive multitasking. If we work in the parallel processing mode, even the presence of such a command does not save us from a more subtle, but identical moment related to the operation of the data bus. The fact is that even the command to modify the contents of the memory cell cannot be executed directly on the memory (there may be such systems, but I personally don’t know them) - at the bus level, there will still be a read from the memory in the ALU, the actual execution of the modification and the recording of the received result in the memory at the same address, that is, again there is a moment, having invaded which, one thread can distort the results of another thread. It is possible to determine what kind of competitiveness we are seeing (we will leave this as a task for an inquisitive reader), but this will not affect the results in any way - we have obtained the result that is deliberately wrong, moreover, distorted at random.
The difference between the execution mode on the two cores and in the switching mode will be as follows. In the first case, conflicts occur literally in each cycle (it can be said that switching occurs very often), but the distortion of the result does not exceed 1. Therefore, we will get completely random numbers with a relatively uniform distribution. In the second case, the distortion will occur much less frequently (only at the moment of switching, and even then not always), but then all the operations performed during the next execution of the stream will distort. Therefore, we will obtain a value that changes in jumps, and the data obtained will be grouped around discrete steps. Moreover, under certain conditions we can even get the right results, which will lead to the fact that we will not notice this problem when debugging. That is why I strongly recommend running programs on multiple kernels in order to intensify the process of displaying errors.
')
These rakes are widely known, they stand in a well-lit place, which does not prevent them from stepping on them from time to time, I just showed how to turn their activities from rare and stunning under conditions of preemptive multitasking into more frequent, not less destructive, but much more observable by parallel computation. Now that we have made out the rake in detail and have learned how to attack it with the frequency required for debugging purposes, let's talk a little about ways to bypass the rake data and the length of the detour.
We have the means to provide exactly the result we expected, namely zero, by modifying our program to take into account the competitiveness of calculations. First of all, these are objects known as mutexes, that is, flags to allow access to a shared resource, in our case, a memory area called counter. We look at the code - everything is fine, clear and correct, there is only 1 small drawback - the program execution time with waiting for the threads to complete has increased 20+ times - that is, we bypass the rake along a rather long arc.
 #include <mutex> using namespace std; static mutex m; int f1(void) { for (int i=0; i<NUMBER; i++) { m.lock(); counter++; m.unlock(); }; }; 
It is clear why this happened - instead of a simple modification operation, we additionally capture the mutex, while the second thread stops when trying to capture it, perform the operation and release the mutex, which also takes time. Nevertheless, the main advantage of this option, besides its correctness, is that it will work everywhere where there is a thread and mutex package. The main drawback of this method is that we must always perform additional operations, regardless of whether there is currently competition for the resource.
There is a somewhat more elaborate, but much less long, detour path, which is associated with the CompareAndSwap operation.
 using namespace std; int f1(void) { int j; for (int i=0; i<NUMBER; i++) { do { j=counter; } while (__sync_val_compare_and_swap(&counter,j,j+1) != j ); }; }; 
Here, too, everything is correct, a little less clear, but after reading the relevant materials on non-blocking operations, everything falls into place, and the execution time has decreased by half compared to the previous one, although it is still 10+ times slower than the simplest option. Everything is quite expected, except for the fact that the materials on non-blocking operations promised us insignificant losses, but here the losses are still significant. Nevertheless, everything is correct, since in the conditions of frequent competition for resources (and it is generally cruel at us), this method is not preferable, which we were honestly warned about. The disadvantage of this option is that on our system there should be such a command implemented by hardware, and this is not always the case. Plus - the fact that the operation is repeated only if there is real competition for the resource, otherwise the additional costs are minimal.
We can go even closer to the rake, and not stepping on them - this is the method of atomic operations. We look code
 int f1(void) { for (int i=0; i<NUMBER; i++) { __sync_add_and_fetch(&counter,1); }; }; 
Everything is absolutely correct, almost understandable at first glance, and if you read materials on atomic operations, it is absolutely clear, compact and fast enough - my execution time increased only 2+ times (you may have different results), but the minus is hardware atomic operations are a rare beast, like C & S, and if you start using their software counterparts, you instantly roll down to the performance level of the mutexes.
To summarize, if you have atomic operations, then using them for elementary access to a shared resource gives you the fastest and most compact code, and the form of parallelism is absolutely unimportant - it will always work. The C & S team is a good alternative, although overhead can be higher in the case of tough competition for a resource. And the use of mutexes is the most expensive option if there is nothing else. By the way, a little reasoning about mutexes - and how are they generally made? If everything is clear for the context switching option, we can prohibit switching by either disabling interrupts or disabling the work of scheduler (the latter is preferable, since it is easy to solve problems with restoring state), then for a variant with multiple cores, the answer is not so obvious. Most likely, you cannot do without using a long bus lock or using the C & S command. There is still a very interesting option, when the analogue of the C & S command is implemented using the XCH register and memory command, but in this case the command must be executed in atomic mode. If anyone knows other options, share information or links.

Source: https://habr.com/ru/post/235445/


All Articles