Parallel tasks. The case of "perfect parallelism." Part 1

Parallel code without dependencies

Introduction

In the first part of this article we will talk about approaches to parallel processing of cycles in those successful cases when there are no dependencies between separate iterations and they can be correctly executed in parallel. In the second part , we will consider the mechanisms that appeared in .NET 4.0 to control such parallelization, and identify the subtleties of the work of these mechanisms.

Consider a cycle in which data is processed:

for ( int i = 0 ; i < upperBound ; i ++ ) { // ... }
for ( int i = 0 ; i < upperBound ; i ++ ) { // ... }
for ( int i = 0 ; i < upperBound ; i ++ ) { // ... }
for ( int i = 0 ; i < upperBound ; i ++ ) { // ... }

If the logic of the body of the cycle is such that the result of its calculations on a specific iteration does not depend on the result of the calculations of any other iteration, then this cycle refers to “ideally parallel”, since all its iterations can be performed in parallel, if necessary for This cores on the processor.
')
The same definition can be attributed to a similar foreach cycle; it also implies that the order of calculating iterations is unimportant.

Approaches to the implementation of the parallel loop

Let us try to implement the method ourselves for performing the cycle in parallel, and see what difficulties we face.

So, we are going to create a method like this:

public static void ParallelFor ( int from, int to, Action < int > body ) ;

It will have to perform body for all values in the range from from to to (not including the latter), while parallelizing their execution into several threads in order to achieve maximum performance. How far is it possible to parallelize it?

Determining the degree of parallelism

There is a sound judgment that it is logical to use as many threads as there are cores in the processor on this machine. With this solution, each of the cores will be fully loaded. With a larger number of threads, we will receive an increase in overhead costs for switching between them, with a smaller one, some of the cores will be idle.

Such an approach is downstream to the core in many cases true, although it is based on an idealistic model of computation, when all the threads only deal with computations on the processor. However, it can sometimes be useful to have more threads. For example, if the threads in their work spend half the time waiting for resources, the processor still stands idle at this time, and an increase in the number of workflows can lead to improved performance.

Thus, it turns out that it is worth starting with a simple “one thread per core” option, but be prepared for consideration of other strategies.

In order to get the number of available cores, we can use the System.Environment.ProcessorCount property. It should be noted that it returns the number of cores taking into account Hyper-threading, that is, if it is present, the “logical”, already doubled number of cores will be returned.

In view of the above, here is a logical (albeit somewhat naive) implementation:

public static void ParallelFor ( int from, int to, Action < int > body )
{
// determine the number of streams and the size of the data block for each stream
int size = to - from ;
int numProcs = Environment. ProcessorCount ;
int range = size / numProcs ;
// break the data, run all threads and wait for completion
var threads = new List < Thread > ( numProcs ) ;
for ( int p = 0 ; p < numProcs ; p ++ )
{
int start = p * range + from ;
int end = ( p == numProcs - 1 ) ?
to : start + range ;
threads. Add ( new Thread ( ( ) => {
for ( int i = start ; i < end ; i ++ ) body ( i ) ;
} ) ) ;
}
foreach ( var thread in threads ) thread. Start ( ) ;
foreach ( var thread in threads ) thread. Join ( ) ;
}

An important disadvantage of this version is the creation / completion of new threads each time, since this is a rather expensive operation (in particular, each thread reserves 1M of memory on the stack, even if no function is currently running on it).

Another problem also stems from creating new threads. Suppose that we were given some code in body that also contains a ParallelFor call. In this case, in the process of execution, it will not create two more numProcs streams, but two times more, and a situation may arise when the cost of switching between threads will be too high (“excessive multithreading”).

Static distribution of iterations

Therefore, instead of creating threads manually, we would prefer to use the thread pool:

public static void ParallelFor ( int from, int to, Action < int > body )
{
int size = to - from ;
int numProcs = Environment. ProcessorCount ;
int range = size / numProcs ;
int remaining = numProcs ;
// synchronization object to determine completion
using ( ManualResetEvent mre = new ManualResetEvent ( false ) )
{
// create all tasks
for ( int p = 0 ; p < numProcs ; p ++ )
{
int start = p * range + from ;
int end = ( p == numProcs - 1 ) ? to : start + range ;
ThreadPool. QueueUserWorkItem ( delegate {
for ( int i = start ; i < end ; i ++ )
body ( i ) ;
// check if the last task completed
if ( Interlocked. Decrement ( ref remaining ) == 0 )
mre. Set ( ) ;
} ) ;
}
// Waiting for all tasks to be completed.
mre. WaitOne ( ) ;
}
}

Such a solution is already devoid of problems with excessive multithreading and with the cost of creating threads. What else can prevent in this situation the most rapid development of this cycle?

In case all iterations are small in time and approximately equally costly, the above approach is close to optimal.

And if we imagine that some iterations are completed quickly, and some - many times longer, then in that case that thread from the pool, which was not lucky to be the performer of more “long” iterations, will work several times longer than the others. At the same time, since the operating time of the entire parallel operation is determined by the operating time of its slowest component, the remaining threads will eventually stand idle, having done all “their” work, while one “unlucky” will delay everyone.

In such a situation to ensure the minimum time, you need

either know the nature of the iterations, so that all threads equally distribute "long" iterations,
or change the general approach to the division of tasks between threads.

Dynamic distribution of iterations

From the static separation, you can go to the dynamic, that is, to give each thread some “portion” of work, having done that, it will receive the next portion, if the unfinished work still remains by that time.

This approach can be illustrated with this code:

public static void ParallelFor ( int from, int to, Action < int > body )
{
int numProcs = Environment. ProcessorCount ;
// amount remaining
int remainingWorkItems = numProcs ;
int nextIteration = from ;
using ( ManualResetEvent mre = new ManualResetEvent ( false ) )
{
// create tasks
for ( int p = 0 ; p < numProcs ; p ++ )
{
ThreadPool. QueueUserWorkItem ( delegate
{
int index ;
// select one item for execution
while ( ( index = Interlocked. Increment ( ref nextIteration ) - 1 ) < to )
{
body ( index ) ;
}
if ( Interlocked. Decrement ( ref remainingWorkItems ) == 0 )
mre. Set ( ) ;
} ) ;
}
// wait for all tasks to complete
mre. WaitOne ( ) ;
}
}

This code is good for the case of long iterations with an unpredictable runtime, but for fast iterations it has too much synchronization costs.

Balanced approach

Thus, we see that, depending on the nature of the parallelized loop, both the static task sharing strategy between the threads and the dynamic one can be advantageous. It is logical to assume that some balanced strategy may be successful for intermediate cases.

Here is a variant of its code. The idea is that the "portion" we do a little more than just one element, but less than in the case of static distribution.

public static void ParallelFor ( int from, int to, Action < int > body )
{
int numProcs = Environment. ProcessorCount ;
int remainingWorkItems = numProcs ;
int nextIteration = from ;
// size of the data portion
const int batchSize = 3 ;
using ( ManualResetEvent mre = new ManualResetEvent ( false ) )
{
for ( int p = 0 ; p < numProcs ; p ++ )
{
ThreadPool. QueueUserWorkItem ( delegate {
int index ;
while ( ( index = Interlocked. Add ( ref nextIteration, batchSize ) - batchSize ) < to )
{
int end = index + batchSize ;
if ( end > = to )
end = to ;
for ( int i = index ; i < end ; i ++ )
body ( i ) ;
}
if ( Interlocked. Decrement ( ref remainingWorkItems ) == 0 )
mre. Set ( ) ;
} ) ;
}
mre. WaitOne ( ) ;
}
}

Here the size of the data portion is set by a constant, and by changing it, we choose the level we need between the static and dynamic distribution of tasks.

Conclusion

In any case, the optimal settings for a particular case are determined by the nature of the calculations. But a compromise with the dynamic distribution of more or less large "portions" of work may be quite acceptable for most situations. This is exactly how library methods for parallel execution of loops in .NET 4 are implemented, which will be discussed in the second part of the article.

PS This article is written under the impression of the book " PATTERNS OF PARALLEL PROGRAMMING: UNDERSTANDING AND APPLYING PARALLEL PATTERNS WITH THE .NET FRAMEWORK 4 AND VISUAL C # " and can be considered its free translation with recycling.

______________________

UPD: The second part is practical: habrahabr.ru/blogs/net/104103

Source: https://habr.com/ru/post/104078/

All Articles