Hello, dear readers.
I suggest everyone who is interested to discuss some of the main ideas of classical and parallel programming in the C ++ extension based on procedures / functions with re-entry planning (PPPV / FPPV). In the minimum case, it is a procedure or function that has a static or dynamic execution plan.
A plan is, generally speaking, a deck consisting of elements-structures, each of which has fields similar in type and name to the parameters of the corresponding function / procedure. A plan can be replenished both from the outside (before calling a procedure / function, using some simple tricks), or from the inside (this is the main approach) with the help of special calls at the beginning of
plan_first (...) or at the end of
plan_last (...) of the plan .
PPPV / FPPV is performed sequentially or in parallel, in accordance with the plan. In
sequential mode, it is called again for each element of the plan (the element is by default retrieved from the beginning of the plan) and fulfills it completely (until the natural completion of the PPPV / FDPV or before the return). At each call, the parameters of the PPPV / FPVV are filled with data from the fields of the current element of the plan with the same name. As it was already said, in the process of execution of any of the stages, new stages can be inserted into the plan, which will be implemented by the PPSP / FPPV further. To support
parallel mode (in the simplest case), in addition to the stages, special marker barriers can be inserted dividing the plan into groups. With the help of the
plan_group_parallelize directive,
you can enable parallel execution of the current (located at the beginning of the plan) group, while it is considered as a group of tasks (task pool) from which processors / cores gather themselves into execution stages of the task. With the help of the
plan_group_vectorize directive,
you can send a task group to a vector calculator, such as a multi-core video card (though some features may arise in the design of the program - for example, you may need to explicitly note which of the program blocks are intended only for a vector calculator, which are CPU, which is for both CPU and vector
devices).
')
Already this basic approach provides, at a minimum:
- another way of programming many tasks that use the stack, decks, the queue, and sometimes even an array.
- another approach to programming parallel processing for SMP systems and vector calculators (multi-core graphics cards).
In order not to complicate the understanding, I will immediately give a couple of examples.
Parallel summation of elements in the tree. The reduction parameter is SumResult.reenterable[ARR_SIZE] TreeSum(TreeNode * Cur, reduction(+) DataItem & SumResult) { if (Cur==Root) plan_group_parallelize; if (Cur->Left) plan_last(Cur->Left,SumResult); if (Cur->Right) plan_last(Cur->Right,SumResult); SumResult = Cur->Data; }
Lee Wave Algorithm: Finding a path in a maze. const int NL = 10; const unsigned char W = 0xFF; unsigned char Labirint[NL][NL] = { {W,W,W,W,W,W,W,W,W,W}, {W,0,0,0,0,0,0,0,0,W}, {W,0,W,W,0,0,W,0,0,W}, {W,0,0,W,0,0,W,0,0,W}, {W,0,0,W,W,0,W,0,0,W}, {W,0,0,0,W,W,W,0,0,W}, {W,0,0,0,0,0,0,0,0,W}, {W,0,0,0,0,0,0,0,0,W}, {W,0,0,0,0,0,0,0,0,W}, {W,W,W,W,W,W,W,W,W,W}, }; signed char OffsX[4] = {-1,0,0,+1}; signed char OffsY[4] = {0,-1,+1,0}; const char FirstX = 8; const char FirstY = 8; const char LastX = 5; const char LastY = 4; chain[NL*NL] FindLi(unsigned char X, unsigned char Y, int Num) throw(unsigned char X, unsigned char Y) { char Found = 0; for (int i=0; !Found && i<4; i++) { unsigned char X1 = X+OffsX[i]; unsigned char Y1 = Y+OffsY[i]; if (X1>=0 && X1<NL && Y1>=0 && Y1<NL && Labirint[Y1][X1]==0) { Labirint[Y1][X1] = Num; if (X1==LastX && Y1==LastY) Found = 1; else plan_last(X1,Y1,Num+1); } } if (Found) { clear_plan; X = LastX; Y = LastY; throw_last(X,Y); while (X!=FirstX || Y!=FirstY) { char PointFound = 0; for (int i=0; !PointFound && i<4; i++) { unsigned char X1 = X+OffsX[i]; unsigned char Y1 = Y+OffsY[i]; if (X1>=0 && X1<NL && Y1>=0 && Y1<NL && Labirint[Y1][X1] && Labirint[Y1][X1]<Labirint[Y][X]) { PointFound = 1; throw_first(X1,Y1); X = X1; Y = Y1; } } } } else if (plan_empty) cout<<"NO PATH\n"; } chain[NL*2] OutLi(unsigned char X, unsigned char Y) { cout<<"("<<(int)Y<<","<<(int)X<<") "; } int main() { cout<<"Find the path in the simple labirint (Li algorithm) :\n"; Labirint[FirstY][FirstX] = 1; plan_chain(0,FindLi(FirstX,FirstY,2),OutLi(0,0)); cout<<"\n"; return 0; }
And one more
abstract example of working with a multi-core video card , which I will give without explanation. Perhaps someone will be interested to guess how it works.
#pragma plan vectorized #include <iostream> using namespace std; #pragma plan common begin #define N 5 #define threads 100 #pragma plan common end #pragma plan gpu begin #define addition 0.01 #pragma plan gpu end float MUL = 3.14; float * _OUT = NULL; reenterable void proc(bool init, int k, _global(1) float * mul, _global(threads) int * sizes, int n, _local(__planned__.n) float * out) { int i; if (init) { for (i = 0; i < threads; i++) { plan_last(false, i, mul, sizes, sizes[i], out); out += sizes[i]; } plan_group_vectorize(NULL); } else for (i = 0; i < n; i++) { *out = k*(*mul); #ifdef __GPU__ *out += addition; #endif out++; } } int main() { int * sizes = new int[threads]; int NN = 0; for (int i = 0; i < threads; i++) { sizes[i] = 1 + i % 2; NN += sizes[i]; } _OUT = new float[NN]; cout<<"MAX group size = "<<vector_max_size(NULL)<<endl; proc(true, 0, &MUL, sizes, 0, _OUT); for (int i = 0; i < NN; i++) cout<<_OUT[i]<<" "; cout<<endl; delete[] _OUT; delete[] sizes; return 0; }
Now, I’ll note that if PPPV / PPPV is considered as an elementary node of the computational topology (graph) and define constructions that allow one PPPV / PPPV to replenish the plan of another (neighboring in the section) PPPV / PPPV, then you can work with quite complex computational topologies, both in the case of shared and in the case of separate memory (for example, on a cluster - there the transfer of plan elements according to the graph will be performed using simple network transfer operations). The operations of replenishing the plan for another PPPV / FPPV are called
throw_first (...) and
throw_last (...) . Their parameters are determined by the call parameters of the respective host PPPV / FPPV. If any PPPV / FPVV has only one neighbor-receiver in the topology (for example, in the pipeline), then the parameters are the most common. If there are several receiver neighbors, then one of the parameters is made special - addressable. All of the same name (corresponding to one PPPV / FPPV) nodes of the graph topology are numbered, therefore, in the address parameter the name of the receiving PPPV / FPPV is indicated, followed by its number in square brackets. For the time being, it is proposed to describe topologies either statically (with special constructions for a vector / pipeline or lists of arcs for an arbitrary graph) or semi-statically — when the list of arcs is generated by special (generating code) inserts — macromodules (they can be written, for example, in a philosophy similar to PHP — insertions generate fragment of the program text, you can use any language depending on the tasks, even though PHP, even GNU Prolog). Technically, the possibility of the usual dynamic generation of topology using calls of certain functions is not excluded.
For full-fledged work, channels / lazy variables, transactional memory, barriers, semaphores, etc. can always be used additionally.
I will give a few examples with different topologies.
Calculation on the pipeline of the minimum and maximum elements of the tree. chain[ARR_SIZE] TreeMax(TreeNode * Cur, reduction(max) DataItem & MaxResult) { static DataItem DummyMin; throw_last(Cur,DummyMin); if (Cur==Root) plan_group_parallelize; if (Cur->Left) plan_last(Cur->Left,MaxResult); if (Cur->Right) plan_last(Cur->Right,MaxResult); MaxResult = Cur->Data; } chain[ARR_SIZE] TreeMin(TreeNode * Cur, reduction(min) DataItem & MinResult) { if (Cur==Root) plan_group_parallelize; MinResult = Cur->Data; } void TreeMinMax(DataItem & Min, DataItem & Max) { Max.fill(0.0); Min.fill(1000.0); plan_parallel_chain(0,TreeMax(Root,Max),TreeMin(Root,Min)); }
The example with the “needle-to-eye” topology can be used to test the performance of a specific implementation #include <iostream> using namespace std; bool stop; chain A(bool init) throw(bool init, int N) { stop = false; throw_first(false, 1); Sleep(2000); stop = true; } chain B(bool init, int N) throw(bool init, bool _stop, int N) { if (!init) { if (stop) throw_first(false, true, N); else throw_first(false, false, N+1); } } chain C(bool init, bool _stop, int N) throw(bool init, int N) { if (!init) { if (_stop) { cout<<N; plan_topology_quit(); /* */ } else throw_first(false, N+1); } } int main() { plan_topology { /* */ plan_parallel_chain(A(true)->B(true,0)->C(true,false,0)); /* */ plan_parallel_reverse(C(true,false,0)->B(true,0)); /* */ }/3; return 0; }
Everything described above is implemented as a C ++ extension (Planning C). There is a simple demo translator that implements, in addition to the above, some more interesting things.