📜 ⬆️ ⬇️

Compiler LLVM for MultiClet: WhetStone benchmark

In conversations about multicellular architecture , its applicability to a particular task in the context of the amount of natural parallelism present in it was often discussed earlier. Thus, when executing various benchmarks, in particular, CoreMark, we talked about the incompatibility of such programs with the multicellular architecture, due to the rather rigid sequence of the algorithm, which does not allow the cells within the group to extract a sufficient number of simultaneously executed commands. In this article, we will evaluate multiklets in more indicative conditions - using the WhetStone benchmark.


The WhetStone test differs in a favorable direction from CoreMark in the nature of the calculations carried out in it: all tests included in it, with the exception of the test speed of conditional transitions, have internal parallelism to one degree or another. Testing was performed in two versions. In the first, Multiclet R1, a compilation was made of the current version of the LLVM compiler with options:

-ffast-math -fno-builtin -O3 

In the second variant, the Multiclet R1 *, the testing was carried out with the introduction in manual mode of the prospective optimizations that are being added to the compiler now. Manual revision was to help the compiler increase the linear sections by combining several iterations of the cycle.
')

results

SystemMHzMWIPS / MHzMFLOPS1 / MHzMFLOPS2 / MHzMFLOPS3 / MHzCOS MOPS / MHzEXP MOPS / MHzFIXPT MOPS / MHzIF MOPS / MHzEQUAL MOPS / MHz
Multiclet R1 *1000.7210.2560.2120.1620.0180.0083.5690.4171.57
RPi 2 v7-A710000.5850.280.2910.2480.0110.0061.3141.2090.981
RPi 3 v8-A5312000.6040.2760.290.2480.010.0071.2671.5611.014
ARM v8-A5313000.6420.2680.2410.2390.0280.0041.1971.4360.439
Core i7 4820K39000.8870.3410.3080.1670.0230.0140.9981.5040.251
Core i7 1 CP30660.8730.3250.2950.1740.0250.0130.8920.9580.167
Phenom II30000.7990.3070.270.1110.0260.0160.8351.0010.167
Athlon 6422110.7850.3080.2720.1040.0260.0160.8320.9990.187
Turion 64 M19000.8840.3020.2580.1450.0260.0160.8270.9880.187
Core i5 2467M23000.8530.2960.2980.1630.0220.0130.8070.9930.222
Core 2 Duo 1 CP24000.8850.3370.3070.1980.0240.0120.8040.810.176
Celeron C2 M20000.8680.2970.2960.1940.0230.0120.7780.7810.172
Core 2 duo m18300.8780.3370.3050.1970.0240.0120.7510.7850.174
Multiclet R11000.3110.1570.1530.0290.0180.0080.7140.0810.143
Celeron m12950.8320.3240.2970.1780.0220.0120.6310.9230.173
Raspberry pi10000.3910.1370.1460.1230.0090.0040.6171.0140.805
Athlon XP20880.8560.3070.2740.1390.0260.0160.5760.9980.166
Pentium pro2000.790.3320.2780.1460.0230.0130.5750.7550.149
Athlon4 barton18000.8460.3050.2720.1370.0260.0160.5710.9880.165
Celeron a4500.760.2910.2760.140.0220.0120.5690.7510.147
Pentium 4E30000.390.1820.1640.0580.0140.0060.3230.270.126
Atom m16000.3480.1760.1570.0510.010.0070.2520.7440.11
Pentium 419000.3830.2140.1880.0560.0120.0060.2410.4270.118
Pentium MMX2000.6150.3280.2670.0790.0250.0130.1980.730.186
Pentium1000.6040.3220.2670.0780.0250.0130.1920.5680.183
80486DX2660.1820.0760.0680.0260.0080.0050.1050.2120.017
* Option to use advanced optimizations compiler LLVM

It can be seen that in terms of MWIPS / MHz, multiklets look much more confident than in terms of CoreMark / MHz (figures published earlier in the article ). We can notice the following:



The tests included in WhetStone can be divided into four groups. The first group includes tests for the performance of floating-point calculations. The results of these tests determine the indicators MFLOPS1 , MFLOPS2 , MFLOPS3 . It can be seen that the use of additional optimizations compiler LLVM gives a significant acceleration in all three tests.

Consider the nature of the resulting acceleration on the example of the MFLOPS1 indicator and investigate how such a result is achieved on the example of the first floating point arithmetic test. C test code:

 timea = dtime(); { for (ix=0; ix<xtra; ix++) { for(i=0; i<n1*n1mult; i+=5) { e1[0] = (e1[0] + e1[1] + e1[2] - e1[3]) * t; e1[1] = (e1[0] + e1[1] - e1[2] + e1[3]) * t; e1[2] = (e1[0] - e1[1] + e1[2] + e1[3]) * t; e1[3] = (-e1[0] + e1[1] + e1[2] + e1[3]) * t; } t = 1.0 - t; } t = t0; } timeb = dtime(); 

When compiling the test with an existing version of LLVM, we get the following assembly code for the body of the inner loop:

 jmp LBB2_4 SR4 := rdq #IR7, 2160 SR5 := rdq #IR7, 2152 SR6 := rdq #IR7, 2144 SR7 := rdq #IR7, 2136 SR8 := rdq #IR7, 2128 SR9 := rdq #IR7, 2120 SR10:= subf @SR6, @SR5 SR11:= subf @SR5, @SR6 SR12:= addf @SR5, @SR6 SR5 := addf @SR10, @SR4 SR10:= addf @SR11, @SR4 SR4 := addf @SR10, @SR7 SR7 := mulf @SR4, @SR8 SR10:= addf @SR5, @SR7 SR5 := subf @SR4, @SR10 SR11:= subf @SR10, @SR4 SR4 := mulf @SR10, @SR8 SR10:= mulf @SR5, @SR8 SR5 := addf @SR12, @SR10 SR10:= addf @SR11, @SR5 SR11:= mulf @SR5, @SR8 SR5 := mulf @SR10, @SR8 SR10:= addf @SR5, @SR6 SR5 := mulf @SR10, @SR8 wrq @SR9, #IR7, 2760 wrq @SR5, #IR7, 2752 wrq @SR11, #IR7, 2744 wrq @SR4, #IR7, 2736 wrq @SR7, #IR7, 2728 

It can be seen that many instructions in such a paragraph can be executed in parallel. However, combining several iterations into one allows you to significantly lengthen a section of code that is executed inside the intercellular environment without using memory, and also allows you to save time on saving intermediate results between iterations.

After the iteration merging procedure, the loop body will look as follows
 jmp LBB2_4 SR4 := rdq #IR7, 272 SR5 := rdq #IR7, 264 SR6 := rdq #IR7, 256 SR7 := rdq #IR7, 248 SR8 := rdq #IR7, 320 SR9 := rdq #IR7, 240 SR10 := rdq #IR7, 232 SR11 := addf @SR4, @SR5 SR5 := addf @SR7, @SR6 SR12 := addf @SR11, @SR6 SR13 := subf @SR12, @SR7 SR12 := mulf @SR13, @SR8 SR14 := addf @SR12, @SR4 SR4 := addf @SR14, @SR11 SR11 := subf @SR14, @SR6 SR6 := addf @SR11, @SR7 SR11 := subf @SR13, @SR6 SR12 := subf @SR6, @SR13 SR13 := mulf @SR11, @SR8 SR11 := addf @SR5, @SR13 SR5 := addf @SR12, @SR11 SR12 := addf @SR11, @SR4 SR13 := mulf @SR5, @SR8 SR5 := mulf @SR12, @SR8 SR12 := addf @SR13, @SR7 SR7 := mulf @SR12, 0x3f000000 SR12 := subf @SR5, @SR7 SR5 := addf @SR12, @SR6 SR6 := addf @SR12, @SR11 SR13 := subf @SR5, @SR11 SR11 := addf @SR5, @SR4 SR4 := mulf @SR13, @SR8 SR5 := mulf @SR11, @SR9 SR11 := addf @SR4, @SR7 SR4 := subf @SR11, @SR12 SR12 := subf @SR6, @SR11 SR6 := mulf @SR12, @SR8 SR11 := subf @SR13, @SR12 SR12 := addf @SR6, @SR7 SR6 := mulf @SR8, @SR11 SR11 := addf @SR4, @SR12 SR4 := mulf @SR12, @SR8 SR13 := mulf @SR11, @SR8 SR11 := addf @SR4, @SR5 SR4 := addf @SR13, @SR7 SR5 := mulf @SR4, 0x3f000000 SR4 := subf @SR11, @SR5 SR7 := addf @SR4, @SR12 SR12 := addf @SR6, @SR4 SR6 := mulf @SR12, @SR8 SR12 := addf @SR6, @SR5 SR13 := addf @SR6, @SR11 SR6 := subf @SR12, @SR4 SR4 := subf @SR7, @SR12 SR7 := mulf @SR4, @SR8 SR4 := addf @SR7, @SR5 SR7 := addf @SR6, @SR4 SR6 := addf @SR4, @SR13 SR11 := mulf @SR7, @SR8 SR7 := mulf @SR6, @SR8 SR6 := addf @SR11, @SR5 SR5 := mulf @SR6, 0x3f000000 SR6 := subf @SR7, @SR5 SR7 := addf @SR6, @SR12 SR11 := addf @SR6, @SR4 SR12 := subf @SR7, @SR4 SR4 := addf @SR7, @SR13 SR8 := moveq @SR8 SR9 := moveq @SR9 SR10 := moveq @SR10 SR7 := mulf @SR12, @SR8 SR13 := mulf @SR4, @SR9 SR4 := addf @SR7, @SR5 SR7 := subf @SR4, @SR6 SR6 := subf @SR11, @SR4 SR4 := mulf @SR6, @SR8 SR9 := subf @SR12, @SR6 SR6 := addf @SR4, @SR5 SR4 := mulf @SR8, @SR9 SR9 := addf @SR7, @SR6 SR7 := mulf @SR6, @SR8 SR11 := mulf @SR9, @SR8 SR9 := addf @SR7, @SR13 SR7 := addf @SR11, @SR5 SR5 := mulf @SR7, 0x3f000000 SR7 := subf @SR9, @SR5 SR11 := addf @SR7, @SR6 SR6 := addf @SR4, @SR7 SR4 := mulf @SR6, @SR8 SR12 := addf @SR4, @SR5 SR13 := addf @SR4, @SR9 SR4 := subf @SR12, @SR7 SR7 := subf @SR11, @SR12 SR9 := mulf @SR7, @SR8 SR11 := subf @SR6, @SR7 SR6 := addf @SR9, @SR5 SR7 := mulf @SR8, @SR11 SR9 := addf @SR4, @SR6 SR4 := addf @SR13, @SR6 SR11 := mulf @SR9, @SR8 SR9 := mulf @SR4, @SR8 SR4 := addf @SR11, @SR5 SR5 := mulf @SR4, 0x3f000000 SR4 := subf @SR9, @SR5 SR9 := addf @SR7, @SR4 SR7 := addf @SR4, @SR6 SR6 := mulf @SR4, @SR8 SR11 := mulf @SR9, @SR8 SR9 := addf @SR11, @SR5 SR11 := subf @SR7, @SR9 SR7 := subf @SR9, @SR4 SR4 := mulf @SR9, @SR8 SR9 := mulf @SR11, @SR8 SR11 := addf @SR9, @SR5 SR9 := addf @SR7, @SR11 SR7 := mulf @SR11, @SR8 SR11 := mulf @SR9, @SR8 SR8 := addf @SR11, @SR5 SR5 := mulf @SR8, 0x3f000000 wrq @SR10, #IR7, 384 wrq @SR6, #IR7, 376 wrq @SR4, #IR7, 368 wrq @SR7, #IR7, 360 wrq @SR5, #IR7, 352 

The second group of tests evaluates the speed of the basic mathematical functions and is characterized by COS MOPS and EQUAL MOPS indicators. Compiler optimizations do not have a noticeable effect on the performance of these tests, since the main burden rests on the mathematical library. The fact that the math library used was written under the old P1 processor and did not take advantage of many of the advantages of the newer R1 processor had a significant negative impact on the result of these tests.

In the third group of tests, you can combine the performance tests of integer arithmetic ( FIXPT MOPS indicator) and array performance ( EQUAL MOPS indicator). The tests of this group are affected by all the processes that improve performance in the first group, in addition, the increased linear sections obtained by combining loop iterations can be optimized using standard LLVM compiler optimization tools. These optimizations significantly reduce the number of intermediate calculations required and lead to the fact that the test totals for R1 are 1.5 ... 2 times higher than those of Intel and ARM.

The last group includes the performance test of conditional jumps with an IF MPS index. Low indicators of this test are due to a strict test sequence and, as a result, the lack of the required number of parallelism.

Thus, the current revision of the processor, with a sufficient length of the linear section and a sufficient number of mutually independent commands within it, provides execution speeds comparable to the current versions of the ARM and Intel family of cores. Good results are achieved for indicators MFLOPS1 , MFLOPS2 , MFLOPS3 . Excellent results on the FIXPT MOPS and EQUAL MOPS indicators are associated not only with the features of the multicellular architecture, but also with the results of compiler optimizations of the algorithm produced on the increased linear sections, which leads to some overestimation of the results in this test by reducing the number of actions performed. Not very good COS MOPS and EXP MOPS are determined by insufficient attention to the optimization of the mathematical library and will be improved in the future.

As for the compiler itself, then since the writing of the last article its functionality for the multicellular architecture has been significantly extended:

  1. Added support for 64-bit integer arithmetic.
  2. Added the ability to generate debug information.
  3. Added target (option -target), which provides generation of assembler code only using real single-precision arithmetic (double, long double types are 32 bits in size, like float).
  4. Added compiler options to ensure that only 32-bit write instructions are used (this was necessary due to the implementation of the external memory of the R1 processor, into which you can write only 32-bit values).
  5. The library functions memset (), memcpy (), memmov () are optimally implemented.
  6. Studies have been carried out on the possibility of compiler support for vector instructions, the results of which did not reveal the need to implement this possibility due to the limited set of vector instructions supported by the R1 processor itself.

In general, the LLVM compiler has been updated to version 3.8.1.

Source: https://habr.com/ru/post/307512/


All Articles