📜 ⬆️ ⬇️

Code optimization for the Elbrus platform using simple examples

"Usually a hacker does not write programs for the benefit of
but for your own pleasure. Such a program
may be useful or may remain
just a game of intelligence. "
Henry S. Warren. Algorithmic tricks for programmers [1]


Today we will continue our notes on Elbrus. The first article dedicated to the launch and optimization of the passport recognition system can be found here .


image


Once, my colleagues and I became interested in how the simplest optimization methods work on Elbrus.


The Elbrus processor has a VLIW (Very Long Instruction Word) architecture - that is, it operates with “wide” command words. This means that the lcc compiler analyzes the source code of the program during compilation, determines the dependencies between the commands, and generates broad command words. In one such word, you can fit up to 23 actions that will be performed simultaneously. If you use SIMD (Single Instruction Multiple Data), then this number can increase to 33 or more operations. Commands in a broad word are executed in parallel, ensuring the loading of all 6 arithmetic logic devices on each processor. Parallelization and loading of all calculators entirely falls on the shoulders of the optimizing compiler, which allows to significantly simplify the equipment for analysis and execution of commands and reduce energy consumption to 45 W for Elbrus-4C [2, 3].


We at Smart Engines have wondered how all the usual optimizations will work, such as, for example, loop deployment, on such an unusual platform with a “smart” compiler.


We looked at simple C ++ examples and compared the results of work on Elbrus-4C and Intel Core i7-6700K. On Elbrus, the lcc compiler version 1.20.17 was used, for the Core i7, the Microsoft Visual Studio Community 2015. For lcc we used the compilation flags -O3 and -ffast-math.


To begin with, we give the characteristics of Elbrus-4C and Intel Core i7-6700K:


Elbrus-4CIntel Core i7-6700K
ArchitectureElbrusSkylake-s
Frequency, GHz0.8four
Number of coresfour4 (8 with Hyper-Threading)
Technological process65 nm14 nm
Cache L1 size, data64 Kb32 Kb
Cache L1 size, instructions128 Kb32 Kb
Cache l2 size8 Mb1 Mb
Cache L3 size-8 Mb
Type of RAMDDR3-1600DDR4-2133
Power consumption, W4591

The clock speed of these processors is significantly different: 800 MHz for the Elbrus-4C and 4 GHz for the Intel Core i7. Also note that Elbrus has a different cache structure: there is no L3 cache, but the L2 cache size is 8 Mb (2 Mb per core), while the considered Core i7 1 Mb (0.25 Mb per core). The L1 cache on Elbrus is also larger, especially the instruction cache (128 Kb vs. 32 Kb).


Example 1. Deploying loops


This optimization is aimed at reducing the number of iterations of the cycle (and hence reducing the number of checks for the condition of exiting the cycle) by increasing the body of the cycle. Such optimization is well suited for simple cycles that occur in virtually every program.


Now consider an example:


#define N 64000 unsigned int x = 0; unsigned int a[N]; for (int i = 0; i < N; ++i) a[i] = i; //  for (int k = 0; k < N;) { x += a[k]; a[k++] = k; } 

The last cycle we tried to deploy. The results of measurements of time (for 10,000 iterations of the cycle) are given in Table 1. It is worth noting that Elbrus used a 32-bit addressing mode (the compiler flag -mpr32). We also calculated the operating time at 1 GHz by multiplying the measured time by the processor clock frequency in GHz. The dimensionless quantity obtained in this way makes it possible to compare the performance of the Elbrus and the Core i7, taking into account the difference in the clock frequency.


Table 1. Running time as a function of N — the number of deployed iterations.


Elbrus-4CElbrus-4CIntel Core i7Intel Core i7
NTime, msTime in terms of 1 GHzTime, msTime in terms of 1 GHz
one4013202551020
24003202751100
four4013202611044
eight401320247988
sixteen4013203611444
324523622621048
644513622621048

You can see that the deployment of the cycle in this example does not give a gain in operating time on both the modern Core i7 and Elbrus-4C. In the case of a very simple cycle, which we considered, Elbrus-4C works more efficiently than the Core i7 considering the frequency ratio.


Example 2. Processing data of various lengths


In this example, we process an array of 1, 4, or 8 bytes. The original array was aligned to 8 bytes.


 #define MASK1 0xF1 #define MASK4 0xF1F2F3F4 #define MASK8 0xF1F2F3F4F5F6F7F8 for (k = 0; k < n; ++k) { [k] &= MASK1; } 

The results of the time measurements are shown in table 2.


Table 2. Operation time depending on N - the number of bytes processed.


Elbrus-4CElbrus-4CIntel Core i7Intel Core i7
NTime, msTime in terms of 1 GHzTime, msTime in terms of 1 GHz
one240019208113244
four600480201804
eight300240102408

You can see that the processing of 4 and 8 bytes works faster on both the modern Core i7 and Elbrus-4C, and the times are reduced by a multiple of the number of bytes processed. In addition, Elbrus works more efficiently than the Core i7 considering the frequency ratio.


Example 3. Using SIMD


In this example, we decided to test intrinsiki and examined the scalar product of float numbers with n = 12800 .
Non-optimized cycle:


 float result = 0.0; for (int i = 0; i < n; ++i) { result += x[i] * y[i]; } 

Using SSE:


 float *pX = x; float *pY = y; __m128 Sum = _mm_setzero_ps(); int num = n / 4; for (int i = 0; i < num; ++i, pX += 4, pY += 4) { Sum = _mm_add_ps(Sum, _mm_mul_ps(_mm_load_ps(pX), _mm_load_ps(pY))); } float result = _mm_extract_ps(Sum, 0) + _mm_extract_ps(Sum, 1) + _mm_extract_ps(Sum, 2) + _mm_extract_ps(Sum, 3); 

Using EML [4] (optimized library for Elbrus):


 double result; eml_Vector_DotProd_32F(x, y, n, &result); 

The results of the time measurements are shown in Table 3. The size of the SIMD register on Elbrus-4C is 64 bits (instruction set version 3), which in general corresponds to the observed acceleration 2 times between the version without optimizations and the version with SIMD. On Core i7, everything is also quite plausible, we operated on 128-bit registers, which fit 4 float numbers. In addition, it is noticeable that Elbrus without intrinsics works more efficiently than the Core i7, taking into account the frequency, but with intrinsics the work times are almost the same.


Table 3. Working hours counting scalar product.


Elbrus-4CElbrus-4CIntel Core i7Intel Core i7
Time, msTime in terms of 1 GHzTime, msTime in terms of 1 GHz
No optimization26321099396
With SIMD110882496

Example 4. Counting Hamming distance between two arrays


Here we calculated the Hamming distance between the binary representation of two arrays, i.e. took Hamming distances between binary representations of the corresponding numbers in the arrays and found their sum. For arrays with 8-bit data, we used a bitwise EXCLUSIVE OR and a pre-calculated distance table:


 int result = 0; for (int i = 0; i < n; ++i) { result += popcnt_table[x[i] ^ y[i]]; } 

For 32-bit and 64-bit data, we used bitwise logical EXCLUSIVE OR and _mm_popcnt_u32, _mm_popcnt_u64 on Intel and _mm_popcnt_u32, _mm_popcnt_u64 on _mm_popcnt_u32, _mm_popcnt_u64 on Elbrus. The total length of the x and y arrays in bytes remained unchanged and equal to n = 512. The results of the time measurements are shown in Table 4.


Table 4. The counting time of the Hamming distance depending on the number of bytes N processed.


Elbrus-4CElbrus-4CIntel Core i7Intel Core i7
NTime, msTime in terms of 1 GHzTime, msTime in terms of 1 GHz
one630504155620
four1108847188
eight76611560

In this example, we see that intrinsiki for counting the number of units in the 64-bit and 32-bit registers on both Elbrus and Core i7 give significant acceleration relative to the version with a pre-calculated table. In addition, the 32-bit popcnt command on Elbrus is faster than on Core i7, taking into account the frequency ratio. But in the 64-bit case, the times of work on Core i7 and Elbrus became equal.


Example 5. Eliminating data dependencies


This example is borrowed from the book by Chris Kaspersky “Software Optimization Technique. Efficient use of memory ”[5]. He demonstrates how eliminating data dependency helps improve performance. The array a filled with zeros, n = 2097152 .
Data dependency example:


 int x = 0; for (size_t a = 0; a < BLOCK_SIZE; a += 8 * sizeof(int)) { x = *(int *)((char *)p1 + a + 0 * sizeof(int)); a += x; x = *(int *)((char *)p1 + a + 1 * sizeof(int)); a += x; x = *(int *)((char *)p1 + a + 2 * sizeof(int)); a += x; x = *(int *)((char *)p1 + a + 3 * sizeof(int)); a += x; x = *(int *)((char *)p1 + a + 4 * sizeof(int)); a += x; x = *(int *)((char *)p1 + a + 5 * sizeof(int)); a += x; x = *(int *)((char *)p1 + a + 6 * sizeof(int)); a += x; x = *(int *)((char *)p1 + a + 7 * sizeof(int)); a += x; } 

Each next element index depends on the value calculated by the previous command, therefore loading of elements from memory occurs sequentially, after the completion of the previous instruction.
Now code without dependency:


 int x = 0; for (size_t a = 0; a < BLOCK_SIZE; a += 8 * sizeof(int)) { x += *(int *)((char *)p2 + a + 0 * sizeof(int)); x += *(int *)((char *)p2 + a + 1 * sizeof(int)); x += *(int *)((char *)p2 + a + 2 * sizeof(int)); x += *(int *)((char *)p2 + a + 3 * sizeof(int)); x += *(int *)((char *)p2 + a + 4 * sizeof(int)); x += *(int *)((char *)p2 + a + 5 * sizeof(int)); x += *(int *)((char *)p2 + a + 6 * sizeof(int)); x += *(int *)((char *)p2 + a + 7 * sizeof(int)); } 

The results of the time measurements are shown in Table 5. Eliminating data dependency works on both Elbrus and Core i7, with Core i7 working time varying about 11 times, and on Elbrus - almost 20 times! The code with data dependency worked on Elbrus more slowly than on Core i7 in terms of 1 GHz, but without dependence Elbrus is only 4 times slower than Core i7 (with a frequency difference of 5 times). Such results on Elbrus can be explained by the presence of an asynchronous array swap mechanism (array prefetch buffer), which works efficiently if the paged elements are arranged in series.


Table 5. Time to read dependent and independent data.


Elbrus-4CElbrus-4CIntel Core i7Intel Core i7
Time, msTime in terms of 1 GHzTime, msTime in terms of 1 GHz
Dependent data60548487348
Independent data3226eight32

Example 6. Multi-threaded calculations


Of course, we could not but consider such an optimization method as parallelization. For the purity of the experiment, we took completely independent tasks (calculating the scalar product of two arrays of type double). Table 6 shows the sequential operation time of N tasks (T last ), the running time of N tasks in N threads (T pairs ) and acceleration E:


Table 6. Time of sequential and parallel calculation of the scalar product depending on the number of tasks and threads N.


Elbrus-4CElbrus-4CElbrus-4CIntel Core i7Intel Core i7Intel Core i7
NT last msT pairs , msE = T last / T pairsT last msT pairs , msE = T last / T pairs
2262813162.0010335002.07
four525913183.9919945003.99
eight1051326343.9939875037.93
sixteen2104552684.00798010097.91
202632165834.00996712637.89
3242053105353.991594820147.92
4052566131703.991993625287.89

It is seen that on Core i7, the acceleration is almost 8 times achieved on 8 streams and then varies slightly. On Elbrus, acceleration is 4 times achieved on 4 streams and also does not change with an increase in the number of streams. The speed ratio between Elbrus and Core i7 turned out to be approximately 2.6, while the frequency ratio is 5.


findings


The usual methods of speeding up computations quite work on Elbrus, and in this regard, programming for it does not require specific knowledge and skills. In the considered elementary examples for the optimization of the code, Elbrus proved to be just wonderful given the clock frequency of 800 MHz and two times less power consumption than Core i7.


PS And we also launched our SPARC passport recognition system! This means that we can now write articles about recognition on another computational architecture.


Update . An error crept into the results of Example 1, thanks to the VLev comments , we found and corrected it. Results updated.


Used sources


[1] Henry S. Warren, Jr. Algorithmic tricks for programmers. M .: Williams Publishing House, 2014. ISBN 978-5-8459-1838-3 - 512 C.
[2] http://www.elbrus.ru/arhitektura_elbrus .
[3] http://www.mcst.ru/mikroprocessor-elbrus4s .
[4] http://www.mcst.ru/vysokoproizvoditelnye_biblioteki .
[5] Chris Kaspersky. “Technology optimization programs. Efficient use of memory. ” Spb .: BHV-Petersburg, 2003. ISBN 5-94157-232-8 - 464 S.


')

Source: https://habr.com/ru/post/317672/


All Articles