Code optimization for the Elbrus platform using simple examples

"Usually a hacker does not write programs for the benefit of
but for your own pleasure. Such a program
may be useful or may remain
just a game of intelligence. "
Henry S. Warren. Algorithmic tricks for programmers [1]

Today we will continue our notes on Elbrus. The first article dedicated to the launch and optimization of the passport recognition system can be found here .

Once, my colleagues and I became interested in how the simplest optimization methods work on Elbrus.

The Elbrus processor has a VLIW (Very Long Instruction Word) architecture - that is, it operates with “wide” command words. This means that the lcc compiler analyzes the source code of the program during compilation, determines the dependencies between the commands, and generates broad command words. In one such word, you can fit up to 23 actions that will be performed simultaneously. If you use SIMD (Single Instruction Multiple Data), then this number can increase to 33 or more operations. Commands in a broad word are executed in parallel, ensuring the loading of all 6 arithmetic logic devices on each processor. Parallelization and loading of all calculators entirely falls on the shoulders of the optimizing compiler, which allows to significantly simplify the equipment for analysis and execution of commands and reduce energy consumption to 45 W for Elbrus-4C [2, 3].

We at Smart Engines have wondered how all the usual optimizations will work, such as, for example, loop deployment, on such an unusual platform with a “smart” compiler.

We looked at simple C ++ examples and compared the results of work on Elbrus-4C and Intel Core i7-6700K. On Elbrus, the lcc compiler version 1.20.17 was used, for the Core i7, the Microsoft Visual Studio Community 2015. For lcc we used the compilation flags -O3 and -ffast-math.

To begin with, we give the characteristics of Elbrus-4C and Intel Core i7-6700K:

	Elbrus-4C	Intel Core i7-6700K
Architecture	Elbrus	Skylake-s
Frequency, GHz	0.8	four
Number of cores	four	4 (8 with Hyper-Threading)
Technological process	65 nm	14 nm
Cache L1 size, data	64 Kb	32 Kb
Cache L1 size, instructions	128 Kb	32 Kb
Cache l2 size	8 Mb	1 Mb
Cache L3 size	-	8 Mb
Type of RAM	DDR3-1600	DDR4-2133
Power consumption, W	45	91

The clock speed of these processors is significantly different: 800 MHz for the Elbrus-4C and 4 GHz for the Intel Core i7. Also note that Elbrus has a different cache structure: there is no L3 cache, but the L2 cache size is 8 Mb (2 Mb per core), while the considered Core i7 1 Mb (0.25 Mb per core). The L1 cache on Elbrus is also larger, especially the instruction cache (128 Kb vs. 32 Kb).

Example 1. Deploying loops

This optimization is aimed at reducing the number of iterations of the cycle (and hence reducing the number of checks for the condition of exiting the cycle) by increasing the body of the cycle. Such optimization is well suited for simple cycles that occur in virtually every program.

Now consider an example:

#define N 64000 unsigned int x = 0; unsigned int a[N]; for (int i = 0; i < N; ++i) a[i] = i; //  for (int k = 0; k < N;) { x += a[k]; a[k++] = k; }

The last cycle we tried to deploy. The results of measurements of time (for 10,000 iterations of the cycle) are given in Table 1. It is worth noting that Elbrus used a 32-bit addressing mode (the compiler flag -mpr32). We also calculated the operating time at 1 GHz by multiplying the measured time by the processor clock frequency in GHz. The dimensionless quantity obtained in this way makes it possible to compare the performance of the Elbrus and the Core i7, taking into account the difference in the clock frequency.

Table 1. Running time as a function of N — the number of deployed iterations.

	Elbrus-4C	Elbrus-4C	Intel Core i7	Intel Core i7
N	Time, ms	Time in terms of 1 GHz	Time, ms	Time in terms of 1 GHz
one	401	320	255	1020
2	400	320	275	1100
four	401	320	261	1044
eight	401	320	247	988
sixteen	401	320	361	1444
32	452	362	262	1048
64	451	362	262	1048

You can see that the deployment of the cycle in this example does not give a gain in operating time on both the modern Core i7 and Elbrus-4C. In the case of a very simple cycle, which we considered, Elbrus-4C works more efficiently than the Core i7 considering the frequency ratio.

Example 2. Processing data of various lengths

In this example, we process an array of 1, 4, or 8 bytes. The original array was aligned to 8 bytes.

 #define MASK1 0xF1 #define MASK4 0xF1F2F3F4 #define MASK8 0xF1F2F3F4F5F6F7F8 for (k = 0; k < n; ++k) { [k] &= MASK1; }

The results of the time measurements are shown in table 2.

Table 2. Operation time depending on N - the number of bytes processed.

	Elbrus-4C	Elbrus-4C	Intel Core i7	Intel Core i7
N	Time, ms	Time in terms of 1 GHz	Time, ms	Time in terms of 1 GHz
one	2400	1920	811	3244
four	600	480	201	804
eight	300	240	102	408

You can see that the processing of 4 and 8 bytes works faster on both the modern Core i7 and Elbrus-4C, and the times are reduced by a multiple of the number of bytes processed. In addition, Elbrus works more efficiently than the Core i7 considering the frequency ratio.

Example 3. Using SIMD

In this example, we decided to test intrinsiki and examined the scalar product of float numbers with n = 12800 .
Non-optimized cycle:

 float result = 0.0; for (int i = 0; i < n; ++i) { result += x[i] * y[i]; }

Using SSE:

 float *pX = x; float *pY = y; __m128 Sum = _mm_setzero_ps(); int num = n / 4; for (int i = 0; i < num; ++i, pX += 4, pY += 4) { Sum = _mm_add_ps(Sum, _mm_mul_ps(_mm_load_ps(pX), _mm_load_ps(pY))); } float result = _mm_extract_ps(Sum, 0) + _mm_extract_ps(Sum, 1) + _mm_extract_ps(Sum, 2) + _mm_extract_ps(Sum, 3);

Using EML [4] (optimized library for Elbrus):

 double result; eml_Vector_DotProd_32F(x, y, n, &result);

The results of the time measurements are shown in Table 3. The size of the SIMD register on Elbrus-4C is 64 bits (instruction set version 3), which in general corresponds to the observed acceleration 2 times between the version without optimizations and the version with SIMD. On Core i7, everything is also quite plausible, we operated on 128-bit registers, which fit 4 float numbers. In addition, it is noticeable that Elbrus without intrinsics works more efficiently than the Core i7, taking into account the frequency, but with intrinsics the work times are almost the same.

Table 3. Working hours counting scalar product.

	Elbrus-4C	Elbrus-4C	Intel Core i7	Intel Core i7
	Time, ms	Time in terms of 1 GHz	Time, ms	Time in terms of 1 GHz
No optimization	263	210	99	396
With SIMD	110	88	24	96

Example 4. Counting Hamming distance between two arrays

Here we calculated the Hamming distance between the binary representation of two arrays, i.e. took Hamming distances between binary representations of the corresponding numbers in the arrays and found their sum. For arrays with 8-bit data, we used a bitwise EXCLUSIVE OR and a pre-calculated distance table:

 int result = 0; for (int i = 0; i < n; ++i) { result += popcnt_table[x[i] ^ y[i]]; }

For 32-bit and 64-bit data, we used bitwise logical EXCLUSIVE OR and _mm_popcnt_u32, _mm_popcnt_u64 on Intel and _mm_popcnt_u32, _mm_popcnt_u64 on _mm_popcnt_u32, _mm_popcnt_u64 on Elbrus. The total length of the x and y arrays in bytes remained unchanged and equal to n = 512. The results of the time measurements are shown in Table 4.

Table 4. The counting time of the Hamming distance depending on the number of bytes N processed.

	Elbrus-4C	Elbrus-4C	Intel Core i7	Intel Core i7
N	Time, ms	Time in terms of 1 GHz	Time, ms	Time in terms of 1 GHz
one	630	504	155	620
four	110	88	47	188
eight	76	61	15	60

In this example, we see that intrinsiki for counting the number of units in the 64-bit and 32-bit registers on both Elbrus and Core i7 give significant acceleration relative to the version with a pre-calculated table. In addition, the 32-bit popcnt command on Elbrus is faster than on Core i7, taking into account the frequency ratio. But in the 64-bit case, the times of work on Core i7 and Elbrus became equal.

Example 5. Eliminating data dependencies

This example is borrowed from the book by Chris Kaspersky “Software Optimization Technique. Efficient use of memory ”[5]. He demonstrates how eliminating data dependency helps improve performance. The array a filled with zeros, n = 2097152 .
Data dependency example:

 int x = 0; for (size_t a = 0; a < BLOCK_SIZE; a += 8 * sizeof(int)) { x = *(int *)((char *)p1 + a + 0 * sizeof(int)); a += x; x = *(int *)((char *)p1 + a + 1 * sizeof(int)); a += x; x = *(int *)((char *)p1 + a + 2 * sizeof(int)); a += x; x = *(int *)((char *)p1 + a + 3 * sizeof(int)); a += x; x = *(int *)((char *)p1 + a + 4 * sizeof(int)); a += x; x = *(int *)((char *)p1 + a + 5 * sizeof(int)); a += x; x = *(int *)((char *)p1 + a + 6 * sizeof(int)); a += x; x = *(int *)((char *)p1 + a + 7 * sizeof(int)); a += x; }

Each next element index depends on the value calculated by the previous command, therefore loading of elements from memory occurs sequentially, after the completion of the previous instruction.
Now code without dependency:

 int x = 0; for (size_t a = 0; a < BLOCK_SIZE; a += 8 * sizeof(int)) { x += *(int *)((char *)p2 + a + 0 * sizeof(int)); x += *(int *)((char *)p2 + a + 1 * sizeof(int)); x += *(int *)((char *)p2 + a + 2 * sizeof(int)); x += *(int *)((char *)p2 + a + 3 * sizeof(int)); x += *(int *)((char *)p2 + a + 4 * sizeof(int)); x += *(int *)((char *)p2 + a + 5 * sizeof(int)); x += *(int *)((char *)p2 + a + 6 * sizeof(int)); x += *(int *)((char *)p2 + a + 7 * sizeof(int)); }

The results of the time measurements are shown in Table 5. Eliminating data dependency works on both Elbrus and Core i7, with Core i7 working time varying about 11 times, and on Elbrus - almost 20 times! The code with data dependency worked on Elbrus more slowly than on Core i7 in terms of 1 GHz, but without dependence Elbrus is only 4 times slower than Core i7 (with a frequency difference of 5 times). Such results on Elbrus can be explained by the presence of an asynchronous array swap mechanism (array prefetch buffer), which works efficiently if the paged elements are arranged in series.

Table 5. Time to read dependent and independent data.

	Elbrus-4C	Elbrus-4C	Intel Core i7	Intel Core i7
	Time, ms	Time in terms of 1 GHz	Time, ms	Time in terms of 1 GHz
Dependent data	605	484	87	348
Independent data	32	26	eight	32

Example 6. Multi-threaded calculations

Of course, we could not but consider such an optimization method as parallelization. For the purity of the experiment, we took completely independent tasks (calculating the scalar product of two arrays of type double). Table 6 shows the sequential operation time of N tasks (T ^last ), the running time of N tasks in N threads (T ^pairs ) and acceleration E:

Table 6. Time of sequential and parallel calculation of the scalar product depending on the number of tasks and threads N.

	Elbrus-4C	Elbrus-4C	Elbrus-4C	Intel Core i7	Intel Core i7	Intel Core i7
N	T ^last ms	T ^pairs , ms	E = T ^last / T ^pairs	T ^last ms	T ^pairs , ms	E = T ^last / T ^pairs
2	2628	1316	2.00	1033	500	2.07
four	5259	1318	3.99	1994	500	3.99
eight	10513	2634	3.99	3987	503	7.93
sixteen	21045	5268	4.00	7980	1009	7.91
20	26321	6583	4.00	9967	1263	7.89
32	42053	10535	3.99	15948	2014	7.92
40	52566	13170	3.99	19936	2528	7.89

It is seen that on Core i7, the acceleration is almost 8 times achieved on 8 streams and then varies slightly. On Elbrus, acceleration is 4 times achieved on 4 streams and also does not change with an increase in the number of streams. The speed ratio between Elbrus and Core i7 turned out to be approximately 2.6, while the frequency ratio is 5.

findings

The usual methods of speeding up computations quite work on Elbrus, and in this regard, programming for it does not require specific knowledge and skills. In the considered elementary examples for the optimization of the code, Elbrus proved to be just wonderful given the clock frequency of 800 MHz and two times less power consumption than Core i7.

PS And we also launched our SPARC passport recognition system! This means that we can now write articles about recognition on another computational architecture.

Update . An error crept into the results of Example 1, thanks to the VLev comments , we found and corrected it. Results updated.

Used sources

[1] Henry S. Warren, Jr. Algorithmic tricks for programmers. M .: Williams Publishing House, 2014. ISBN 978-5-8459-1838-3 - 512 C.
[2] http://www.elbrus.ru/arhitektura_elbrus .
[3] http://www.mcst.ru/mikroprocessor-elbrus4s .
[4] http://www.mcst.ru/vysokoproizvoditelnye_biblioteki .
[5] Chris Kaspersky. “Technology optimization programs. Efficient use of memory. ” Spb .: BHV-Petersburg, 2003. ISBN 5-94157-232-8 - 464 S.

Source: https://habr.com/ru/post/317672/

All Articles

Code optimization for the Elbrus platform using simple examples

Used sources

More articles: