→ Original article
Posted by: Martin Krastev
A friend of mine drew my attention to an interesting article on habrahabr.ru - a Russian translation of the article by Daniel Lemire. Quick removal of spaces from lines on ARM processors . This article intrigued me for two reasons: first, someone actually spent the time and effort to find the optimal solution to a common problem on a non-x86 architecture (hurray!), And second, the author gave some results at the end of the article puzzled me: about 6-fold advantage for Intel? The author made an unequivocal conclusion that ARM is very far from the ratio of "efficiency per cycle" to "big hardware" from Intel in this simple task.
Challenge accepted!
But let's start from the beginning! The author began with a certain baseline - a consistent implementation, so I also decided to start from there and move upwards. Let's call this basis testee00
for greater confusion:
inline void testee00() { size_t i = 0, pos = 0; while (i < 16) { const char c = input[i++]; output[pos] = c; pos += (c > 32 ? 1 : 0); } }
I ran testee00
on several amd64 processors and one arm64 processor using different versions of the GCC and Clang compiler, always taking as a basis the best compilation result. Below are the tact / character ratio calculated by perf -e cycles
divided by the number of characters processed (in our case, 5 10 ^ 7 16) and truncated to the 4th digit after the decimal point:
CPU | Compiler & flags | clocks / character |
---|---|---|
Intel Xeon E5-2687W (SNB) | g ++ - 4.8 -Ofast | 1.6363 |
Intel Xeon E3-1270v2 (IVB) | g ++ - 5.1 -Ofast | 1.6186 |
Intel i7-5820K (HSW) | g ++ - 4.8 -Ofast | 1.5223 |
AMD Ryzen 7 1700 (Zen) | g ++ - 5.4 -Ofast | 1.4113 |
Marvell 8040 (Cortex-A72) | g ++ - 5.4 -Ofast | 1.3805 |
Table 1. testee00
on desktop cores
Interesting isn't it? A small phone chip (3-Decoder OoO) really gives a better tact / symbol ratio than a larger desktop chip (at the end of this article you can see the actual statistics).
So let's go to SIMD. I do not pretend to be considered an experienced coder for NEON, but sometimes I bother with ARM SIMD. I will not inline the SIMD routines into the main part of this article so as not to scare the reader away; Instead, the entire test code and participating test procedures can be found at the end of this article.
I took the liberty to change Daniel’s original SSSE3 pruning procedure — in fact, I used my version for the test. Cause? I can’t just take 2 ^ 16 * 2 ^ 4 = 1 MB of the lookup table in my code - this would be a big cache eater for any scenarios where we not only cut off ascii threads, but calling the subroutine makes other work easier. The LSS-less SSSE3 version comes with a price of a few calculations, but it works only on registers, and, as you will see, the price excluding the table is not prohibitive, even with intensive cropping loads. Moreover, both the new version of SSSE3 and the version of NEON (ASIMD2) use the same algorithm, so the comparison is as direct as physically possible:
CPU | Compiler & flags | clocks / character |
---|---|---|
Intel Xeon E5-2687W (SNB) | g ++ - 4.8 -Ofast -mssse3 | .4230 |
Intel Xeon E3-1270v2 (IVB) | g ++ - 5.4 -Ofast -mssse3 | .3774 |
Marvell 8040 (Cortex-A72) | g ++ - 5.4 -Ofast -mcpu = cortex-a57 | 1.0503 |
Table 2. testee01
on desktop cores
Note: Tuning the microarchitecture for the A57 is transferred to the arm64 build, since the default scheduler from the compiler is clearly worse in this version, as far as the NEON code is concerned, and A57 is a fairly “common” common denominator of ARMv8 when it comes to scheduling.
As you can see, the efficiency per tact is 2x in favor of Sandy Bridge - the core, which, with the same (or similar) fabnode, will be 4 times larger in area A72. So it's not so bad for phone chips; )
Bonus material: the same test on small arm64 and amd64 CP:
CPU | Compiler & flags | clocks / character, scalar | clocks / character, vector |
---|---|---|---|
AMD C60 (Bobcat) | g ++ - 4.8 -Ofast -mssse3 | 3.5751 | 1.8215 |
MediaTek MT8163 (Cortex-A53) | clang ++ - 3.6 -march = armv8-a -mtune = cortex-a53 -Ofast | 2.6568 | 1.7100 |
Table 3. testee00
on testee01
on entry-level cores
Xeon E5-2687W @ 3.10GHz
Scalar version
$ g++-4.8 prune.cpp -Ofast $ perf stat -e task-clock,cycles,instructions -- ./a.out alabalanica1234 Performance counter stats for './a.out': 421.886991 task-clock (msec) # 0.998 CPUs utilized 1,309,087,898 cycles # 3.103 GHz 4,603,132,268 instructions # 3.52 insns per cycle 0.422602570 seconds time elapsed $ echo "scale=4; 1309087898 / (5 * 10^7 * 16)" | bc 1.6363
SSSE3 version (batch of 16, misaligned write)
$ g++-4.8 prune.cpp -Ofast -mssse3 $ perf stat -e task-clock,cycles,instructions -- ./a.out alabalanica1234a Performance counter stats for './a.out': 109.063426 task-clock (msec) # 0.997 CPUs utilized 338,414,215 cycles # 3.103 GHz 1,052,118,398 instructions # 3.11 insns per cycle 0.109422808 seconds time elapsed $ echo "scale=4; 338414215 / (5 * 10^7 * 16)" | bc .4230
Xeon E3-1270v2 @ 1.60GHz
Scalar version
$ g++-5 -Ofast prune.cpp $ perf stat -e task-clock,cycles,instructions -- ./a.out alabalanica1234 Performance counter stats for './a.out': 810.515709 task-clock (msec) # 0.999 CPUs utilized 1,294,903,960 cycles # 1.598 GHz 4,601,118,631 instructions # 3.55 insns per cycle 0.811646618 seconds time elapsed $ echo "scale=4; 1294903960 / (5 * 10^7 * 16)" | bc 1.6186
SSSE3 version (batch of 16, misaligned write)
$ g++-5 -Ofast prune.cpp -mssse3 $ perf stat -e task-clock,cycles,instructions -- ./a.out alabalanica1234a Performance counter stats for './a.out': 188.995814 task-clock (msec) # 0.997 CPUs utilized 301,931,101 cycles # 1.598 GHz 1,050,607,539 instructions # 3.48 insns per cycle 0.189536527 seconds time elapsed $ echo "scale=4; 301931101 / (5 * 10^7 * 16)" | bc .3774
Intel i7-5820K
Scalar version
$ g++-4.8 -Ofast prune.cpp $ perf stat -e task-clock,cycles,instructions -- ./a.out alabalanica1234 Performance counter stats for './a.out': 339.202545 task-clock (msec) # 0.997 CPUs utilized 1,204,872,493 cycles # 3.552 GHz 4,602,943,398 instructions # 3.82 insn per cycle 0.340089829 seconds time elapsed $ echo "scale=4; 1204872493 / (5 * 10^7 * 16)" | bc 1.5060
AMD Ryzen 7 1700
Scalar version
$ perf stat -e task-clock,cycles,instructions -- ./a.out alabalanica1234 Performance counter stats for './a.out': 356,169901 task-clock:u (msec) # 0,999 CPUs utilized 1129098820 cycles:u # 3,170 GHz 4602126161 instructions:u # 4,08 insn per cycle 0,356353748 seconds time elapsed $ echo "scale=4; 1129098820 / (5 * 10^7 * 16)" | bc 1.4113
Marvell ARMADA 8040 (Cortex-A72) @ 1.30GHz
Scalar version
$ g++-5 prune.cpp -Ofast $ perf stat -e task-clock,cycles,instructions -- ./a.out alabalanica1234 Performance counter stats for './a.out': 849.549040 task-clock (msec) # 0.999 CPUs utilized 1,104,405,671 cycles # 1.300 GHz 3,251,212,918 instructions # 2.94 insns per cycle 0.850107930 seconds time elapsed $ echo "scale=4; 1104405671 / (5 * 10^7 * 16)" | bc 1.3805
ASIMD2 version (batch of 16, misaligned write)
$ g++-5 prune.cpp -Ofast -mcpu=cortex-a57 -mtune=cortex-a57 $ perf stat -e task-clock,cycles,instructions -- ./a.out alabalanica1234 Performance counter stats for './a.out': 646.394560 task-clock (msec) # 0.999 CPUs utilized 840,305,966 cycles # 1.300 GHz 801,000,092 instructions # 0.95 insns per cycle 0.646946289 seconds time elapsed $ echo "scale=4; 840305966 / (5 * 10^7 * 16)" | bc 1.0503
ASIMD2 version (batch of 32, misaligned write)
$ clang++-3.7 prune.cpp -Ofast -mcpu=cortex-a57 -mtune=cortex-a57 $ perf stat -e task-clock,cycles,instructions -- ./a.out alabalanica1234 Performance counter stats for './a.out': 1140.643640 task-clock (msec) # 0.999 CPUs utilized 1,482,826,308 cycles # 1.300 GHz 1,504,011,807 instructions # 1.01 insns per cycle 1.141241760 seconds time elapsed $ echo "scale=4; 1482826308 / (5 * 10^7 * 32)" | bc .9267
AMD C60 (Bobcat) @ 1.333GHz
Scalar version
$ g++-4.8 prune.cpp -Ofast $ perf stat -e task-clock,cycles,instructions -- ./a.out alabalanica1234 Performance counter stats for './a.out': 2208.190651 task-clock (msec) # 0.997 CPUs utilized 2,860,081,604 cycles # 1.295 GHz 4,602,968,860 instructions # 1.61 insns per cycle 2.214173331 seconds time elapsed $ echo "scale=4; 2860081604 / (5 * 10^7 * 16)" | bc 3.5751
SSSE3 version (batch of 16, misaligned write)
$ clang++-3.5 prune.cpp -Ofast -mssse3 $ perf stat -e task-clock,cycles,instructions -- ./a.out alabalanica1234a Performance counter stats for './a.out': 1098.519499 task-clock (msec) # 0.998 CPUs utilized 1,457,266,396 cycles # 1.327 GHz 1,053,073,591 instructions # 0.72 insns per cycle 1.101240320 seconds time elapsed $ echo "scale=4; 1457266396 / (5 * 10^7 * 16)" | bc 1.8215
MediaTek MT8163 (Cortex-A53) @ 1.50GHz (sans perf)
Scalar version
$ ../clang+llvm-3.6.2-aarch64-linux-gnu/bin/clang++ prune.cpp -march=armv8-a -mtune=cortex-a53 -Ofast $ time ./a.out alabalanica1234 real 0m1.417s user 0m1.410s sys 0m0.000s $ echo "scale=4; 1.417 * 1.5 * 10^9 / (5 * 10^7 * 16)" | bc 2.6568
ASIMD2 version (batch of 16, misaligned write)
$ ../clang+llvm-3.6.2-aarch64-linux-gnu/bin/clang++ prune.cpp -march=armv8-a -mtune=cortex-a53 -Ofast $ time ./a.out alabalanica1234 real 0m0.912s user 0m0.900s sys 0m0.000s $ echo "scale=4; 0.912 * 1.5 * 10^9 / (5 * 10^7 * 16)" | bc 1.7100
Martin Krastev.
Translation: Dmitry Alexandrov
Source: https://habr.com/ru/post/334142/
All Articles