Quick removal of spaces from strings on ARM processors - alternative analysis

→ Original article
Posted by: Martin Krastev

A friend of mine drew my attention to an interesting article on habrahabr.ru - a Russian translation of the article by Daniel Lemire. Quick removal of spaces from lines on ARM processors . This article intrigued me for two reasons: first, someone actually spent the time and effort to find the optimal solution to a common problem on a non-x86 architecture (hurray!), And second, the author gave some results at the end of the article puzzled me: about 6-fold advantage for Intel? The author made an unequivocal conclusion that ARM is very far from the ratio of "efficiency per cycle" to "big hardware" from Intel in this simple task.

Challenge accepted!

But let's start from the beginning! The author began with a certain baseline - a consistent implementation, so I also decided to start from there and move upwards. Let's call this basis testee00 for greater confusion:

 inline void testee00() { size_t i = 0, pos = 0; while (i < 16) { const char c = input[i++]; output[pos] = c; pos += (c > 32 ? 1 : 0); } }

I ran testee00 on several amd64 processors and one arm64 processor using different versions of the GCC and Clang compiler, always taking as a basis the best compilation result. Below are the tact / character ratio calculated by perf -e cycles divided by the number of characters processed (in our case, 5 10 ^ 7 16) and truncated to the 4th digit after the decimal point:

CPU	Compiler & flags	clocks / character
Intel Xeon E5-2687W (SNB)	g ++ - 4.8 -Ofast	1.6363
Intel Xeon E3-1270v2 (IVB)	g ++ - 5.1 -Ofast	1.6186
Intel i7-5820K (HSW)	g ++ - 4.8 -Ofast	1.5223
AMD Ryzen 7 1700 (Zen)	g ++ - 5.4 -Ofast	1.4113
Marvell 8040 (Cortex-A72)	g ++ - 5.4 -Ofast	1.3805

Table 1. testee00 on desktop cores

Interesting isn't it? A small phone chip (3-Decoder OoO) really gives a better tact / symbol ratio than a larger desktop chip (at the end of this article you can see the actual statistics).

So let's go to SIMD. I do not pretend to be considered an experienced coder for NEON, but sometimes I bother with ARM SIMD. I will not inline the SIMD routines into the main part of this article so as not to scare the reader away; Instead, the entire test code and participating test procedures can be found at the end of this article.

I took the liberty to change Daniel’s original SSSE3 pruning procedure — in fact, I used my version for the test. Cause? I can’t just take 2 ^ 16 * 2 ^ 4 = 1 MB of the lookup table in my code - this would be a big cache eater for any scenarios where we not only cut off ascii threads, but calling the subroutine makes other work easier. The LSS-less SSSE3 version comes with a price of a few calculations, but it works only on registers, and, as you will see, the price excluding the table is not prohibitive, even with intensive cropping loads. Moreover, both the new version of SSSE3 and the version of NEON (ASIMD2) use the same algorithm, so the comparison is as direct as physically possible:

CPU	Compiler & flags	clocks / character
Intel Xeon E5-2687W (SNB)	g ++ - 4.8 -Ofast -mssse3	.4230
Intel Xeon E3-1270v2 (IVB)	g ++ - 5.4 -Ofast -mssse3	.3774
Marvell 8040 (Cortex-A72)	g ++ - 5.4 -Ofast -mcpu = cortex-a57	1.0503

Table 2. testee01 on desktop cores

Note: Tuning the microarchitecture for the A57 is transferred to the arm64 build, since the default scheduler from the compiler is clearly worse in this version, as far as the NEON code is concerned, and A57 is a fairly “common” common denominator of ARMv8 when it comes to scheduling.

As you can see, the efficiency per tact is 2x in favor of Sandy Bridge - the core, which, with the same (or similar) fabnode, will be 4 times larger in area A72. So it's not so bad for phone chips; )

Bonus material: the same test on small arm64 and amd64 CP:

CPU	Compiler & flags	clocks / character, scalar	clocks / character, vector
AMD C60 (Bobcat)	g ++ - 4.8 -Ofast -mssse3	3.5751	1.8215
MediaTek MT8163 (Cortex-A53)	clang ++ - 3.6 -march = armv8-a -mtune = cortex-a53 -Ofast	2.6568	1.7100

Table 3. testee00 on testee01 on entry-level cores

Xeon E5-2687W @ 3.10GHz

Scalar version

 $ g++-4.8 prune.cpp -Ofast $ perf stat -e task-clock,cycles,instructions -- ./a.out alabalanica1234 Performance counter stats for './a.out': 421.886991 task-clock (msec) # 0.998 CPUs utilized 1,309,087,898 cycles # 3.103 GHz 4,603,132,268 instructions # 3.52 insns per cycle 0.422602570 seconds time elapsed $ echo "scale=4; 1309087898 / (5 * 10^7 * 16)" | bc 1.6363

SSSE3 version (batch of 16, misaligned write)

 $ g++-4.8 prune.cpp -Ofast -mssse3 $ perf stat -e task-clock,cycles,instructions -- ./a.out alabalanica1234a Performance counter stats for './a.out': 109.063426 task-clock (msec) # 0.997 CPUs utilized 338,414,215 cycles # 3.103 GHz 1,052,118,398 instructions # 3.11 insns per cycle 0.109422808 seconds time elapsed $ echo "scale=4; 338414215 / (5 * 10^7 * 16)" | bc .4230

Xeon E3-1270v2 @ 1.60GHz

Scalar version

 $ g++-5 -Ofast prune.cpp $ perf stat -e task-clock,cycles,instructions -- ./a.out alabalanica1234 Performance counter stats for './a.out': 810.515709 task-clock (msec) # 0.999 CPUs utilized 1,294,903,960 cycles # 1.598 GHz 4,601,118,631 instructions # 3.55 insns per cycle 0.811646618 seconds time elapsed $ echo "scale=4; 1294903960 / (5 * 10^7 * 16)" | bc 1.6186

SSSE3 version (batch of 16, misaligned write)

 $ g++-5 -Ofast prune.cpp -mssse3 $ perf stat -e task-clock,cycles,instructions -- ./a.out alabalanica1234a Performance counter stats for './a.out': 188.995814 task-clock (msec) # 0.997 CPUs utilized 301,931,101 cycles # 1.598 GHz 1,050,607,539 instructions # 3.48 insns per cycle 0.189536527 seconds time elapsed $ echo "scale=4; 301931101 / (5 * 10^7 * 16)" | bc .3774

Intel i7-5820K

Scalar version

 $ g++-4.8 -Ofast prune.cpp $ perf stat -e task-clock,cycles,instructions -- ./a.out alabalanica1234 Performance counter stats for './a.out': 339.202545 task-clock (msec) # 0.997 CPUs utilized 1,204,872,493 cycles # 3.552 GHz 4,602,943,398 instructions # 3.82 insn per cycle 0.340089829 seconds time elapsed $ echo "scale=4; 1204872493 / (5 * 10^7 * 16)" | bc 1.5060

AMD Ryzen 7 1700

Scalar version

 $ perf stat -e task-clock,cycles,instructions -- ./a.out alabalanica1234 Performance counter stats for './a.out': 356,169901 task-clock:u (msec) # 0,999 CPUs utilized 1129098820 cycles:u # 3,170 GHz 4602126161 instructions:u # 4,08 insn per cycle 0,356353748 seconds time elapsed $ echo "scale=4; 1129098820 / (5 * 10^7 * 16)" | bc 1.4113

Marvell ARMADA 8040 (Cortex-A72) @ 1.30GHz

Scalar version

 $ g++-5 prune.cpp -Ofast $ perf stat -e task-clock,cycles,instructions -- ./a.out alabalanica1234 Performance counter stats for './a.out': 849.549040 task-clock (msec) # 0.999 CPUs utilized 1,104,405,671 cycles # 1.300 GHz 3,251,212,918 instructions # 2.94 insns per cycle 0.850107930 seconds time elapsed $ echo "scale=4; 1104405671 / (5 * 10^7 * 16)" | bc 1.3805

ASIMD2 version (batch of 16, misaligned write)

 $ g++-5 prune.cpp -Ofast -mcpu=cortex-a57 -mtune=cortex-a57 $ perf stat -e task-clock,cycles,instructions -- ./a.out alabalanica1234 Performance counter stats for './a.out': 646.394560 task-clock (msec) # 0.999 CPUs utilized 840,305,966 cycles # 1.300 GHz 801,000,092 instructions # 0.95 insns per cycle 0.646946289 seconds time elapsed $ echo "scale=4; 840305966 / (5 * 10^7 * 16)" | bc 1.0503

ASIMD2 version (batch of 32, misaligned write)

 $ clang++-3.7 prune.cpp -Ofast -mcpu=cortex-a57 -mtune=cortex-a57 $ perf stat -e task-clock,cycles,instructions -- ./a.out alabalanica1234 Performance counter stats for './a.out': 1140.643640 task-clock (msec) # 0.999 CPUs utilized 1,482,826,308 cycles # 1.300 GHz 1,504,011,807 instructions # 1.01 insns per cycle 1.141241760 seconds time elapsed $ echo "scale=4; 1482826308 / (5 * 10^7 * 32)" | bc .9267

AMD C60 (Bobcat) @ 1.333GHz

Scalar version

 $ g++-4.8 prune.cpp -Ofast $ perf stat -e task-clock,cycles,instructions -- ./a.out alabalanica1234 Performance counter stats for './a.out': 2208.190651 task-clock (msec) # 0.997 CPUs utilized 2,860,081,604 cycles # 1.295 GHz 4,602,968,860 instructions # 1.61 insns per cycle 2.214173331 seconds time elapsed $ echo "scale=4; 2860081604 / (5 * 10^7 * 16)" | bc 3.5751

SSSE3 version (batch of 16, misaligned write)

 $ clang++-3.5 prune.cpp -Ofast -mssse3 $ perf stat -e task-clock,cycles,instructions -- ./a.out alabalanica1234a Performance counter stats for './a.out': 1098.519499 task-clock (msec) # 0.998 CPUs utilized 1,457,266,396 cycles # 1.327 GHz 1,053,073,591 instructions # 0.72 insns per cycle 1.101240320 seconds time elapsed $ echo "scale=4; 1457266396 / (5 * 10^7 * 16)" | bc 1.8215

MediaTek MT8163 (Cortex-A53) @ 1.50GHz (sans perf)

Scalar version

 $ ../clang+llvm-3.6.2-aarch64-linux-gnu/bin/clang++ prune.cpp -march=armv8-a -mtune=cortex-a53 -Ofast $ time ./a.out alabalanica1234 real 0m1.417s user 0m1.410s sys 0m0.000s $ echo "scale=4; 1.417 * 1.5 * 10^9 / (5 * 10^7 * 16)" | bc 2.6568

ASIMD2 version (batch of 16, misaligned write)

 $ ../clang+llvm-3.6.2-aarch64-linux-gnu/bin/clang++ prune.cpp -march=armv8-a -mtune=cortex-a53 -Ofast $ time ./a.out alabalanica1234 real 0m0.912s user 0m0.900s sys 0m0.000s $ echo "scale=4; 0.912 * 1.5 * 10^9 / (5 * 10^7 * 16)" | bc 1.7100

Martin Krastev.

Translation: Dmitry Alexandrov

Source: https://habr.com/ru/post/334142/

All Articles

Quick removal of spaces from strings on ARM processors - alternative analysis

More articles: