
Suppose I gave you a relatively long string, and you want to remove all spaces from it. In ASCII, we can define spaces as a space character ('') and line terminations ('\ r' and '\ n'). I am most interested in the issues of algorithm and performance, so that we can simplify the task and delete all bytes with values less than or equal to 32.
In the
previous article , where I asked a question about removing spaces for speed, the best answer was to use vectorization using 128-bit registers (SSE4). It turned out to be 5-10 times faster than approaching the forehead.
It is very convenient that all processors have 128-bit vector registers, as well as x64 processors. Are ARM processors as fast as x64 processors?
Let's first consider a fast scalar implementation:
')
size_t i = 0, pos = 0; while (i < howmany) { char c = bytes[i++]; bytes[pos] = c; pos += (c > 32 ? 1 : 0); }
It removes all characters with values less than or equal to 32 and writes the data back. It works very quickly.
Is it possible to achieve even greater speed with vector instructions?
On x64 processors, the best strategy is to capture 16 bytes of data, quickly compare for empty characters, then extract the mask value (or bitset) created from 16 bits, one bit per character, where each bit corresponds to the value, found a blank character or not. Such a set is quickly calculated on the x64 processor, since there is a special instruction (
movemask
). There is no such instruction on ARM processors. You can emulate
movemask
with a few instructions.
So, we cannot process data on ARM as on x86 processors. What can be done?
As SS4 does, we can quickly verify that the byte values are less than or equal to 32, and so define empty characters:
static inline uint8x16_t is_white(uint8x16_t data) { const uint8x16_t wchar = vdupq_n_u8(' '); uint8x16_t isw = vcleq_u8(data, wchar); return isw; }
Now we can quickly check any of the 16 characters, it is empty, using only two instructions:
static inline uint64_t is_not_zero(uint8x16_t v) { uint64x2_t v64 = vreinterpretq_u64_u8(v); uint32x2_t v32 = vqmovn_u64(v64); uint64x1_t result = vreinterpret_u64_u32(v32); return result[0]; }
This suggests a useful strategy. Instead of comparing characters one by one, you can compare all 16 characters at once. If none of them is empty, then simply copy 16 characters back to the input and move on. Otherwise, we slide to the slow scalar approach, but with the added advantage that there is no need to repeat the comparison:
uint8x16_t vecbytes = vld1q_u8((uint8_t *)bytes + i); uint8x16_t w = is_white(vecbytes); uint64_t haswhite = is_not_zero(w); w0 = vaddq_u8(justone, w); if(!haswhite) { vst1q_u8((uint8_t *)bytes + pos,vecbytes); pos += 16; i += 16; } else { for (int k = 0; k < 16; k++) { bytes[pos] = bytes[i++]; pos += w[k]; } }
The advantages of this approach will be most pronounced if you expect a lot of streams of 16 bytes without empty characters. In principle, this is true for many applications.
I wrote a benchmark in which I tried to estimate how much space the removal of spaces would take, one at a time, based on input data with a small number of empty characters scattered randomly.
Source code is available , but you need an ARM processor to run. I ran it on a
64-bit ARM processor (made from A57 cores) . John Reger has
a few more benchmarks on the same machine . It seems to me that the same cores work in the Nintendo Switch.
Technical specifications are not rich . However, the processor runs at 1.7 GHz, which everyone can see if he starts
perf stat
. Here are how many cycles we need the symbol:
scalar | ARM | Recent x64 |
---|
scalar | 2.4 cycles | 1,2 cycles |
vectorized (NEON and SS4) | 1.8 cycle | 0.25 cycle |
If we compare, on the x64 processor, the scalar version uses something like 1.2 cycles per character, and ARM is about twice as good as cycles per character. This was quite expected, because the A57 cores are unlikely to compete with x64 in cycle performance. However, using SS4 on an x64 machine, I managed to achieve a performance of just 0.25 cycles per character, which is more than five times faster than on ARM NEON.
Such a big difference is due to the difference in algorithms. On x64 processors, we use a combination of
movemask/pshufb
and a
movemask/pshufb
algorithm with a very small number of instructions. The ARM NEON version is much weaker.
There are many nice things on ARM processors. The assembler code is much more elegant than the similar code for x86 / x64 processors. Even the ARM NEON instructions seem cleaner than the SSE / AVX instructions. However, for many tasks, the complete lack of a
movemask
instruction may limit your work.
But maybe I underestimate the ARM NEON ... can you accomplish the task more effectively than I did?
Note. The article was edited: as noted by one of the commentators, on 64-bit ARM processors there is the possibility of permuting 16 bits with one instruction.