SIMD without SIMD, or looking at C almost twice as fast as C ++

I read articles about combinatorial code generation in C ++ in the context of a linear search in the database: Optimization possibilities in C and C ++ languages and Development and execution speeds that are not achievable in C. Let's try to reach development and execution speeds in C?

After I started compiling C ++ code from the second article, I was curious - will I have time to write an analogue in C, which will work faster while the code ... compiles? I did not have time, the code was compiled in 5 minutes, and the analogue in C was written all 15.

So, the statement of the problem is a structure of several fields, there is a filter that checks whether each field is in the specified range. Or does not check - for each field. You need a code that does this checking on a fixed filter very quickly. The data is random, so the less conditional transitions the better - the prediction of transitions to random data works so-so.
')
The problem is solved in two stages:
1. Let's learn how to replace comparison operations with something more simple, such as addition and bitwise operations.
2. Let's learn how to combine several operations from 1 together, for free.

So, the first point. There is a value b and a range [a, c], it is necessary to calculate a <= b and b <= c. No comparisons. Let for definiteness a, b, c be non-negative and fit in 7 bits - i.e. from 0 to 127.

It is easy to see that the expression (128 - a) + b is guaranteed to fit in 8 bits; moreover, the 8th bit of the result is equal to 1 if and only if a <= b. For example, if a = 0, then in the value of the expression 128 + b, the 8th bit is always 1; if a = 1, then in the value of the expression 127 + b, the 8th bit is 1 if b is 0 or 1, and so on.

The result of comparing b and c is almost the same - the expression (127 - c) + b fits into 8 bits, and the 8th bit of the result is 0 if and only if b <= c.

So, hooray! Instead of counting a <= b, you and I will assume ((128 - a) + b) & 128. It would seem, why? ..

Point two. Bit operations have a wonderful property that everyone knows - one instruction can be used to do many homogeneous bit operations at once. The year seems to be already 2013, even the processor without SSE cannot be found in the afternoon with fire, so we’ll assume that 64 bits are definitely possible.

Not everyone knows that arithmetic operations have a similarly beautiful property — to be careful and to prepare data in advance. For example, suppose we have 32-bit arithmetic, and 4 pairs of seven-bit unsigned numbers. See how they can be folded:

1. Add to each number an extra high bit, which is initialized to zero:
aaaaaaa -> 0aaaaaaa

2. 4 groups of 8 bits are written sequentially:
0aaaaaaa0bbbbbbb0cccccccddddddd

3. Add the two resulting 32-bit numbers; we get 4 8-bit addition results in a 32-bit number.

4. If you want 7-bit arithmetic, then the carry bits can be reset.

It becomes immediately clear what needs to be done in the desired task:

1. We lay out the data in the 64-bit register so that each value has an extra high bit. Accidentally it turned out that an example from the article fits in with 62 bits in such a way - there is still room for one single-bit field!

struct T_cash_account_row { unsigned reserved0:1; unsigned code:20; // 0 - 1000000 unsigned carry_code:1; unsigned gender:1; // 0 - 1 unsigned carry_gender:1; unsigned age:7; // 0 - 100 unsigned carry_age:1; unsigned reserved1:1; unsigned amount_of_money:20;// 0 - 1000000 unsigned carry_amount_of_money:1; unsigned height:9; // 0 – 300 unsigned carry_height:1; };

Attention, here knowledge of the order of installation of bit fields in a specific compiler is used; in fact, it is necessary to replace the structure with an unsigned long long and put the values there via bit operations, but for this, the reference code would have to be changed. The code has become a little less cross-platform than it was, it is treated.

2. In advance, we generate two terms from numbers like (128 - a) and (128 - 1 - c) from the example above. Instead of 128, we substitute the width of each field in bits. In order not to go crazy, we use macros.

 #define FILL(enum, field, bits) \ if (range_filters->use_filter[enum]) { \ begin_add.field = (1 << bits) - range_filters->begin.field; \ begin_add.carry_##field = range_filters->begin.field == 0; \ end_add.field = (1 << bits) - 1 - range_filters->end.field; \ mask_add.carry_##field = 1; \ }

I pay attention to filling in carry_field for begin - bit fields operate with arithmetic without add. extended bit, so if begin = 0, then you need to fill the field itself with zero, and carry_field unit. If unsigned long long and bit operations are used instead of bit fields, then you can simply write (1 << bits) - begin.

3. We add the 64-bit number in which our fields are laid, with two control terms; get the information we need in the carry bits

  unsigned long long value = *(unsigned long long*)&array_ptr[i]; unsigned long long begin_result = (value + begin_add_i) ^ mask_add_i; unsigned long long end_result = value + end_add_i;

In this piece of code, we maliciously violated strict aliasing (and to hell with it), and also replaced 1 in the carry bits with 0.

4. Check that all the carry bits of interest to us (those that should be considered by the filter) are 1 after the first addition and 0 after the second.

  if (((begin_result | end_result) & mask_add_i) == 0)

Since we replaced 1 with 0, it is enough to check that all bits of interest to us are zero.

5. We measure the performance, we get a gain of almost 2 times compared with C ++ code, which is compiled for 5 minutes.

 Microsoft (R) C/C++ Optimizing Compiler Version 17.00.50727.1 for x64 Generated rows: 100000000 C-search took 0.778000 seconds. C++-optimized search took 0.307000 seconds. C-optimized-search took 0.171000 seconds.

And just in case, the entire code . It is compiled in MSVC with the / TP key (because it is not C89, neither before my changes nor after), and in gcc without keys (randomness).

Of course, you cannot measure performance in this way .
Of course, I was lucky with 62 bits (and the author of the original article was lucky with 4000 options).
But let everyone make conclusions for himself.

Finally, I would like to note that - it seems - the real solution for production will use SoA data stacking - instead of an array of structures, use one array for each field (with, for example, bit-wrap). Then, firstly, you can save on the amount of data read by the linear scan, secondly, compile-time combinatorics are not needed, thirdly, less code is needed; fourthly, the number of fields and the structure of the query are easier to change dynamically. Of course, the current production solution will use (oh, horror) platform-dependent SIMD.

Source: https://habr.com/ru/post/182588/

All Articles

SIMD without SIMD, or looking at C almost twice as fast as C ++

More articles: