Development and execution speeds not reachable in C

In the continuation of the article on cross-platform and cross-hardware optimization, using the example of a search task with a full pass through a table of 5 fields and 10,000,000 rows, and the inevitability of this task even during an index search, I will show how to speed up such a search 3.5-5.3 times using C ++ is independent of the hardware platform.
In the previous article, we were able to speed up the search by 1.3 times: GitHub.com
We will not simply describe the language constructs, but show the advantages of C ++ in solving one of the stages of the real problem.
We still write cross-platform for MSVC11 (MSVS2012) and GCC 4.7.2, and use C and the partially implemented C ++ 11 standard in them.
To simplify the understanding, we still write without an index search, but this solution will later be used in an index search.

1. Hardcore C ++ solution

This would be a really hardcore solution on C. This is because, unfortunately for C-developers, you have to create 3840 functions and never make a mistake in them. As long as C-developers write them, we will treat them with great respect, and I will show you how the C ++ compiler can write these 3840 functions for you. He will not do it quickly, but much faster and more accurately than C-developers will write them.
Everything is very simple, in the theory of a DBMS there is such a thing as a predicate selectivity ( selectivity ) - the percentage of matching rows of a table under one of the conditions (predicate). The lower the selectivity of the condition, the earlier it should be compared, and the sooner we understand that the line is not suitable. Those. let's get rid of comparisons for the following conditions.
In real DBMS, statistics for this are collected by the optimizer and the necessary formula for calculating the selectivity depending on the data distribution graph is selected. I think discrete mathematics and combinatorics with brute force fields are enough for you today, so for simplicity we will consider cardinality selectivity based on uniform data distribution. (Cardinality column - the percentage of unique values in the column.)
Selectivity (number of rows) will be equal to:
selectivity = c_array_size * (1 + field.end – field.begin) / cardinality;
The size of the array and cardinality for evenly distributed data are known to us at the compilation stage. And we find the limits of the search conditions for each field during execution and calculate the selectivity.
Then we select the fields in ascending order of selectivity starting from the lowest. And depending on the order of the selected field numbers, we form a numerical index of the most optimal function, which we then call to search.
We generate the optimized functions for each variant of the search conditions, or rather the polymorphic classes, with two nested promotion templates:

Brute force search options (as was the case with the previous C ++ solution )
Enumerate all variants of the order of search conditions

The total is created and filled with an array of (2^5)*5! = 32*120 = 3840 (2^5)*5! = 32*120 = 3840 objects with different implementations, but with a common base abstract ancestor.
Checks of the range for more and less for one field always go together. We will not rearrange the minimum and maximum values for one field independently or change the order so as not to delay the compilation time.
Now let's look at how the code will look.

2. Reducing dependencies

First, to implement our solution, we will have to implement access to the fields of the string through the template compile-time-getters. It will also add a nice bonus - reducing dependencies - now any changes in types and the number of fields in the code will affect only the structure of the T_cash_account_row. line T_cash_account_row.
The interface of our search class itself will remain unchanged, and the implementation of a search on an arbitrary table will be instantiated by a template search class.
This is how the line structure code with fields will look like:

 /* Row */ struct T_cash_account_row { // Fields: enum T_field_enum { amount_of_money_e, gender_e, age_e, code_e, height_e, /*<<<<<- add fields here (with index) */ last_e /*<<<<<- add included fields here (without index) */ }; static_assert(last_e > 0, "Number of indexed fields in enum of T_cash_account_row must be greater than 0!"); unsigned code:20; // 0 - 1000000 unsigned gender:1; // 0 - 1 unsigned age:7; // 0 - 100 unsigned amount_of_money:20; // 0 - 1000000 unsigned height:9; // 0 – 300 // Get field value template<int field_enum> inline typename std::enable_if<field_enum == code_e, decltype(code)>::type get_field_value() const { return code; } template<int field_enum> inline typename std::enable_if<field_enum == gender_e, decltype(gender)>::type get_field_value() const { return gender; } template<int field_enum> inline typename std::enable_if<field_enum == age_e, decltype(age)>::type get_field_value() const { return age; } template<int field_enum> inline typename std::enable_if<field_enum == amount_of_money_e, decltype(amount_of_money)>::type get_field_value() const { return amount_of_money; } template<int field_enum> inline typename std::enable_if<field_enum == height_e, decltype(height)>::type get_field_value() const { return height; } template<int field_enum> inline typename std::enable_if<field_enum == last_e, bool>::type get_field_value() const { return true; } static_assert(5 == last_e, "Add/delete new field-case and correct this assert!"); }; /* ----------------------------------------------------------------------- */

enum T_field_enum always have default values in ascending order starting with 0 ( C ++ 03 7.2 Enumeration declarations ).
The std::enable_if<> from the #include <type_traits> library allows you to get instances of functions / classes only for certain values of the template parameter. Otherwise, a substitution error occurs during instantiation, and according to the SFINAE principle (substitution-failure-is-not-an-error) , if there is a substitution error, no compilation error occurs, and an instance of the function is simply not created. More in detail in the standard ( C ++ 03 14.8.2 Template argument deduction ).
This is necessary to access the required fields of different types through the parameters of the getter template, for example:
auto code = row->get_field_value<T_cash_account_row::code_e>();
In our case, all 5 fields are of the same type and one could do without std::enable_if<> , but we think ahead. In my example, you can change the type of any of the 5 fields, and the getters will compile and work fine. Suppose if we made them through the specialization of the template function, we would get errors when accessing different types, for example, such code will not compile:

 /* Row */ struct T_cash_account_row { // Fields: enum T_field_enum { amount_of_money_e, gender_e, age_e, code_e, height_e, /*<<<<<- add fields here (with index) */ last_e /*<<<<<- add included fields here (without index) */ }; static_assert(last_e > 0, "Number of indexed fields in enum of T_user_row must be greater than 0!"); unsigned code:20; // 0 - 1000000 unsigned gender:1; // 0 - 1 unsigned age:7; // 0 - 100 unsigned amount_of_money:20; // 0 - 1000000 int height; // 0 – 300 // Get field value template<int field_enum> inline unsigned get_field_value(); template<int field_enum> inline int get_field_value(); static_assert(5 == last_e, "Add/delete new field-case and correct this assert!"); }; template<> inline unsigned T_cash_account_row::get_field_value<T_cash_account_row::code_e>() { return code; } template<> inline unsigned T_cash_account_row::get_field_value<T_cash_account_row::gender_e>() { return gender; } template<> inline unsigned T_cash_account_row::get_field_value<T_cash_account_row::age_e>() { return age; } template<> inline unsigned T_cash_account_row::get_field_value<T_cash_account_row::amount_of_money_e>() { return amount_of_money; } template<> inline int T_cash_account_row::get_field_value<T_cash_account_row::height_e>() { return height; } template<> inline unsigned T_cash_account_row::get_field_value<T_cash_account_row::last_e>() { return 1; } /* ----------------------------------------------------------------------- */ int main() { T_cash_account_row *row = new T_cash_account_row; auto code = row->get_field_value<T_cash_account_row::code_e>(); return 0; }

The compiler will naturally generate an error about an ambiguous get_field_value<>() function call. Ideone.com
Our version is perfectly compiled: ideone.com
You have figured out the getters, now let's apply the unrolling of template similar to the one from the previous article, but not for the designer, but for the functor (the functor is a class with the overloaded call operator()(…) ):

 // The templated functor of unrolling of the input templated functor (reusable) template<unsigned unroll_count, template<unsigned> class T> struct T_unroll_functor { T_unroll_functor<unroll_count-1, T> next_unroll; T<unroll_count-1> functor; template<typename T1, typename T2> inline bool operator()(T1 const& val1, T2 const& val2) { return next_unroll(val1, val2) && functor(val1, val2); } }; // End of unroll template<template<unsigned> class T> struct T_unroll_functor<0, T> { template<typename T1, typename T2> inline bool operator()(T1 const&, T2 const&) { return true; } }; // -------------------------------------------------------------------------

last_e such promotion, we expand the comparisons for each field, based on the value of the last_e enumeration and access to the fields through our sample getters.
And we will expand the functor struct T_test_pred as follows:

 // The filters for each search variant (range_filters) template<typename T_row, unsigned index_pred> struct T_custom_filter : T_filter<T_row> { // Test predicates functor for unrolling template<unsigned field_num> struct T_test_pred { bool inline operator()(T_row *const __restrict row, T_range_filters *const __restrict range_filters) const { typedef typename T_row::T_field_enum T_field_enum; // Without fields where use_filter==0 return ( T_filter<T_row>::template T_get_use_filter<index_pred, field_num>::value || (row->template get_field_value<static_cast<T_field_enum>(field_num)>() >= range_filters->begin.template get_field_value<static_cast<T_field_enum>(field_num)>()&& row->template get_field_value<static_cast<T_field_enum>(field_num)>() <= range_filters->end.template get_field_value<static_cast<T_field_enum>(field_num)>()) ); } }; // ----------------------------------------------------------------------- // search virtual size_t search(T_row *const __restrict array_ptr, const size_t c_array_size, T_row *const __restrict result_ptr, T_range_filters *const __restrict range_filters) final { size_t result_size = 0; size_t i; // loop index T_unroll_functor<T_row::last_e, T_test_pred> test_predicate; for(i = 0; i < c_array_size; ++i) { if(test_predicate(array_ptr + i, range_filters)) result_ptr[result_size] = array_ptr[i], ++result_size; } return result_size; } }; // -------------------------------------------------------------------------

I also note that now in the T_filter<>, T_custom_filter<> T_optimized_search<> T_row , the template parameter T_row , into which the row type of the table is transferred - in our case T_cash_account_row .
Now all changes in the number, names and types of table fields are concentrated solely inside T_cash_account_row , and the entire search code is generated automatically from the template classes; you just need to instantiate the template with the type of the necessary string:
T_optimized_search<T_cash_account_row> optimized_search; // C++ optimized search
Here is the link to the resulting working code: GitHub.com
This example clearly shows why in critical parts of the code team-lead and performance-architect should pay attention not only to the external interfaces of the modules, but also to the types of objects transmitted through these interfaces. In our case, these are: T_cash_account_row T_range_filters . Having set the opportunity in T_cash_account_row pass parameters through template getters, we predetermine a more flexible implementation by the developer of the module through template promotion, as well as a more flexible implementation of tests when using development techniques through TDD testing, which I will show in the next article.
On the MSVC11 compiler (MSVS2012) and one CPU Core i5 K750 core, the result is:

Generated rows: 10,000,000
C ++ - Searching ...
C ++ - optimized search took 0.056000 seconds.
Found rows: 38
C-Searching ...
C-search took 0.074000 seconds.
The C ++ faster than C: 1.321429 times
Found rows: 38

As you can see, the C ++ code is still 1.3 times faster than the C code. The resulting assembler code of the test_predicate function is still identical to the previous C ++ version, test_pradicate_enum_cpp.asm .
“Picture disasm of TortoiseDiff”

But now we have focused all the changes in one place, which will greatly facilitate the use of this search module.
In addition, we prepared the code for another, more efficient optimization.

3. Acceleration 3.5 - 5.3 times

Quite often there are articles on C ++, where compile-time calculations are considered or, for example, the search for primes on templates. The articles are interesting, but as the authors themselves note: “I will tell you how to do a completely useless thing.”
I will tell you how to get a noticeable acceleration of the search from compile-time calculations and optimizations using templates.
')

3.1 Simple calculations on templates

Let's start with the simplest of what we need, let's implement the compile-time factorial calculation:

 template <unsigned N> struct factorial : std::integral_constant<unsigned, N * factorial<N-1>::value> {}; template <> struct factorial<0> : std::integral_constant<unsigned, 1> {};

Calculation of factorial we need to create a static array std::array<> with the number of elements equal to (2^5)*5! = 3840 (2^5)*5! = 3840 .
From discrete mathematics and combinatorics it is known that the number of different permutations of unique elements is equal to the factorial of their number. A number of options for the presence of elements is 2 in the degree of the number of elements. (Of course, it doesn’t matter to us how the missing elements are rearranged, and the absence of redundant options could speed up the compilation, but I try to simplify this article as much as possible.)
Now let's see what optimizations will be inside the structure.
struct T_custom_filter when instantiating it.
First, it now also takes the index_order parameter. And secondly, the searchable functor has changed in it.

 // The filters for each search variant (range_filters) template<typename T_row, unsigned index_pred, unsigned index_order = 0> struct T_custom_filter : T_filter<T_row> { // Test predicates functor for unrolling template<unsigned field_num> struct T_test_pred { bool inline operator()(T_row *const __restrict row, T_range_filters *const __restrict range_filters) const { typedef typename T_row::T_field_enum T_field_enum; // Without fields where use_filter==0 enum { ordered_field_number = T_filter<T_row>::template T_number_field<index_order, field_num>::value }; return ( T_filter<T_row>::template T_get_use_filter<index_pred, ordered_field_number>::value || (row->template get_field_value<static_cast<T_field_enum>(ordered_field_number)>() >= range_filters->begin.template get_field_value<static_cast<T_field_enum>(ordered_field_number)>() && row->template get_field_value<static_cast<T_field_enum>(ordered_field_number)>() <= range_filters->end.template get_field_value<static_cast<T_field_enum>(ordered_field_number)>()) ); } }; // ----------------------------------------------------------------------- // search virtual size_t search(T_row *const __restrict array_ptr, const size_t c_array_size, T_row *const __restrict result_ptr, T_range_filters *const __restrict range_filters) final; }; // -------------------------------------------------------------------------

As before, the T_get_use_filter template returns a flag of using (or not using) a filter by the number_filter field for a given index_pred index.

 // whether to use the filter for a given predicate? (strictly at compile time) template <unsigned index_pred, unsigned number_filter> struct T_get_use_filter : std::integral_constant<bool, !(index_pred & 1<<number_filter)> {};

From the T_number_field<> template, based on the index_order - the index of the order of comparing fields and the sequence number of the field field_num , we get the field number ordered_field_number to get its value from the getter get_field_value<static_cast<T_field_enum>(ordered_field_number)>() .

3.2 More complex calculations on templates

Now more difficult. Implement the calculation of the order of the field in the template class T_number_field<> . The index of the order of the fields index_order is formed as follows - for example, the index for 5 fields is composed of the following formula:
index_order = field0 + 5*field1 + 5*4*field2 + 5*4*3*field3 + 5*4*3*2*field4;
where field1 is the number of the field from 0 to 4 (out of 5 available), going first when comparing; field2 - field number from 0 to 3 (out of 4 remaining ones), coming second when comparing; ...; and field5 is the field number from 0 to 0 (of the 1 remaining one) going last, since it is always 0, then it can be omitted:
index_order = field0 + 5*field1 + 5*4*field2 + 5*4*3*field3;
or, which is the same:
index_order = field0 + 5*(field1 + 4*(field2 + 3*(field3)));
It may be easier for someone to represent the value of index_order as a number with a variable basis of the calculus system, where field0 is the first digit with a base 5 (can take 5 values 0-4), in field1 is the second digit with a base 4 (it can take 4 values 0 -3), and so on. Thus, in (5!) Combinations, we can encode any field permutations. For a better understanding, it is advisable to familiarize yourself with the basics of the theory of probability or a search from discrete mathematics.
Visually, this can be represented as follows: with round brackets () we denote the lowest priority value from the available ones (the most priority), square [] - its number from the existing ones, and curly {} - the real field number. For example, initially we have 5 fields with such priorities: 5, 3, 1, 2, 4

 Field number: 0 1 2 3 4
 Comparison priority: 5 3 (1) 2 4 - [2] - current field number (field0) {2}
 Comparison priority: 5 3 (2) 4 - [2] - current field number (field1) {3}
 Comparison priority: 5 (3) 4 - [1] - current field number (field2) {1}
 Comparison priority: 5 (4) - [1] - current field number (field3) {4}
 Comparison priority: (5) - [0] - current field number (field4) {0}

Substituting the numbers in: index_order = field0 + 5*(field1 + 4*(field2 + 3*(field3)));
We get the index value: index_order = 2 + 5*(2 + 4*(1 + 3*(1))) = 92;
Later we will show how this is implemented in the run-time function T_optimized_search::get_index_order().
Now, as a result of the promotion, this index is known to us at the compilation stage, and there is an inverse problem of how to get a sequence number from it for each field in the compile-time.
First, learn to get the numbers in square brackets []. In our particular case, the “current field numbers” are as follows, for the index_order = 2 + 5*(2 + 4*(1 + 3*(1))) = 92 we index_order = 2 + 5*(2 + 4*(1 + 3*(1))) = 92 :

 field0 = index_order% 5 = 92% 5 = 2;
 field1 = (index_order / 5)% 4 = (92/5)% 4 = 2;
 field2 = (index_order / (5 * 4))% 3 = (92/5 * 4)% 3 = 1;
 field3 = (index_order / (5 * 4 * 3))% 2 = (92/5 * 4 * 3)% 2 = 1;
 field4 = 0;

In compile-time, this makes the following template class:

 // Get the sequence number the remaining field for the number_filter, after removal of all previous template <unsigned index_order, unsigned number_filter, unsigned index = T_row::last_e> struct T_number_remaining_field : std::integral_constant<unsigned, T_number_remaining_field<index_order / index, number_filter - 1, index - 1>::value> {}; // End of unroll template <unsigned index_order, unsigned index> struct T_number_remaining_field<index_order, 0, index> : std::integral_constant<unsigned, index_order % index> {}; // -------------------------------------------------------------------------

And we get the value of the “current field number” equal to number_filter, by instantiating the template in this way T_number_remaining_field<index_order, number_filter>::value .
There is a moment where some may get confused: someone might think that with ( number_filter=0 ), i.e. T_number_remaining_field<index_order, 0>::value immediately causes partial specialization with two parameters:
template <unsigned index_order, unsigned index> struct T_number_remaining_field<index_order, 0, index>
Where index_order substituted in index_order , and 0 is substituted in index . But it is not so!
According to the standard ( C ++ 03 14.8.2 Template argument deduction ), first the default template is substituted from the common template index = T_row::last_e and in fact our instantiation is converted to T_number_remaining_field<index_order, 0, T_row::last_e>::value . And then there is an appeal to the specialization:
template <unsigned index_order, unsigned index> struct T_number_remaining_field<index_order, 0, index>
And for a zero current field number of 5 fields ( T_row::last_e=5 ), we get index_order % 5 , as required.
We implemented a way to find the "current field numbers" of those in square brackets. Now we need to implement getting real field numbers for those in curly braces {}.
As we saw from our visual presentation, the “current field numbers” in square brackets [] do not always coincide with the real field numbers in curly brackets {}. If you present a deck of numbered cards in ascending order from bottom to top, then initially their numbers will coincide with their order in the deck. But removing cards from the middle of those cards that lie above them, each time will decrease by one the ordinal number in the deck relative to their real number.
In our case, this leads to the fact that all fields with a smaller real number {} and a smaller priority value () (higher priority, which are previously withdrawn), reduce by one the “current field value” [], whose real number and priority value more. These offsets are calculated by the following template classes:

  // Get 1 or 0 offset if current_field (for number_filter) affect to number_field template <unsigned index_order, unsigned number_filter, unsigned number_field> struct T_offset { enum { current_field = T_number_remaining_field<index_order, number_filter>::value, value = (current_field <= number_field)?1:0 }; }; // Get offset of number_field (enum in row-type) for number_filter template <unsigned index_order, unsigned number_filter, unsigned number_field> struct T_offset_number_field { enum { offset = T_offset<index_order, number_filter-1, number_field>::value, value = T_offset_number_field<index_order, number_filter-1, number_field + offset>::value + offset }; }; // End of unroll template <unsigned index_order, unsigned number_field> struct T_offset_number_field<index_order, 0, number_field> : std::integral_constant<unsigned, 0> {};

Finally, we get the real field numbers in the following template class:

  // (Main) Get number of field (enum in row-type) for number_filter template <unsigned index_order, unsigned number_filter> struct T_number_field { enum { remaining_field = T_number_remaining_field<index_order, number_filter>::value, value = remaining_field + T_offset_number_field<index_order, number_filter, remaining_field>::value }; }; // -------------------------------------------------------------------------

Is done.

3.3 Nested template promotion

Now we need to instantiate the T_custom_filter<> filter template by the presence of search fields and by their order. To do this, we apply the nested promotion templates:

 template<typename T_row> class T_optimized_search { // unroll tamplates template<unsigned index_pred> struct T_unroll_find { template<unsigned index_order> struct T_unroll_order { template<typename T> T_unroll_order(T &filters) { filters[index_pred + index_order*(1<<T_row::last_e)].reset( new T_custom_filter<T_row, index_pred, index_order>() ); } }; T_unroll_constructor<factorial<T_row::last_e>::value, T_unroll_order> fill_ordered_filter; template<typename T> T_unroll_find(T &filters) : fill_ordered_filter(filters) {} }; // ------------------------------------------------------------------------- std::array<std::unique_ptr<T_filter<T_row>>, (1<<T_row::last_e)*factorial<T_row::last_e>::value> filters; T_unroll_constructor< 1<<T_row::last_e, T_unroll_find> fill_filter; public: T_optimized_search() : fill_filter(filters) {} // C++ optimized search inline size_t search(T_row *const __restrict array_ptr, const size_t c_array_size, T_row *const __restrict result_ptr, T_range_filters *const __restrict range_filters) { auto const& filter = filters[T_filter<T_row>::get_index_pred(range_filters) + T_filter<T_row>::get_index_order(range_filters)*(1<<T_row::last_e)]; return filter->search(array_ptr, c_array_size, result_ptr, range_filters); } }; // -------------------------------------------------------------------------

Here we used the compile-time promotion of the presence of field comparisons struct T_unroll_find<>, and inside it we used the promotion of the field comparison order struct T_unroll_order<>.
And in the search function, we use the run-time function get_index_pred()— retrieve an index depending on the need to compare fields, and function get_index_order()— retrieve an index depending on the required order of comparison of fields.
The first function remains the same:

  // Get index of T_test_pred version for current search range static inline const unsigned get_index_pred(T_range_filters *const __restrict range_filters) { unsigned result = 0; for(size_t i = 0; i < T_row::last_e; ++i) result |= range_filters->use_filter[i]?(1<<i):0; return result; } // -------------------------------------------------------------------------

And the second one was added get_index_order(). First, I’ll clarify that struct T_cash_account_rowa function appeared in the structure of the line get_bitf_size()— getting cardinality (it is in this form, and not through switch / case, it can act as constexpr, that is, compile-time, but MSVC11 does not yet support this):

 /* Row */ struct T_cash_account_row { ... // Get cardinality template<int field_enum> // constexpr // not supported in the MSVS2012(MVCC11) static inline const unsigned int get_bitf_size() { return (field_enum == gender_e)? 1: (field_enum == age_e)? 100: (field_enum == height_e)? 300: (field_enum == code_e)? 1000000: (field_enum == amount_of_money_e)?1000000: 0; static_assert(5 == last_e, "Add/delete new field-case and correct this assert!"); } };

And once again I will remind the selectivity formula (number of lines):
selectivity = c_array_size * (1 + field.end – field.begin) / cardinality;
And now the function itself for obtaining the index, depending on the order of comparison of the fields, using the template promotion of the constructor to calculate the selectivity for each condition / field:

  // The unrolling filling of selectivity in a compile-time template<unsigned field_num> struct T_selectivity_fill { T_selectivity_fill(std::map<unsigned, unsigned> *const __restrict ordered_filters, T_range_filters *const __restrict range_filters) { // selectivity for each field-filter const unsigned selectivity = range_filters->use_filter[field_num]? ((1 + range_filters->end.template get_field_value<field_num>() - range_filters->begin.template get_field_value<field_num>() )*c_array_size / T_row::template get_bitf_size<field_num>()) // cardinality :c_array_size; ordered_filters->insert(std::make_pair(field_num, selectivity)); } }; // Get index of T_test_pred version for current search range static inline const unsigned get_index_order( T_range_filters *const __restrict range_filters) { unsigned result = 0; std::map<unsigned, unsigned> ordered_filters; T_unroll_constructor<T_row::last_e, T_selectivity_fill> selectivity_fill(&ordered_filters, range_filters); unsigned multiplexor = 1; for(size_t i = 0; i < T_row::last_e; ++i) { unsigned cur_index = 0, min_index = 0; unsigned field_num = ordered_filters.cbegin()->first; unsigned min = c_array_size; for(auto const& it : ordered_filters) { if (it.second < min) min = it.second, field_num = it.first, min_index = cur_index; ++cur_index; } ordered_filters.erase(field_num); result += min_index*multiplexor; multiplexor *= (T_row::last_e - i); } return result; } // -------------------------------------------------------------------------

Here the key is std::map<>filled with the field numbers, and the value is filled with the expected selectivity. Then in the cycle we find the minimum selectivity, take the corresponding field number from the key, delete the record from it std::map<>, and use this field number to compile the index. And so in the loop for each field. We have already described how this index returned by the function is obtained get_index_ordermathematically:
index_order = field1 + 5*(field2 + 4*(field3 + 3*(field4)));
So we were able to unwind the search pattern 3840once, populate a static array with these objects and get the index of the necessary object with a search function optimized for the given search conditions at run-time.
The resulting assembly code functiontest_predicatechanged compared to the previous version in C ++ - two comparisons are reversed, the first of them is now the last: github.com asm .
"Picture from TortoiseDiff"

4. Test results

Here is a fully working version of this solution in C ++: GitHub.com
On GCC4.7.2 with the –O3 –march = native key, CPU Core i5 K750 and the size of the exe file of 1.3MB is the result:

Generated rows: 10,000,000
C ++ - Searching ...
C ++ - optimized search took 0.017000 seconds.
Found rows: 38
C-Searching ...
C-search took 0.090000 seconds.
The C ++ faster than C: 5.294118 times
Found rows: 38

And on MSVC11 (MSVS2012) with the / O2 / Ob2 / Oi key, CPU Core i5 K750 and exe-file size of 1.42MB, the result is:

Generated rows: 10,000,000
C ++ - Searching ...
C ++ - optimized search took 0.021000 seconds.
Found rows: 38
C-Searching ...
C-search took 0.074000 seconds.
The C ++ is faster than C: 3.523809 times
Found rows: 38

As we see, the execution time fell from 74ms to C, to 21ms to C ++, i.e. speed increased 3.5 times . Fine.And as you can see, GCC shows an even more noticeable difference in speed, 5.3 times . And this is absolutely a cross-hardware solution, unlike SIMD instructions and cache-prefetch. Unlike PGO , which creates one optimized program execution path, we have created 3840 paths optimized for all cases of input data.
Maybe this solution will seem difficult to you, but now it is clear what it means to really know C ++. If you don’t know C ++ well, then the only alternative you have is to write in C.
To dispel doubts that this can be solved in C through an array of function pointers, I’ll give an example. SinceI know that a good optimization on C in this case will not work, then I will not implement the mechanism for calculating selectivity and rearrangement of conditions, but simply with my hands will optimally change the functions of places when filling in an array of pointers.
Here is a working version of this solution on C: GitHub.com
How it differs from the C-optimization example from the previous article can be viewed in diff with the previous version: GitHub.com
On GCC4.7.2 with the –O3 key –march = native, CPU Core i5 K750 result is as follows:

Generated rows: 10,000,000
C-optimized-Searching ...
C-optimized-search took 0.049000 seconds.
Found rows: 38
C-Searching ...
C-search took 0.086000 seconds.
The C ++ is faster than C: 1.755102 times
Found rows: 38

And on MSVC11 with the / O2 / Ob2 / Oi key, CPU Core i5 K750 the result is:

Generated rows: 10,000,000
C-optimized-Searching ...
C-optimized-search took 0.045000 seconds.
Found rows: 38
C-Searching ...
C-search took 0.074000 seconds.
The C ++ is faster than C: 1.644444 times
Found rows: 38

As we can see, the result with optimization in C lags behind our optimization in C ++ by more than 2 to 3 times .
Acceleration on GCC on C by 1.8 times versus 5.3 times on C ++, and acceleration on MSVC on C by 1.6 times versus 3.5 times on C ++. Moreover, C optimization is easily compiled by the C ++ compiler without a single change, but not vice versa.

5. Conclusion

I showed how using C ++ templates to implement computations and cross-hardware optimization at compile time to get acceleration 3.5-5.3 times for a real task.
There are no such optimizations in C that could not be applied in C ++, since there is almost complete backward compatibility. But in C - the only alternative to C ++ - templates for such optimizations is to write dozens, hundreds and thousands of functions - this is where copy-paste really is.
Let me remind you that this optimization is not applicable in Java and C #, since in generics it is impossible to use values in parameters, and only types are possible. Java and C # have other advantages - these are advanced libraries and protection against stupid bugs.
In the following articles, I will describe the index search option that is most suitable for this task, why the multi-index connections Index Hash Join and Bitmap AND will not work here, but the Index Organized Tables data access strategy and the Index Only Scan data access strategy should be used. I will describe the shortcomings of the binary search, and why in our case an ordered hash index will do. We will reach speeds of 0.15 to 3.6 million requests per second on a single Core i5 K750 core, and then we will show how this is implemented in a multi-threaded version with weak competitive access and rare changes. We will speed up this search on the GPU and show why the future is behind a multithreaded pattern - distributed states, with one memory barrier per group of records. Then come close to the search task in a high-load web project and compare its implementation using FastCGI and boost :: asio.But not all at once.

Source: https://habr.com/ru/post/182428/

All Articles