Optimization options in C and C ++ languages

It is believed that C ++ has significant overhead compared to C and is therefore slower. In addition, there are even articles showing advantages in speed of languages with compilation on the fly (JIT - Just-in-time compilation), such as Java and C #. We will leave the last to those who consider them fast, but we will explain why this is not so. And we compare C and C ++ using the example of a data retrieval task.
The task of data retrieval is often found in: web services, database management systems (DBMS), geo-search and analytics.
First, for simplicity of explanation, we will set the task of searching for elements by a complete pass through an array of 10,000,000 elements (structures) containing 5 fields with ranges of values: amount_of_money (0-1000000), gender (0-1), age (0-100), code (0-1000000), height (0-300). And in the following articles we will add an index search to the solution.
We will write cross-platform under MSVC11 (MSVS2012) and GCC 4.7.2, and use in them a partially implemented C ++ 11 standard.

1. Solution in C

The simplest solution to this problem in C is to create a structure of bit fields occupied by 8 bytes (the general rule, in the absence of the #pragma pack(push,1) , the fields cannot cross the size limits of their basic types, in our case unsigned - 32 bits):

 /* Fields */ enum T_field_enum { amount_of_money_e, gender_e, age_e, code_e, height_e, last_e }; struct T_cash_account_row { // 1 – double word (32 bits) unsigned code:20; // 0 - 1000000 unsigned gender:1; // 0 - 1 unsigned age:7; // 0 - 100 // 2 – double word (32 bits) unsigned amount_of_money:20; // 0 - 1000000 unsigned height:9; // 0 – 300 };

Allocate memory for 10 million such elements:

 const size_t c_array_size = 10000000; struct T_cash_account_row *const array_ptr = ( struct T_cash_account_row *)calloc(c_array_size, sizeof(struct T_cash_account_row)); if (array_ptr == NULL) { printf ("calloc error\n"); exit(1); }

In the cycle, fill with random values within the ranges specified by the condition:

 /* Generate random data for the one row */ static inline struct T_cash_account_row generate_row() { struct T_cash_account_row cash_account_row; cash_account_row.age = rand() % 100; cash_account_row.amount_of_money = (rand() % 1000)*(rand() % 1000); cash_account_row.code = (rand() % 1000)*(rand() % 1000); cash_account_row.gender = rand() % 2; cash_account_row.height = rand() % 300; return cash_account_row; } /* ----------------------------------------------------------------------- */ /* in int main() { … } */ /* Fill table random data */ for(i = 0; i < c_array_size; ++i) array_ptr[i] = generate_row();

Create a search query filter structure, where last_e is an enumeration with the value of the number of fields in a string (C ++ 03 7.2 Enumeration declarations) :

 /* Filters */ struct T_range_filters { struct T_cash_account_row begin, end; /* bytes array or bitset from https://gist.github.com/jmbr/667605 */ unsigned char use_filter[last_e]; }; /* ----------------------------------------------------------------------- */

Here, use_filter[] used to specify whether to filter by this condition-field or not.
And perform a search by checking the specified fields passing through all the elements of the array in a loop:

 /* Compare row with filters */ static inline unsigned char test_predicate(struct T_cash_account_row const*const row, struct T_range_filters const*const range_filters) { return (!range_filters->use_filter[amount_of_money_e] || (row->amount_of_money >= range_filters->begin.amount_of_money && row->amount_of_money <= range_filters->end.amount_of_money)) && (!range_filters->use_filter[gender_e] || (row->gender >= range_filters->begin.gender && row->gender <= range_filters->end.gender)) && (!range_filters->use_filter[age_e] || (row->age >= range_filters->begin.age && row->age <= range_filters->end.age)) && (!range_filters->use_filter[code_e] || (row->code >= range_filters->begin.code && row->code <= range_filters->end.code)) && (!range_filters->use_filter[height_e] || (row->height >= range_filters->begin.height && row->height <= range_filters->end.height)); } /* ----------------------------------------------------------------------- */ /* search */ static inline size_t search(struct T_cash_account_row const*const array_ptr, const size_t c_array_size, struct T_cash_account_row *const result_ptr, struct T_range_filters const*const range_filters) { size_t result_size = 0; size_t i; /* loop index */ for(i = 0; i < c_array_size; ++i) { if(test_predicate(array_ptr + i, range_filters)) result_ptr[result_size] = array_ptr[i], ++result_size; } return result_size; }

Link to the working code entirely on GitHub.com
It would seem that here you can still speed up and optimize with a full pass without indexes?

Expand the cycle, to reduce the number of comparisons with the condition of the cycle and do several filterings test_predicate? - This is too small compared to from 5 to 15 comparisons of our string in the embedded function test_predicate and compared with the call to RAM.
Do prefetching-cache? - It is possible both on C, and on C ++, but as part of our task, this will give little, because multiple searches and so caches in LLC (L3) as much as they can, and the whole array of 80 MB in any case will not.
Use vector comparison commands from SSE: CMPSS, COMISS, UCOMISS ? - It is possible both in C, and in C ++. But this optimization is not portable to non-x86 / x64 processors such as ARM or Power [PC].
You can use compiler optimization keys and PGO, in C and C ++. PGO is a compromise anyway, because optimized code is created for one program execution path to the detriment of the other paths, i.e. with some input data will be faster, and with some input slower. I will show how to create an optimized code for each of the possible execution paths, and only in the most critical part of the program speed.

This is where the low-level (non-architectural) optimization in C ends, and any of these optimizations are applicable in C ++.

2. What is another optimization in C and C ++?

First, the above solution in C is easily compiled on a C ++ compiler without any changes, since in most cases backward compatibility occurs.
Result in the online compiler on C: ideone.com
Result in the online compiler in C ++ 11: ideone.com
I commented out the random seed count.
```
  /* srand (time(NULL)); */ 
```
to compare the results of different program launches. But you can uncomment it and make sure that the absolute execution time varies, but the relative acceleration from my optimizations remains about the same.
Secondly, especially clever C-developers could offer one more optimization - create 2 ^ 5 = 32 variants of test_predicate / search functions for each variant of the number of search conditions, save pointers to them into an array and select the desired variant during execution depending on the conditions search. This will significantly reduce the number of comparisons.

Let's say the search condition for 2 fields from 5 came: age and code. Then we call the search_12() function:

 /* Compare row with filters */ static inline unsigned char test_predicate_12(struct T_cash_account_row const*const __restrict row, struct T_range_filters const*const __restrict range_filters) { return (row->age >= range_filters->begin.age && row->age <= range_filters->end.age) && (row->code >= range_filters->begin.code && row->code <= range_filters->end.code); } /* ----------------------------------------------------------------------- */ /* search */ static inline size_t search_12(struct T_cash_account_row const*const __restrict array_ptr, const size_t c_array_size, struct T_cash_account_row *const __restrict result_ptr, struct T_range_filters const*const __restrict range_filters) { size_t result_size = 0; size_t i; /* loop index */ for(i = 0; i < c_array_size; ++i) { if(test_predicate_12(array_ptr + i, range_filters)) result_ptr[result_size] = array_ptr[i], ++result_size; } return result_size; }

Although the number of conditions in this function has decreased to 4 from the original 15. In reality, in the version of the solution on C comparisons, only 9 out of 15 were executed; for the search condition for 2 fields, this can be seen in the disassembler of the test_predicate function: github.com . This happened because after each comparison of use_filter[] == false was a conditional transition to comparisons in the next field. Those. in addition to these 4-ex comparisons, only 5 more comparisons with use_filter[] were performed there.
The solution described here, for example, for searching by 2 fields of 5, will give an acceleration of 1.3 times. Not bad, but there is a small problem - C-developers will have to manually create these 32 functions, never make mistakes in them, going through all the options and changing the necessary functions with any change in the number and field names. Well, if you need a similar solution for a table with 10 fields, then you have to create 1024 functions. In general, just a fairy tale when finalizing the code!
In addition, confusion is created when adding type fields with non-trivial comparison, such as the string char[] compared via strcmp() . In C ++, this is solved by creating a custom type with an overloaded comparison operator. (Operators for fundamental types in C ++ cannot be overloaded - one of the operator parameters must be a custom class.)
And the task of automatically creating the necessary number of optimized functions in C ++ is easily solved by unrolling of template.
It may seem to someone that it can be solved on C and at run-time, and there is no need to fence 32 to 1024 functions optimized in compile-time. Suppose you create an array of pointers to a function with a number equal to the number of conditions, in our case 5, and with each search, fill this array with only those functions with the conditions that are used for this search query. And at the end add a pointer to the function returns 1 (true). And each of such functions receives a pointer to an array of functions of the same type as itself, and the index of the next function being called. I will disappoint, but in this case, the functions are not built in (inline), and their call is no faster than comparisons with conditional transitions.
Here is the working version of this run-time solution on C: GitHub.com
As you can see in MSVC, the speed dropped from 74 ms to 84ms. And in the GCC even more - up to 117ms. On C, such optimization is not possible, and only optimization through the creation of a large number of functions is possible.
')

3. C ++ Solution

Template promotion is performed by instantiating (instantiating) one template with another template, with the template parameter value one less than that of the creating one. And for a template with a parameter value of 0, we create an empty, do nothing specialization. As a result, by instantiating the promotion of the template with the N parameter, we get N - instances of the template being spun, in each of which the constructor or inline operator is called to call the next instance of the template. In this promotion can participate, as template functions, and template classes.
To take out promotion from the logic of the templates themselves, we will create a template promotion class. One parameter will take the number to which it is necessary to unwind, and the second parameter will take the pattern that needs to be unwound:

 // The templated constructor of unrolling of the class (no return, mandatory call to the constructor, called once) template<unsigned unroll_count, template<unsigned> class T> struct T_unroll_constructor { T_unroll_constructor<unroll_count-1, T> next_unroll; T<unroll_count-1> functor; template<typename T1> inline T_unroll_constructor(T1 & val1) : next_unroll(val1), functor(val1) {} }; // End of unroll template<template<unsigned> class T> struct T_unroll_constructor<0, T> { template<typename T1> inline T_unroll_constructor(T1 &) {} }; // -------------------------------------------------------------------------

Now create the base abstract search class. We will inherit from it a template child class that takes a 32-bit unsigned int value as a template parameter, each bit of which will mean whether or not the corresponding filter is used:

 // Abstract base class of filters for each search variant (range_filters) struct T_filter { // search virtual size_t search(T_cash_account_row const*const __restrict array_ptr, const size_t c_array_size, T_cash_account_row *const __restrict result_ptr, T_range_filters const*const __restrict range_filters) = 0; }; // ------------------------------------------------------------------------- // The filters for each search variant (range_filters) template<unsigned index_pred> struct T_custom_filter : T_filter { inline unsigned char test_predicate(T_cash_account_row const*const __restrict row, T_range_filters const*const __restrict range_filters) { return (!(index_pred & 1<<amount_of_money_e) || (row->amount_of_money >= range_filters->begin.amount_of_money && row->amount_of_money <= range_filters->end.amount_of_money)) && (!(index_pred & 1<<gender_e) || (row->gender >= range_filters->begin.gender && row->gender <= range_filters->end.gender)) && (!(index_pred & 1<<age_e) || (row->age >= range_filters->begin.age && row->age <= range_filters->end.age)) && (!(index_pred & 1<<code_e) || (row->code >= range_filters->begin.code && row->code <= range_filters->end.code)) && (!(index_pred & 1<<height_e) || (row->height >= range_filters->begin.height && row->height <= range_filters->end.height)); } // ------------------------------------------------------------------------- // search virtual size_t search(T_cash_account_row const*const __restrict array_ptr, const size_t c_array_size, T_cash_account_row *const __restrict result_ptr, T_range_filters const*const __restrict range_filters) final { size_t result_size = 0; size_t i; // loop index for(i = 0; i < c_array_size; ++i) { if(test_predicate(array_ptr + i, range_filters)) result_ptr[result_size] = array_ptr[i], ++result_size; } return result_size; } }; // -------------------------------------------------------------------------

Since template parameter index_pred and enumeration amount_of_money_e, gender_e … known at the compilation stage, the compiler will throw out some of the conditions, as always true. In fact, we help the compiler to optimize our program. This is the most important decision in this!
And now we will show how this template child class template<unsigned index_pred> struct T_custom_filter into 32 classes. Create 32 objects of each of them and store the pointers of the base type on them into a static array std::array<> . And at the time of execution we will polymorphically refer to the desired object, depending on the search conditions:

 class T_optimized_search { // unroll tamplates template<unsigned index_pred> struct T_unroll_find { template<typename T> T_unroll_find(T &filters) { filters[index_pred].reset( new T_custom_filter<index_pred>() ); } }; // ------------------------------------------------------------------------- // Get index of T_test_pred version for current search range inline unsigned get_index_pred(T_range_filters const*const __restrict range_filters) { unsigned result = 0; for(size_t i = 0; i < last_e; ++i) result |= range_filters->use_filter[i]?(1<<i):0; return result; } std::array<std::unique_ptr<T_filter>, 1<<last_e> filters; T_unroll_constructor< 1<<last_e, T_unroll_find> fill_filter; public: T_optimized_search() : fill_filter(filters) {} // C++ optimized search inline size_t search(T_cash_account_row const*const __restrict array_ptr, const size_t c_array_size, T_cash_account_row *const __restrict result_ptr, T_range_filters const*const __restrict range_filters) { auto const& filter = filters[get_index_pred(range_filters)]; return filter->search(array_ptr, c_array_size, result_ptr, range_filters); } }; // -------------------------------------------------------------------------

Here, the unsigned get_index_pred((T_range_filters const*const __restrict range_filters) returns the index number of the required search object for the given search condition range_filters .
Used in a similar way as the solution on C:

 T_optimized_search optimized_search; // C++ optimized search result_size = optimized_search.search(array_ptr, c_array_size, result_ptr, &range_filters);

Here is a comparison of the disassembled code of the two test_predicate functions on C and optimized in C ++ compiled on MSVC11 (MSVS 2012) with my comments - the difference is clearly visible if you look through TortoiseDiff : diff link on GitHub.com
We see that from 15 comparisons, 9 of which are performed under our search conditions, only 4 comparisons remain - cmp assembler commands.
“Picture disasm from TortoiseDiff with my comments”

In fact, with the help of templates, we’ve out-of-cycle a check for using each of the filters. And inside the loop, the use values of the use_filter[] filters use_filter[] known in compile-time were obtained, which allowed the compiler to exclude them during optimization. Those. This optimization is applicable to all similar cases of the removal of calculations or checks from a loop to an out.

In the C ++ example, I used the C-style method of passing parameters to a function using the constant pointer *const , so that in diff between C and C ++, the changes related only to the discussed optimization. However, using interfaces in C ++ - style, the function can also take parameters via the & link, which will exclude the possibility of forgetting const after * and this is somewhat shorter. But Google C ++ Style Guide recommends transferring immutable parameters via a constant link const& , and modifiable using a constant pointer *const . If the code is written in this style, then you completely control the change (or not change) of your variables passed to another function - that is, if you pass by value
 void func(int const& a, int *b) {} // function of other developer in Google C++ Style int* a = new int(1); int b = 2; func(*a, b); 
then the compiler will generate an error stating that the function wants to change your b parameter. This is especially important when developing through TDD testing, when external calls to tests rigidly specify the interface format, and in this case such a call in external tests would tell the developer of the function that b cannot be changed.
And if we pass by the pointer (or by taking the address):
 void func(int const& a, int *b) {} // function of other developer in Google C++ Style int* a = new int(1); int b = 2; func(*a, &b); 
it will compile without errors. And even from the function call, it is obvious to us that the function a does not change the variable, and the variable b changes. And in the case of TDD, we say that the developer must take b by pointer, and therefore must change it. And a he will have to take a constant link or value, and will not be able to change its external value.
But in C, where there are no links, this approach is not possible, since if the function always accepts only by the pointer, then it is impossible to guarantee on the side of the caller that they cannot be modified, and passing on the value of variables of a custom type can have significant overhead costs.

4. Conclusion

Here is a fully working version of this solution in C ++: GitHub.com
I have GCC4.7.2 with –O3 –march = native, CPUCore i5 K750 keys and the exe-file size of 74KB, the result is this:

Generated rows: 10,000,000
C ++ - Searching ...
C ++ - optimized search took 0.061000 seconds.
Found rows: 38
C-Searching ...
C-search took 0.089000 seconds.
The C ++ is faster than C: 1.459016 times
Found rows: 38

And on MSVC11 (MSVS2012) with the / O2 / Ob2 / Oi keys, CPU Core i5 K750 and the size of the exe-file in 138KB the result was:

Generated rows: 10,000,000
C ++ - Searching ...
C ++ - optimized search took 0.056000 seconds.
Found rows: 38
C-Searching ...
C-search took 0.074000 seconds.
The C ++ faster than C: 1.321429 times
Found rows: 38

As we can see, the execution time dropped from 74ms to 56ms, i.e. speed increased 1.3 times. In principle, not bad.
Just 1.3 times? And what about acceleration 3.5 - 5.3 times for a full pass search, any ideas?
Conclusion - the more the compiler knows at compile time, the better he can optimize the program. And templates (templates) help him like nothing else.
By the way, this optimization is not applicable in Java and C #, since in generics, it is not possible to use the parameter value, not the type
In the next article, a hardcore solution with acceleration in 3.5 - 5.3 and still without indices. But the solution will be used further in the index search.

Source: https://habr.com/ru/post/182356/

All Articles

Optimization options in C and C ++ languages

1. Solution in C

2. What is another optimization in C and C ++?

3. C ++ Solution

4. Conclusion

More articles: