⬆️ ⬇️

Search optimization when choosing a campaign

Blah blah blah



When the head starts to swell from the main work, I like to switch the context and come up with a task related to the improvement of some algorithm, optimization. Since more than 10 years I have been developing advertising management systems, all my tasks are also related to this area.



Recently, my thoughts are busy with the question of how to increase the speed of search for advertising campaigns according to specified parameters. An advertising campaign is, in fact, a set of restrictions that must be checked in order to choose the right one for a request. Trying to speed up the exhaustive search, I reached intrinsic, but the achievements could not be called outstanding. At the same time, there was always a feeling that I was not going the right way, that somewhere I had to turn something upside down and the speed would increase by an order of magnitude.



Yesterday I was lit up after a conversation with Pasha Beyzman, who spoke the other day with a report on HighLoad ++. Bit operations !!! Yes, I perfectly understand the power of bit operations, but I didn’t prepare them correctly;) The test results exceeded my most optimistic expectations.



Idea



One type of constraint when choosing a campaign is to check whether a value is in the set. For example, a campaign can be restricted by country, region, city, category, etc. The first thing that comes to mind is a tree with all the values ​​at the campaign level. The second is an array of values; if the values ​​in the set are few (about 10), brute force will be faster than using trees. To select a campaign, you must go through everything and in each of them check the occurrence of a given value in the set.



The algorithm with correctly prepared bit operations is as follows. For a given restriction value, we build a large bit mask whose size is equal to the number of campaigns and set the bits for the campaigns that are assigned to this value (or there are no restrictions of a given type). Bitmasks are placed in a tree, where the key is the value of the constraint. For example, for countries, this will be a tree, where the key is the country identifier, and the value is a mask with bits set for campaigns that can be shown to this country.



It is very easy to work with such a structure. Knowing which country to select campaigns for, we select the bit mask by identifier and perform the AND operation with masks of other restrictions. Very fast and beautiful.



Tests



Where without performance tests ... I tested the processing of one limit for 1 million campaigns. For each of the campaigns, a restriction with a set of 10 values ​​from 0 to 256 was randomly generated (checks for memory allocation errors, etc., from the code were removed):



int nels = 10; int campaigns = 1000000; int *values = (int *) malloc(sizeof(int) * campaigns * nels); srand(13); for (int i = 0; i < campaigns; i++) { for (int j = 0; j < nels; j++) { values[i * nels + j] = rand() % 256; } } 


Brute force with values ​​in an array



We just try to sort through all the campaigns with all the restrictions. Here and below, we compile with the -O3 option.



We are testing.



  clock_t cs, ce; int found_campaigns = 0; int reference = 13; cs = clock(); for (int i = 0; i < campaigns; i++) { for (int j = 0; j < nels; j++) { if (values[i * nels + j] == reference) { found_campaigns++; break; } } } ce = clock(); printf("test #1 bruteforce, found: %d, clocks: %ld, sec: %f\n", found_campaigns, (long) (ce - cs), (double) (ce - cs) / CLOCKS_PER_SEC); 


Result: clocks: 7034, sec: 0.007034.



Brute force tree



We try to use a tree instead of an array of constraint values ​​( Judy Arrays ).



Preparing data
 Pvoid_t *sets = (Pvoid_t) malloc(sizeof(Pvoid_t) * campaigns); memset(sets, 0, sizeof(Pvoid_t) * campaigns); for (int i = 0; i < campaigns; i++) { for (int j = 0; j < nels; j++) { PWord_t pw; JLI(pw, sets[i], values[i * nels + j]); } } 


We are testing.



  found_campaigns = 0; cs = clock(); for (int i = 0; i < campaigns; i++) { PWord_t pw; JLG(pw, sets[i], reference); if (pw) { found_campaigns++; } } ce = clock(); printf("test #2 set, found: %d, clocks: %ld, sec: %f\n", found_campaigns, (long) (ce - cs), (double) (ce - cs) / CLOCKS_PER_SEC); 


Result: clocks: 7930, sec: 0.007930.



The enumeration of 10 array values ​​turned out to be faster than using a tree. There is nothing surprising here; brute force through the array utilizes the processor cache remarkably well. The approximate equality between the arrays and the tree was on 12 elements.



Use bit masks



We prepare bit masks
  #define SET_BIT(_n, _i) _n[_i / 64] |= 1ULL << (_i % 64) #define CHECK_BIT(_n, _i) _n[_i / 64] & (1ULL << (_i % 64)) PWord_t pw; Pvoid_t pv = (Pvoid_t) 0; int mask_size = sizeof(uint64_t) * (campaigns / 64 + 1); for (int i = 0; i < campaigns; i++) { for (int j = 0; j < nels; j++) { int id = values[i * nels + j]; JLI(pw, pv, id); if (*pw) { uint64_t *mask = (uint64_t *) *pw; SET_BIT(mask, i); } else { uint64_t *mask = (uint64_t *) malloc(mask_size); memset(mask, 0, mask_size); SET_BIT(mask, i); *pw = (Word_t) mask; } } } 


We are testing.



  found_campaigns = 0; cs = clock(); JLG(pw, pv, reference); if (pw) { uint64_t *mask = (uint64_t *) *pw; for (int i = 0; i < campaigns; i++) { if (CHECK_BIT(mask, i)) { found_campaigns++; } } } ce = clock(); printf("test #3 bitmaps, found: %d, clocks: %ld, sec: %f\n", found_campaigns, (long) (ce - cs), (double) (ce - cs) / CLOCKS_PER_SEC); 


Result: clocks: 1393, sec: 0.001393.



5 times faster than brute force. Not bad, but better.



Accelerate the use of bitmasks



To speed up, use the built-in function gcc __builtin_ffsll, which returns the index of the first bit set.



We are testing.



  found_campaigns = 0; cs = clock(); JLG(pw, pv, reference); if (pw) { uint64_t *mask = (uint64_t *) *pw; for (int i = 0; i < (campaigns / 64 + 1); i++) { uint64_t m = mask[i]; while (m) { int n = __builtin_ffsll(m); m &= ~(1ULL << (n - 1)); found_campaigns++; } } } ce = clock(); printf("test #4 bitmaps 2, found: %d, clocks: %ld, sec: %f\n", found_campaigns, (long) (ce - cs), (double) (ce - cs) / CLOCKS_PER_SEC); 


Result: clocks: 28, sec: 0.000028.



In this place, my eyes went up to my forehead. Acceleration is 250 times compared with brute force.



How the processor works, what SSE, AVX and similar instructions are, how the cache works - this knowledge allows you to speed up the program, often significantly. But a beautiful algorithm can speed up the work by orders of magnitude.



The text of the program can be found here: https://github.com/saterenko/labs/tree/master/bit_limits



PS



Thanks to PkXwmpgN for the error found in the initialization, because of it, not the entire array of test data was filled, but only a tenth of it. After correcting the error, the execution time of the last test increased by an order of magnitude, i.e. acceleration compared to brute force - 25 times, eyes on the forehead no longer climb, but also impressive;)



')

Source: https://habr.com/ru/post/314818/



All Articles