Optimization of code execution speed

This article is about how classic algorithms can make our code faster. We will look at the zero-bit search algorithm and how it can help us increase the efficiency of a character search algorithm from a range (for example, find the first occurrence of a character from the range 0-9 in a line).
Those. ~~just a spherical horse in a vacuum is~~ quite common situation when processing text, when it is necessary to find the position of the first digit in a free text. What many begin to learn regular expressions with.
This article describes the use of gcc. If desired, everything can be repeated under other compilers and languages with minor modifications. Carefully, there is an assembler under the cut!

To solve this problem, we can use the following ways:
1) Take advantage of the regular expression library.
2) Write a function to search for yourself.
If the lines are small, the question of optimization does not arise. This task becomes interesting when the length of the lines increases, or there is a large number of checks.
We will measure the execution time using the clock () function from the <time.h> file.

Use REGEX

Consider a method using regular expressions. One of the first methods that comes to mind is to use the REGEX library. The regular expression will look like this:
"[[: digit:]]"
The code for using this library can be:

clock_t tbefore; long i; double eplased; int pmatch[1]; regex_t regex; int reti; …. /*REGEXP usage*/ tbefore = clock(); reti = regcomp( & regex, "[[:digit:]]", 0); for (i=1;i<100000;i++){ reti = regexec(& regex, test_string, 1, pmatch, 0); } eplased=clock()-tbefore; printf("Function strchr used %.3f seconds %d\n", eplased/CLOCKS_PER_SEC, pmatch[0]); regfree( & regex); /**************/

Execution speed: 1.74 sec.
In this code, I deliberately removed the checks on the success of the compilation and search, that is, you cannot use this code to solve real problems. But to test the speed in "greenhouse" conditions - the most it. In addition, I measure the execution time based on the compilation of a regular expression and the release of memory after using it.
')

Own function to search

Let's try to use the traditional approach - in the forehead. Search function
will look like this ( Algorithm 1 ):

 int search1(char *str){ int l=0; int i; l = strlen(str); for (i=0;i<l;i++){ if ((str[i] >= '0') && (str[i] <= '9')){ return (i); } } return (-1); }

Execution speed: 7.19 seconds.
And the results of measurements show more than four times lower speed of execution! What should be expected.

Optimized algorithm

Let's try to optimize. Once I came across the book “Algorithmic Tricks for Programmers” by G. Warren. And in this book there was a description of the algorithms for finding the zero character in the string. These algorithms are used for example in the C language to search for the end of a line. Also in one of the paragraphs proposed a generalization of one of these algorithms for the range tag.
The idea of the algorithm is that, in order to search for the index of a character in a string, they search through not four characters at once, but four at a time, reading double words from the array (dword = 4 byte). Those. In terms of the C language, we read from the array of characters (char) long integers (unsigned).
After this, arithmetic operations are performed with the word. For a range from zero to nine, it looks like this:

 int zbytel3(unsigned z) { unsigned y; x = z ^ 0x30303030; // Original byte: 00 80 other y = (x & 0x7F7F7F7F) + 0x76767676; // 7F 7F 1xxxxxxx y = ~(y | x | 0x7F7F7F7F); // 80 00 00000000 // These steps map: if (y == 0) return 4; // 00000000 ==> 4, else if (y > 0x0000FFFF) // 80xxxxxx ==> 0, return (y >> 31) ^ 1; // 0080xxxx ==> 1, else // 000080xx ==> 2, return (y >> 15) ^ 3; // 00000080 ==> 3. }

i.e. x = z ^ 0x30303030 is an exclusive operation or, which turns the minimum number of the range „0“ - 0x30 into 0. 0x76 = 0x7F-9, where 9 = n-1, where n is the number of characters in the range.
Then simple byte arithmetic, as a result of which you can get the number that is processed in the branch, and as a result we get the numbers from 1 to 4. That is, if the character is not found, then 4.
Now our function will look like this:

 int search2( char *str){ int n, j = 0; unsigned x, y, *r; const char *c_ptr; //   ,   . for (c_ptr=str; ((unsigned long int)c_ptr & (sizeof(x) - 1)) != 0; ++c_ptr) if ((*c_ptr >= '0') && (*c_ptr <= '9')) return c_ptr - str; r = (unsigned *) c_ptr; j = c_ptr - str ; while (1){ x = *r ^ 0x30303030; y = (x & 0x7F7F7F7F) + 0x76767676; y = ~(y | x | 0x7F7F7F7F); // These steps map: if (y == 0) n = 4; // 00000000 ==> 4, else if (y > 0x0000FFFF) // 80xxxxxx ==> 0, n= (y >> 31) ^ 1; // 0080xxxx ==> 1, else // 000080xx ==> 2, n= (y >> 15) ^ 3; // 00000080 ==> 3. j=j+n ; if (n<4) {j =j +3 -2*n; break;} r++; } return (j); }

Execution speed: 1.71 sec.
The measurement results show a speed close to regexp. Can we do it faster? Yes!
We use assembler inserts. The strength of the described method is that it allows to reduce the number of cycles that the processor spends on execution due to the application of elementary arithmetic operations. Programming in a high-level language does not fully allow the possibilities of this algorithm to be realized.
Therefore, now our function will take the following form:

 int search3( char *str){ int n, j = 0; unsigned x, y, *r; const char *c_ptr; for (c_ptr=str; ((unsigned long int)c_ptr & (sizeof(x) - 1)) != 0; ++c_ptr) if ((*c_ptr >= '0') && (*c_ptr <= '9')) return c_ptr - str; r = (unsigned *) c_ptr; j = c_ptr - str ; while (1){ __asm__( "movl $5, %%edx\n" "movl (%%ebx), %%eax\n" "xor $0x30303030, %%eax\n" // "and $0x7F7F7F7F, %%eax\n" "add $0x76767676, %%eax\n" "movl %%eax, %%ecx\n" "or %%eax, %%ecx\n" "or $0x7F7F7F7F, %%ecx\n" "notl %%ecx\n" :"=c"(y) :"b"( r ) ); // These steps map: if (y == 0) n = 4; // 00000000 ==> 4, else if (y > 0x0000FFFF) // 80xxxxxx ==> 0, n= (y >> 31) ^ 1; // 0080xxxx ==> 1, else // 000080xx ==> 2, n= (y >> 15) ^ 3; // 00000080 ==> 3. j=j+n ; if (n<4) {j =j +3 -2*n; break;} r++; } return (j); }

Execution speed: 1.23 seconds.
I replaced the inset assembly with only arithmetic operations. Branching can also be replaced by an assembler code, but I’m not sure of a significant gain, but a few branches with labels will significantly complicate the debugging process.
The win is present, but not many times in comparison with regexp. In principle, optimizing assembly inserts can still improve the code, but this all entails the complication of the debugging process. If you optimize the branching and the cycle, then you can increase the speed one and a half more. But debugging the code will be much harder. You can also optimize this algorithm for 64 bits instead of 32, try to parallelize the process.

So we get:

Regex	1.74
Algorithm 1	7.19
Algorithm 2	1.71
Algorithm 2 + asm	1.23

The conclusion suggests itself: The invention of bicycles is not welcome, and the gain from assembler optimization may not always be so significant.

Link to the source archive:
webfile.ru/5706721

UPD:
Optimization for the processor architecture was specified, adding the -O 2 parameter.
Also I add summation inside test functions to somehow use the result.
Result:

Regex	4.56
Algorithm 1	2.99
Algorithm 2	one
Algorithm 2 + asm	1.7 *

* With the -O2 key, it stopped working properly, __volatile_ did not help, the result with the -O1 key but with added summation.

UPD 2:

The user Maratyszcza (thanks to him!) Offered an even faster option using SIMD extensions .

 #include <intrin.h> const char* find_digit(const char* str) { static const __m128i str_mask[16] = { _mm_set_epi32(0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF), _mm_set_epi32(0x00FFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF), _mm_set_epi32(0x0000FFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF), _mm_set_epi32(0x000000FF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF), _mm_set_epi32(0x00000000, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF), _mm_set_epi32(0x00000000, 0x00FFFFFF, 0xFFFFFFFF, 0xFFFFFFFF), _mm_set_epi32(0x00000000, 0x0000FFFF, 0xFFFFFFFF, 0xFFFFFFFF), _mm_set_epi32(0x00000000, 0x000000FF, 0xFFFFFFFF, 0xFFFFFFFF), _mm_set_epi32(0x00000000, 0x00000000, 0xFFFFFFFF, 0xFFFFFFFF), _mm_set_epi32(0x00000000, 0x00000000, 0x00FFFFFF, 0xFFFFFFFF), _mm_set_epi32(0x00000000, 0x00000000, 0x0000FFFF, 0xFFFFFFFF), _mm_set_epi32(0x00000000, 0x00000000, 0x000000FF, 0xFFFFFFFF), _mm_set_epi32(0x00000000, 0x00000000, 0x00000000, 0xFFFFFFFF), _mm_set_epi32(0x00000000, 0x00000000, 0x00000000, 0x00FFFFFF), _mm_set_epi32(0x00000000, 0x00000000, 0x00000000, 0x0000FFFF), _mm_set_epi32(0x00000000, 0x00000000, 0x00000000, 0x000000FF) }; static const __m128i str_offset = _mm_set1_epi8(127 - '9'); static const __m128i str_threshold = _mm_set1_epi8(127 - 10); const size_t str_misalignment = ((size_t)str) & ((size_t)15); const __m128i* str_aligned = (const __m128i*)(((size_t)str) - str_misalignment); __m128i str_vector = _mm_load_si128(str_aligned); str_vector = _mm_and_si128(str_vector, str_mask[str_misalignment]); str_vector = _mm_add_epi8(str_vector, str_offset); int str_bitmask = _mm_movemask_epi8(_mm_cmpgt_epi8(str_vector, str_threshold)); unsigned long index; _BitScanForward(&index, str_bitmask); while (str_bitmask == 0) { str_aligned += 1; str_vector = _mm_load_si128(str_aligned); str_vector = _mm_add_epi8(str_vector, str_offset); str_bitmask = _mm_movemask_epi8(_mm_cmpgt_epi8(str_vector, str_threshold)); _BitScanForward(&index, str_bitmask); } return ((const char*)str_aligned) + index; }

While this is the fastest option.

Source: https://habr.com/ru/post/133903/

All Articles

Optimization of code execution speed

Use REGEX

Own function to search

Optimized algorithm

More articles: