Many “know” that programming in C / C ++ allows you to get programs that work almost as fast as programs written in the Assembler language, and those, in turn, are as fast as possible in theory.
In fact, of course, this is not quite the case (and in rare cases it is not at all), but in general, C / C ++ programs are really fast, require a little memory and start up instantly. If you write them correctly.
That's about how to write in C / C ++ and I would like to talk a little. Today I want to discuss the issue of rowsets. That is, about procedures that allow you to get a string from a number, and a number from a string.
')
Where similar lists are found? Well, for example, it can be lists of html tokens with which your program works. Or a list of commands that your command interpreter accepts. But, of course, such sets most often appear as lists of all kinds of errors: strerror, gai_strerror, regerror, etc. I think every programmer has met with a similar task at least once.
I want to make a reservation that the further description is directly applicable only to operating systems using the ELF format: Linux, MacOS, etc. On Windows or embedded systems, the situation may be slightly different. Plus, this time I’ll confine myself to the direct problem (by the number of getting a string) because it is, firstly, simpler, and secondly, many solutions to the inverse problem contain a direct problem as part of the solution.
So let's set the task. We want to have the
errstr function, which returns a string by the “error” number. How can I write a similar function?
Well, for example, in the forehead:
const char * errstr
( int nr
) {switch ( nr
) {case 0 :
return "X00000X" ;
case 1 :
return "X00001X" ;
...
case 9999 :
return "X09999X" ;
default :
return "Unknown error" ;
}}Does this option work? Full But how many resources does it require? Let's get a look:
$ gcc -march = native -fPIC -O2 -shared -o libgen1.so gen1.c$ strip --strip-unneeded libgen1.so$ size libgen1.sotext | data | bss | dec | hex | filename |
201116 | 272 | 28 | 201416 | 312c8 | libgen1.so |
We spent 20 bytes on each line. Moreover, the actual string itself takes 8 bytes. Can you do better? Mmm ... Yes, of course! It is enough to make a list of strings in the form of an array and a simple function for extracting elements. Pointers in the array will occupy 4 bytes each and instead of 20 bytes per line we get only 12 - saving 40%! The game is definitely worth the candle.
No sooner said than done:
static const char *
const msgs
[ ] =
{[ 0 ] =
"X00000X" ,
[ 1 ] =
"X00001X" ,
...
[ 9999 ] =
"X09999X" ,
} ;
const char * errstr
( int nr
) {if ( nr <
0 || nr> sizeof
( msgs
) /
sizeof ( msgs
[ 0 ] ) )return "Unknown error" ;
return msgs
[ nr
] ;
}$ gcc -march = native -fPIC -O2 -shared -o libgen2.so gen2.c$ strip --strip-unneeded libgen2.so$ size libgen2.sotext | data | bss | dec | hex | filename |
161110 | 40272 | 28 | 201410 | 312c2 | libgen2.so |
Awesome achievement! Save 6 bytes? What happened? 80'000 bytes is occupied by the table (it cannot be modified so that it is protected as text), but where did we get another ~ 80'000 bytes of text (we counted on 40'000 bytes) and ~ 40'000 bytes of data (and they should not be at all - do we have all constant tables?). Let's figure it out:
$ relinfo libgen2.solibgen2.so: 10007 relocations, 10002 relative (99%), 3 PLT entries, 0 for local syms (0%), 0 usersYeah. Now it is clear! The compiler cannot calculate this table and write it to the library: for this you need to know the address where the library will be located in memory during execution - and this information is unknown to him. Moreover: as a rule, this information is unknown to the linker - therefore it forms the table not in the text segment, but in the data segment and creates another table - already for the dynamic linker. Which
at the time of launching the program fills it. Mdaa ... And this is all - for the table, which, most likely, will not be used at all! Who was talking about instantly running programs?
What to do? Obviously, it would be nice to reorganize the table so that the calculations are not made by the dynamic linker each time the program is started, but by our program - and only where they are needed. How to do it? We are not alone with this problem? How is this done in, say, glibc?
And so : all the lines are in a separate file that is used by the main program.
The
stringtab.h file contains the following:
_S
( 0 ,
"X00000X" )_S
( 1 ,
"X00001X" )...
_S
( 9999 ,
"X09999X" )And the program is this:
#include <stddef.h>#define MSGSTRFIELD (line) MSGSTRFIELD1 (line)#define MSGSTRFIELD1 (line) str ## linestatic const union msgstr_t
{struct {#define _S (n, s) char MSGSTRFIELD (__ LINE __) [sizeof (s)];#include "stringtab.h"#undef _S} ;
char str
[ 0 ] ;
} msgstr =
{ {#define _S (n, s) s,#include "stringtab.h"#undef _S} } ;
static const unsigned int msgidx
[ ] =
{#define _S (n, s) offsetof (union msgstr_t, MSGSTRFIELD (__ LINE__)),#include "stringtab.h"#undef _S} ;
const char * errstr
( int nr
) {if ( nr <
0 || nr> sizeof
( msgidx
) /
sizeof ( msgidx
[ 0 ] ) )return "Unknown error" ;
return msgstr.str + msgidx
[ nr
] ;
}$ gcc -march = native -fPIC -O2 -shared -o libgen3.so gen3.c$ strip --strip-unneeded libgen3.so$ size libgen3.sotext | data | bss | dec | hex | filename |
121136 | 272 | 28 | 121436 | 1da5c | libgen3.so |
Now - the order. The only remaining question is: we transferred the calculations from the dynamic linker to our program - how much did it affect the speed? In other words: can this method be used everywhere where lists of strings are needed, or only where speed is not particularly important (for example, in error handling)?
For these purposes, I wrote a small program and, using gcc 4.2.2, explored its behavior on three systems: with the AMD Opteron (tm) Processor 175, with the Intel Core (TM) 2 CPU 6600 and with the Intel Pentium D CPU 3.00GHz.
#define _GNU_SOURCE#include <sched.h>#include <stdio.h>#include "gen.h"#define NUMTESTS 1000#define NUMSTRINGS 10000#ifdef __i386__static inline unsigned long long read_tsc
( void ){unsigned long long val;
__asm__ __volatile__
( "rdtsc" :
"= A" ( val
) ) ;
return val;
}#endif#ifdef __x86_64__static inline unsigned long long read_tsc
( void ){unsigned int __a, __d;
__asm__ __volatile__
( "rdtsc" :
"= a" ( __a
) ,
"= d" ( __d
) ) ;
return ( ( unsigned long ) __a
) |
( ( ( unsigned long ) __d
) <<
32 ) ;
}#endifint main
( ){cpu_set_t affinity;
CPU_ZERO
( & affinity
) ;
CPU_SET
( 1 , & affinity
) ;
sched_setaffinity
( 0 ,
sizeof ( cpu_set_t ) , & affinity
) ;
unsigned long long testresults
[ NUMTESTS
] ;
for ( int i =
0 ; i <NUMTESTS; ++ i
) {unsigned long long start = read_tsc
( ) ;
for ( int i =
0 ; i <NUMSTRINGS; ++ i
) {// Do nothing - measure overhead}testresults
[ i
] = read_tsc
( ) - start;
}unsigned long long overhead = testresults
[ 0 ] ;
for ( int i =
1 ; i <NUMTESTS; ++ i
) {if ( testresults
[ i
] <overhead
)overhead = testresults
[ i
] ;
}for ( int i =
0 ; i <NUMTESTS; ++ i
) {unsigned long long start = read_tsc
( ) ;
for ( int i =
0 ; i <NUMSTRINGS; ++ i
) {errstr
( i
) ;
}testresults
[ i
] = read_tsc
( ) - start;
}unsigned long long testrun = testresults
[ 0 ] ;
for ( int i =
1 ; i <NUMTESTS; ++ i
) {if ( testresults
[ i
] <testrun
)testrun = testresults
[ i
] ;
}printf ( "errstr takes% g CPU cycles per call \ n " ,
( testrun - overhead
) /
( NUMSTRINGS *
1.0 ) ) ;
return 0 ;
}$ gcc -march = native -O2 -std = c99 -o main1 main.c -L. -lgen1$ gcc -march = native -O2 -std = c99 -o main2 main.c -L. -lgen2$ gcc -march = native -O2 -std = c99 -o main3 main.c -L. -lgen3$ LD_LIBRARY_PATH =. ./main1errstr takes 44.748 CPU cycles per call$ LD_LIBRARY_PATH =. ./main2errstr takes 12.9996 CPU cycles per call$ LD_LIBRARY_PATH =. ./main3errstr takes 13 CPU cycles per call | Option 1 | Option 2 | Option 3 |
---|
AMD Opteron (32bit): | 28.13 CPU cycles | 15 CPU cycles | 16 CPU CPU cycles |
AMD Opteron (64bit): | 30.83 CPU cycles | 10 CPU cycles | 11 CPU CPU cycles |
Intel Core 2 (32 bit): | 44.67 CPU cycles | 13 CPU cycles | 13 CPU CPU cycles |
Intel Core 2 (64 bit): | 39.37 CPU cycles | 9 CPU cycles | 10 CPU CPU cycles |
Intel Pentium D (32 bit): | 111.34 CPU cycles | 26 CPU cycles | 26 CPU CPU cycles |
Intel Pentium D (64 bit): | 99.05 CPU cycles | 16 CPU cycles | 14 CPU CPU cycles |
The results speak for themselves: the choice from the list has a tendency to conflict with the branch predictor and therefore works very slowly, the “standard” option is usually the fastest, but the “economical” variant does not lag behind it in half of cases, and there where it is lagging behind - it loses only one processor cycle, so if you need this function not in the “innermost” cycle itself, then you can (and if there are a lot of lines, you should recommend it). It is interesting, by the way, that in one of the variants (Intel Pentium D 64 bit) the “economical” option is
faster - which is most likely due to the fact that it requires half the memory, and the access speed to the cache is finite ...