
The OpenSSL library has a
rather curious function with a promising name CRYPTO_memcmp () . Comments to it
explain that the usual
memcmp () has a fatal flaw - the time of its work depends not only on the size of the compared blocks, but also on their contents, and this can help the attacker to carry out the so-called
time attack .
There are similar functions in a number of other projects - searching for the request
constant time memcmp gives several thousand results.
We will not question the need to use the function
CRYPTO_memcmp () , but instead consider whether it solves the problem posed to it.
First, we compare the “normal” and “protected” implementation of
memcmp () .
')
"Normal" is arranged as follows:
int memcmp( const void* f, const void* s, size_t length ) { const unsigned char* first = f; const unsigned char* second = s; while( length != 0 ) { if( *first > *second ) return 2; if( *first < *second ) return -5; length--; first++; second++; } return 0; }
The function can return zero, negative integer and positive integer (not necessarily 1 and -1). Naturally, as soon as the first difference is found, it does not make sense to continue the cycle, so the cycle is interrupted and the function returns.
For comparison, this is how
CRYPTO_memcmp () works :
int CRYPTO_memcmp( const void* f, const void* s, size_t length ) { size_t i; const unsigned char* first = f; const unsigned char* second = s; unsigned char magic = 0; for( i = 0; i < length; i++ ) { magic |= (first[i] ^ second[i]); } return magic; }
The first difference is that it returns only “zero” and “non-zero”, so
memcmp () is not suitable for direct replacement. In all cases of using this function in OpenSSL, this difference is taken into account - the function is used only to check the equality of blocks.
The second difference is that there is no exit from the loop except after passing both blocks of data entirely. For the sake of this difference, the function is made.
ALAS.
The optimizing compiler has the right to “break”
CRYPTO_memcmp () and generate the same machine code for it as for the usual
memcmp () (of course, corrected for a slight difference in the semantics of their return values). Further, it will be considered in detail what possibilities the compiler has for such optimization. At the time of writing the post, no common compiler presented on
gcc.godbolt.org does not make such optimizations. The more interesting - you can stock up on popcorn and follow their development.
Now look again at the implementation of
CRYPTO_memcmp () . At the same time look at the Standard C99. It does not contain such a definition of “observable behavior” as in 1.9 / 6 of the C ++ 03 Standard, but in 5.1.2.3/3 it is said that the compiler can delete code that does not lead to “the necessary side effects”. Further, in 5.1.2.3/6, the minimum requirements are listed, which include the requirements of “stability of
volatile objects at points of sequence in the sense that previous access operations were completed at the point of sequence, and the subsequent ones did not start”. In fact, this requirement is similar to the requirements of 1.9 / 6 from C ++ 03 - nothing is said about access to objects that do not have
volatile qualifiers.
Looking closely at this wonderful cycle:
/* i variable value not used after the loop */ for( i = 0; i < length; i++ ) { magic |= (first[i] ^ second[i]); }
It is conceived that it will force the compiler to generate code that completely passes both data blocks. But the Standard does not require anything like that. Here is just a way to calculate the value of the magic variable. From the standpoint of the Standard, this code is completely equivalent to such a code (the value of the variable
i after exiting the loop may differ from the first variant of the code, but this value after the loop is not used, therefore, optimization is permissible):
for( i = 0; i < length; i++ ) { magic |= (first[i] ^ second[i]); if( magic == UCHAR_MAX ) break; }
The compiler has enough data for this optimization. Obviously, the operation
| = can either leave the digits of the
magic variable unchanged, or put them in 1. It cannot put them in 0. Once all the digits are set, no further changes to the value of the
magic variable are possible.
Can the compiler decide on this optimization? The second variant of the code may be slower in cases when the data in the compared blocks are the same or the first difference is far from the beginning. And we will check - we will compile both versions of the code with the Visual C ++ 9 compiler, and first we will make sure that the machine code is different as we expected.
This is how the machine code of the cycle without
break looks
00AF10D3 mov bl,byte ptr [eax+edi] 00AF10D6 xor bl,byte ptr [eax] 00AF10D8 inc eax 00AF10D9 or dl,bl 00AF10DB sub ebp,1 00AF10DE jne wmain+0D3h (0AF10D3h)
So - in the case of a cycle with
break :
00AF1116 mov edx,dword ptr [esp+1Ch] 00AF111A mov dl,byte ptr [eax+edx] 00AF111D xor dl,byte ptr [eax] 00AF111F or bl,dl 00AF1121 cmp bl,0FFh 00AF1124 je wmain+12Ch (0AF112Ch) 00AF1126 inc ecx 00AF1127 inc eax 00AF1128 cmp ecx,esi 00AF112A jl wmain+116h (0AF1116h)
In the second case, there are 10 instructions instead of 6, and the comparison is honestly added (the
cmp instruction) and the conditional transition (following the
cmp je instruction).
It should be tens of percent slower and no one will do this, right?
But in the worst case (with byte equal blocks) on Intel Core i5 660 the second code is slower than the first one by no more than 3 percent (of course, the result may differ on other processors) - branch prediction works well on the second code, it allows you to do without processor pipeline reset, almost no speed reduction. If the differences between the elements are typed in
UCHAR_MAX , and for this only one pair of bytes is enough, in which the value of one byte is a bit negative of the value of the other byte, you can count on acceleration (if the difference is not too far from the beginning, of course, but with differences on 3 percent in the worst case “not too far” is somewhere around 97% of the time).
Compiler developers can add heuristics, which in such cases will always compile the first version of the code as the second.
Even in the compiler, there may be code profiling technology (something like PGO in Visual C ++), which allows you to run the code on a test package and then optimize it for the most typical scenarios. In this case, the compiler will have objective data that allows you to assess whether there is a benefit from code conversion.
And remember that the code for the Intel Core i5 is shown above, and its architecture is not the only one in the world, there may be instructions on some other processor, for example, combining
| = , comparison and transition, with zero overhead compared to with
| = . The compiler, well sharpened for such an architecture, certainly uses such an instruction and compiles the first version of the code as the second.
The above are three obvious ways to optimize the code, using only the code itself.
Slowly stirring, add LTO, LTCG or whatever it is called optimization of several translation units in your compiler. Using this technology, the compiler can simultaneously analyze both the code of the function itself and the calling code. The calling code will look something like this:
if( CRYPTO_memcmp( first, second, size ) != 0 ) { }
And here it is completely obvious that a cycle in a function is equivalent to this:
for( i = 0; i < length; i++ ) { if( first[i] != second[i]) return 3; }
With this knowledge, it is much easier for the compiler to apply the three optimizations listed above.
Such a code as in
CRYPTO_memcmp () is intolerable. It works as expected only on specific compilers with specific build parameters.
As usual, you need to use the keyword
volatile .
A completely standard-compliant way is to add the
volatile qualifier to the
magic variable. In this case, the compiler will be obliged to generate code that ensures the execution of
| = operations on each of the specified number of loop iterations. The compiler is forced to place the magic variable in the automatic memory and for each iteration to generate the corresponding update code of the variable value. From this on the Intel Core i5 660, the loop code slows down by about half. How essential it is for the rest of the code to work can not be said without careful profiling.
If it is determined that adding the
volatile qualifier to the
magic variable leads to an unacceptable deceleration of the code, subject to the reservations below, try using “
volatile pointers”:
int CRYPTO_memcmp( const void* f, const void* s, size_t length ) { size_t i; const volatile unsigned char* first = f; const volatile unsigned char* second = s; unsigned char magic = 0; for( i = 0; i < length; i++ ) { magic |= (first[i] ^ second[i]); } return magic; }
If
const volatile puzzles you - in vain.
const indicates that the data cannot be changed from your code, and
volatile indicates that it can in principle be changed without your code, so you cannot optimize the reads.
As in the
previous post , this does not guarantee optimization, because the data itself may not have a
volatile qualifier (accessing the pointer gives lvalue, the type of which does not affect the data itself), but there are chances - usually compilers pay attention to the code with such pointers and code readings are not optimized.
Using
volatile in this case makes the code a bit more portable.
Dmitry Mescheryakov,
product department for developers