Compare the performance of integer multiplication

Starting to write a test program for this article, I internally expected that the Intel CPU will put on both blades of AMD, as well as the eponymous compiler will win Visual Studio without a fight. But not everything is so simple, maybe it was influenced by the choice of software test?

For the test, I used the integer multiplication of two 128-bit numbers to get a 256-bit result. The test was repeated 1 billion times and took only 12 to 85 seconds. AMD FX-8150 3.60GHz and Intel Core i5 2500 3.30GHz processors were used. No multi-threading, no overclocking.

The compilers used were Intel Parallel Studio XE Version 12.0.0.104 Build 20101006, its newer reincarnation 12.1.5.344 Build 20120612, Visual Studio 2010 SP1 and the most modern Visual Studio 2012 (with the Metro and CAPSLOCK menu), also known as C ++ 11.0 Release Candidate. We do not forget about the -O2 option, it is enabled in Visual Studio. And for Intel this is optional, it optimizes with -O2 by default, for Intel the -O3 option is enabled.

I will give the test itself. I agree that for a 64-bit code it would be necessary to make BN_WORD equal to __int64, then BN_DWORD should be divided into low and high parts, and to multiply this economy, use an intrinsic called _mul128, which is supported by data compilers. All this is in the plans and is supposed to be done later. The purpose of this article is to compare optimizing compilers, but not to compare the speed of 32-bit and 64-bit multiplication, as well as discarding one myth.
')

#include <stdio.h> #include <windows.h> #define QUANTITY 4 typedef unsigned int BN_WORD; typedef unsigned __int64 BN_DWORD; void Mul(BN_WORD *C, BN_WORD *A, BN_WORD *B ) { BN_WORD Carry = 0; BN_WORD h = *(B++); int i, j; union { BN_DWORD sd; BN_WORD sw[2]; } s; for( i = QUANTITY; i > 0; --i) { s.sd = (BN_DWORD) *(A++) * h + Carry; *C++ = s.sw[0]; Carry = s.sw[1]; } *C = Carry; for ( j = QUANTITY-1; j > 0; --j ) { A -= QUANTITY; h = *(B++); C -= QUANTITY-1; Carry = 0; for( i = QUANTITY; i > 0; --i ) { s.sd = (BN_DWORD) *(A++) * h + *C + Carry; *C++ = s.sw[0]; Carry = s.sw[1]; } *C = Carry; } } typedef void (*my_proc)(BN_WORD*, BN_WORD*, BN_WORD*); void put_addr(void) { FILE *f=fopen("tmp.$$$", "wb"); my_proc proc = Mul; fwrite(&proc, 1, sizeof(proc), f); fclose(f); } my_proc get_addr(void) { FILE *f=fopen("tmp.$$$", "rb"); my_proc proc = NULL; fread(&proc, 1, sizeof(proc), f); fclose(f); return proc; } int main(void) { int i,j; LARGE_INTEGER lFrequency, lStart, lEnd; double dfTime1; BN_WORD A[QUANTITY], B[QUANTITY], C[QUANTITY*2]; BN_WORD RES[QUANTITY*2]={0xd7a44a41, 0xf6e4895c, 0x1624c878, 0x35650795, 0xa55cb22f, 0x861c7313, 0x66dc33f7, 0x479bf4db }; // ,   inline    Mul   main void (*mul)( BN_WORD *C, BN_WORD *A, BN_WORD *B ); put_addr(); mul = get_addr(); for( i=0; i<QUANTITY; ++i) { A[i] = B[i] = 0x87654321; } QueryPerformanceFrequency(&lFrequency); QueryPerformanceCounter(&lStart); for( i=0; i<1000; ++i) { for( j=0; j<1000000; ++j) { mul(C, A, B); } if (memcmp(RES, C, sizeof(RES))!=0) { printf("Something wrong!\n"); } } QueryPerformanceCounter(&lEnd); dfTime1 = (double)(lEnd.QuadPart - lStart.QuadPart) / (double)lFrequency.QuadPart; printf("Time = %g sec\n", dfTime1); }

The results are shown in the table:

	AMD FX-8150 3.60GHz 64 bit	AMD FX-8150 3.60GHz 32 bit	Core i5-2500 3.30GHz 64 bit	Core i5-2500 3.30GHz 32 bit
Intel Parallel Studio XE 12.0.0.104 Build 20101006	22.6235 sec	25.913 sec	13.0921 sec	23.1986 sec
Intel Parallel Studio XE 12.1.5.344 Build 20120612	22.2398 sec	26.0347 sec	12.9242 sec	23.1603 sec
Visual Studio 2010 C ++ 10.0 SP1	22.5853 sec	84.1714 sec	12.4991 sec	53.633 sec
Visual Studio 2012 C ++ 11.0 Release Candidate	22.2952 sec	72.8279 sec	12.6212 sec	47.1136 sec

On 64-bit code we have approximately the same result for all three compilers.

On 32 bits, Intel wins a lot, VS2010 lags behind it, and the newest VS2012 demonstrates strong growth, although Intel is far from it.

It is also interesting to compare the speed of work on AMD and Core i5. With a similar price of 7,000 rubles, processors show similar performance in 32-bit applications on the Intel compiler. Although it was expected that in the single-threaded test will always be the advantage of Core i5. There are plans to write a multi-threaded test to tap the full power of 8 AMD cores. And then he will most likely win, since he has 8 integer arithmetic cores (but 4 floating point cores) against 4 cores in Core i5, there is no multi-threading support for (my) Core i5.

One more important conclusion suggests itself - compiler manufacturers have thrown all their efforts into creating an optimizing 64-bit compiler, while achieving similar results. Processor manufacturers also threw all their forces on the 64-bit platform, while Intel significantly outperforms AMD.

Another interesting fact is that the myth has been debunked that the Intel compiler allegedly creates code that works well only on Intel, and shows disastrous performance on AMD (2 or more times slower). It is easy to see that the Intel compiler on 32-bit code gives about the same result when switching from AMD to Intel CPU, but the VS compiler gives a boost, why the Core i5 is undergoing significant acceleration, if you use the code from VS, you haven’t figured out yet. (Really, why?)

UPD1. In the first edition of this article, an error crept in due to the fact that, due to the cunning optimization, VS2010 did not multiply and the code was obtained 5.5 times faster. Now the source code has been corrected (the mul pointer has been entered on f-ju Mul, and the pointer is written / read from the file to deceive the compiler), the results are updated. And _WORD corrected to BN_WORD for full clarity, judging by the first comment.

Source: https://habr.com/ru/post/147272/

All Articles

Compare the performance of integer multiplication

More articles: