Hello! I decided to tell you about how I optimized the lightcoin mining algorithm. And I will present my story in the form of a diary.
Day 0: I stumbled upon a
topic , where some bitless shared with the community a way to speed up mining by a few percent. He asked himself: why am I worse than him? Started to analyze the code.
Day 1: I recall the syntax, I find information about the functions of OpenCL.
Day 2: I stumbled upon very strange lines of code:
#define Coord(x,y,z) x+y*(x ## SIZE)+z*(y ## SIZE)*(x ## SIZE) #define CO Coord(z,x,y)
Framed the first point into the second, got
')
#define CO z+x*zSIZE+y*xSIZE*zSIZE
Why it was necessary to make a fuss - I don’t understand, besides, in the scrypt_core function, there was an unused variable ySIZE.
Also in the search function I found multiple uses of i * 2, replaced with rotl (i, 1U). It did not give a performance boost, but let it be.
Day 3: I realized that my only chance to optimize something, with the current level of knowledge, is to help the compiler with “CO”, because with the default settings, z + x * zSIZE + y * xSIZE * zSIZE is considered 8704 times. Most likely, the compiler somehow optimizes these calculations, but this does not prevent him from a little help :) Besides, zSIZE is a constant equal to 8, and xSIZE is a constant obtained from the program settings.
Began to explore the first cycle, which uses "CO":
for(uint y=0; y<1024/LOOKUP_GAP; ++y) { #pragma unroll for(uint z=0; z<zSIZE; ++z) lookup[CO] = X[z];
It can be seen that x * zSIZE is a constant, since x does not change during the execution of the loop. It is also obvious that y increases by 1 with each iteration of the loop.
Understanding this, the creation of the CO variable arises, which initially stores x * zSIZE, and with each iteration of the loop, xSIZE * zSIZE will be added to it.
And so that the variable z does not interfere with us, we will create a local variable inside the loop, to which we will add one after each iteration of the internal loop.
In addition to the above reason, this may allow the compiler to push this variable into a register, which should also speed up the process.
The result is the following code:
uint CO=rotl(x,3U);
Day 4: Analyzing the following uses of "CO". I skip the code in the preprocessor, because there it is used only once.
for (uint i=0; i<1024; ++i) { uint4 V[8]; uint j = X[7].x & K[85];
It can be seen that y varies according to a rather complicated algorithm, it is unlikely to predict the value of this variable.
Therefore, all I can do is calculate in advance x * zSIZE and xSIZE * zSIZE.
CO_tmp=rotl(x,3U); CO=rotl(xSIZE,3U); for (uint i=0; i<1024; ++i) { uint4 V[8]; uint j = X[7].x & K[85]; uint y = (j/LOOKUP_GAP); uint CO_reg=CO_tmp+CO*y; for(uint z=0; z<zSIZE; ++z, ++CO_reg) V[z] = lookup[CO_reg];
Day 5: I compile, and observe an increase of ~ 3% on my configuration. After re-checking the results several times, I bring the code into a human form, optimizing the code in the preprocessing section in the same way as in the second cycle, and removing the previously made i * 2 replacement with rotl (i, 1U).
After the tests of the “cleared” code, the result surprises me - the speed has become significantly less than before the start of my optimization. After a little investigation, I found out that the reason for this is the return back i * 2, instead of rotl (i, 1U).
It seemed that the replacement itself gives nothing at all - I checked it several times, however, together with my optimizations, it increases the speed.
I send the results of my work to Con Colivas post.
Day 12: Without waiting for a response within a week, I post my achievements and instructions to the official Litecoin forum.
With a few people, it worked and really gave a speed boost. However, a problem was soon discovered - I conducted tests with drivers 13.4, and with earlier versions of drivers (and OpenCl) - the speed drops by about a third.
I installed myself a 13.1 driver (not without problems - the OpenCl version did not want to go down until the system was completely cleared of AMD and OpenCL drivers), I begin research.
Day 13: Found that the most mysterious replacement of i * 2 by rotl (i, 1U) causes a drop in performance. But, removing this replacement, the speed returns to the initial level.
I understand that it is necessary to do two versions of optimizations: for 13.4, and for older versions of drivers.
Days 13-40: Gathering myself voluntary testers, from among those who reported idle optimization at 13.1 - I start work.
In the course of tests, iron-dependent optimizations are detected, which I immediately refuse (such as, for example, creating an array of 1024 elements in which the precomputed values ​​of y * xSIZE * zSIZE are stored, for y = 0..1023 - I have optimized, there are no testers), as well as the nonlinear effect of frequencies on the results of some optimizations: on my 7850 at frequencies of 1000/1300, anomalous results were reported, like ~ 340 kiloeshes per second at an intensity of 13 (allows you to quietly work at the computer when the graphics card mine t), instead of ~ 200 kiloheshey per second without my optimization and / or at other frequencies.
But, I had to give up such optimizations (almost, I saved this version for myself).
The “magic value” of the --thread-concurency setting was also found, at which the speed grows equal to 2 ^ n + 1, for example 4097 or 16385. For any other value, the mining speed is lower.
My assumptions - it is possible that multiplication by 2 ^ n + 1 is performed most quickly, but this is not entirely logical, because multiplication by 2 ^ n is a simple bit shift to the left, and, in theory, should be performed faster ...
As a result, I came to the following code:
uint CO_tmp=xSIZE<<3U;
Published version for drivers older than 13.4 on the same forum, mentioned the "magic value" --thread-concurency. And I found my work completed.
I hope that knowledgeable people will be able to tell me what happens when i * 2 is replaced by rotl (i, 1U), as well as the nature of the “magic value” of the --thread-concurency parameter.
My topic on the lightcoins forum:
forum.litecoin.net/index.php?topic=4082.0PS All work was carried out in the file scrypt.cl, included in the kit to any miner that supports the mining of lightcoins.