📜 ⬆️ ⬇️

Multi-Tale of Lost Time

In the publication The Tale of Lost Time, user crea7or told how he refuted the Euler Hypothesis on a modern CPU.

I was also interested to know how the GPU will show itself, and I compared the single-threaded code with the multi-threaded for the CPU and the completely multi-threaded for the GPU, using the CUDA parallel computing architecture.

Iron and compilers
CPU Core i5 4440
GPU GeForce GTX 770
MSVS 2015 update 3
tookit CUDA 8 .
Of course, the build was release and 64 bit, with optimization / 02 and / LTCG.
Disable video adapter reset
Since the calculations last more than two seconds, the video driver completes the CUDA program. To prevent this from happening, you need to specify a sufficient time before resetting through the registry key.
HKEY_LOCAL_MACHINE \ SYSTEM \ CurrentControlSet \ Control \ GraphicsDrivers \ TdrDelay and restart the computer.


CPU


For a start, I parallelized the algorithm for the CPU. For this, the function is passed a range for iterating over the values ​​of the outer loop a . Then the whole range of 1..N is broken into pieces and fed to the processor cores.

void Algorithm_1(const uint32_t N, const std::vector<uint64_t>& powers, const uint32_t from, const uint32_t to) { for (uint32_t a = from; a < to; ++a) { for (uint32_t b = 1; b < a; ++b) { for (uint32_t c = 1; c < b; ++c) { for (uint32_t d = 1; d < c; ++d) { const uint64_t sum = powers[a] + powers[b] + powers[c] + powers[d]; if (std::binary_search(std::begin(powers), std::end(powers), sum)) { const auto it = std::lower_bound(std::begin(powers), std::end(powers), sum); uint32_t e = static_cast<uint32_t>(std::distance(std::begin(powers), it)); std::cout << a << " " << b << " " << c << " " << d << " " << e << "\n"; } } } } } } 

In my case, there were 4 cores, and I performed the calculations through a slightly modified thread pool, which I used as the basis here . Since there are no dependencies between the data, the speed increases almost in proportion to the number of cores.
')

GPU


For a GPU it is a little more difficult. The outer two cycles of searching ( a and b ) will be expanded, and in the function there will be only a search of two internal cycles ( c and d ). One can imagine that the program will be executed in parallel for all values a = 1..N and b = 1..N. In fact, of course, this is not the case; they will not be able to execute everything in parallel at the same time, but this is CUDA’s concern to parallelize the implementation as much as possible, this can be helped if the configuration of blocks and streams is properly configured.

Blocks and threads
  int blocks_x = N; int threads = std::min(1024, static_cast<int>(N)); int blocks_y = (N + threads - 1) / threads; dim3 grid(blocks_x, blocks_y); NaiveGPUKernel<<<grid, threads>>>(N); 

blocks_x is the enumeration range of the first cycle a .
But going through the second cycle b had to be made more difficult due to the video card restriction on the number of threads (1024), and put the counter simultaneously in threads and blocks_y :
a and b is calculated as:
  const int a = blockIdx.x + 1; const int b = threadIdx.x + blockIdx.y * blockDim.x + 1; 

This is definitely not the optimal configuration, as it turns out a lot of idle cycles. But I did not begin to adjust the values ​​further.

The code for the GPU is quite straightforward and very similar to the CPU, only the binary search has to be done by hand.

 __constant__ uint64_t gpu_powers[8192]; inline __device__ int gpu_binary_search(const uint32_t N, const uint64_t search) { uint32_t l = 0; uint32_t r = elements_count - 1; uint32_t m; while (l <= r) { m = (l + r) / 2; if (search < gpu_powers[m]) r = m - 1; else if (search > gpu_powers[m]) l = m + 1; else return l; } return -1; } __global__ void NaiveGPUKernel(const uint32_t N) { const int a = blockIdx.x + 1; const int b = threadIdx.x + blockIdx.y * blockDim.x + 1; if (b >= a) return; for (uint32_t c = 1; c < b; ++c) { for (uint32_t d = 1; d < c; ++d) { const uint64_t sum = gpu_powers[a] + gpu_powers[b] + gpu_powers[c] + gpu_powers[d]; const auto s = gpu_binary_search(N, sum); if (s > 0) { printf("%d %d %d %d %d\n", a, b, c, d, s); } } } } 

I also implemented the optimizations suggested here .

Speed ​​measurement


The measurements for the CPU were done two times, since they turned out to be much more stable than the GPU, which I have already done seven and selected the best time. The variation for the GPU could be twofold, I explain this by the fact that the only video adapter in the system was used for CUDA calculations.
NCPU time, msCPU (4 threads) time, msGPU time, ms
@antonkrechetov10058.616.714.8
Base10045.313.010.7
Basic optimization # 11006.32.112.7
Basic Optimization # 21001.40.70.8
@antonkrechetov2502999.7782.9119.0
Base2502055.6550.190.9
Basic optimization # 1250227.260.3109.2
Basic Optimization # 225042.911.96.0
@antonkrechetov50072034.219344.11725.83
Base50038219.710200.8976.7
Basic optimization # 15003725.1926.51140.36
Basic Optimization # 2500630.7170.248.7
@antonkrechetov750396566.0105003.011521.2
Base750218615.057639.25742.5
Basic optimization # 175019082.74736.86402.1
Basic Optimization # 27503272.0846.9222.9
Basic Optimization # 2100010204.42730.31041.9
Basic Optimization # 2125025133.16515.32445.5
Basic Optimization # 2150051940.114005.04895.2

And now give a little adrenaline to the CPU!


And this adrenaline is Profile-guided optimization .

For PGO, I cite only CPU time, because the GPU has changed little, and this is expected.
NCPU
time, ms
CPU (4 threads)
time, ms
CPU + PGO
time, ms
CPU + PGO
(4 threads)
time, ms
@antonkrechetov10058.616.755.315.0
Base10045.313.042.212.1
Basic optimization # 11006.32.15.21.9
Basic Optimization # 21001.40.71.30.8
@antonkrechetov2502999.7782.92966.1774.1
Base2502055.6550.12050.2544.6
Basic optimization # 1250227.260.3200.053.2
Basic Optimization # 225042.911.940.411.4
@antonkrechetov50072034.219344.168662.817959.0
Base50038219.710200.838077.710034.0
Basic optimization # 15003725.1926.53132.9822.2
Basic Optimization # 2500630.7170.2618.1160.6
@antonkrechetov750396566.0105003.0404692.0103602.0
Base750218615.057639.2207975.054868.2
Basic optimization # 175019082.74736.815496.44082.3
Basic Optimization # 27503272.0846.93093.8812.7
Basic Optimization # 2100010204.42730.39781.62565.9
Basic Optimization # 2125025133.16515.323704.36244.1
Basic Optimization # 2150051940.114005.048717.512793.5

It can be seen that PGO was able to accelerate optimization # 1 by as much as 18%, for the rest the increase was more modest.

Dragons are found here


Forgot to mention a funny feature. At first I searched for the first solution, and set return immediately after printf. So it reduced productivity by an order of magnitude. When I began to look for all the solutions, the performance jumped sharply upwards.
 __global__ void NaiveGPUKernel(const uint32_t N) { const int a = blockIdx.x + 1; const int b = threadIdx.x + blockIdx.y * blockDim.x + 1; if (b >= a) return; for (uint32_t c = 1; c < b; ++c) { for (uint32_t d = 1; d < c; ++d) { const uint64_t sum = gpu_powers[a] + gpu_powers[b] + gpu_powers[c] + gpu_powers[d]; const auto s = gpu_binary_search(N, sum); if (s > 0) { printf("%d %d %d %d %d\n", a, b, c, d, s); return; <-      } } } } 


What else can you do


For the CPU, try to expand the cycles, count two numbers at a time, use vector processor commands.

For the GPU to play with the order of calculations, try to save on registers, it is better to choose the parameters of the blocks and threads.

findings


1. Writing on CUDA is easy.
2. Without tricks, the straight-line code on CUDA turned out to be 10 times faster than the CPU implementation.
3. In the number of crushed tasks, the GPU is much faster than the CPU for the same money.
4. The GPU takes a lot of time to "swing" and for very short tasks acceleration may not be.
5. A more complex algorithm may be faster for the CPU, but slower for the GPU.
6. Profile optimization speeds up code well and reduces file size simultaneously.

Sources


Traditionally you can touch the code on github

Source: https://habr.com/ru/post/318066/


All Articles