📜 ⬆️ ⬇️

A new round of CUDA architecture

Hello!
In early April, I saw the announcement of a new video card from nVidia, with a new major index compute capability - 3.0. Having carefully studied the specs, I was surprised - it turned out all over that now the branching will lead to the worst consequences: big losses in productivity. I liked that from version to version branching play an ever smaller role, and Kepler seemed to be a step backwards in this plan. I understood with my brain that this is hardly possible and decided to wait a bit.
And now this week I got a whitepaper on a new number crusher on Kepler architecture and clarified a lot.

Beginnings


Initially, this note was conceived only for those who are in the subject, but just in case, I will explain what compute capability in CUDA is.
So, nVidia has been developing video cards for general computing for the past 5 years. Not to say that before this could not be considered on the GPU, but to do so was very uncomfortable for a normal person. In 2007, the Universal Architecture for Computing Devices (CUDA) based on video cards was proposed. This allowed extensively increasing the power of devices, while maintaining the main features of the architecture. That is, the number of processors and the amount of memory is constantly growing, but the division of memory into shared / global / texture and registers have been preserved since those ancient times (with SS 2.0 there were also surface and a couple of caches, but this is another song).
However, nothing stands still and the architecture and instruction set change over time - sometimes significantly, sometimes not very much. For displaying families of GPUs with identical architecture, versions of Compute Capability (CC) were identified. For example: devices with CC 1.0 were not able to do atomic operations in general, with CC 1.1 - they were able in global memory and with CC 1.2 - both in global and in shared memory. A complete list of the capabilities of different CCs is traditionally given at the end of the CUDA C Programming Guide.

What will Kepler bring us?


Firstly, a new huge multiprocessor. If earlier multiprocessors had 8 (CC 1.x), 32 (CC 2.0) or 48 (CC 2.1) stream processors, then Kepler uses a new chip for 192 processors. The remaining characteristics are also impressive:
FERMI GF100FERMI GF104KEPLER GK104KEPLER GK110
CC version2.02.13.03.5
Warp threads32323232
Number of warp per multiprocessor48486464
Threads on multiprocessor1536153620482048
Blocks on multiprocessoreighteightsixteensixteen
32-bit registers per multiprocessor32768327686553665536
Maximum number of registers / stream636363255
Shared memory configurations16K
48K
16K
48K
16K
32K
48K
16K
32K
48K
X maximum grid size2 ^ 16-12 ^ 16-12 ^ 16-12 ^ 32-1
Hyper-QNotNotNotthere is
Dynamic concurrencyNotNotNotthere is

What conclusions can be drawn from this plate? The size of the warp remained the same, and this is good. The main thing I was afraid of was an increase in the size of the warp, which would lead to unnecessary downtime during branching. But not everything is so rosy. Now you need to be even more careful with registers - in order to load SMX as much as possible (by the number of simultaneously loaded blocks / streams) you need to use even less ... On the other hand, the architecture allows you to store up to 255 registers per stream, and if the postulate used to work than to store, ”now is not so simple. Thus, we have another fork in the optimization path - you can store the results of calculations in the stream, but then SMX will fit less blocks and it is quite possible that there will be read / write downtime in RAM.
You can still casually note that the calculations of double precision and specials. the functions (sin, cos, etc.) will now be performed faster and this pleases.
The second notable innovation is the new Warp Scheduler . For each multiprocessor, there are 4 processors, and each of them is able to execute 2 warp instructions per clock. In short, an interesting thing. Make sure that branching inside maximum of 2 paths inside the warp)
Further incomprehensible things:
Dynamic Parallelism - I didn’t understand how it will work, but it looks like the streams will be able to launch new grids ... Let's see how it will be. It seems that soon there will be only I / O on the host, and everything else can be run on the GPU, since there is an opportunity to manage the kernels.
GPUDirect - now you can take and transfer data from one GPU to another directly, without a host. And even over the network. Seeing is believing. It sounds too cool.
All sorts of pleasant things:

And finally, very cool - nVidia sold the souls of a pair of engineers to the gods of energy consumption and now the GTX 680 promises to eat less than a similar Fermi based on W / FLOPS. Well, because cool, eh? Straight even throw out the old 280 Ovens. And then they are heated to 130 degrees and in reserve they had to allocate 1.5 kW.

Conclusion


In short, I do not know about you, but I’m eagerly looking forward to waiting for the new Kepler! But I don’t know if it will be possible to get to two computers with GPU CC 3.5 - so that GPUDirect can feel. If anyone has it - let him touch it, okay?

')

Source: https://habr.com/ru/post/144202/


All Articles