A new round of CUDA architecture

Hello!
In early April, I saw the announcement of a new video card from nVidia, with a new major index compute capability - 3.0. Having carefully studied the specs, I was surprised - it turned out all over that now the branching will lead to the worst consequences: big losses in productivity. I liked that from version to version branching play an ever smaller role, and Kepler seemed to be a step backwards in this plan. I understood with my brain that this is hardly possible and decided to wait a bit.
And now this week I got a whitepaper on a new number crusher on Kepler architecture and clarified a lot.

Beginnings

Initially, this note was conceived only for those who are in the subject, but just in case, I will explain what compute capability in CUDA is.
So, nVidia has been developing video cards for general computing for the past 5 years. Not to say that before this could not be considered on the GPU, but to do so was very uncomfortable for a normal person. In 2007, the Universal Architecture for Computing Devices (CUDA) based on video cards was proposed. This allowed extensively increasing the power of devices, while maintaining the main features of the architecture. That is, the number of processors and the amount of memory is constantly growing, but the division of memory into shared / global / texture and registers have been preserved since those ancient times (with SS 2.0 there were also surface and a couple of caches, but this is another song).
However, nothing stands still and the architecture and instruction set change over time - sometimes significantly, sometimes not very much. For displaying families of GPUs with identical architecture, versions of Compute Capability (CC) were identified. For example: devices with CC 1.0 were not able to do atomic operations in general, with CC 1.1 - they were able in global memory and with CC 1.2 - both in global and in shared memory. A complete list of the capabilities of different CCs is traditionally given at the end of the CUDA C Programming Guide.

What will Kepler bring us?

Firstly, a new huge multiprocessor. If earlier multiprocessors had 8 (CC 1.x), 32 (CC 2.0) or 48 (CC 2.1) stream processors, then Kepler uses a new chip for 192 processors. The remaining characteristics are also impressive:

	FERMI GF100	FERMI GF104	KEPLER GK104	KEPLER GK110
CC version	2.0	2.1	3.0	3.5
Warp threads	32	32	32	32
Number of warp per multiprocessor	48	48	64	64
Threads on multiprocessor	1536	1536	2048	2048
Blocks on multiprocessor	eight	eight	sixteen	sixteen
32-bit registers per multiprocessor	32768	32768	65536	65536
Maximum number of registers / stream	63	63	63	255
Shared memory configurations	16K 48K	16K 48K	16K 32K 48K	16K 32K 48K
X maximum grid size	2 ^ 16-1	2 ^ 16-1	2 ^ 16-1	2 ^ 32-1
Hyper-Q	Not	Not	Not	there is
Dynamic concurrency	Not	Not	Not	there is

What conclusions can be drawn from this plate? The size of the warp remained the same, and this is good. The main thing I was afraid of was an increase in the size of the warp, which would lead to unnecessary downtime during branching. But not everything is so rosy. Now you need to be even more careful with registers - in order to load SMX as much as possible (by the number of simultaneously loaded blocks / streams) you need to use even less ... On the other hand, the architecture allows you to store up to 255 registers per stream, and if the postulate used to work than to store, ”now is not so simple. Thus, we have another fork in the optimization path - you can store the results of calculations in the stream, but then SMX will fit less blocks and it is quite possible that there will be read / write downtime in RAM.
You can still casually note that the calculations of double precision and specials. the functions (sin, cos, etc.) will now be performed faster and this pleases.
The second notable innovation is the new Warp Scheduler . For each multiprocessor, there are 4 processors, and each of them is able to execute 2 warp instructions per clock. In short, an interesting thing. Make sure that branching inside maximum of 2 paths inside the warp)
Further incomprehensible things:
Dynamic Parallelism - I didn’t understand how it will work, but it looks like the streams will be able to launch new grids ... Let's see how it will be. It seems that soon there will be only I / O on the host, and everything else can be run on the GPU, since there is an opportunity to manage the kernels.
GPUDirect - now you can take and transfer data from one GPU to another directly, without a host. And even over the network. Seeing is believing. It sounds too cool.
All sorts of pleasant things:

Shuffle Instruction - a new way to exchange data between threads in a block. If we are too lazy to allocate a separate shared mem and control access to it, then we take a local variable and juggle it from stream to stream.
Atomic operations with double precision . Well, added a couple. The main thing - they promise acceleration up to 9 times!
New read only data cache. L1 and L2 data caches appeared in Fermi, and in Kepler you can mark data as read only (
```
const __restrict 
```
) so that the compiler stuffs them into this new cache. And before that I stuffed the data into the textures. Eh.
Improved L2 cache. First, it became 2 times more, and secondly, access to it is 2 times faster.
Hyper-Q, Grid Management Unit - Fermi has the opportunity to run multiple tasks simultaneously. But these 2 technologies allow you to do it correctly, but they work at the hardware level and we will notice this only from a more complete load of the GPU.

And finally, very cool - nVidia sold the souls of a pair of engineers to the gods of energy consumption and now the GTX 680 promises to eat less than a similar Fermi based on W / FLOPS. Well, because cool, eh? Straight even throw out the old 280 Ovens. And then they are heated to 130 degrees and in reserve they had to allocate 1.5 kW.

Conclusion

In short, I do not know about you, but I’m eagerly looking forward to waiting for the new Kepler! But I don’t know if it will be possible to get to two computers with GPU CC 3.5 - so that GPUDirect can feel. If anyone has it - let him touch it, okay?

Source: https://habr.com/ru/post/144202/

All Articles

A new round of CUDA architecture

Beginnings

What will Kepler bring us?

Conclusion

More articles: