HPC: nVidia, AMD, Sony Cell, x86

Many have heard of mythical words - nVidia CUDA , AMD Brook , IBM / Sony Cell ... It is rumored that the mere mention of these forces your program to work hundreds of times faster. Let's try to figure out what they are, what this magical high-performance programming looks like in general terms, and what benefits they can provide in comparison with the good old x86 processors .

I recently got carried away with high-performance computing (language can be broken, HPC - High-performance computing), and I want to share a brief overview of how things are now with the programming and performance of various hardware platforms, namely nVidia CUDA, AMD Brook +, Sony / Toshiba / IBM Cell (the main power of the PS3), well, for comparison - the old x86 in the face of Core2Quad / i7, AMD Phenom / Phenom II. Just want to note that there are many nuances affecting performance - the sizes and latencies of caches of different levels, the combinability of commands of various types and put it all in one article will not work out, so now I will show the situation in general terms.

The conversation goes everywhere about 32-bit numbers. For evaluation: 64-bit numbers, the GTX285 performance drops by about 10-12 times, AMD 4870 - by 4 times, x86 and Cell- by 2 times.
')
At the seed I immediately show the result:

...	nVidia GTX285	AMD 4870	Sony cell	X86 i7 3.2Ghz
Real speed, GFLOP *	500	600	153	153
Crystal area	470mm ²	256mm ²	120mm ²	263mm ²
Technical process	55nm	55nm	65nm	45nm

* GFLOP - billions of operations per second

PS: Cell at 45nm still as follows
not launched, and it is not used in Sonya
until.

nVidia CUDA

nVidia is perhaps the most effort to promote its technology. I heard the magic word CUDA long before I decided to try what it is. CUDA - This is a C ++ compiler (fairly straightforward and stable), which a specially written program compiles for execution by SIMD processors inside a video card. For example, in GTX285 - 240 processors that do 1-3 (multiplication + addition + special, 2-3 graphic shaders are real, but in conventional calculations it’s better to focus on 1.5) operations per clock at a frequency of 1476Mhz, marketing performance just above 1 TFLOP (a trillion operations per second), the real achievable speed of about 500 GFLOP.

The program looks like this:

__device__ unsigned int * data_d;
__global__ void gpu_thing (int somevalue)
{
const
int ix = blockDim.x * blockIdx.x + threadIdx.x;

data_d [ix]
* = somevalue;
}

Runs on the GPU like this:

dim3
threads (256);
dim3 grid (256);
int
src [256 * 256], result [256 * 256]; // copy the data to the memory of the video card
cudaMemcpyAsync (data_d, src, sizeof (int) * 256 * 256, cudaMemcpyHostToDevice, 0); // Run on the video card, 256 * 256 threads
at the same time
gpu_thing <<< grid, threads >>> (125); // Get the result from memory
cudaMemcpyAsync (result, data_d, sizeof (int) * 256 * 256, cudaMemcpyDeviceToHost,
NULL);

Limitations:
Threads are executed in groups of 16, all instructions must be the same (otherwise performance is reduced, the compiler / hardware keeps an eye on this).

The start time of the streams is not 1ms, of course, but noticeable, so it’s better not to shift small tasks to the video card. During the program
on the video card to which the monitor is connected, windows cannot refresh the screen, even the mouse cursor does not move - the problem is solved by selecting the workload by 10-20ms so that the user is not too bad. If the kernel timeout exceeds 5 seconds, Windows decides that the driver is frozen, and restarts it - the results are bad for your program (again, this is true only if a monitor is connected to the video card) :-) Linux and MacOS are supported. Support for multiple video cards - works only if the SLI mode is disabled. It supports all video cards from 8400GS (although it usually turns out slower than the processor). Formally, video cards of different types are not required to work, but in practice any combinations work (requires manual separation of tasks).

AMD Brook +

AMD tried its best with iron - its 4870 has 160 processors, each of which can do 5 operations (with minor restrictions) per clock at a frequency of 750 Mhz. Theoretical and practical performance - 600GFlop. AMD fans - do not read the next paragraph, but be aware - on real applications (if you can write them), the 4870 makes the GTX285 10-20%. And now about the bad: C Brook + just pitchfork. You need to be a very calm person to work with him. The C compiler (without ++), and therefore we meet again both with long-forgotten restrictions (like defining local variables only at the beginning of the function body), and with completely unfamiliar ones: local arrays are not supported, structures are supported, but the compiler almost never ok compile it (crash / issue
wrong code - this is true for Stream SDK 1.3). In general, get ready that the version is called 1.3 and works like 0.3alpha :-D Despite the fact that we have the VLIW architecture and it would be useful to be able to unfold cycles (for example, as in Intel C ++ - pragma unroll (5)), this cannot be done.
Yes, there is no preprocessor either, screw your own. Well, it completes all the inability to work with several video cards (or rather, one process (not a thread) - one video card). It is worth paying tribute to AMD CAL - programming at a low level (not machine code, but almost) - there are almost no problems, but the drawbacks are obvious. The program looks like this:

kernel void gpu_thing (int input <>,
int somevalue, out int output <>)
{

output
= somevalue * input;

unsigned int md5_dim [] = {800 * 256};
:: brook :: Stream <int> input (1, size);
:: brook :: Stream <int> result (1, size);

int src [800 * 256], dst [800 * 256];

input.read (src); // prepare the data for the video card
hello_brook_check (input, 45, output); // run 800 * 256 threads
output.write (dst);

Restrictions: the start time is also significant, again everything stops except the mouse cursor, again 5 seconds the maximum execution time under windows.
Iron is supported by HD 2xxx (with limitations), HD3xxx and 4xxx.

Sony cell

Many are delighted with the performance of Sony Cell. It is certainly impressive, if we forget that x86 is also advanced (not to mention the video cards). Cell consists of the usual Power-core called PPE (with a slight performance compared to the above-mentioned monsters) and 8 SPE cores, of which 1 is potentially defective (non-defective chips are sold separately for much larger
money), and 1 is reserved by the operating system (if you change the console to Linux, you can get access to 7, but this is difficult for end users ;-)), there are 6 in total. Each of the cores does 8 operations per clock (2 * 4) on the frequency 3.2Ghz, which gives a total real
performance 153.6 GFLOP. With programming here I will not tell you, because these 153 GFLOPs are impractical enough to use, and therefore not
tried (in terms of the release of the program is not for domestic consumption).

x86

Already dumped x86? But it can still shake the old days: On i7 / Core2Quad, each of the 4 cores can perform 3 SSE2 operations on 4 numbers per clock, for a total of 48 operations per clock (and it was relatively recently thought that it was difficult / impossible to overcome 1 operation per clock , and started to go in the direction of VLIW / RISC), which gives us 153.6 GFLOP (ops, exactly the same amount as Cell) Phenom / PhenomII can also, but it has SSE2 specialized modules (one multiplication, one addition, one universal), therefore Productivity is 20-40% lower. If it seems to someone that it is difficult to use 48 operations per clock, here is an example:

int data [1024 * 1024 * 12];
const int value1 = 123;
for (int i = 0; i <1024 * 1024 * 12; i ++) data [i] ^ = value1;

If this is compiled in Intel C ++, and run in 4 threads (however, he now sometimes can even guess to start the threads himself), he will combine the operations into groups convenient for SSE2 (it is necessary to admit that it does not always work smoothly for everyone, you can suggest numerous # pragma or use SSE-intrinsics, but this is a topic for a separate article)

What can be concluded? Of course, video cards now show phenomenal performance on a fairly narrow category of tasks (especially dual-chip ones), about 8 times faster than good code on a regular x86 processor. 8 is far from 30-100 as advertised, but still enough to sometimes make the impossible possible. In addition, a motherboard for 4 video cards is much cheaper than a board for 4 processors. The euphoria about the performance of the Cell in the PS3 should already subside, since the serial x86 processors caught up with it in performance.

And the habr parser does not want to validly display the <code> tag, so I will remove it. :(

Source: https://habr.com/ru/post/49910/

All Articles

HPC: nVidia, AMD, Sony Cell, x86

nVidia CUDA

AMD Brook +

Sony cell

x86

More articles: