Story
Heterogeneous computing already exists in flesh and blood for many years - Cell architecture. There is one percent in which 1 core is clever and performs the functions that modern PCs perform on the CPU, and 8 nucleoli that are much simpler, slightly more complicated than the cores in the form, they each have their own flow of instructions, but it is determined by the controlling kernel themselves even interrupts do not process these cores, only tsiferki consider. But with this architecture, there are problems of scaling, because if you just make 2 such cores, then there will be a problem - the flow performed on one main core may have more work for its “daughter” cores than the other and thus chip utilization deteriorates. Yes, for such things is difficult to write, but this architecture can be developed in various ways.
Philosophical Reflections on Computing
')
Before Fermi in nVidia, there were just 8 nucleoli in one computing unit - count as execution units. It would be possible to do something similar to Cell and GPU in one - 2.4 Cell control kernels - they give tasks to the 4th, 8 SIMD blocks similar to those used on the old nVidia types, while the controlling core simply puts the task in the queue, and hardware sheduler assigns tasks from 2, 4 queues (one for each control core) to free SIMD blocks. This scheme will simplify the programming of such chips, because from the programmer’s point of view, additional kernels are just a SIMD unit in the core, we are writing programs for ordinary processors without even knowing that there are SIMD units and what number of execution units there are. This is very similar to what AMD is doing in the Bulldozer architecture (percents from Sun hold a lot of threads per core and effectively share resources between them), there is a common scheduler for 2 threads and it manages the resursi of 2 cores. If you can extend the pipeline in one module, the scheduler is not for 2 threads, but for 4 or more, which will allow you to dispose of execution units even in single-threaded tasks with intensive use of SSE / (AVX in the future). To do this, you really need an interesting type of instruction set extension, but even thought about it already - this is Intell AVX, they themselves wanted to use it for both Larrabee video cards and processors. Since we know that in these chips there are 16 execution units in the core, and in any modern procedure in the SIMD unit performing SSE operations 4 times less, we can conclude that the comrades from the Steud have already figured out how to virtualize these instructions, that is, the programming problem simply does not it is worth it, the only problem is that they didn’t have a vidha and this is the only reason to scold heterogeneous calculations. The Bulldozer approach, which is 4 times wide, will lead to really productive heterogeneous calculations earlier than stupid switching on views and on the same chip, although this reduces delays when accessing it, it does not simplify working with it. Smart virtual machines can help if you combine percents with a graphics chip, with a small number of execution units in a SIMD unit, 80 computing units in one SIMD unit in Fusion chips are too big / non-ergonomic bludgeon for hammering tiny nails. The architecture of AMD video cards is poorly suited for most computational tasks, unlike nVidia architectures - this is no secret to anyone. But it is very well suited for games - the output without creating a new GPU architecture, AMD's chances of making candy from Fusion look bad.
Near future
But back to reality - nVidia said that heterogeneous computing is a dubious idea, apparently only because they decide to make their own percent, multi-core percent with ARM architecture and views with Fermi architecture on one chip - in my opinion it will be much more convenient / more interesting for calculations than separately a steep Fermi card, because the reduction in a pair of map / reducts is performed recursively, therefore sequentially - which means that to successfully use the map / reductus paradigm on a view, you need to have several processor cores, high frequency, individual schedulers and extraordinary execution of instructions — and being located on one chip will allow it to perform without overhead for copying data into memory + will allow you to create new tasks / streams directly on the chip without requiring changes to the video card's architecture — but only some changes to the drivers.