how we tested dual core opterons

Well. To tell the truth, we are still testing them, but there is already one feature that has become obvious: the memory does not pull two cores. Well, not all. We have a new cluster, in which there are some blades in which dual processor boards are stuck, on which dual-core opterons 285 are installed, each of which has 4 gigabytes of memory stuck in and connected to the outside world via Hyper Transport here:

the outside world (including 1Gb copper ethernet) --HT-- cpu0 --2xHT-- cpu1

We do the simplest test (well, in fact, the program is not so simple - this is MG from the NASA Parallel Benchmark) in the following configuration: either we nail 4 processes of this task to one node — each core in the process, or 4 processes on different processors in two different blades. As a result, the first configuration produces 2700 parrots, and the second 4500. Ponder the interaction via ethernet, which is much less efficient than zerocopy on NUMA, and even in such a configuration, when all traffic is pulled through one link, and even when everything is swaying instead of the optimized protocol computing TCP is much more efficient.
')
Of course, the reason is obvious: two cores compete for one memory controller. When we use 4 controllers and asynchronous data delivery between them, everything is much faster. But then the question arises: what kind of guys make us buy multi-core processors, convincing them that they have some unsurpassed performance? Actually, it would be better to add some logic instead of the second core, like DMA. Or power consumption would be reduced, just throwing the core.

Eh. But what is most depressing is the fact that the developers have gotten their business: putting even more cores into the processor without reducing the competition for memory access. Well, yes, yes, Intel has a shared second level cache, AMD has a third cache, but this is a shared resource, yes, probably with several banks. Yes, AMD will bring it all to two memory controllers, but why is it so difficult and energy efficient? If this does not remove the problem. Because, as before, one controller (even if by some miraculous miracle it is possible to convince Linux to decompose address spaces so that they fall under different memory controllers and into different banks of the third cache) two cores will compete.

Now, attention, drum roll and all that, processors with integrated graphics cores roll out onto the scene. I have a question: how are they in their favorite SMP configurations with one memory access subsystem to feed their favorite kernels with data? Why do we need idle (how many there, if you count on parrots?) 30% of the time processors? Why not make more simple processors, but with DMA and with its memory. By the way, do you know that BlueGene / L works on processors for embedded electronics ?:)

Yes, even if you want Intel and AMD to sell solutions on a single chip, why not just dissolve each core to its memory? Well, much more effective. Eh. I do not understand the logic of manufacturers. Moreover, I don’t understand the logic of those who spend money on all this 'magnificence', for the shortcomings are obvious.

In short, everything is bad except IBM, which did just that in Cells. But again, it is not clear, this is a manic desire to shove everything in one chip. No, yes, external tires, they were slow for a long time. But now there are consecutive ones - insanely fast and energy is being consumed little, and even they can be safely let through the optical fiber.

Here it is. No, of course, the second core can be used: some kind of virtualization, or, like in our case, you can run all system processes on, for example, even nuclei, and calculated ones on odd ones, you can win, 10 percent, which is not small In principle, if the settlement time is 10 days. But still, intuition protests against the complication of such an already difficult thing as a processor. Complication is, of course, only quantitative. But ... hyh. All the same, the transistors switch, the air warms, the ice melts, America drowns, the nations emigrate, we live in the Urals closely, :)

PS Here, for sure, they will come and say that we are fools ourselves, that in x86-64 16 registers were invented to unload the memory subsystem during calculations, and that the compiler should be optimized and correct to use. To which I will answer: we used the recommended AMD compiler PGI. Intel tried too, carefully observing that it optimized the memory accesses. The result is the same.

Source: https://habr.com/ru/post/8802/

All Articles

how we tested dual core opterons

More articles: