📜 ⬆️ ⬇️

Specialized chips will not save us from the "deadlock accelerators"



Improvements in CPU speed are slowing down, and we are seeing the semiconductor industry moving to accelerator cards so that the results continue to noticeably improve. Nvidia has benefited the most from this transition, however, it is part of the same trend that feeds research into neural network accelerators, FPGAs, and products such as Google’s TPU. These accelerators have incredibly increased the speed of electronics in recent years, and many have begun to hope that they represent a new path of development, due to the slowing down of Moore's law. But the new scientific work suggests that in fact everything is not as rosy as some would like.

Such special architectures as GPU, TPU, FPGA and ASIC, even if they work in a completely different way than general purpose CPUs, still use the same functional units as x86, ARM or POWER processors. This means that the increase in speed of these accelerators is also to some extent dependent on the improvements associated with the scaling of transistors. But how much of these improvements depended on improvements in manufacturing technology and an increase in density associated with Moore's law, and which on improvements in the target areas for which these processors are intended? What is the improvement in transistors only?

An associate professor of electrical engineering at Princeton University, David Wentzlaf, and his graduate student Adi Fuchs created a model that allows them to measure the speed of improvements. Their model uses the characteristics of 1612 CPUs and 1001 GPUs of various capacities, made on the basis of various functional units, in order to numerically evaluate the advantages associated with the improvements in nodes. Wenceslaf and Fuchs created a performance improvement metric related to progress in CMOS (CMOS-Driven return, CDR), which can be compared to improvements acquired through chip specialization (Chip Specialization Return, CSR).
')


The team came to a discouraging conclusion. The advantages gained by chip specialization are fundamentally related to the number of transistors that fit in a millimeter of silicon in the long run, as well as the improvements in these transistors associated with each new functional node. Worse, there are fundamental limitations to how much speed we can learn from improving the accelerator circuit without improving the CMOS scale.

It is important that all of the above acts on the long term. The study of Wentzlaf and Fuchs shows that the speed often rises sharply during the initial commissioning of accelerators. Over time, when the methods of optimal acceleration are studied, and the best practices described, the researchers come to the most optimal approach. Moreover, well-defined tasks from a well-studied area that can be parallelized (GPU) are well solved at accelerators. However, this also means that the same properties, thanks to which the task can be adapted for accelerators, limit the advantage obtained from this acceleration in the long term. The team called this problem the "deadlock of accelerators."

And the high performance computing market has probably been feeling this for some time. In 2013, we wrote about the difficult road to ex-scale supercomputers. And even then, the Top500 predicted that accelerators would give a one-time jump in performance ratings, but they would not increase the speed of performance increase.



However, the implications of these discoveries are beyond the high performance computing market. For example, after examining the GPU, Wenzlaf and Fuchs found that the benefits, which cannot be attributed to the improvement of CMOS, were very small.



In fig. The growth in absolute speed of the GPU (including the benefits derived from the development of CMOS) is shown, and these benefits emerged solely due to the development of CSR. CSR is about the improvements that remain if all the breakthroughs in CMOS technology are removed from the GPU circuit.

The following figure clarifies the relationship of values:



A decrease in CSR does not mean a slowdown of the GPU in absolute terms. As Fuchs wrote:
CSR normalizes profit "based on CMOS potential", and this "potential" takes into account the number of transistors and the difference in speed, efficiency in the use of energy, area, etc. (in different CMOS generations). In fig. 6 we gave an approximate comparison of the combinations “architecture + CMOS nodes” by triangulating the measured speeds of all applications in different combinations, and applying the transitive relationship between those combinations that do not have enough common applications (less than five).

Intuitively, these graphs can be understood as in fig. 6a shows what “engineers and managers see”, and in fig. 6b - “what we see, excluding the potential of CMOS”. I would venture to suggest that you are more concerned about whether your new chip is ahead of the previous one than it does because of the best transistors or because of the best specialization.

The GPU market is well defined, designed and specialized, and both AMD and Nvidia have every reason to be ahead of each other, improving the circuitry. But despite this, we see that for the most part, accelerations are due to factors related to CMOS, and not because of CSR.

FPGA and special cards for processing video codecs, studied by scientists, also fall under these characteristics, even if the relative improvement over time became more or less due to the growing market. The same characteristics that allow you to actively respond to acceleration, ultimately limit the ability of accelerators to improve their efficiency. About GPU Fuchs and Wenzlaf write: “Although the frame rate of GPU graphics has increased 16 times, we expect that further improvements in speed and energy efficiency will follow 1.4-2.4 times and 1.4-1.7 times, respectively.” . AMD and Nvidia have no special space for maneuver, in which you can increase the speed by improving CMOS.

The implications of this work are important. She says that architecture-specific for their areas will no longer provide significant improvements in speed when Moore's law stops working. And even if chip designers can concentrate on improving performance with a fixed number of transistors, these improvements will be limited to the fact that well-studied processes have almost nowhere to improve.

The work points to the need to develop a fundamentally new approach to computing. One potential alternative is the Intel Meso architecture . Fuchs and Wenzlaf also suggested using alternative materials and other solutions that go beyond CMOS, including research into the possibility of using non-volatile memory as accelerators.

Source: https://habr.com/ru/post/444964/


All Articles