How processors are designed and manufactured: the future of computer architectures

Despite constant improvements and gradual progress in each new generation, there have been no fundamental changes in the processor industry for a long time. A huge step forward was the transition from vacuum to transistors, as well as the transition from individual components to integrated circuits. However, after them, serious paradigm shifts of the same scale did not occur.

Yes, the transistors are smaller, the chips are faster, and the performance has increased hundreds of times, but we are starting to observe stagnation ...

This is the fourth and last part of a series of articles on the development of the CPU, which tells about the design and manufacture of processors. Starting at a high level, we learned how computer code is compiled into assembly language, and then into binary instructions that the CPU interprets. We discussed how the processor architecture is designed and process the instructions. Then we looked at the different structures that make up the processor.
')
Having a little deep into this topic, we saw how these structures are created, and how billions of transistors work together inside a processor. We looked at the process of physically manufacturing raw silicon processors. They learned about the properties of semiconductors and how the inside of the chip looks. If you missed one of the topics, here is a list of articles in the series:

Part 1: Basics of computer architecture (instruction set architecture, caching, pipelines, hyperthreading)
Part 2: CPU Design Process (Wiring Diagrams, Transistors, Logic Elements, Timing)
Part 3: Compiling and physical chip manufacturing (VLSI and silicon fabrication)
Part 4: Current trends and important future directions in the architecture of computers (sea of accelerators, three-dimensional integration, FPGA, Near Memory Computing)

We turn to the fourth part. Development companies do not share with the public their research or the details of modern technologies, so it’s hard for us to clearly imagine what exactly is inside the CPU of the computer. However, we can look at modern research and find out which direction the industry is heading.

One of the famous images of the processor industry is Moore's law. He says that the number of transistors in a chip doubles every 18 months. For a long time this rule of thumb was fair, but growth begins to slow down. Transistors become so tiny that we begin to approach the limit of physically achievable sizes. Without a revolutionary new technology, we will have to explore other possibilities for productivity growth in the future.

Moore's Law for 120 years. This graph becomes even more interesting if you find out that the last 7 points belong to the Nvidia GPU, and not to general-purpose processors. Illustration of Steve Jarvetson

From this analysis follows one conclusion: in order to increase productivity, companies began to increase the number of cores instead of frequency. For this reason, we are seeing how eight-core processors are becoming widespread, and not dual-core processors with a frequency of 10 GHz. We just do not have much space for growth, except for the addition of new nuclei.

On the other hand, a huge area for future growth promises a field of quantum computing . I am not an expert, and since its technologies are still being developed, in this area, in any case, there are still few real “specialists”. To dispel the myths, I would say that quantum computing will not be able to provide you with 1000 frames per second in a realistic renderer, or something like that. So far, the main advantage of quantum computers is that they allow the use of more complex algorithms that were previously unattainable.

One of the prototypes of IBM quantum computers.

In traditional computers, the transistor is either in the on or off state, which corresponds to 0 or 1. In a quantum computer, a superposition is possible, that is, the bit can simultaneously be in the 0 and 1 state. Thanks to this new opportunity, scientists can develop new methods of computing they will be able to solve problems for which we still do not have enough computing power. The point is not so much that quantum computers are faster, but that they are a new model of computing that will allow us to solve other types of problems.

One or two decades are left before the mass introduction of this technology, so what trends are we starting to see in real processors today? Dozens of active studies are underway, but I will touch only on certain areas that, in my opinion, will have the greatest impact.

The growing trend of the influence of heterogeneous computing . This technique is to include in the same system a variety of different computational elements. Most of us take advantage of this approach in the form of individual GPUs in computers. The central processor is very flexible and can perform a wide range of computational tasks with decent speed. On the other hand, GPUs are designed specifically to perform graphical calculations, for example, matrix multiplication. They do this very well and are orders of magnitude faster than the CPU in these kinds of instructions. Transferring part of the graphical calculations from the CPU to the GPU, we can speed up the calculations. Any programmer can optimize software by changing the algorithm, but optimizing hardware is much more difficult.

But the GPU is not the only area in which accelerators are becoming more popular. Most smartphones have dozens of hardware accelerators designed to speed very specific tasks. This style of computing is called the Sea of Accelerators (Sea of Accelerators) , its examples include cryptographic processors, image processors, machine learning accelerators, video encoders / decoders, biometric processors, and more.

Loads are becoming more and more specialized, so designers include more and more accelerators in their chips. Cloud service providers, such as AWS, have begun providing FPGA cards to developers to speed up their cloud computing. Unlike traditional computing elements like CPUs and GPUs that have a fixed internal architecture, FPGAs are flexible. It is an almost programmable hardware that can be customized to fit your needs.

If someone needs image recognition, then he implements these algorithms in hardware. If someone wants to simulate the work of a new hardware architecture, then before making it can be tested on FPGA. FPGA provides greater performance and energy efficiency than the GPU, but still less than the ASIC (application specific integrated circuit - a special-purpose integrated circuit). Other companies, such as Google and Nvidia, are developing separate ASIC machine learning tools to speed up image recognition and analysis.

Crystal images of popular mobile processors, showing their structure.

Looking at the images of crystals with respect to modern processors, you can see that most of the CPU area is not really occupied by the core itself. Increasingly large share of various accelerators. This made it possible to speed up highly specialized calculations, as well as significantly reduce power consumption.

Previously, when a video chip was added to the video processing system, developers had to install a new chip in it. However, it is very inefficient in terms of energy consumption. Every time a signal needs to exit the chip through a physical conductor to another chip, a huge amount of energy is required per bit. By itself, a tiny fraction of the joule does not seem to be particularly large expenditure, but data transfer inside, and not outside the chip can be 3-4 orders of magnitude more efficient. Thanks to the integration of such accelerators with the CPU, we have recently witnessed an increase in the number of chips with ultra-low power consumption.

However, accelerators are not perfect. The more we add them to the circuits, the less flexible the chip becomes and we begin to sacrifice the overall performance in favor of the peak performance of specialized types of computing. At some stage, the entire chip simply turns into a set of accelerators and ceases to be a useful CPU. The balance between specialized computing performance and overall performance is always very carefully tuned. This disagreement between general-purpose equipment and specialized loads is called specialization gap .

Although some people think. that we are at the peak of the GPU / Machine Learning bubble, we can most likely expect that an increasing amount of calculations will be transferred to specialized accelerators. Cloud computing and AI continue to evolve, so GPUs look like the best solution for achieving the level of volume computing required.

Another area in which designers are looking for ways to increase productivity is memory. Traditionally, reading and writing values has always been one of the most serious "bottlenecks" of processors. Fast and large caches can help us, but reading from RAM or from an SSD can take tens of thousands of clock cycles. Therefore, engineers often view memory access as more costly than the calculations themselves. If the processor wants to add two numbers, it first needs to calculate the memory addresses by which numbers are stored, find out at what level of memory hierarchy there is this data, read the data into registers, perform calculations, calculate the receiver address and write the value to the right place. For simple instructions that may take a cycle or two to complete, this is extremely inefficient.

A new idea that is being actively explored is a technique called Near Memory Computing . Instead of extracting small pieces of data from memory and calculating them with a fast processor, the researchers turn the work upside down. They are experimenting with the creation of small processors directly in the memory controllers of RAM or SSD. Due to the fact that computations are getting closer to memory, there is the potential for tremendous savings in energy and time, because data no longer needs to be transferred so often. Computing modules have direct access to the data they need, because they are directly in memory. This idea is still in its infancy, but the results look promising.

One of the obstacles that must be overcome for near memory computing is the constraints of the manufacturing process. As mentioned in the third part , the process of silicon production is very complex and involves dozens of stages. These processes are usually specialized for the manufacture of either fast logic elements or densely located storage elements. If you try to create a memory chip using a manufacturing-optimized manufacturing process, you get an extremely low-density chip. If you try to create a processor using the drive manufacturing process, you will get very low performance and high timings.

3D integration example showing vertical connections between transistor layers.

One of the potential solutions to this problem is 3D integration . Traditional processors have one very wide transistor layer, but this has its limitations. As the name implies, three-dimensional integration is the process of arranging several layers of transistors on top of each other to increase density and reduce delays. Vertical columns produced in different manufacturing processes can then be used for joints between the layers. This idea was proposed long ago, but the industry has lost interest in it due to serious difficulties in its implementation. Recently, we are witnessing the emergence of 3D NAND storage technology and the revival of this field of research.

In addition to physical and architectural changes, another tendency will strongly influence the entire semiconductor industry - a greater emphasis on security. Until recently, the safety of processors was thought almost at the last moment. This is similar to how the Internet, e-mail and many other systems that we actively use today, were developed almost without regard for security. All existing security measures were “screwed up” as incidents happened so that we feel safe. In the field of processors, this tactic has hurt companies, and especially Intel.

Specter and Meltdown bugs are probably the most well-known examples of how designers add features that significantly speed up a processor without being fully aware of the security risks involved. When developing modern processors, much more attention is paid to security as a key part of the architecture. When security improves, performance often suffers, but given the damage that serious security bugs can cause, it's safe to say that it is better to focus on security as much as on performance.

In the previous parts of the series, we touched on such techniques as high-level synthesis, which allows designers to first describe the structure in a high-level language, and then allow complex algorithms to determine the optimal hardware configuration to perform the function. With each generation, design cycles are becoming more costly, so engineers are looking for ways to speed development. It should be expected that in the future this tendency of designing equipment with the help of software will only increase.

Of course, we are not able to predict the future, but the innovative ideas and research areas that we discussed in the article can serve as benchmarks for expectations in the design of future processors. We can safely say that we are nearing the end of the usual improvements in the production process. In order to continue to increase productivity in each generation, engineers will have to invent even more complex solutions.

We hope that our series of four articles spurred your interest in the study of design, verification and production processors. There is an infinite amount of materials on this topic, and if we tried to reveal them all, then each of the articles could grow into a whole university course. I hope you have learned something new and now understand better how complex computers are at each level.

Source: https://habr.com/ru/post/458670/

All Articles

How processors are designed and manufactured: the future of computer architectures

More articles: