📜 ⬆️ ⬇️

Iron in the service of the algorithm (continued)

Boris Babayan about the past, present and future of computing technology


Boris Babayan


Almost three months have passed since the publication of the first part of this work. All this time the second part matured and ... finally, matured!
')
As in the previous part, the story is built on behalf of Babayan. Most of my comments I designed in the form of frames or links to Internet pages.


After the failure of the 432th machine, the whole world dropped out to engage in hardware support for high-level languages. This was forgotten around the 84th year and has not yet been remembered. Everyone believes that this is a beautiful idea, but not practical. It is believed, despite the fact that we had a complete implementation of this idea, and the implementation was effective.

But what did the high-level languages ​​come to? All current high-level languages ​​can be divided into two groups: one includes languages ​​such as Java, the other - type C. Java is strict type control, but very inefficient. And the speakers are not there. Nobody even thinks of writing an operating system in Java. By my standards, this is not a universal high-level language. Another group is C-like languages. The general opinion is that these are not high-level languages. Rather, the assembly version.

It turns out that no one in the world, except our Elbrus children, knows the real advantages of high-level languages. And it's just awful!

Now, what do you need to get these benefits? Very few. It is easy to do.

First of all, two types of descriptors are introduced in languages ​​and equipment - spatial and temporal. A spatial descriptor is, in fact, a pointer to data. It may, for example, contain the size of the array and the address of the beginning of the array. Typical operations with such a descriptor are indexing (access to a specific element of the array) and taking a subarray. In total there is a very small number of operations on pointers to data.

And the important point here is that the base address for read and write memory operations can only be taken from such pointers. With their help, the equipment controls the output of the array boundary. I will say more about what else spatial descriptors give when I talk about “context protection”.

Now the time descriptor is a pointer to a function, the right to run some procedure. A pointer to the function contains the address of the transition. At the time of Elbrus, upper-level addressing, that is, statically nested procedures, was popular.
Upper-level addressing
Babayan believes that now upper-level addressing is also necessary, since It allows you to structure globals. Such structuring should have a positive effect on data protection.
Therefore, then the function pointer also included a link to the global data of the procedure within which this function was described. That is, if procedure A is described inside procedure B, then procedure A, generally speaking, has access to all the data that are available to procedure B. Therefore, function A is first in the function descriptor, firstly, there is a transition address and, secondly, a link to global data of the function B. And on the global data of a specific incarnation of B. Now I will explain it. As a result of a chain of recursive calls, procedure B can have several simultaneously living incarnations. Each incarnation will have its own global data. Due to this, the pointers to function A, generated within different incarnations of B, will be different pointers. They will refer to data corresponding to different copies of the covering procedure.

Now look what a general conclusion can be made in terms of the universality of the descriptor approach: there is no privileged mode in the operating system. Do not need such a thing. And if there is no privileged mode, then there is no concept of system programming. That is, the operating system is a regular user program. It is simply more widely known to other applications, in the sense that some of its procedures are accessible from any application. I will illustrate why such a system does not need a privileged mode.

If any procedure is started, then initially it has access only to registers. Registers are available all. But if there is not a single descriptor in them, then the procedure cannot work with anything except with these registers. Neither the memory read operation nor the write operation into it (since no spatial pointer is available to the procedure) will work. If the registers contain exactly one pointer, then the procedure can refer only to one object — the one to which the pointer refers.

That is, each procedure has a context. We even called it "context protection".

Now let's assume that the procedure at some point in time took a new array to start. And this is one of the most privileged actions in modern systems. In Elbrus, the context of the operating system was used. It contained pointers to functions. Including the pointer to the “give a new array” function. That “get space” was called in the Burroughs operating system.

When a get space procedure is started, the context of the procedure that called it is closed on the stack and the context of the get space itself opens. In this context, generally speaking, there is a giant descriptor. On all virtual memory. And the user, when running “get space”, for example, requested an array of 25 words. “Get space” has a table in which it is written which places are occupied and which are free. She finds free space and makes an unprivileged operation: take a subarray. As a result, a handle is obtained to the memory that the procedure wants to give to the new array. Additionally, the table is adjusted accordingly.

After returning from “get space”, the context of the procedure that called it has expanded. And for this, no privileged action was required.

Of course, it is necessary to remember that each procedure is responsible for the semantics of its actions. If there are any errors in memory allocation in the “give memory” procedure, the memory will be poorly allocated. But this is the semantics of the procedure itself. And it is impossible to check the semantics of the procedure at the hardware level.

Further, see ... It seems that from the above it follows that descriptors cannot be changed. But this is not true. Descriptors can be changed if necessary. Again without privileged mode. See how. The machine has operations that allow you to change descriptors. However, they are sewn into special procedures, in the context of which a special parameter is indicated, allowing them to use these operations. There are very few such procedures, and there are pointers to them only in the context of a small number of system procedures. No other manual change of handles is available!

For example, the procedure “give a memory” has a pointer to a function that can change descriptors. The fact is that in reality we did not use the giant descriptor that I mentioned above. I mentioned it only as an illustration. In practice, a large descriptor is very inconvenient, since leads to large virtual addresses. Newly created handles will take up a lot of space. In addition, the “get space” can still, through the hardware, issue any pointer, since all memory is available to it. Why not let her directly create handles?

This will reduce their size. Let “get space” itself, allocating some memory, draw the necessary descriptor. Besides her, no one can do it anyway, because only she has the necessary context.

To be fair, it should be noted that a special parameter that allowed the use of operations that change descriptors actually sets the privileged mode. But this is purely technical privilege. As I showed in the example with a huge handle to all virtual memory, philosophically this privilege is not needed.

Now look what possibilities the context protection gives. Debugging is incredibly simplified. Suppose you have a giant program, which is known to know that there are no errors in it. Suppose it is theoretically proven. Here I can exaggerate, but let's say. Suppose now that a user who does not know that the program is absolutely correct, adds to it some kind of new procedure that is not debugged. This procedure can give the correct result, but at the same time spoil the data of others. For the user, this will look like an error in the whole huge program. Moreover, since the procedure that he added gives the correct result, the user may even decide that the error is in the main program, and not in his procedure. It turns out that it should be ready to debug the already debugged code. And this applies to all modern architectures. Additionally, this situation is worsened by the fact that the extensible code can be huge, and the people who wrote it could have quit for a long time. Go debug such a system ... In Elbrus, this simply could not be. In Elbrus, you simply start the program and set stops at all exits from the procedure being debugged. Then at the outputs check the results of her work. If they satisfy you, you can forget about the procedure being debugged. She won't do you any more harm. This dramatically simplifies programming.

Due to the fact that everything is typed in the memory, you can debug by simply checking the memory dumps. When we handed over the car Elbrus two, at the same time attended Lyonya Raikov, project manager at NICEVT. And NICEVT is an organization that copied IBM machines. We then just made an operating system; often a hundred terminals work; Often there are some errors. Our guys took a memory dump. It is structured, there are types. Programmers sat for two hours, went to fix the operating system. Leon Raikov indignant! Because they have, when an error in the IBM operating system is detected, a few months you need to get to this error.

Let's see what is needed in order to provide support for the descriptors in the machine. The hardest thing here is to allocate one extra bit for tags for every 64 bits.
The scheme of work with tags in Elbrus-3
One bit allows you to specify whether an object is standard user data (a set of bits that is interpreted differently, depending on the operations that are performed on it), or has an internal structure (thanks to which an object can be interpreted, for example, as a "dummy" pointer to data, pointer to function, ...)
If an ECC code is implemented in memory, then this is easy. There is an ECC code in Elbrus-3, so there were no difficulties there. But now in many machines there is no ECC-code. Then you need to invent something. Unless, of course, we want to fasten support for descriptors to them.

The extra bit is the only difficulty. Then everything is simple. Two types of descriptors need to be entered - this is elementary. Operations with descriptors do not affect performance. rarely performed. Implementing such operations is easy. The most important thing here is to add the necessary checks to write and read memory operations.

After all this, it remains to add a purge of memory. I will now explain what it is.

Suppose a programmer used some memory and freed it. The freed memory must be cleaned, because it may contain old descriptors. If the memory is not cleaned, then these descriptors may be in new hands.
Explanation
For example, after the freed memory will be given to another application.
And it will be a violation of protection. Therefore, the memory must be cleaned. But it is simple, it is an operating system and now it can do.

More difficult to deal with pointers to cleaned memory. After the memory is freed, all pointers to it must be found and destroyed. Otherwise, after the freed memory is reused again, it will be possible to get to other objects using these pointers.

How do we solve this problem? We did not enter the binman, it is difficult. Our operating system always distributes memory on an accrual basis, does not reuse it. If the memory runs out, then the operating system collects holes in it. This is done in one pass through the entire memory in the background, while the user applications do not stop. If during this passage it is detected that some kind of handle is looking at the hole, it is clogged. Everything! Further, all the memory is simply compacted, the addresses in the descriptors are modified as necessary.

Previously, it rarely happened that programs used all the memory, so in fact compacting almost never happened. Perhaps in modern systems this would no longer be the case.

The algorithm I just described is the algorithm for the heap. It is a little harder to clean hung pointers on the stack. You cannot do the same with the stack as with the heap, because the memory is actively reused there. But with the help of a small hardware attachment, you can clean the stack. In the first, second Elbrus, we cleaned it programmatically, but now it is impossible, now we need it efficiently, with the help of equipment. And you can even not do this. You can simply ban difficult situations at the language level. Now I will explain it.

Complex stack cleanup mechanisms have to be entered due to a single problem: due to the fact that descriptors referring to local procedure data can be written to globals. This is a very bad programming style, but it is sometimes found. For example, when developing operating systems. In an amicable way, such situations should be forbidden. Then there will be no problems with the equipment.

Everything! In addition to the difficulties that I described, there are no difficulties. That is, support objects worth nothing! But this is still not in one car!
Modern incarnation of Elbrus-2
Just a few days ago, when I was no longer going to make significant changes to the article, I came across a CRASH project being carried out today by the British company BAE Systems on the order of the American defense industry. The goal of the project is to develop a general-purpose computer and an operating system for it. The details of the project, for obvious reasons, were not disclosed, but the superficial description of the machine practically reproduces the description of Elbrus-2. For example: "... under the auspices of the DARPA CRASH program," . From the above description it is clear that the machine will be tagged. Also, the phrase “zero kernel” may indicate the absence of a privileged mode in the operating system. In the language description there is the phrase “dynamically typed”, i.e. language will be dynamic by type.

So it is likely that in the near future we are waiting for the incarnation of the ideas of Elbrus-2 in modern conditions
Just a disaster, do you understand? Such properties as high-level programming still do not exist.

Everything I mentioned above almost completely related to the spatial component.

Now let's look at the time component. As I said, it all started with a bit-serial machine. Then the equipment quietly expanded and expanded. Well, the first thing they did was to replace bit arithmetic with word arithmetic. When I came to ITMIVT - and then I studied at the Physical and Technical Institute - there was a discussion there. Sergey Alekseevich Lebedev started it. It was argued that the machines should be parallel. Then parallelism was considered inside arithmetic operations. No one has thought that the operations themselves can be performed in parallel. So, they introduced parallelism inside operations. After that, the architecture could only be improved by speeding up the logic behind the implementation of arithmetic. No other was given. Because the program execution time still consisted of the sum of the execution times of the arithmetic operations included in it.

And then work began on accelerating arithmetic. So I can brag: my first scientific work - in the 54th year I did it, in the fourth year at the Physics and Technology Institute - an introduction to carry-save arithmetic. Now those who are engaged in arithmetic, know it very well. This multiplication, division and square root without transferring discharges. Pretty obvious and simple thing. It is still the basis of all material operations. There are two basic algorithms. I presented one of them in the 55th year at the Physical and Technical Conference. The first Western open publication on the same topic came out in the 56th year. Nadler published.
Publication
Probably referring to the following publication: Nadler, M., “A High-Speed ​​Electronic Computing Machines,” Acta Technica (Prague), No. 6, pp. 464-478, 1956
That is, I publicly submitted this thing a year earlier. However, the article on the results of my work, unfortunately, was published only in 1958 in the first issue of the MIPT Conference Proceedings.

The second basic carry-save algorithm of arithmetic specifies how the actions of the first algorithm should be performed in numeration systems with a base of more than two. For example, in quaternary or octal. The introduction of multi-digit arithmetic reduces the total number of operations. This idea and its implementation belongs to Robertson.
Robertson
The author's card in the University of Illinois library: http://archives.library.illinois.edu/archon/?p=collections/controlcard&id=1000 . Link to the report on visits to computer centers in the USSR: http://dl.acm.org/citation.cfm?id=368342&dl=ACM&coll=DL&CFID=293666434&CFTOKEN=68165159
He came to Moscow in the 58th year, and we talked a lot with him.

Two algorithms, which I have mentioned, still form the basis of machine arithmetic. The struggle for parallelism began with them.

Then, of course, Cray introduced the conveyor. Only one conveyor module, but nevertheless it is also parallelism. A very serious step was taken.

We also attended to this. After we introduced carry-save arithmetic, we had specialized multipliers, specialized dividers, adders, logic. All different devices. We saw that in the car about 10 devices and thought: trees, sticks, they also need to be able to start up in parallel. And the command system is consistent. Moreover, our erroneous idea then was that a simple compiler is needed to support high-level languages. Therefore, we had a reverse Polish record. Addressless instruction set based on the stack. At first, the data was loaded onto the top of the stack, then an addressless operation was performed, the result was again on top of the stack ... Everything went through the stack. A lot of operations were obtained, compared to, for example, a three-address command system. Therefore, we thought: it would be nice to be able to perform all these operations in parallel. This, of course, would make it possible to catch up and overtake the three-address machines, to which we strongly lost due to the bloated Polish record. However, the main motivation was different. We thought about the overall performance, thought about how to use the numerous computing devices that we already had. It was 72 years old.

By 78, we made the world's first out-of-order super-scalar.
the world's first out-of-order super-scalar
The first out-of-order superscalar in history is a theme around which many copies are broken. Elbrusians naturally declare their primacy, and in response they usually fly "but Wikipedia says ..." (hereinafter, usually a link to the IBM 360/91 machine).

Wikipedia is a good thing, of course, but Elbrus engineers are still professionals. Therefore, I had to thoroughly understand this issue. Sergey Shishlov, the engineer in Intel’s company, helped me in this. Thank you very much for this!

First of all, it is necessary to find out what is commonly understood by superscalar. The question is not idle, because the term is not clearly fixed. But in the most significant sources, two basic mechanisms are necessarily included in the definition: pipelining and execution of several instructions per clock. It was from this position that Sergey and I analyzed the loudest projects of the 1960-80s. Below are the results of this analysis.

1) IBM 7030 (Stretch) (1961) – , , ,
2) CDC 6600 (1964) – out-of-order. , .. . , . CDC 7600
3) IBM ACS-1 (1960-) – out-of-order. , out-of-order . , 1961 , 1969 - IBM 360, . ACS-1 1965
4) IBM 360/91 (1967) – out-of-order . out-of-order, . . out-of-order' (- CDC 6600). out-of-order –
5) Floating Point Systems AP-120B (1976) – , VLIW. out-of-order , VLIW-
6) RISC- 1980 . RISC- IBM 801 (1975) . RISC, IBM RS/6600, 1989 . out-of-order-
7) Cheetah (1980-) –
8) GS-1000 (1986) – ,
9) America (1987) – out-of-order . «»
10) Astronautics ZS-1 (1987) – out-of-order
11) RS/6600 (1989) – , out-of-order
12) Intel i860 (1989) – Intel. . out-of-order-

, Reorder Buffer Checkpoint 1988 1986 .

, , 1980 , , « [] out-of-order ». , , . ,
True, he had only 2 conveyors. Now the superscalar has already reached 6 pipelines. Power 8 has ten in all. But this is RISC. There it is easier to make a large number of conveyors.

Intel made an out-of-order machine in the 95th year. A very large distance is obtained. 78th year and 95th. 17 years! Horror!

Such was the struggle for parallelism.

The problem with today's superscalar is that it still has one single instruction stream, a single instruction pointer. In the first superscalar, this was tolerable, because then not much equipment was required to maintain this stream.

Now look at the current microscale superscalar. This is something incredible. Awful thing. First, the compiler turns the initially parallel algorithm into a ticker. Then the superscalar machine from this ruler again extracts parallelism. And then, due to the fact that the languages ​​are consecutive, the retirement unit builds in parallel the instructions executed in the ruler again.
Retirement unit
“Intel Architecture Optimization Reference Manual”: In-Order Retirement Unit. For semantically-correct execution, the results of instructions must be processed in original program order. Likewise, any exceptions that occur must be processed in program order. When a μop completes and writes its result, it is retired. Up to three μops may be retired per cycle. The unit in the processor which buffers completed μops is the reorder buffer (ROB). ROB updates the architectural state in order, that is, updates the state of instructions and registers in the program semantics order. ROB also manages the ordering of exceptions
Here such zigzag turns out! Well, how can you tolerate it? This is just awful. But people tolerate.

In my opinion, superscalar machines are very imperfect (although at one time they turned out to be very useful). My colleagues and I have long realized that a superscalar has a limit ... Why does it have a limit? Because this transformation from serial to parallel has a limit. The decoder in the modern intepelski supescalar should each time give four or six decoded commands. And reading while consistent. Therefore, it is not so easy to decode six operations every clock. The x86 decoder is awfully complicated.

In the RISC decoder, of course, much easier, because there is a fixed command length. But there we still rest on the restriction of the next order, in renaming. Renaming links the arguments and results of various operations with each other, i.e. in fact, it processes the operations one after the other along the chain, which, of course, limits the parallelization. However, even if we somehow overcome the limitations of renaming, the bypass logic will come to the fore.
Bypass
Bypass- , , , , . . , « » 6 , 36 , ..
But if you even cope with this, then everything will finally rest on the branch predictor . The accuracy of prediction of transitions drops sharply with an increase in the nesting level of the branches.

Thus, the superscalar has a limit, a limit of productivity. We understood this after the second generation of our superscalars, after Elbrus two. Although at the time of the second Elbrus superscalar was a good idea. We beat IBM on it, in the sense that we beat our competitors from NEEPT, who made the EC-1060 - a complete copy of the IBM System / 360 (with a 3033 processor that was not a superscalar).

We beat them twice. But it was not a competition with NICEVT, but with IBM, because NICEVT made a clock-precise copy of the American machine, that is, it was repeated to the exact same tact. And their software went all IBM. And we have all his own.

However, no matter how beautiful the idea of ​​a superscalar was, we realized that it had a limit. We need to come up with something in return. At that time, we thought a lot about this topic with Volodya Volkonsky and came up with “tsugi” - parallel execution threads within one core.
Vladimir Volkonsky
– , – « », , , « ».

: http://www.mcst.ru/nashi_uchenye
We thought: well, do 32 zugs - and we will have real parallelism! But then I made a mistake. I thought it was very difficult, but there is a simpler approach - the approach of a wide team. Well, we decided to try it. After all, we then did not know the minuses of this broad team ...

As a result, we made a broad team. But not the same as in Itanium. Itanium is also a complete, so to speak, misunderstanding. Because the width is the same as the superscalar: 6 teams. And while using static planning. Why should it be faster than superscalar? I do not understand.

We in Elbrus three applied the cluster approach.
Clusters
VLIW- renaming', bypass-. . . , bypass- . , , .

, -3
Due to the fact that there is no dynamic conversion from serial to parallel in a wide command, the service of a single instruction pointer is no longer a waste.

Each team could include up to 22 operations. That is, there were 22 actuators.

We worked with this and realized that it was not brilliant either, because it uses static planning.

Thus, in our life we ​​tried extremely dynamic planning, but narrow - with two conveyors, and extremely static - with two clusters, but wide. We have seen that both are bad.

After that, already in Intel, we began to make a project that would allow us to avoid the limitations of both approaches.

So I got to today. He also told how time has changed, and how space has changed. Let's now take a closer look at the current time.

What needs to be changed? Why is everything bad? Well, firstly, I have already said that superscalar is very complex. But the worst thing is that it has a performance limit. Its limit - 6 pipelines - is no longer possible. There are 10 of them in Power 8, but this is RISC. There it is easier given, in x86 - more difficult. But this “more difficult” is now more than compensated for by the high qualifications of the Intel team. Qualification in this matter is of great importance.

Now look at what this limit led to. Begin to think how to expand the machine, how to raise the logical speed. How can this be done? The increase in frequency is now limited. Hence, vector arithmetic is introduced.

It's horrible.At one time, Cray invented a vector. Cray was a brilliant man. He then had only one floating device. He entered the vector registers. And through one device, each clock cycle started the same operation on different registers. Those.his vectors did not go "across" the device, as now, but "along". He had only one conveyor. I read his article. He says that he is loved for fast scalar speed, and vectors are given for free.

And now look at vector arithmetic. Wow for free! Giant vector devices! In memory there damn-those that are going on from this. The use of vectors is very small. If a conditional transition occurs in a vectorized cycle, then this is just awful. Because you have to enter the mask and in fact twice pass the entire cycle.

Iterative parallelism would be natural in this situation. And with vectors it can not be used.

A vector is a hardware complication, and in many cases this complication is simply not used. It is not very popular.

The second thing you can think of to increase your logical speed is multi-thread. But this is also bad, because if you have one task, you must manually divide it into threads. We felt it well when we worked with graphics. Intel was working on Larrabee at the time. Larrabee set a huge goal - programmable graphics. But there was a giant mistake. They decided to make this programmable graphics on x86, where explicit concurrency is not supported.

Let's take a look at how rasterization is done.. Rasterization is a lot of very small pieces of computing that are very parallel with each other. Here are the objects, we project them onto the plane. Typical work is this: another object is taken, then for each pixel we look: is it already something projected? If projected, is the previously projected object obscured by the current? If yes, then the pixel value is overwritten. If the current object does not go to a closer plan, then it is ignored.

In order to parallelize such an algorithm, we need synchronization. Because many objects will try to be projected in parallel to a specific pixel. But there can be a lot of objects, and synchronization is unacceptably long. Because synchronization through memory goes. As a rule, semaphores are taken from shared memory.

This was one of the reasons why the Larrabee project failed.

And what is the ideal for graphics? Need adequate hardware to implement explicit concurrency.

So explicit parallelism is what is it? Consider a cycle with an incredible number of repetitions. If explicit parallelism is supported, then iterations of this loop can be performed in any order. And a lot of them. Here we reach the synchronization. A request is made to the memory.

If there is no explicit concurrency — in Larrabee, for example — then the stream waiting for synchronization — which, by the way, requires 200 ticks — does nothing all this time and takes up resources!
What is really needed is to give the resources of a stream that is synchronized at a given point in time to another stream that can be executed. Such a stream is always there, because of iterations in bulk. It is necessary, however, to have a way to efficiently transfer resources from one thread to another. But we already know how to do it.
Why not make a superscalar?
: ? – . , program order, .. . (retirement). ,


With this approach, the use of arithmetic - in graphics, mostly floating arithmetic - will be one hundred percent! Faster will simply be impossible to do with a given width of the machine. This is because explicit concurrency.

That is actually what was required? It took explicit parallelism. This sequential presentation of the program in memory and in time has become a burden. Because of the linear space, we do not have type-safety, because there is no support for objects. In space, everything is ordered.

And what does the ordering in time lead to? To the fact that the languages ​​are also made orderly. In general, language developers understood that explicit concurrency is an important thing, and they introduced parallel languages. This is Algol 68, this is Ada. There is an explicit parallelism of iterations. Explicit concurrency means synchronization must be necessary. Here the developers have made their own language, a parallel language with synchronization. And then the compiler transmits it to a serial machine. There is no benefit from the fact that the language is parallel. Only a headache for programmers. Because when in language there is an explicit parallelism, sometimes dead-locks are obtained. Therefore, such languages ​​have not taken root. And now they are extremely important!

So, once again, in order to follow the logic: the modern superscalar is very complex and also has a speed limit. To overcome this limit, vectorization and multi-threading are introduced. And what is really needed is an obvious parallelism.

Now, another problem of modern computing ... When the first machines appeared, the programmer or compiler created a program designed for a specific machine. And the program was optimized, respectively, under it. True, it was not necessary to optimize for sequential machines. But later, when the caches appeared, for each machine the translator did its optimization.

This situation persisted until the 360th IBM machine appeared. With the advent of the 360th, IBM actually stated: “Well, earlier, for each car, they wrote their programs. New command system - new generation, new programs. Let's make sure that the 360th has the same command system for many models. There are fast, there are slow models, but the compiler translates the code only once, and this code can be executed on all models. ” This is what we have now.

Previously, this approach may have looked good. But now he's just awful! Look, what trouble arises here. The compiler translates the program and optimizes it. Those.optimizes the use of resources in this algorithm. To perform optimization well, the compiler needs to know two things. First, he should know the algorithm well. Thoroughly you should know: what is its structure, what kind of parallelism is there ... The second thing that the compiler should know is the resource: how many cache levels, how they are arranged, how many registers ...

Now look how things are now. The compiler doesn’t know the algorithm properly. Because languages ​​are spoiled by old machines. They are all consistent. If the algorithm has parallelism, the user himself is forced to linearize it in the language. Moreover, it does not linearize at all in the most optimal way. For one model, you need to linearize it, for the other - otherwise. So, the programmer is already spoiling the program. Well, okay, he messed up, then the compiler transmits it to iron for some sort. He does not know on which model the program will be executed. Here I simplify the situation a little, because With the help of compilation options (at least in theory), you can get code optimized for any hardware features. However, the practice of specializing code for specific models is not generally accepted.Therefore, the compiler must compile the program well for the old model, for the modern, and for the future. This he can not do. Therefore, compiles haphazardly. Today, the compilation for the superscalar is shamanism. How do people develop optimizing compilers? I admire their work. Rearranging something in places - and suddenly everything starts to work faster.

But look, for example, that is impossible with this approach. It is impossible to do such a simple optimization, as code swap. Volodya Pentkovsky explained this to me . He worked in teams that implemented pre-swap.

When this swap is debugged on a specific model, the program runs faster. After two or three years, other models appear - and the same program may work for them more slowly than on the old machine. One colleague told me a bike: when he needed to go to the users for whom the program was slow, he first of all turned off the software swap. The program usually started to work faster. This is all because the model-dependent swap. The compiler cannot do it efficiently.

So what do we have? The compiler does not quite know the algorithm and does not know the hardware at all. Therefore, the equipment itself has to get out. In particular, it is programmed cache. But how is this work programmed? Iron can not do prepacting, it pumps the code only with an eye back. By type least recently used. All this can go against the data used by the algorithm.

To optimize well, you need to know the algorithm. Which, of course, is not available to the hardware.

But now what needs to be done to correct the situation? It is necessary the compiler to make model-dependent. Drag it directly to the equipment.
Drag to the equipment
. , ,
This is the first.Secondly, you need to make the information that comes from the algorithm to the compiler, which is absolutely equivalent to the algorithm.

Now let's look at another very important thing. The first cars were with tires. And the memory was common. If something is written, it is directly written into the memory system. There were no local caches. Memory accesses from different processors did not differ. Yes, and memory access was performed faster than an arithmetic operation.
Fast memory and slow arithmetic
, CDC 6600 10 , – . , - , ..


Now access to shared memory takes 200 clock cycles, which is much longer than the execution time of any operation. Therefore, we have caches. Kesh, in meaning, is a local thing. And due to the fact that the bus is modeled, working with the cache reproduces, in a sense, the work with shared memory. As a result, we have a complex cache interaction algorithm. If you write to a specific cache, you have to take ownership over the line, etc.

A simple approach here is to enter local memory instead of cache. Then there will be almost no misses in the cache. Local memory should serve only a specific core.

After all now how is done? There is a program, some kind of core is working. It works with some kind of memory. And it is not known: is there a parallel operation on the other core with the same memory or not? Just in case, we have to assume that such work is underway. It is necessary to withstand a memory-model. This slows the car down a lot. This is stupid!

If we had obvious parallelism, everything could be done much more efficiently. If a programmer in the language of explicit concurrency said that no one in parallel on shared memory is working with any particular part of the computations, then the kernel may not go anywhere. Can work with directly addressable local memory. The current cache is a wild tribute to the old. The technology has changed a long time ago, but the bus structure is imitated anyway.

I will make a reservation that I very carefully suggest local memory. I have not yet worked out this idea. But I think it can be done.

Let's look at another archaism - retirement. With the introduction of explicit concurrency, it will need to be removed. Do not need a single sequence of instructions. Full linearization makes it very difficult to achieve efficiency. Of course, somehow it is necessary to solve the dependence between reading from memory and writing to memory, but this is another matter. I believe that such dependencies should be explicitly indicated by the programmer, and their efficient working out is already a matter of the compiler and hardware. In particular, if the dependence does not lie on the critical path of the algorithm, then nothing special needs to be done with it. A dependent operation simply has to wait for the operation on which it depends. With dependencies through which the critical path passes, everything is a bit more complicated. If we arrange all the dependent operations here, then we can lose a lot in performance. Therefore, you need to use data speculation and good planning, which will try to spread dependent operations to the greatest possible distance from each other.

Now ... when I realized the flaws of modern cars, I just took horror! It has long been completely different equipment. And at the same time the very first cars are modeled. And no one even thinks that this is nonsense. Everyone is afraid.

I have already said a lot. But what, in the end, do I have the look of the most progressive car of the future?

Look here. There are algorithms. It is necessary to make both spatial and temporal components in programs fully equivalent to the algorithm. To this end, an explicit parallelism language is introduced. We need to work on it. Type safety is required in this language. This is the first.

Second, a simple transcoder is introduced. God forbid any optimization at this stage to do. This is the task of the compiler, and the transcoder should simply translate the program into a binary representation.

The third - the distributive created by the recoder reaches right up to the compiler, which is pushed to the gland. No other compiler options are needed. Because they will only hide the algorithm from the compiler, pushed to the hardware. If everything is done as I have just said, then this unique compiler will be well aware of both the algorithm and the hardware.

Next, you need to resolve the issue with the cache.

This is the most difficult place. The least studied. With some caution, I forward with some prediction that the caches just need to be thrown out. Keshi is a tribute to the old fashioned way. You need to enter directly addressable memory.

Then retirement also needs to be thrown out.

After all this, the car becomes wildly simple. In such a system, it will be necessary to introduce speculativeness: both data speculation, and control speculation. Further there will be a very simple scheduler, much simpler than in a superscalar. Something will need to be added for the proper execution of conditional transitions and inputs to procedures. Well, actuators will be needed. And nothing else is needed.

Now a little about the versatility of such a machine. But I speak with some caution ... Recently, I looked at the crystal structure of a Qualcomm smartphone. There are about 11 squares. One of the squares is a universal processor, it occupies only 15% of the area. Next, worth the schedule. It cannot be made in modern languages ​​and machines. This we know. And on a machine with support for explicit concurrency, it is elementary. So the square with the graphics goes to the universal square. Next we see the DSP. DSP is the same cycle, and not the most difficult one. On a universal car it will be easy to execute. As for the rest of the squares, I do not really know the algorithms for which they are intended. But I asked people to deal with each of them. Additionally, I contacted a person in the US who works with smartphones and asked him for information about the tasks that are performed by these boxes.

So you know what this man said to me? Previously, each of the squares was a simple small specialized device. Further, these devices became more complicated and turned into programmable controllers. Those. each smartphone developer now makes more than a dozen home-grown programmable processors. This is what makes smartphone developers compete. They have approximately the same general-purpose processors, which occupy only 15% of the chip and compete at the expense of other devices.

My own idea is ... What could be in these boxes, except for discrete arithmetic and cycles? Most likely nothing. But then, apparently, the universal processor will be able to replace all these devices. Existing universal processors today cannot do this, because very inefficient at specialized tasks. But a processor with support for explicit concurrency can. Of course, this cannot be said to be guaranteed, we are just starting to work on it.

However, if this is so, then the life of smartphone developers will become much easier.

Now they are forced to work themselves with a large number of complex devices. Their competition becomes very heavy.

If it is possible to replace all this variety of devices with a single universal processor, then all the competition will go into programs. Because those algorithms that are currently implemented in a specialized hardware hardware can be written off in programs and executed on a universal machine.

This thought occurred to me in the old time, when I worked with Samsung men. Once they made special chips for refrigerators. And then they thought: if the controller is programmable, then we can use one device on many products. So now I think: if you switch to one universal smartphone - it will be just great!

Once again, what is happening now. The current state of computing is not far removed from the original, when everything was linear.

Compatibility dominates everything: consistent time, consistent space. And the tire is still modeled. Now, if all this is removed and done, as I say now, then the superscalar limit is removed, and we are approaching the natural limit of algorithms.

And this limit is in some sense unimprovable. Here they can say to me: probably something is wrong with your head. What does "unimprovable" mean? Progress stopped, or what?

Well, it happens. Now everyone acknowledges that the superscalar almost stopped in development. The machine is the final education. If this is the final hardware, then the number of different processors is, at least theoretically, of course. Among this finite set of some is the best. So in principle there is unimprovability. The genius of Intel's developers is that they practically achieved this unimprovability. They are not theoretically, but in practice have reached it. Now we need to step over it - to introduce explicit parallelism. Then the limit will be the algorithm itself. If there is some kind of parallelism in the algorithm, it will not be jumped anymore. If someone says: I can modify the algorithm, and it will become faster - I will say: aha, come on, modify, we take this modified algorithm and execute it in the best way. Please, everything is fine.

Well, perhaps all that I wanted to say. I don’t know about you, but I’m very impressed with it.

The author thanks Alexander Ostanevich and Sergey Shishlov for help in preparing the material.

I also want to address you, dear readers, with a request. It is interesting for me to continue work on the coverage of the history (and modern times) of the Soviet and Russian computer technology. Therefore, if you have any access to the founding fathers (for example, BESM machines, Setun, etc.), please contact me. I think the history of cars and their creators keeps a lot of white spots, which would be nice to fill.

Thank!

Source: https://habr.com/ru/post/214377/


All Articles