No, I will not reveal to you the riddle hiding in the name of
MCp0411100101 , but I will try to respond
fully to the comment
nerudo recorded in the topic
“Multiklet” processors became more accessible :
Reading the description of the architectural innovations of this multiclet, I would like to use the phrase from the next topic: "I do not understand."
In short, MCp is a streaming (from dataflow) processor with the original EPIC architecture. EPIC is Explicitly Parallel Instruction Computing, computation with explicit instruction parallelism. I use this term here precisely in this sense, as an abbreviation, and not as a reference to the Itanium architecture. Explicit concurrency in MCp is of a completely different kind.
About the benefits of MCp
For starters, I’ll say that EPIC in MCp is such that it gives the processor some attractive properties (at least for me personally).
')
- Good energy efficiency. Which is ensured by the fact that MCp can:
- coordinate their architectural state much less frequently than processors of traditional architectures;
- It is natural to combine in parallel asynchronous execution instructions for working with memory, arithmetic instructions and code prefetching.
- The non-standard organization of branches provides interesting possibilities for the implementation of the so-called managed runtime, or (in classical terminology) safe programming languages.
- Functional languages ​​can be put on the MCp architecture “more naturally” than on traditional machines.
- The coding feature of the programs (implied by the machine code) and the feature of their execution make it possible to make MCp a gradually degrading processor. That is, it may not break down entirely if the functional devices fail, but it is natural to switch to the operation mode with a different distribution of computations, with lower performance. Unlike traditional fault-tolerant processors, in which functional devices (or the whole processors themselves) are simply triple, MCp in normal mode, when hardware errors do not occur, can more efficiently use its computational power.
- To all this, MCp is still relatively easy to scale (would allow the process) and run in multi-threading mode (meaning SMT - Simultaneous Multi Threading) and also with dynamic resource sharing between threads.
Now I will try to explain where these properties come from, and where such a name is “multicellular”, that is, what a “cell” is. I don’t know anything about the mysterious markings. Maybe this is the key of some “MultiClet” quest? :) No, I seriously do not know.
Cells
A cell (such is the name) is the main element of the MCp microarchitecture. Probably, there is a simpler explanation of what it is, but it is easier for me to begin by describing existing processors.
Any modern CPU contains a set of some functional devices. They can be divided into several types. Let's divide (to explain the features of the MCp, I don’t need their very detailed description, so everything is superficial).
- ALU : arithmetic logic units, in a broad sense. That is, devices that perform various transformations on the data. They have input ports to which the opcode and operands are sent and the output on which the results are generated.
- LSU : memory access devices (Load / Store Unit). Naturally, this piece of data does not convert, but writes or reads from memory. It has its own input and output ports.
- RF : register file. These devices save data from some buses (not quite the correct name, but not the essence) and give them to other buses, focusing on the commands and values ​​on their input ports. These tires are connected to LSU or ALU ports. It is often said that registers are such a fast internal processor memory. Rather, RF is such a very efficient internal processor memory, and register semantics are interfaces to access it. Because there is His Majesty ...
- CU : control device. This is a device that manages multiple ALUs, LSUs and RFs (it is now fashionable to make one common RF per core, but this was not always the case; just clarify), controlling the transmission of signals between them, executing the program. In modern traditional high-performance processors, the CU itself is very complex, and consists of other components: schedulers, transition predictors, decoders, queues, acknowledgment buffers, etc. But for the purposes of this story, it is more convenient for me to consider all this as one device. Similarly, I do not lay out on the adders and shifters ALU.
We can say that the CU is a device that determines the type of processor. In almost all modern traditional processors, ALU, LSU and RF are roughly the same from the functional point of view (if you don’t go into subtle details of the implementation and make no distinction between vector and scalar ALUs, yes, the statement was conditional). All the variety of models of CPU, GPU, PPU, SPU and other xPU is provided by the difference in the logic of different CU variants (and this difference is much more significant than the difference between vector and scalar ALUs).
This may be the logic of a simple stack processor, whose CU should operate in a trivial cycle. Read both registers from your RF, which consists of two IP registers (a pointer to the current instruction) and SP (a cycle tip). Put the read operation code and IP contents on the LSU input ports (most likely, the CU in this case simply switches the RF output and the LSU input), get the answer - the instruction code. If, say, this is a jump instruction code, then the CU should set a request on the LSU ports to read the value from the top of the stack, change the SP by one, send this value in RF, and on the next clock cycle switch the output LSU port to the input port RF (to another port setting the value corresponding to the entry in the IP). Then, repeat the cycle. In fact, very simple. It seems that in vain in our specialized universities do not develop stack processors as an exercise.
This can be the logic of a sophisticated superscalar and multi-thread POWER8 with extraordinary execution, which selects several instructions for each clock cycle, decodes the previous sample, renames registers, drives a huge register file (even in i686 with its visible 16th registers, register files could be 128x64 bits), predicts branching, etc. Such a processor can not be done in the form of homework.
Or it may also be a fairly simple RISC-like CU, which in the GPU distributes all the ALU, LSU and RF, packed in a multiprocessor, the same command.
In today's high-performance CPU, CU is the most complex device that occupies most of the chip. But this is not important yet. The main thing is that in all the cases listed above, the CU is
one , although it can at the same time load with work and control many other functional devices. What can be equivalently formulated like this:
in modern processors, you can perform several control threads (chains of instructions) using one CU (SMT, for example, Hyper Threading); but one control flow cannot be performed with multiple CUs.Aha My young Padawan (we are all young in spirit and know that we don’t know anything :) the solution to the mystery of the Multiklet is close. Naturally, now I will say that the design of a multicellular processor is such that it contains several CUs that work according to a certain protocol and form some kind of distributed CU that can execute one execution thread (one thread, that is, thread) on several cores . But first I will say more.
So, the
cell is an analogue of the core in the usual CPU. It contains its CU, ALU (one, but sufficiently advanced even in the early version of the processor, capable of performing operations with float [2] values, including operations of complex arithmetic; the version currently being developed will support calculations with double). Cells can have access to shared RF and LSU, or they can have their own, which can work in mirror mode or even RAID-5 (if necessary; remember, the most important at this stage of the project’s development is the word “fault tolerance”). And one of the most pleasant places in the MCp architecture is that although in such RF modes it will work much slower, this will not significantly reduce the MCp performance, since the main data exchange during the calculation goes not through RF and shunts (bypass), but through another non-memory device - the switch.
The main feature of cells is that their CUs can, working under a special protocol and with a special presentation of the program, together constitute one distributed CU that can execute one thread (in the sense of thread, in the sense of control flow). And they can execute this thread in parallel, asynchronous, combined mode, when they simultaneously occur: selection of instructions, work with memory and RF (this is done in a very easy way), arithmetic transformations, calculation of the transition goal (and from this I personally bastard, for Pattern-matching from high-level languages ​​fits perfectly in this). And what is even more remarkable, these CUs turned out to be much more substantial :) it’s actually simpler than the CUs of modern superscalar processors with an extraordinary performance. They are also capable of such a parallel execution of the program (to clarify: but not at the expense of its simplicity and distribution, but on the contrary, at the expense of its complexity and centralization, which are needed to form special knowledge about the executable program; more in the next part of the text).
In my opinion (which may differ from the opinion of the engineers who developed and improve the MCp), the most important achievement in the processor is these CUs, which provide fault tolerance and energy efficiency that are important at the current stage of the processor's existence. And the proposed principle of their construction is important not only for microprocessors, but also for other high-performance distributed computing systems (for example, the RiDE system is built according to similar principles).
Energy efficiency. MCp is a parallel processor capable of executing 4 instructions per clock; this is not bad at all. And for this, it does not need a complex and large (in size) central CU, it costs relatively small local devices for each cell. Small means they consume less energy. Local means that you can get by with shorter wiring for transmitting signals, which means that less energy will be dissipated, and the frequency potential will be higher. This is all +3 energy efficiency.
Fault tolerance. If the CU dies in the traditional processor, the entire processor dies. If one of the CU dies in the MCp, one of the cells dies. But the calculation process can continue on the remaining cells, albeit more slowly. Conventional processors, traditionally, troop to ensure reliability. That is, they put three processors that run the same program. If one of them starts to fail, it is detected and disconnected. The MCp architecture allows the processor to work in this mode on its own, and this can be controlled programmatically: if necessary, it can be considered in high-performance mode, when necessary, it can be read in cross-checking mode, without spending additional hardware resources on it, which can also to refuse Other modes are possible (for the time being, as far as I know, they have not been patented, so I will not be distributed).
The birth of nonlinearity
Now I will try to explain why such a distributed CU is possible, that it can really be simple, why you need another way to encode a program, and why this method proposed by the authors of MCp is cool. It's easier for me to start again with a description of traditional (GPU and VLIW are also considered traditional) architectures.
Let's compile something already, but I haven’t compiled anything for two days already, my hands itch already.
cat test-habr.c && gcc -S test-habr.c && cat test-habr.s
typedef struct arrst Arrst;
struct arrst
{
void * p;
char a[27];
unsigned x;
};
struct st2
{
Arrst a[23];
struct st2 * ptr;
};
struct st2 fn5(unsigned x, char y, int z, char w, double r, Arrst a, Arrst b)
{
int la[27];
char lb[27];
double lc[4];
struct st2 ld[1];
return ((struct st2 *)b.p)[a.a[((Arrst *)b.p)->a[13]]].ptr->ptr->ptr[lb[10]];
}
.file "test-habr.c"
.text
.globl fn5
.type fn5, @function
fn5:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $1016, %rsp
movq %rdi, -1112(%rbp)
movl %esi, -1116(%rbp)
movl %edx, %eax
movl %ecx, -1124(%rbp)
movl %r8d, %edx
movsd %xmm0, -1136(%rbp)
movb %al, -1120(%rbp)
movb %dl, -1128(%rbp)
movq 56(%rbp), %rdx
movq 56(%rbp), %rax
movzbl 21(%rax), %eax
movsbl %al, %eax
cltq
movzbl 24(%rbp,%rax), %eax
movsbq %al, %rax
imulq $928, %rax, %rax
addq %rdx, %rax
movq 920(%rax), %rax
movq 920(%rax), %rax
movq 920(%rax), %rdx
movzbl -134(%rbp), %eax
movsbq %al, %rax
imulq $928, %rax, %rax
leaq (%rdx,%rax), %rcx
movq -1112(%rbp), %rax
movq %rax, %rdx
movq %rcx, %rsi
movl $116, %eax
movq %rdx, %rdi
movq %rax, %rcx
rep movsq
movq -1112(%rbp), %rax
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size fn5, .-fn5
.ident "GCC: (GNU) 4.7.2"
.section .note.GNU-stack,"",@progbits
. , CU, (, AT&T ):
imulq $928, %rax, %rax
addq %rdx, %rax
movq 920(%rax), %rax
movq 920(%rax), %rax
movq 920(%rax), %rdx
movzbl -134(%rbp), %eax
, , , , 6-, . , , . , : - .
, , , — ( , , , , ; CU ; , ) — ( , ). CU :
, , . , , CU. Core iX, AMD FX, POWER8 . CU «», , , . . CU — , CPU. . , . , .
, . VIA Isaiah ( ) . , , PLL, FP, IU, Load/Store — .

ARM big.LITTLE, , , . , NVIDIA , 4 , Sandy Bridge. SB - .
VLIW EPIC Intel HP (Itanium) : , ( , ). , . . CU VLIW- Itanium- . , CU , . , MCp. , VLIW : 6- ( Itanium) , , 12. : , ( ), VLIW . Itanium prefetch-, . , . () RISC SMT-, , .
. - , , CU ( , ALU). (, , CU ). .
«» :
? :
!
! compilation time! (, ) MCp.
cat test-habr.c && rcc -target=mcp < test-habr.c
typedef struct arrst Arrst;
struct arrst
{
void * p;
char a[27];
unsigned x;
};
struct st2
{
Arrst a[23];
struct st2 * ptr;
};
struct st2 fn5(unsigned x, char y, int z, char w, double r, Arrst a, Arrst b)
{
int la[27];
char lb[27];
double lc[4];
struct st2 ld[1];
return ((struct st2 *)b.p)[a.a[((Arrst *)b.p)->a[13]]].ptr->ptr->ptr[lb[10]];
}
: .local/.global ( ).
.alias
,
#define
. . . beta- ( , , ). . . , . , .
.alias SP 39 ; stack pointer
.alias BP 38 ; function frame base pointer
.alias SI 37 ; source address
.alias DI 36 ; destination address
.alias CX 35 ; counter
.text
fn5:
.alias fn5.2.0C #BP,8
.alias fn5.x.4C #BP,12
.alias fn5.y.8C #BP,16
.alias fn5.z.12C #BP,20
.alias fn5.w.16C #BP,24
.alias fn5.r.20C #BP,28
.alias fn5.a.24C #BP,32
.alias fn5.b.60C #BP,68
.alias fn5.2.0A #BP,8
.alias fn5.x.4A #BP,12
.alias fn5.y.8A #BP,16
.alias fn5.z.12A #BP,20
.alias fn5.w.16A #BP,24
.alias fn5.r.20A #BP,28
.alias fn5.a.24A #BP,32
.alias fn5.b.60A #BP,68
.alias fn5.lb.27AD #BP,-27
.alias fn5.1.32RT #BP,-32
.alias fn5.2.36RT #BP,-36
.alias fn5.3.40RT #BP,-40
.alias fn5.4.44RT #BP,-44
.alias fn5.5.48RT #BP,-48
jmp fn5.P0
getl #SP
getl #BP
subl @2, 4
subl @3, 56
wrl @3, @2
setl #SP, @2
setl #BP, @4
complete
fn5.P0:
jmp fn5.P1
rdsl fn5.y.8C
wrsb @1, fn5.y.8A
complete
fn5.P1:
jmp fn5.P2
rdsl fn5.w.16C
wrsb @1, fn5.w.16A
complete
fn5.P2:
jmp fn5.P3
getsl 0x340
wrsl @1, fn5.1.32RT
complete
fn5.P3:
jmp fn5.P4
rdsb fn5.lb.27AD + 10
rdsl fn5.1.32RT
mulsl @1, @2
wrsl @1, fn5.2.36RT
complete
fn5.P4:
jmp fn5.P5
rdl fn5.b.60A
wrl @1, fn5.3.40RT
complete
fn5.P5:
jmp fn5.P6
rdl fn5.3.40RT
addl @1, 0x11
rdsb @1
exa fn5.a.24A + 4
addl @2, @1
rdsb @1
rdsl fn5.1.32RT
mulsl @1, @2
wrsl @1, fn5.4.44RT
complete
fn5.P6:
jmp fn5.P7
getsl 0x33c
wrsl @1, fn5.5.48RT
complete
fn5.P7:
jmp fn5.P7.blkloop
rdl fn5.3.40RT
rdsl fn5.4.44RT
rdsl fn5.5.48RT
addl @2, @3
addl @1, @2
rdsl fn5.5.48RT
rdl @2
addl @1, @2
rdsl fn5.5.48RT
rdl @2
addl @1, @2
rdl @1
rdsl fn5.2.36RT
addl @1, @2
rdl fn5.2.0A
; - :)
getl 0x0000ffff
patch @1, @3
patch @2, @3
setq #SI, @2
setq #DI, @2
getl 0xfcc1ffff
patch @1, 0
setq #CX, @1
getl #MODR
or @1, 0x38
setl #MODR, @1
complete
; , CX, SI DI
fn5.P7.blkloop:
exa #CX
jne @1, fn5.P7.blkloop
je @2, fn5.P7.blkclean
rdb #SI
wrb @1, #DI
complete
fn5.P7.blkclean:
jmp fn5.PF
getl #MODR
and @1, 0xffffffc7
setl #MODR, @1
complete
fn5.1L:
fn5.PF:
rdl #BP, 4
jmp @1
getl #BP
rdl #BP, 0
addl @2, 4
setl #BP, @2
setl #SP, @2
complete
, . , ,
complete
.
.
. , (
#
), , .
@N
, N —
, .
, , , .
, ,
@
- . , , , , , . MCp «» . (
setX
) (
wrX
).
, , , , . .
, N . n, , N*k+n (k = 0, 1, ...) , complete ( , ). .
, . , . ,
@
- (
) - , .
. , , — , - , CU . VLIW EPIC-, ( ), , , CU , ( ).
. RF. MCp , , RF . , RF .
Load/Store/MOB Isaiah. , — MOB, Memory Ordering Buffer. , , / ( ) . : . .
: ! RF . , ,
@
- . , :
volatile a;
a += a;
( ):
rdsl a
rdsl a
addsl @1, @2
wrsl @1, a
, , . , MOB, +1 MCp.
. . , , . . , . . .
MCp . , (, , ) . : jmp ( ) j ( ), , .
,
complete
, . , jmp , .
MCp . , , :
doSomething;
if(condition)
{
doStuff;
}
,
doSomething
.
condition
, ( , ).
doSomething
.
doSomething
. , CU , , , . , . : , ! ; : , ! , .
MCp , , ,
doSomething
(, ). MCp ( ), ( ) ( ). +1 .
, MCp CU, CU . CU , : , RF.
MCp. . , . MCp , , .
— . . , RF , ( ), . , . , ( ). .
-, , , . -, WriteBack-, . WB-, , , MCp RF (, , , MCp , , +1 ), . .
— (MMU). . , OS, : Linux, Plan9 :) , MMU — . , SPEC MMU 17.5% ; SUN Java 40%. , MMU ( MCp)? CUDA . , () , Java, .Net, JavaScript, Go, : ?
, - , , . , (TLB) MMU 32 , 33 . . - . .. , .
MCp
, , .
, :) , CU. , . , , , . , , N*k+n (k = 0, 1, ...) N=1 n=0, . , . Profit? PROFIT! big.LITTLE .
. , - , , N n. (, , SMS- , ! !).
(
managed runtimes) . (, , , ?). MCp (, ) , . , - , .
, MCp , , , . , , , (dependable) . , pattern-matching . , (- ):
fib :: (Integral t) => t -> t
fib 0 = 1
fib 1 = 1
fib n = fib (n - 1) + fib (n - 2)
CPU
fib
, , 1 .. , . MCp , .
,
. , . , ( , ..). 16- , , , : ? ( ). , . , , .
MCp. , ( , ) , . . , , .
, !
, ( , ) - , .