The Ballad of "Multiklet"

No, I will not reveal to you the riddle hiding in the name of MCp0411100101 , but I will try to respond fully to the comment nerudo recorded in the topic “Multiklet” processors became more accessible :

Reading the description of the architectural innovations of this multiclet, I would like to use the phrase from the next topic: "I do not understand."

In short, MCp is a streaming (from dataflow) processor with the original EPIC architecture. EPIC is Explicitly Parallel Instruction Computing, computation with explicit instruction parallelism. I use this term here precisely in this sense, as an abbreviation, and not as a reference to the Itanium architecture. Explicit concurrency in MCp is of a completely different kind.

About the benefits of MCp

For starters, I’ll say that EPIC in MCp is such that it gives the processor some attractive properties (at least for me personally).
')

Good energy efficiency. Which is ensured by the fact that MCp can:
- coordinate their architectural state much less frequently than processors of traditional architectures;
- It is natural to combine in parallel asynchronous execution instructions for working with memory, arithmetic instructions and code prefetching.
The non-standard organization of branches provides interesting possibilities for the implementation of the so-called managed runtime, or (in classical terminology) safe programming languages.
Functional languages can be put on the MCp architecture “more naturally” than on traditional machines.
The coding feature of the programs (implied by the machine code) and the feature of their execution make it possible to make MCp a gradually degrading processor. That is, it may not break down entirely if the functional devices fail, but it is natural to switch to the operation mode with a different distribution of computations, with lower performance. Unlike traditional fault-tolerant processors, in which functional devices (or the whole processors themselves) are simply triple, MCp in normal mode, when hardware errors do not occur, can more efficiently use its computational power.
To all this, MCp is still relatively easy to scale (would allow the process) and run in multi-threading mode (meaning SMT - Simultaneous Multi Threading) and also with dynamic resource sharing between threads.

Now I will try to explain where these properties come from, and where such a name is “multicellular”, that is, what a “cell” is. I don’t know anything about the mysterious markings. Maybe this is the key of some “MultiClet” quest? :) No, I seriously do not know.

Cells

A cell (such is the name) is the main element of the MCp microarchitecture. Probably, there is a simpler explanation of what it is, but it is easier for me to begin by describing existing processors.

Any modern CPU contains a set of some functional devices. They can be divided into several types. Let's divide (to explain the features of the MCp, I don’t need their very detailed description, so everything is superficial).

ALU : arithmetic logic units, in a broad sense. That is, devices that perform various transformations on the data. They have input ports to which the opcode and operands are sent and the output on which the results are generated.
LSU : memory access devices (Load / Store Unit). Naturally, this piece of data does not convert, but writes or reads from memory. It has its own input and output ports.
RF : register file. These devices save data from some buses (not quite the correct name, but not the essence) and give them to other buses, focusing on the commands and values on their input ports. These tires are connected to LSU or ALU ports. It is often said that registers are such a fast internal processor memory. Rather, RF is such a very efficient internal processor memory, and register semantics are interfaces to access it. Because there is His Majesty ...
CU : control device. This is a device that manages multiple ALUs, LSUs and RFs (it is now fashionable to make one common RF per core, but this was not always the case; just clarify), controlling the transmission of signals between them, executing the program. In modern traditional high-performance processors, the CU itself is very complex, and consists of other components: schedulers, transition predictors, decoders, queues, acknowledgment buffers, etc. But for the purposes of this story, it is more convenient for me to consider all this as one device. Similarly, I do not lay out on the adders and shifters ALU.

We can say that the CU is a device that determines the type of processor. In almost all modern traditional processors, ALU, LSU and RF are roughly the same from the functional point of view (if you don’t go into subtle details of the implementation and make no distinction between vector and scalar ALUs, yes, the statement was conditional). All the variety of models of CPU, GPU, PPU, SPU and other xPU is provided by the difference in the logic of different CU variants (and this difference is much more significant than the difference between vector and scalar ALUs).

This may be the logic of a simple stack processor, whose CU should operate in a trivial cycle. Read both registers from your RF, which consists of two IP registers (a pointer to the current instruction) and SP (a cycle tip). Put the read operation code and IP contents on the LSU input ports (most likely, the CU in this case simply switches the RF output and the LSU input), get the answer - the instruction code. If, say, this is a jump instruction code, then the CU should set a request on the LSU ports to read the value from the top of the stack, change the SP by one, send this value in RF, and on the next clock cycle switch the output LSU port to the input port RF (to another port setting the value corresponding to the entry in the IP). Then, repeat the cycle. In fact, very simple. It seems that in vain in our specialized universities do not develop stack processors as an exercise.

This can be the logic of a sophisticated superscalar and multi-thread POWER8 with extraordinary execution, which selects several instructions for each clock cycle, decodes the previous sample, renames registers, drives a huge register file (even in i686 with its visible 16th registers, register files could be 128x64 bits), predicts branching, etc. Such a processor can not be done in the form of homework.

Or it may also be a fairly simple RISC-like CU, which in the GPU distributes all the ALU, LSU and RF, packed in a multiprocessor, the same command.

In today's high-performance CPU, CU is the most complex device that occupies most of the chip. But this is not important yet. The main thing is that in all the cases listed above, the CU is one , although it can at the same time load with work and control many other functional devices. What can be equivalently formulated like this: in modern processors, you can perform several control threads (chains of instructions) using one CU (SMT, for example, Hyper Threading); but one control flow cannot be performed with multiple CUs.

Aha My young Padawan (we are all young in spirit and know that we don’t know anything :) the solution to the mystery of the Multiklet is close. Naturally, now I will say that the design of a multicellular processor is such that it contains several CUs that work according to a certain protocol and form some kind of distributed CU that can execute one execution thread (one thread, that is, thread) on several cores . But first I will say more.

So, the cell is an analogue of the core in the usual CPU. It contains its CU, ALU (one, but sufficiently advanced even in the early version of the processor, capable of performing operations with float [2] values, including operations of complex arithmetic; the version currently being developed will support calculations with double). Cells can have access to shared RF and LSU, or they can have their own, which can work in mirror mode or even RAID-5 (if necessary; remember, the most important at this stage of the project’s development is the word “fault tolerance”). And one of the most pleasant places in the MCp architecture is that although in such RF modes it will work much slower, this will not significantly reduce the MCp performance, since the main data exchange during the calculation goes not through RF and shunts (bypass), but through another non-memory device - the switch.

The main feature of cells is that their CUs can, working under a special protocol and with a special presentation of the program, together constitute one distributed CU that can execute one thread (in the sense of thread, in the sense of control flow). And they can execute this thread in parallel, asynchronous, combined mode, when they simultaneously occur: selection of instructions, work with memory and RF (this is done in a very easy way), arithmetic transformations, calculation of the transition goal (and from this I personally bastard, for Pattern-matching from high-level languages fits perfectly in this). And what is even more remarkable, these CUs turned out to be much more substantial :) it’s actually simpler than the CUs of modern superscalar processors with an extraordinary performance. They are also capable of such a parallel execution of the program (to clarify: but not at the expense of its simplicity and distribution, but on the contrary, at the expense of its complexity and centralization, which are needed to form special knowledge about the executable program; more in the next part of the text).

In my opinion (which may differ from the opinion of the engineers who developed and improve the MCp), the most important achievement in the processor is these CUs, which provide fault tolerance and energy efficiency that are important at the current stage of the processor's existence. And the proposed principle of their construction is important not only for microprocessors, but also for other high-performance distributed computing systems (for example, the RiDE system is built according to similar principles).

Energy efficiency. MCp is a parallel processor capable of executing 4 instructions per clock; this is not bad at all. And for this, it does not need a complex and large (in size) central CU, it costs relatively small local devices for each cell. Small means they consume less energy. Local means that you can get by with shorter wiring for transmitting signals, which means that less energy will be dissipated, and the frequency potential will be higher. This is all +3 energy efficiency.

Fault tolerance. If the CU dies in the traditional processor, the entire processor dies. If one of the CU dies in the MCp, one of the cells dies. But the calculation process can continue on the remaining cells, albeit more slowly. Conventional processors, traditionally, troop to ensure reliability. That is, they put three processors that run the same program. If one of them starts to fail, it is detected and disconnected. The MCp architecture allows the processor to work in this mode on its own, and this can be controlled programmatically: if necessary, it can be considered in high-performance mode, when necessary, it can be read in cross-checking mode, without spending additional hardware resources on it, which can also to refuse Other modes are possible (for the time being, as far as I know, they have not been patented, so I will not be distributed).

The birth of nonlinearity

Now I will try to explain why such a distributed CU is possible, that it can really be simple, why you need another way to encode a program, and why this method proposed by the authors of MCp is cool. It's easier for me to start again with a description of traditional (GPU and VLIW are also considered traditional) architectures.

Let's compile something already, but I haven’t compiled anything for two days already, my hands itch already.

cat test-habr.c && gcc -S test-habr.c && cat test-habr.s

typedef struct arrst Arrst;

struct arrst
{
	void * p;
	char a[27];
	unsigned x;
};

struct st2
{
	Arrst a[23];
	struct st2 * ptr;
};

struct st2 fn5(unsigned x, char y, int z, char w, double r, Arrst a, Arrst b)
{
	int la[27];
	char lb[27];
	double lc[4];
	struct st2 ld[1];

	return ((struct st2 *)b.p)[a.a[((Arrst *)b.p)->a[13]]].ptr->ptr->ptr[lb[10]];
}

	.file	"test-habr.c"
	.text
	.globl	fn5
	.type	fn5, @function
fn5:
.LFB0:
	.cfi_startproc
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset 6, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register 6
	subq	$1016, %rsp
	movq	%rdi, -1112(%rbp)
	movl	%esi, -1116(%rbp)
	movl	%edx, %eax
	movl	%ecx, -1124(%rbp)
	movl	%r8d, %edx
	movsd	%xmm0, -1136(%rbp)
	movb	%al, -1120(%rbp)
	movb	%dl, -1128(%rbp)
	movq	56(%rbp), %rdx
	movq	56(%rbp), %rax
	movzbl	21(%rax), %eax
	movsbl	%al, %eax
	cltq
	movzbl	24(%rbp,%rax), %eax
	movsbq	%al, %rax
	imulq	$928, %rax, %rax
	addq	%rdx, %rax
	movq	920(%rax), %rax
	movq	920(%rax), %rax
	movq	920(%rax), %rdx
	movzbl	-134(%rbp), %eax
	movsbq	%al, %rax
	imulq	$928, %rax, %rax
	leaq	(%rdx,%rax), %rcx
	movq	-1112(%rbp), %rax
	movq	%rax, %rdx
	movq	%rcx, %rsi
	movl	$116, %eax
	movq	%rdx, %rdi
	movq	%rax, %rcx
	rep movsq
	movq	-1112(%rbp), %rax
	leave
	.cfi_def_cfa 7, 8
	ret
	.cfi_endproc
.LFE0:
	.size	fn5, .-fn5
	.ident	"GCC: (GNU) 4.7.2"
	.section	.note.GNU-stack,"",@progbits

. , CU, (, AT&T ):

	imulq	$928, %rax, %rax
	addq	%rdx, %rax
	movq	920(%rax), %rax
	movq	920(%rax), %rax
	movq	920(%rax), %rdx
	movzbl	-134(%rbp), %eax

, , , , 6-, . , , . , : - .

, , , — ( , , , , ; CU ; , ) — ( , ). CU :

CU ?
, ( ?) , ?
?
?

, , . , , CU. Core iX, AMD FX, POWER8 . CU «», , , . . CU — , CPU. . , . , .

, . VIA Isaiah ( ) . , , PLL, FP, IU, Load/Store — .

ARM big.LITTLE, , , . , NVIDIA , 4 , Sandy Bridge. SB - .

VLIW EPIC Intel HP (Itanium) : , ( , ). , . . CU VLIW- Itanium- . , CU , . , MCp. , VLIW : 6- ( Itanium) , , 12. : , ( ), VLIW . Itanium prefetch-, . , . () RISC SMT-, , .

. - , , CU ( , ALU). (, , CU ). .

«» : ? : !

! compilation time! (, ) MCp.

cat test-habr.c && rcc -target=mcp < test-habr.c

typedef struct arrst Arrst;

struct arrst
{
	void * p;
	char a[27];
	unsigned x;
};

struct st2
{
	Arrst a[23];
	struct st2 * ptr;
};

struct st2 fn5(unsigned x, char y, int z, char w, double r, Arrst a, Arrst b)
{
	int la[27];
	char lb[27];
	double lc[4];
	struct st2 ld[1];

	return ((struct st2 *)b.p)[a.a[((Arrst *)b.p)->a[13]]].ptr->ptr->ptr[lb[10]];
}

: .local/.global ( ). .alias, #define. . . beta- ( , , ). . . , . , .

.alias SP 39	; stack pointer
.alias BP 38	; function frame base pointer
.alias SI 37	; source address
.alias DI 36	; destination address
.alias CX 35	; counter

.text

fn5:
	.alias fn5.2.0C #BP,8
	.alias fn5.x.4C #BP,12
	.alias fn5.y.8C #BP,16
	.alias fn5.z.12C #BP,20
	.alias fn5.w.16C #BP,24
	.alias fn5.r.20C #BP,28
	.alias fn5.a.24C #BP,32
	.alias fn5.b.60C #BP,68

	.alias fn5.2.0A #BP,8
	.alias fn5.x.4A #BP,12
	.alias fn5.y.8A #BP,16
	.alias fn5.z.12A #BP,20
	.alias fn5.w.16A #BP,24
	.alias fn5.r.20A #BP,28
	.alias fn5.a.24A #BP,32
	.alias fn5.b.60A #BP,68

	.alias fn5.lb.27AD #BP,-27
	.alias fn5.1.32RT #BP,-32
	.alias fn5.2.36RT #BP,-36
	.alias fn5.3.40RT #BP,-40
	.alias fn5.4.44RT #BP,-44
	.alias fn5.5.48RT #BP,-48

	jmp	fn5.P0
	getl	#SP
	getl	#BP
	subl	@2, 4
	subl	@3, 56
	wrl	@3, @2
	setl	#SP, @2
	setl	#BP, @4
	complete

fn5.P0:
	jmp	fn5.P1
	rdsl	fn5.y.8C
	wrsb	@1, fn5.y.8A
	complete

fn5.P1:
	jmp	fn5.P2
	rdsl	fn5.w.16C
	wrsb	@1, fn5.w.16A
	complete

fn5.P2:
	jmp	fn5.P3
	getsl	0x340
	wrsl	@1, fn5.1.32RT
	complete

fn5.P3:
	jmp	fn5.P4
	rdsb	fn5.lb.27AD + 10
	rdsl	fn5.1.32RT
	mulsl	@1, @2
	wrsl	@1, fn5.2.36RT
	complete

fn5.P4:
	jmp	fn5.P5
	rdl	fn5.b.60A
	wrl	@1, fn5.3.40RT
	complete

fn5.P5:
	jmp	fn5.P6
	rdl	fn5.3.40RT
	addl	@1, 0x11
	rdsb	@1
	exa	fn5.a.24A + 4
	addl	@2, @1
	rdsb	@1
	rdsl	fn5.1.32RT
	mulsl	@1, @2
	wrsl	@1, fn5.4.44RT
	complete

fn5.P6:
	jmp	fn5.P7
	getsl	0x33c
	wrsl	@1, fn5.5.48RT
	complete

fn5.P7:
	jmp	fn5.P7.blkloop
	rdl	fn5.3.40RT
	rdsl	fn5.4.44RT
	rdsl	fn5.5.48RT
	addl	@2, @3
	addl	@1, @2
	rdsl	fn5.5.48RT
	rdl	@2
	addl	@1, @2
	rdsl	fn5.5.48RT
	rdl	@2
	addl	@1, @2
	rdl	@1
	rdsl	fn5.2.36RT
	addl	@1, @2
	rdl	fn5.2.0A

;   -     :)
	getl	0x0000ffff
	patch	@1, @3
	patch	@2, @3
	setq	#SI, @2
	setq	#DI, @2
	getl	0xfcc1ffff
	patch	@1, 0
	setq	#CX, @1

	getl	#MODR
	or	@1, 0x38
	setl	#MODR, @1
	complete

;  ,    CX, SI  DI  
fn5.P7.blkloop:
	exa	#CX
	jne	@1, fn5.P7.blkloop
	je	@2, fn5.P7.blkclean
	rdb	#SI
	wrb	@1, #DI
	complete

fn5.P7.blkclean:
	jmp	fn5.PF
	getl	#MODR
	and	@1, 0xffffffc7
	setl	#MODR, @1
	complete

fn5.1L:
fn5.PF:
	rdl	#BP, 4
	jmp	@1
	getl	#BP
	rdl	#BP, 0
	addl	@2, 4
	setl	#BP, @2
	setl	#SP, @2
	complete

, . , , complete. .

. , ( #), , . @N, N — , . , , , .

, , @- . , , , , , . MCp «» . (setX) (wrX).

, , , , . .

, N . n, , N*k+n (k = 0, 1, ...) , complete ( , ). .

, . , . , @- ( ) - , .

. , , — , - , CU . VLIW EPIC-, ( ), , , CU , ( ).

. RF. MCp , , RF . , RF .

Load/Store/MOB Isaiah. , — MOB, Memory Ordering Buffer. , , / ( ) . : . .

: ! RF . , , @- . , :

volatile a;
a += a;

( ):

	rdsl	a
	rdsl	a
	addsl	@1, @2
	wrsl	@1, a

, , . , MOB, +1 MCp.

. . , , . . , . . .

MCp . , (, , ) . : jmp ( ) j ( ), , .

, complete , . , jmp , .

MCp . , , :

	doSomething;
	if(condition)
	{
		doStuff;
	}

, doSomething. condition, ( , ). doSomething.

doSomething . , CU , , , . , . : , ! ; : , ! , .

MCp , , , doSomething (, ). MCp ( ), ( ) ( ). +1 .

, MCp CU, CU . CU , : , RF.

MCp. . , . MCp , , .

— . . , RF , ( ), . , . , ( ). .

-, , , . -, WriteBack-, . WB-, , , MCp RF (, , , MCp , , +1 ), . .

— (MMU). . , OS, : Linux, Plan9 :) , MMU — . , SPEC MMU 17.5% ; SUN Java 40%. , MMU ( MCp)? CUDA . , () , Java, .Net, JavaScript, Go, : ?

, - , , . , (TLB) MMU 32 , 33 . . - . .. , .

MCp

, , .

, :) , CU. , . , , , . , , N*k+n (k = 0, 1, ...) N=1 n=0, . , . Profit? PROFIT! big.LITTLE .

. , - , , N n. (, , SMS- , ! !).

(managed runtimes) . (, , , ?). MCp (, ) , . , - , .

, MCp , , , . , , , (dependable) . , pattern-matching . , (- ):

fib :: (Integral t) => t -> t
fib 0 = 1
fib 1 = 1
fib n = fib (n - 1) + fib (n - 2)

CPU fib , , 1 .. , . MCp , .

, . , . , ( , ..). 16- , , , : ? ( ). , . , , .

MCp. , ( , ) , . . , , .

, !

, ( , ) - , .

Source: https://habr.com/ru/post/163057/

All Articles

The Ballad of "Multiklet"

About the benefits of MCp

Cells

The birth of nonlinearity

MCp

, !

More articles: