Many have heard about multicellular architecture, processors and even the first devices on them. Especially advanced users tried their algorithms. The first simple performance tests were conducted, as well as the user
Barsmonster , etched the P1 processor chip. The first R1 processor is already undergoing tests and will soon be available to everyone. But the answer to the question of how multicellular architecture works and what is its difference, not everyone knows. Let us now try to bring up to date.
1. MulticellularityA multicellular nucleus is a group of identical processor units (2 or more) united by a fully connected unidirectional switching medium. The interaction features of the processor units between themselves follow from the presentation of the algorithm (see below). The processor unit in a multicellular architecture is called a cell. The set of commands that it can perform is determined by the specific implementation and does not depend on the architecture.
Algorithms "eyes" multicellular nucleus
First of all.Any formula can be presented in the form of a parallel-layered form (JFM). Will consider
A simple example: g = e * (a + b) + (ac) * f, in YPF it might look like this:
Figure 1. Example of JFM
')
In general, to perform operations, nodes located on the i-th tier can use the results obtained on tiers from 1 to (i-1) -th. From this it follows that teams that are on one tier are independent.
Any algorithm is a set of formulas, which is divided into subsets - linear sections connected by control transfer operators. The linear section (LU) is understood as such a subset of formulas that is calculated if and only when control is transferred to the given linear section. Inside the linear region, informationally unrelated formulas can be executed in any order.
As you know, the result of the execution by the processor of any command is expressed in the change of the processor state. The use of this new state by subsequent teams forms an information link between the result source team and the receiving teams of this result. Such a relationship can be both indirect (indirect) and direct (direct).
In the case of indirect communication, the result is available only after its alienation: recordings in public registers or memory or other devices. The source command must place the result in these devices, and the receiving command will have to take data from there. The device name is specified as an operand. Such a connection between the teams is the basis of absolutely all modern processors, except multicellular ones.
In the case of a direct informational link, the commands are named and the names must identify the command itself, not its location or other implementation features. Access to the results can be done by the names of both source commands and receiving commands. In the first case, a broadcast is used, followed by a list-by-name selection, in which the name of the source command is set in the operand field of the receiving command. In the second, a list-by-name distribution is performed, in which the name of the command-receiver of the result is specified in the source command.
Theoretically, we can name nodes of the JFM in any way. The only requirement is unambiguity. To do this, in a multicellular processor, when fetching from memory, a tag (tag) is obtained - thus they and their results are given a local name - and then the interaction between the teams is organized through tags. Data can be obtained from any team with a tag less than its own. The tag number of the desired result is determined by the difference in the values ​​of the command tag and the tag of the desired result. For example, if the operand contains @ 5, and the command tag is 7, then the result of the command with tag 2 is used as the operand. All command execution results are sent to the switching environment, from which the required ones are selected by the tag. Thus, in a multicellular processor, a broadcast of the results is used with their subsequent selection by name.
It is worth noting that, due to physical limitations, there is the concept of “visibility window”, which determines the maximum distance of the source command from the receiver's command. The tag in the process of selecting commands changes cyclically. Its maximum value is determined by the size of the command buffer. In fact, the size of the tag determines the number of teams that can simultaneously be at different stages of execution. Tag value cannot be used if the command to which it was previously assigned has not yet been executed.Secondly.A multicellular processor operates with structures that we call paragraphs.
A paragraph is an informationally closed sequence of commands. A paragraph is an analogue of a command, after the execution of which, the state of the processor and / or systems within it (registers, buses, memory cells, input / output channels, etc.) changes. Those. for a multicellular nucleus, a paragraph is a command.
The main feature of the multicellular architecture is that it directly implements the YPF representation of the algorithm. From this understanding of the algorithm, many properties of the multicellular architecture follow.
2. Properties of multicellular architecture- independence from the count of cells;
- dynamic distribution of computing resources;
- all commands ready for execution are executed simultaneously;
- energy consumption reduction. Works when there is work;
- scalability, no limit on the number of cells.
Let us consider in more detail how the paragraph will look for the algorithm described in Section 2.
For a multicellular nucleus, the following scheme for constructing a paragraph is characteristic (it is not mandatory, in the given example there are also digressions):
- data acquisition for work (commands with tag 0-2,4,6 in the example);
- data processing (commands with a tag 3,5,7-9 in the example);
- saving the results (command with tag 10 in the example).
It should be recalled that the change in the state of the machine in the multicellular processor occurs at the end of the paragraph.
The algorithm does not depend on the number of cells.Informational links between teams are clearly indicated and there is no need to know about the number of processor units when a program is written. The teams just wait for the readiness of their operands and after that they leave for execution and it doesn’t matter who and when they prepare the operands. Imagine a watering can - you do not think about how many holes in its nozzle when watering. The cells are identical and there is no difference in which cell this or that team will be executed.
Commands are distributed into cells in the order they are followed, 1 team is in 0 cell, 2 team is in 1 cell, etc., when the team is transferred to the last available cell, we begin to redistribute it from 0 cell. Consider the work of the 4-cell nucleus. The paragraph above will be executed as follows:
Figure 2. The distribution of teams in cells
The number of simultaneously executed commands depends on the number of cells. Choosing the optimal number of cells for each task is a task that requires analysis. It’s still impossible to create a universal system, so you can rely on the following data: 4 cells do a good job on common tasks, and 16 cells on signal processing tasks, and for video processing, their number can be on the order of hundreds.
You can analyze your code and imagine what would be the optimal number of cells for them. To do this, you need to submit your algorithm to YPF and count the number of teams in tiers. The average number of teams in them will hint at the optimal number of cells in your particular case.All commands ready for execution are executed simultaneously.As mentioned above, the multicellular nucleus implements the YPF representation of the algorithm, independent commands are located on the tiers. Everything that can be done at the moment - will be done without any special instructions from the programmer, the main condition - the team must receive all the data for execution.
It is necessary to take into account technological limitations: the number of cells is limited, therefore, there will be as many commands simultaneously as there are cells in the core.
"Similarity collapse" or dynamic distribution of computing resourcesSince the code does not depend on the number of cells with which it will be executed, the ability of the multicellular nucleus to distribute its computational resources during operation appears, and the control is programmed.
The ability of the multicellular architecture to redistribute its resources we call
reconfiguration . For example, cells of the multicellular nucleus can be arbitrarily distributed to execute any algorithm or part of it. A group (better names have not yet been invented) is a part of the cells of the multicellular nucleus, which are interconnected to carry out an algorithm or part of an algorithm. A group may contain 1 or more cells. When the multicellular nucleus divides a group into parts, this is called
decomposition . The process of grouping is
composition . Figure 3 below shows how the 4-cell nucleus is reconfigured over time (example). Cells that perform the same task have the same color fill.
Figure 3. Cell Reconfiguration Example
Reduced power consumption. Works when there is work.The principal basis of the multicellular architecture is broadcasting. To date, this is still the only architecture that uses direct informational communication to transfer data between teams in the program.
If you pay attention to Figure 1, which shows an example of how to execute a code with a 4-cell processor, you can see that the cells execute commands located on tiers when they are ready to be executed. The result is placed in the switching environment and is waiting for its consumer.
The multicellular nucleus does not commit unnecessary actions, it works when there is work. If any of the components of the processor is not ready to execute the command, then work with it is suspended until the release.
Scalability, no restrictions on the number of cells.The architecture does not limit the number of cells. The case is behind the technology.
There are several options for organizing a large number of cells, but this is a topic for a separate article, which we will prepare later.
3. ImplementationMulticellular processor "in the flesh"A multicellular processor consists of N identical cells with numbers from 0 to n-1, connected by a switching medium. Each cell contains a program memory block (PM), a control unit (CU), a buffer device (BUF), and also each cell has a switching device (SU), which together form a switching medium (SB). The processor also contains data memory (DM), a block of general-purpose registers (GPR) and an executive device (EU) consisting of an arithmetic logic unit for floating-point numbers (ALU_FLOAT), ALU for integer numbers (ALU_INTEGER) and a memory access block data (DMS).
The program for multicellular assembler consists of paragraphs. For example:
Paragraph: ; jmp paragraph_next ; complete
In fact, the transition team to the next paragraph can stand anywhere in the current paragraph, but so far we do not focus on this. In this example, a paragraph is all commands from the “Paragragh” label to the end of the command section - “complete”.
In a paragraph, commands can be arranged in an arbitrary order, commands can refer to the result of previous commands using the “@” operator, but note that you cannot refer to the result of a command that is 64 commands higher than the current one. Those. The window of visibility of the result for each command is 64, while the size of the paragraph is not limited, i.e. it may consist of at least several thousand teams. Consider an example:
Paragraph: getl 2 ; getl 3 ; addl @1, @2 ; 2 + 3 addl @1, @3 ; 5 +2 jmp paragraph2 complete
In this example, the addl @ 1, @ 2 command performs the addition of the two previous commands, and the addl @ 1, @ 3 command performs the addition of the result of the previous command and the result of the command that is 3 lines higher.
The state of the processor changes when moving from one paragraph to another. Write to memory, registers occur at the end of the paragraph in case the read write control is enabled or direct read and write commands are not used. But it is possible to disable read and write controls, i.e. writing to memory will take place in the paragraph itself, without waiting for its completion (in cases where it is possible to disable read and write control, working with memory can bring a noticeable increase in performance). In fact, it is easy to get the most out of the multicellular processor, but this is a topic for a separate article.
For the execution of a context-sensitive program, the commands of each paragraph sequentially, starting with the same address, which is the address of this paragraph, are placed in the PM cells. The location address of the first executable paragraph is equal to the address from which the cells select the commands. As a result, in the PM of the i – th cell, starting from the address of a paragraph, its commands with the numbers i, N + i, 2 * N + i, ..., are sequentially placed.
Each cell provides, starting from the address given to it, a sequential selection of commands and placing them on the command register for subsequent decoding. Commands cells are selected synchronously.
When decoding, each next selected group of N commands is assigned the next tag value (t), and the number from the group and, accordingly, its result is assigned a number. Sampling and decoding of commands continues until a command is selected that is marked with the control attribute “end of paragraph” (Complete). The address of a new paragraph can be received either at any moment of the selection of commands of the current paragraph, or after the completion of the selection of commands. It enters all cells simultaneously. If the address of the next paragraph is not calculated by the time the last paragraph command is selected, the sample is suspended until the address is received. If the address is received, the sample continues from this address. The selected command enters the buffer device, which consists of a buffer of the first operand, a buffer of the second operand, and a buffer of commands (operation codes, tags, etc.). The buffer devices form commands and transfer them to the corresponding execution unit.
The operating part (command buffer), in addition to the command code, includes all the necessary service information for sending and receiving results, namely, the command number and signs of readiness of the first (second) operand to execute the command.
The storage buffer operands is associative addressing. Associative
the address is the tag of the requested result. As operands when performing operations can be used:
- the requested results of source commands from the switch;
- values ​​calculated during decoding of the control word, as well as those directly present in the control word or taken from general registers.
In the first case, the readiness flag of this operand is set to “not ready” when writing a command, and in the second, to the “ready” state. After receiving the requested result from the switching device, the sign, in the first case, is also set to the “ready” state. The command that received all the operands passes priority selection among other ready-made commands in the buffer device, after which it is issued for execution under the condition that the executive device is empty.
In the P1 processor, each cell was allocated its own section of memory, i.e. program memory and data memory do not overlap. In processor R1, the program memory and data memory are in the same address space, i.e. cells have channels of access to program memory and data memory. Figure 4 shows the general structure of the processor P1. Figure 5 shows the structure of the cell and its description for the processor R1.
Fig 4. General structure of the multicellular processor
Arithmetic logic units (ALUs), along with the DMS unit, are part of the set of actuators available in each cell (CELL).
The cell consists of the following devices:
1. Device selection commands IDU (Instruction Distribution Unit).
2. Control device.
3. Switching device SU (Switch Unit).
4. Buffer device.
5. Integer ALU.
6. ALU with a floating point.
7. Block of access to the data memory DMS (Data Memory Service).
8. Multiplexer results.
9. A set of registers GPR (General-Purpose Registers).
10. Interrupt Controller IC (Interrupt Controller).
11. JTAG-GPR debugging unit.
The detailed structure of the cell is shown in Figure 5:
Figure 5. Cell structure.
The result of the execution of commands the actuators transmit to the switching device.
If the results of executing the instructions of the integer ALU, floating-point ALU and DMS are ready at the same time, the switching device receives the result of executing the floating-point ALU command. According to the priority of reading the result, the actuators are distributed as follows:
1. ALU with a floating point;
2.DMS;
3. Integer ALU.
Executive devices, the result of which the current clock can not be issued in the switching device, write the result of the command in the output register and wait for their turn to issue the result.
The situation described above has the effect that the number of ticks needed to execute a command is non-deterministic.
In each cell, there are two ALUs: an integer ALU and an ALU floating point processing.
Integer ALU is designed to execute instructions on integer operands, and operands can be presented in the following formats (depending on the instruction):
- 64-bit;
- 32-bit;
- 16-bit;
- 8-bit;
- packed 32-bit;
- packed 16-bit.
The floating-point ALU is designed to execute instructions on operands represented in the format of floating-point single (32 bits) or double precision (64 bits) in accordance with the requirements of IEEE-754.
Structurally, the floating-point ALU consists of three parts:
- Single precision divider (division, square root)
- Double precision divider
- Calculator (execution of other commands)
What is the difference between multicellular processors and multicore processors?In conventional processors consisting of one or several cores, the elementary unit of execution is the command that is executed in a certain order. In a multicellular processor, an elementary unit of execution is a paragraph, i.e. a linear section consisting of an unlimited number of commands, after which a transition to another linear section with a given label occurs. Commands from this section will be executed in parallel where possible. Parallelization occurs hardware, i.e. the programmer does not need to take care of which cell the command will fall into. This is what we call “natural parallelism”. As a result, the same code can be executed on any number of cells. Another difference in the multicellular processor is the transmission of the results of the work of the teams via broadcasting, whereas in multi-core and single-core systems the results are transmitted through memory and registers. Of course, in multicellular processors there is a memory and registers for transferring data between linear sections, but inside the linear section basic operations can be carried out without the use of registers and memory. This fact gives the multicellular architecture simplicity of implementation and reduction of memory accesses, as a result, reduction of energy consumption. In addition, the execution of the same program on one, two, ..., two hundred and fifty-six cells provides opportunities for reconfiguring (combining cells into task groups) of cells, creating a fault-tolerant processor, and scalability.
We will return to the issues of parallelism and reconfiguration later in this article.
4. ExamplesA multicellular processor consists of 4 cells (maybe 256 or more), the cells are completely equal and are connected by a switching environment (switch). The result of the execution of commands cells are stored in the switch.
The program in assembler is divided into sections and paragraphs that contain commands.
For the exchange of information between cells serves as a switch, and for the exchange between paragraphs, there are RONs, index registers, data memory. There are peripheral registers for working with peripherals.
To weed out statements like:
- Even if this is a “page” of memory, but who “steers” all of memory? Where is the senior (top) control unit that controls the cells? Who loads cell memory?
- Parallel programmer, not processor
Consider a simple program:
Figure 2. The distribution of teams in cells
As shown in Figure 2, the commands from the paragraph will be divided into cells. Commands in each cells are executed in parallel (“natural parallelism”). Execution of commands occurs upon readiness of arguments. In this case, the program is executed by a group of 4 cells. The program will be successfully executed both on 4, and on any other number of cells. The basic principle of the multicellular architecture is that the cells are independent from each other and from anyone and the same. This principle applies to processors with reconfiguration.
Reconfiguration is the ability of the processor cells to composition (collection) and decomposition (analysis) into groups, i.e. the ability of cells to unite in groups from one cell to N (for the N cell processor) and execute their own part of the code. By default, when starting any program, all cells are in the same group. It should be noted that each group has its own set of RONS, index, control registers, you can assign your own interrupt handler.
In the
article about the von Neumann architecture, Leonid Chernyak identifies the following three types of reconfigurable processors:
1) specialized processors
2) configurable processors
3) dynamically reconfigurable processors (as an example, the only class of this type that exists at the time of publication of L. Chernyak's article - FPGA) is given.
The first two types acquire their own specifics in the manufacturing process, and the third can be programmed.
The multicellular processor R1, according to this classification, is dynamically reconfigurable, but it is a processor on a chip and, due to the independence of the machine code, the redistribution of resources (cells), unlike FPGA, occurs without stopping or rebooting the processor and without losing information. Thus, MultiClet R1 is a new class of the third type (along with the first one - FPGA).
Nobody has ever made such processors in the world today.
The classification proposed by L. Chernyak, adjusted for future development, can be developed by the fourth type, which we formulate as
4)
self-adapting processors ,
able to independently ensure the operation of all systems, automatically redistributing resources. In case of damage, failures or when additional tasks appear, the processor or system from the processors must be able to adapt to the new conditions.
The first step to the creation of such systems may be the processor Multiclet L1 or SVK based on it.
Consider an example of decomposition (division into groups) of cells:
In this example, we see a division into 3 groups: cells with numbers 0 and 1 perform calculations, cell 2 collects information from sensors, and cell 3 compiles reports on work and interacts with the external environment. If it happened that cells 0 and 1 do not have time to process the data, then cell 3 can come to their aid and two groups will turn out. It is important to note that a processor reset is not required to reconfigure cells. Cell 3 will speak the same language with cells 0 and 1, i.e. will acquire their set of RONS, index registers and control registers.
A simple example of an ordinary assembler program (you can write in standard C):
.text habr: getl 4 ; getl 5 ; addl @1, @2 ; 4 + 5 = 9 mull @3, @2 ; 4 * 5 = 20 subl @1, @2 ; 20 – 9 = 11 slrl @1, 2 ; 11 << 2 = 44 subl @1, @2 ; 44 – 11 = 33 mull @4, @5 ; 20 * 9 = 180 jmp habrahabr ; complete habrahabr: getl 5 ; addl @1, [10] ; 10 setl #32, @1 ; 32 complete
Paragraphs are in the section marked up as “.text”. A paragraph can contain an unlimited number of commands (as long as the program memory allows), but each command can refer to a result only for a command that is not more than 63 lines higher.
For convenience, the assembler can set a label for each command, for example:
habr: arg1 := getl 5 ; sum1 := addl @arg1, [10] ; 10 setl #32, @sum1 ; 32 complete
In a multicellular processor, there is no hardware to detect information links between the selected operations (commands) and their distribution among functional devices, i.e. dynamic paralleling is missing. There is no static parallelization, since The program, although it describes informational links, has a linear form and does not contain any indication of what can be done in parallel and how. This is a fundamental difference from other processors. Due to this feature, the survivability of the multicellular processor is potentially ensured, i.e. the possibility of continuous execution of the program without recompiling or reloading if its individual cells fail (processor degradation).
Due to the property of the multicellular processor, we can dynamically allocate our resources, we can create systems capable of performing the task even if part of the cells fail.
Degradation is associated with loss of productivity and, consequently, an increase in the time to solve problems. But, for a number of embedded applications, the survivability of a multicellular processor allows a controlled object to perform basic functions, either by reducing their quality or by abandoning the solution of secondary tasks.
Asynchronous and decentralized organization of a multicellular processor, both at the system level - between cells (when implementing parallelism) and at the intracellular level - between cell blocks (when implementing commands), additionally provides:
- minimization of the nomenclature of design objects and their reduction
difficulties; - reduction of the processor chip area (the amount of equipment at
decentralized control is less than with centralized); - increased productivity and reduced energy consumption due to
implement a more efficient computational process; - use of an individual synchronization system for each cell
when implemented on a single crystal tens and hundreds of cells.
As a result, a well-structured and modular system is obtained, which makes it possible to drastically reduce the complexity of the processor and, consequently, reduce costs and improve the quality of design. In this case, compared with the von Neumann model, the quantitative characteristics of the processor are also improved.
At the moment, tests of simple algorithms, such as POCNT and encryption algorithms, have been conducted. For the popcnt test, an assembler program was written and we compared the result with a simple program for the Pentium Dual Core 5700 C.
The test results can be summarized in the following table (the number of clock cycles per 32-bit calculation cycle):

Those. we can conclude that the Multiklet R1 processor on the “table” test is 15% faster than the Intel processor, and on the BitHacks test, the Multiklet R1 processor is faster than Intel by more than 2 times. The popcnt test for a multicellular processor was simply converted for parallel computing. Of course, in the new Intel processors there is a popcnt hardware implementation, but in this test we showed the capabilities of our processors.
5.What we have nowA Multiclet P1 processor is currently released and is available for ordering, as well as two debug boards.
In addition, the first multicellular processor developed a device for protecting information Multiclet Key_P1, the first production batch is scheduled for September 2014.
In addition, based on the P1 processor, a foil stamping printer from Virshke LLC was developed and supplied as standard.
Users actively master multicellular processors and complement libraries, as well as create new useful examples.The first batch of R1 processors has been received, testing of new processors has begun.The first revision of the processor is called R1-1.Again, the new processor is a topic for a separate article, the characteristics of the first revision processor will be published a little later.For this processor, LDM-Systems has released a debug board, consisting of a base one and a processor one. In the photo version with a block. Custom version will be slightly different. It will be possible to choose the configuration of the baseboard for your tasks, in the maximum configuration the base + processor board will have an acceptable cost.The software bundle includes:• Assembler• C compiler• Functional model• Libraries• Debugger for IDE Geany• FreeRTOS OS• Program examples• PlotterPS If you liked this article despite the large volume, then I can write a continuation of processor programming, an overview of the R1 processor and principles of building high-performance systems on multicellular processors.UPD: The program for checking the operation time of the popcnt algorithm is taken from www.strchr.com/crc32_popcnt. As already noted in the article, for Intel processors, there are other faster algorithms for implementing the popcnt test, which require certain commands implemented at the hardware level. To evaluate the capabilities of the multicellular architecture, Table and Bit hacks algorithms were taken.UPD2: Figure 2 was updated. The add @ 3, @ 2 command should be stretched to two lines, and the mpy @ 4, @ 3 command was reduced in duration, since the arguments for it were ready earlier, which reduced the total duration of the paragraph. Also the mpy command is replaced with mul.UPD3: Each cell contains a buffer of 64 commands. The selection of paragraph commands by cells continues in parallel with execution, the selection of the next paragraph can be started, only if the address of the transition to it is known.