We continue to review the new domestic computer. After a brief acquaintance with the features of the architecture of "Elbrus", consider the software development tools offered to us.


')
We remind the structure of the article:
- hardware overview :
- the acquisition process;
- Hardware;
- Software review :
- launch of the operating system;
- staff software;
- development tools overview:
- performance benchmarking :
- description of competing computers;
- benchmark results;
- summarizing.
Enjoy reading!
Architecture features
One can formulate the essence of the E2K architecture in one phrase like this: 64-bit registers, explicit parallelism of execution of instructions and strictly controlled access to memory.
For example, x86 or SPARC architecture processors capable of executing more than one instruction per clock (superscalar), and sometimes also out of sequence, have
implicit parallelism: the processor analyzes the dependencies between instructions in real time in a small section of code and, if it considers possible, loads these or other actuators simultaneously. Sometimes he acts too optimistically, - speculatively, with discarding the result or rolling back the transaction in case of a failed prediction. Sometimes, on the contrary, it is too pessimistic, assuming dependencies between the values of registers or parts of registers, which are actually absent from the point of view of an executable program.
With
explicit parallelism (EPIC), the same analysis takes place at the compilation stage, and all machine instructions defined for parallel execution are written into one wide command word (very large instruction word, VLIW) - and in “Elbrus” the length of this “word” is not fixed and can be from 1 to 8 double words (in this context, a single word is 32 bits wide).
Undoubtedly, the compiler has much greater potential in terms of the amount of code covered, time spent and memory, and when writing machine code manually, the programmer can perform even more intelligent optimization. But this is in theory, and in practice you are unlikely to use an assembler, and therefore everything depends on how good the optimizing compiler is, and writing this is not a simple task, to say the least. In addition, if, under implicit parallelism, "slow" instructions can continue to work without blocking the flow of the following instructions to other actuators, then with explicit parallelism, the whole wide team will expect to complete in its entirety. Finally, an optimizing compiler will be of little help in interpreting dynamic languages.
All this is well understood in the MCST, of course, and therefore in “Elbrus” technologies of speculative execution are also implemented, an advance paging of code and data, combined computational operations. Therefore, instead of theorizing to infinity and figuring out how many hypothetical gigaflops a given platform can give out under the best of circumstances, in the fourth part of the article we will simply take and evaluate the actual performance of real programs - application and synthetic.
VLIW: breakthrough or dead end?There is an opinion that the VLIW concept is poorly suited for general-purpose processors: they say that Transmeta Crusoe, once released, didn’t “shoot”. It’s strange for the author to hear such statements, since ten years ago he was testing a laptop based on Efficeon (this is the next generation of the same line) and found it very promising. If you don’t know that x86-code is being translated into native commands under the hood, it was impossible to guess about it. Yes, he could not with the Pentium M, but the performance at the level of Pentium 4 showed that the power consumption was much more modest. And he certainly was a head taller than VIA C3, which is quite a x86.
Not less interest because of its exoticism is the technology of protected execution of programs in C / C ++ languages, where the use of pointers provides an extensive range of opportunities to shoot yourself in the foot. The concept of contextual protection, jointly implemented by the compiler at the assembly stage and the processor at the execution stage, as well as by the operating system in terms of memory management, does not allow for disrupting the scope of variables, be it a reference to a private class variable, private data of another module, local variables of the calling function . Any manipulations with changes in the access level are allowed only in the direction of decreasing rights. Preserving references to short-lived objects in long-lived structures is blocked. Attempts to use hanging links are also prevented: if the object to which the link was once received has already been deleted, even the location of another, new object at the same address will not be considered an excuse for accessing its contents. Could not stop using data as a code and transfer control knows where.
Indeed, as soon as we move from high-level idioms to low-level signposts, all these scopes turn out to be nothing more than a syntactic salt. Some (simplest) cases of erroneous use of pointers can sometimes help catch static source code analyzers. But when the program is already translated into x86 or SPARC machine instructions, nothing will prevent it from reading or writing the value from the wrong memory cell or the size that will lead to a crash in a completely different place - and here you are, look at the corrupted stack and you have no idea where to start debugging, because on another machine, the same code works out successfully. And the stack overflow and the resulting vulnerabilities are simply a scourge of popular platforms. It is gratifying that our developers are systematically approaching the solution of these problems, and are not limited to arranging more and more new crutches, the effect of which still resembles a rake. After all, nobody cares how fast your program works, if it does not work correctly. In addition, more stringent control by the compiler forces us to rewrite "foul-smelling" and intolerable code, and thus indirectly enhances the programming culture.
The byte order when storing numbers in the memory of "Elbrus", unlike SPARC, is little endian (the low byte goes first), that is, like on x86. Similarly, since the platform tends to support x86 code, there are no restrictions on the alignment of data in memory.
Order, alignment and portabilityFor programmers who are spoiled by Intel's cozy world, it can be a revelation that outside of this world, accessing memory without alignment (for example, writing a 32-bit value at 0x0400000 5 ) is not just an unwanted operation that is slower than usual, but a forbidden action, resulting in a hardware exception. Therefore, porting a nominally cross-platform project, initially resulting in minimal edits, may become deadlocked after the first launch — when it becomes clear that all serialization and deserialization of data (integers and real numbers, the UTF ‑ 16 text) are spread throughout the multi-megabyte code, is made directly, without a dedicated level of platform abstraction, and in each case is framed in its own way. Definitely, if every programmer had the opportunity to test his imperishable masterpieces on alternative platforms — for example, SPARC — the worldwide code quality probably would have improved.
For more information about the device MCTS on the architecture of SPARC and E2K can be found in the book “Microprocessors and computer systems of the Elbrus family”, which was published in the
publishing house “Peter” with a minimum circulation and has long been sold around the hands, but is available free of charge as a
PDF (
6 MB ) and for a small fee on
Google Play . Against the background of the absence of other detailed information in the public domain, this edition is just a storehouse of knowledge. But the text is mainly focused on hardware, algorithms of buffers and pipelines, caches and arithmetic logic devices - the topic of writing [efficient] programs is completely unaffected, and even just mentioning machine instructions can be counted on the fingers.
Machine language
In addition to compiling high-level languages C, C ++, Fortran, the documentation at every opportunity does not forget to mention the possibility of writing programs directly in Assembler, but nowhere does it specify how exactly you can join this filigree art, where you can at least get a reference to machine commands. Fortunately, the system has a GDB debugger that can disassemble the code of previously compiled programs. In order not to go beyond the article, we will write the simplest arithmetic function, which has a good basis for parallelization.
uint64_t CalcParallel( uint64_t a, uint64_t b, uint64_t c, uint32_t d, uint32_t e, uint16_t f, uint16_t g, uint8_t h ) { return (a * b) + (c * d) - (e * f) + (g / h); }
This is what it translates when compiling in
‑O3 mode:
0x0000000000010490 <+0>: muld,1 %dr0, %dr1, %dg20 sxt,2 6, %r3, %dg19 getfs,3 %r6, _f32,_lts2 0x2400, %g17 getfs,4 %r5, _lit32_ref, _lts2 0x00002400, %g18 getfs,5 %r7, _f32,_lts3 0x200, %g16 return %ctpr3 setwd wsz = 0x5, nfx = 0x1 setbp psz = 0x0 0x00000000000104c8 <+56>: nop 5 muld,0 %dr2, %dg19, %dg18 muls,3 %r4, %g18, %g17 sdivs,5 %g17, %g16, %g16 0x00000000000104e0 <+80>: sxt,0 6, %g17, %dg17 addd,1 %dg20, %dg18, %dg18 0x00000000000104f0 <+96>: nop 5 subd,0 %dg18, %dg17, %dg17 0x00000000000104f8 <+104>: sxt,0 2, %g16, %dg16 0x0000000000010500 <+112>: ct %ctpr3 ipd 3 addd,0 %dg17, %dg16, %dr0
The first thing that catches the eye is that each command is decoded at once into several instructions executed in parallel. The mnemonic designation of instructions is generally intuitive, although some of the names seem unusual after Intel: for example, the instruction of an unsigned extension is called
sxt rather than
movzx . The parameter of many computational commands, besides the operands themselves, is the number of the actuator, it is not without reason that ELBRUS stands for explicit basic resources utilization scheduling, that is, “explicit planning of the use of basic resources”.
To access the full 64-bit register value, the “
d ” prefix is added; in theory, it is also possible to refer to the lower 16 and 8 bits of the value. The designation of global general registers, of which there are 32 pieces, has a “
g ” prefix in front of the number, and the local procedure registers have the “
r ” prefix. The window size of the local registers requested by the
setwd instruction can reach 224 pieces, and the pumping to the stack is performed automatically as needed.
The way in which some instructions are used is confusing: for example,
return , as you might guess, serves to return control to the calling procedure, but in all the examined code samples this instruction occurs long before the last command (where there is also some kind of context manipulation) - sometimes even in the very first command word, like here. Although the aforementioned book gives a whole paragraph to this issue, it has not become clearer for us to do so yet.
Update as of February 9, 2016: in the comments it is suggested that the
return statement only prepares the ground for the return from the subroutine and allows the processor to start loading the next commands of the calling procedure, and the control returns directly when execution reaches the
ct instruction.
However, “easy to read code” and “effective code” are not the same thing when it comes to machine commands. If you compile without optimization, the code is more consistent and similar to the calculation "head-on", but at the cost of lengthening: instead of 6 saturated command words, 8 sparse are generated.
Session fortune-telling on the coffee grounds behind the sim, let's finish, until dofantizirovalis to very very ridiculous assumptions. Hopefully, one day, the command reference and programming and optimization manual will be made public.
Development tools
The standard compiler of C / C ++ languages in the Elbrus operating system is LCC, an in-house development of the MCST company compatible with GCC. Detailed information about the structure and principles of operation of this compiler is not published, but according to
an interview with the former developer of one of several developed compiler subtypes, the frontend from the
Edison Design Group is used for high-level source code parsing, and low-level translation into machine commands can be performed differently - without optimization or with optimization. The optimizing compiler is delivered to end users, and not only on the E2K platform, for which there are simply no alternative machine code generators, but also on the SPARC family of platforms, where the usual GCC is also available as part of the operating system.
Considering the architectural features listed earlier, explicit parallelism, secure execution of programs, the LCC compiler obviously implements many unique solutions worthy of the most rigorous study and testing in practice. Unfortunately, at the time of this writing, the author has neither sufficient qualifications for this, nor the time for such studies; I hope that sooner or later a much wider circle of representatives of the IT community, including more competent ones, will take up this issue.
From what was nevertheless managed to be noticed with the naked eye when assembling programs for performance testing, the LCC on the E2K is more likely to give warnings about possible errors, illiterate constructions, or simply suspicious places in the code. True, the author is not so familiar with GCC to ensure that the unique LCC messages in Russian are simply distinguished from those simply translated (moreover, the translation is selective), and is not sure that the more intense flow of warnings is not a consequence of the automatically executed assembly configuration. Also, without knowing the semantics of a particular section of code, it is sometimes difficult to understand how skillful the compiler is in finding hidden bugs, or to raise a false alarm. For example, in the Postgresql code, the same construction is found four times in the same file with slight variations:
for (i = 0, ptr = cont->cells; *ptr; i++, ptr++) { //....// /* is string only whitespace? */ if ((*ptr)[strspn(*ptr, " \t")] == '\0') fputs(" ", fout); else html_escaped_print(*ptr, fout); //....// }
The compiler predicts a possible overrun of a 1-dimensional array in a string with a call to the
strspn function. Under what circumstances this can happen, the author does not understand (and on other platforms there was no such warning, although the
‑Warray-bounds check mode is standard for GCC), however, the repeated replication of the same non-trivial design attracts attention (since explain its purpose in a comment), instead of putting it into a separate function with an eloquent name that does not require explanation. Even if the alarm was false, detecting a foul-smelling code is a useful effect; that way, the authors of the PVS ‑ Studio static analyzer will remain without work. But seriously, it would be interesting and useful to compare what additional errors in the code are really able to detect LCC due to the unique features of the E2K architecture - at the same time the world of free software could get another batch of bug reports.
Another curious result of acquaintance with the talkative LCC was the education of the author, and then his more experienced colleagues, on the topic of trigraphs in C / C ++ languages, and why they are not supported by default, fortunately. This is how you live and do not suspect that a seemingly innocuous combination of punctuation characters in text literals or comments can be a time bomb - or an excellent material for program bookmarks, depending on which side of the barricades you are on.
The unpleasant consequence of the LCC separatism is that the format of its messages differs from that of GCC, and when compiled from the development environment (for example, Qt Creator), these messages fall only into the general log of work, but not into the list of recognized problems. It may be that this is somehow customizable, either from the compiler or from the development environment, but at least out of the box does not understand one another.
Traditionally, for domestic platforms, given their relatively low performance, there is a question of cross-compilation, that is, building programs for the target architecture and a specific set of system libraries, using the resources of more powerful computers with a different architecture and other software. Judging by the identification lines in the Elbrus core and in the LCC compiler itself, they are built on Linux i386, but this toolkit for x86, of course, is not included in the distribution of the system itself. Interestingly, is it possible the other way around: to build programs for other platforms on Elbrus? (The author did not get beyond the first phase of the GCC build for i386.)
Versions of the most significant packages for the developer:
- compilers: lcc 1.19.18 (gcc 4.4.0 compatible);
- interpreters: erlang 15.b.1, gawk 4.0.2, lua 5.1.4, openjdk 1.6.0_27 (jvm 20.0 ‑ b12), perl 5.16.3, php 5.4.11, python 2.7.3, slang 2.2.4, tcl 8.6.1;
- build tools: autoconf 2.69, automake 1.13.1, cmake 2.8.10.2, distcc 3.1, m4 1.4.16, make 3.81, makedepend 1.0.4, pkgtools 13.1, pmake 1.45;
- build tools: binutils 2.23.1, elfutils 0.153, patchelf 0.6;
- frameworks: boost 1.53.0, qt 4.8.4, qt 5.2.1;
- libraries: expat 2.1.0, ffi 3.0.10, gettext 0.18.2, glib 2.36.3, glibc 2.16.0, gmp 4.3.1, gtk + 2.24.17, mesa 10.0.4, ncurses 5.9, opencv 2.4.8, pcap 1.3.0, popt 1.7, protobuf 2.4.1, sdl 1.2.13, sqlite 3.6.13, tk 8.6.0, usb 1.0.9, wxgtk 2.8.12, xml ‑ parser 2.41, zlib 1.2.7;
- testing and debugging tools: cppunit 1.12.1, dprof 1.3, gdb 7.2, perf 3.5.7;
- development environments: anjuta 2.32.1.1, glade 2.12.0, glade 3.5.1, qt ‑ creator 2.7.1;
- version control systems: bzr 2.2.4, cvs 1.11.22, git 1.8.0, patch 2.7, subversion 1.7.7.
Again, if you were expecting GCC 5, PHP 7 and Java 9, then these are your problems, as one famous footballer says. In this case, we must also thank you that at least not GCC 3.4.6 (LCC 1.16.12), as part of the previous versions of the Elbrus system, or GCC 3.3.6 as part of MSWS 3.0; By the way, the main compiler in MSVS 3.0 and until now is GCC 2.95.4 (and why be surprised when there is a 2.4-branch kernel?). Compared to the previous situation, when it was possible to stumble upon a GCC bug fixed in the Upstream ten years ago, the conditions in the system are almost heavenly - you can even wipe C ++ 11 if you do not want to maintain backward compatibility.
The appearance of OpenJDK, at least in some form, can already be called a major breakthrough, because dislike of Java and
Mono in such systems has long been known; and this dislike can be understood when even native programs barely move. Since there are many javistes among the colleagues of the author, due to the above circumstances, forced to restrain souls with wonderful impulses, it was decided to devote a separate series of performance tests to Java. Looking ahead, we note that the results were discouraging even in relative terms: you can write interpreted scripts in PHP or Python with the same success, I guess.
Support for C and C ++ alone is not limited to the declared compatibility with the GNU Compiler Collection: the system still has a Fortran compiler. Since the author is only familiar with Professor Fortran, everyone interested can recommend the December
topic on “Made by Us” , where the commentary covers the topic of using this language as a benchmark.
For dessert, we saved up the most delicious: the
last part of the article is devoted to the research of the speed of "Elbrus" in comparison with a variety of hardware and software platforms, including domestic ones.