📜 ⬆️ ⬇️

We write our programming language, part 1: we write the language VM

Introduction


Good day to all habrachitelemy!

So, perhaps it should be said that the purpose of my work, on the basis of which a number of articles will be written, was to go all the way to create a fully functional PL with 0 itself and then share my knowledge, experience and experience with interested people.

I will describe the creation of a language that I described earlier here .
')
He was interested in many and caused a heated discussion in the comments. Therefore - the topic is interesting to many.

I think that it is worthwhile to immediately post information about the project:

Site (will be filled with documentation a little later).
Repository

To touch the project yourself and see everything in action, it is better to download the repository and launch everything from the bin folder. In the release, I'm not in a hurry to post the latest versions of the language and the runtime, because I sometimes just lazy to do it.

I know how to code in C / C ++ and Object Pascal. I wrote the project on FPC, because In my opinion this language is much simpler and better suited for writing this. The second determining factor was that FPC supports a huge number of target platforms and you can rebuild the project under the desired platform with a minimum of rework. If for some reason I don’t like Object Pascal, then do not rush to close the post and run to throw stones at the comments. This language is very beautiful and clear, and I will not give so much code. Just what you need.

So, perhaps I'll start my story.

Set goals


First of all, any project needs its goals and TK, which will have to be implemented in the future. You need to decide in advance what type of language will be created to write the primary VM for it.

The key points that determined the further development of my VM are as follows:


So, I chose the above priorities and I started to implement a language virtual machine. Strangely, of course, usually some parsers / translators are written first, and then VM. Well, I began to develop the project in this order and will describe it further in the order in which I developed it.

I’ll say right away that I called the VM as eloquently as possible - SVM (Stack-based Virtual Machine).

Let's start with the implementation of the class variable


Initially, I just used the variant type, because it's easier and faster. It was a crutch, but he propped up the project and allowed me to quickly implement the first version of VM and language. Later I sat down at the code and wrote the implementation of my “variant”. In essence, you need to write a class that stores a pointer to a value in memory, in my implementation it is null/cardinal/int64/double/string/array . It would be possible to use case typing, but I thought it would be better to implement it the way I implemented it.

Before starting to write class code, I decided to immediately put the {$ H +} directive in the module header for more flexible support of strings in a future language.
Ps for those who may not be aware of the difference between the H- and H + FPC mode.

When assembling code in H-mode, strings will be presented as an array of characters. With H + - in the form of a pointer to a piece of memory. In the first case, the lines will be initially fixed length and limited to 256 characters by default. In the second case, the lines will be dynamically expandable and you can shove much more characters into them. Will work a little slower, but more functional. With H +, you can also declare strings as an array of characters, like this:

 var s:string[256]; 
So, to begin with, we will declare the Enum type, which we will use as a certain flag, to determine the data type by the pointer:

 type TSVMType = (svmtNull, svmtWord, svmtInt, svmtReal, svmtStr, svmtArr); 

Next, we describe the basic structure of our variable type and some methods:

  TSVMMem = class m_val: pointer; m_type: TSVMType; constructor Create; destructor Destroy; procedure Clear; end; ... constructor TSVMMem.Create; begin m_val := nil; m_type := svmtNull; end; destructor TSVMMem.Destroy; begin Clear; end; procedure TSVMMem.Clear; inline; begin case m_type of svmtNull: { nop }; svmtWord: Dispose(PCardinal(m_val)); svmtInt: Dispose(PInt64(m_val)); svmtReal: Dispose(PDouble(m_val)); svmtStr: Dispose(PString(m_val)); svmtArr: begin SetLength(PMemArray(m_val)^, 0); Dispose(PMemArray(m_val)); end; else Error(reVarInvalidOp); end; end; 

The class is not inherited from anything, so the inherited calls in the constructor and destructor can be avoided. Pay attention to the directive inline. In the header of the file it is better to add {$ inline on}, so for sure. Its active use in VM rather significantly increased performance (mb somewhere as much as 15-20%!). She tells the compiler that the body of the method is best embedded in the place of its call. The output code will be slightly larger in the end, but it will work faster. In this case, using inline is advisable.

Ok, we wrote down the basis of our class at this stage. Now we need to describe a number of setters and getters (setter & getter) in our class.

The task is to write a couple of methods that allow you to stuff and later get the values ​​back from our class.

First, let's look at the assignment of values ​​for our class. First you can write a generalized setter, well, then, for individual data types:

 procedure TSVMMem.SetV(const value; t:TSVMType); inline; begin if (m_val <> nil) and (m_type = t) then begin case t of svmtWord: PCardinal(m_val)^ := Cardinal(value); svmtInt: PInt64(m_val)^ := Int64(value); svmtReal: PDouble(m_val)^ := Double(value); svmtStr: PString(m_val)^ := String(value); end; end else begin if m_val <> nil then FreeMem(m_val); m_type := t; case t of svmtWord: begin New(PCardinal(m_val)); PCardinal(m_val)^ := Cardinal(value); end; svmtInt: begin New(PInt64(m_val)); PInt64(m_val)^ := Int64(value); end; svmtReal: begin New(PDouble(m_val)); PDouble(m_val)^ := Double(value); end; svmtStr: begin New(PString(m_val)); PString(m_val)^ := String(value); end; else Error(reVarTypeCast); end; end; end; ... procedure TSVMMem.SetW(value:cardinal); inline; begin if (m_val <> nil) and (m_type = svmtWord) then PCardinal(m_val)^ := value else begin if m_val <> nil then FreeMem(m_val); m_type := svmtWord; New(PCardinal(m_val)); PCardinal(m_val)^ := value; end; end; 

Now you can write code for a couple of getters:

 function TSVMMem.GetW:cardinal; inline; begin Result := 0; case m_type of svmtWord: Result := PCardinal(m_val)^; svmtInt: Result := PInt64(m_val)^; svmtReal: Result := Trunc(PDouble(m_val)^); svmtStr: Result := StrToQWord(PString(m_val)^); else Error(reVarTypeCast); end; end; 

Ok, great, now, after you've spent some time staring at the IDE and typing the code for setters and getters with enthusiasm, we are faced with the task of implementing support for our type of mathematical and logical operations. As an example, I will give the implementation of the addition operation:

 procedure TSVMMem.OpAdd(m:TSVMMem); inline; begin case m_type of svmtWord: case m.m_type of svmtWord: SetW(GetW + m.GetW); svmtInt: SetI(GetW + m.GetI); svmtReal: SetD(GetW + m.GetD); svmtStr: SetD(GetW + StrToFloat(m.GetS)); else Error(reVarInvalidOp); end; svmtInt: case m.m_type of svmtWord: SetI(GetI + m.GetW); svmtInt: SetI(GetI + m.GetI); svmtReal: SetD(GetI + m.GetD); svmtStr: SetD(GetI + StrToFloat(m.GetS)); else Error(reVarInvalidOp); end; svmtReal: case m.m_type of svmtWord: SetD(GetD + m.GetW); svmtInt: SetD(GetD + m.GetI); svmtReal: SetD(GetD + m.GetD); svmtStr: SetD(GetD + StrToFloat(m.GetS)); else Error(reVarInvalidOp); end; svmtStr: case m.m_type of svmtWord: SetS(GetS + IntToStr(m.GetW)); svmtInt: SetS(GetS + IntToStr(m.GetI)); svmtReal: SetS(GetS + FloatToStr(m.GetD)); svmtStr: SetS(GetS + m.GetS); else Error(reVarInvalidOp); end; else Error(reVarInvalidOp); end; end; 

It's simple. Similarly, further operations can be described and our class is ready.
For arrays, of course, we need a couple of methods, an example of getting an element by index:

 function TSVMMem.ArrGet(index: cardinal; grabber:PGrabber): pointer; inline; begin Result := nil; case m_type of svmtArr: Result := PMemArray(m_val)^[index]; svmtStr: begin Result := TSVMMem.CreateFW(Ord(PString(m_val)^[index])); grabber^.AddTask(Result); end; else Error(reInvalidOp); end; end; 

Super. Now we can move on.

We implement the stack


After a while I came to such thoughts. The stack must be both static (for speed) and dynamic (for flexibility) at the same time.

Therefore, the stack is implemented block. Those. how it should work - initially the stack array has a certain size (I decided to set the block size to 256 elements so that it was beautiful and not small). Accordingly, complete with an array is a counter indicating the current top of the stack. Re-allocating memory is an extra long operation that can be performed less frequently. If more values ​​are added to the stack, then its size can always be expanded by the size of one more block.

I cite the implementation of the entire stack:

 type TStack = object public items: array of pointer; size, i_pos: cardinal; parent_vm: pointer; procedure init(vm: pointer); procedure push(p: pointer); function peek: pointer; procedure pop; function popv: pointer; procedure swp; procedure drop; end; PStack = ^TStack; procedure TStack.init(vm: pointer); begin SetLength(items, StackBlockSize); i_pos := 0; size := StackBlockSize; parent_vm := vm; end; procedure TStack.push(p: pointer); inline; begin items[i_pos] := p; inc(i_pos); if i_pos >= size then begin size := size + StackBlockSize; SetLength(items, size) end; end; function TStack.peek: pointer; inline; begin Result := items[i_pos - 1]; end; procedure TStack.pop; inline; begin dec(i_pos); if size - i_pos > StackBlockSize then begin size := size - StackBlockSize; SetLength(items, size); end; end; function TStack.popv: pointer; inline; begin dec(i_pos); Result := items[i_pos]; if size - i_pos > StackBlockSize then begin size := size - StackBlockSize; SetLength(items, size); end; end; procedure TStack.swp; inline; var p: pointer; begin p := items[i_pos - 2]; items[i_pos - 2] := items[i_pos - 1]; items[i_pos - 1] := p; end; procedure TStack.drop; inline; begin SetLength(items, StackBlockSize); size := StackBlockSize; i_pos := 0; end; 

The external methods of the VM will pass a pointer to the stack so that they can take the necessary arguments from there. A pointer to the VM thread was added later so that callback calls from external methods could be implemented and, in general, to transfer more power over VM methods.

So, as with how the stack is arranged you have read. The callback stack is arranged in the same way, for simplicity and convenience of call & return operations and the garbage collector stack. The only thing - other block sizes.

Talk about trash


It is usually a lot, a lot. And you need to do something with it.

First I want to talk about how garbage collectors are organized in other languages, for example, in Lua, Ruby, Java, Perl, PHP, etc. They work on the principle of counting pointers to objects in memory.

Those. we allocated memory for something, it is logical - the pointer was immediately placed in a variable / array / somewhere else. The runtime garbage collector immediately adds this pointer to itself with a list of possible garbage objects. In the background, the garbage collector constantly monitors all variables, arrays, etc. If there does not appear to be a pointer to something from the list of possible garbage, then this is garbage and the memory from under it must be removed.

I decided to sell my bike. I am more accustomed to working with memory on the principle of Taras Bulba. I gave birth to you - I will kill you, I mean, when I call the next Free from the next class. Therefore, the garbage collector from my VM is semi-automatic. Those. it needs to be called manually and work with it accordingly. Pointers to the declared temporary objects are added to its turn (this role falls mostly on the translator and a little on the developer). To release memory from under other objects, you can use a separate opcode.

Those. at the time of the call, the garbage collector has a ready list of pointers to run through and free up memory.

So, now let's deal with the compilation into an abstract executable file.


The idea was originally that applications written in my language can run without source, as is the case with many similar languages. Those. It can be used for commercial purposes.

To do this, determine the format of executable files. I got the following:

  1. The header, for example, "SVMEXE_CNS".
  2. Section containing the list of libraries from which methods will be imported.
  3. The import section of the required methods, the libraries from which the methods are imported are indicated by their number in the section above.
  4. Section constants.
  5. Code section

I don’t think it’s worthwhile to lay out the detailed steps for implementing parsers for these sections, because you can see everything in my repository yourself.

Code execution


After parsing the sections listed above and initializing the VM, we are left with one section with the code. In my VM, an unaligned bytecode is executed, i.e. instructions can be of any length.

A set of opcodes - instructions for a virtual machine with small comments I show in advance below:

 type TComand = ( {** for stack **} bcPH, // [top] = [var] bcPK, // [var] = [top] bcPP, // pop bcSDP, // stkdrop bcSWP, // [top] <-> [top-1] {** jump's **} bcJP, // jump [top] bcJZ, // [top] == 0 ? jp [top-1] bcJN, // [top] <> 0 ? jp [top-1] bcJC, // jp [top] & push callback point as ip+1 bcJR, // jp to last callback point & rem last callback point {** for untyped's **} bcEQ, // [top] == [top-1] ? [top] = 1 : [top] = 0 bcBG, // [top] > [top-1] ? [top] = 1 : [top] = 0 bcBE, // [top] >= [top-1] ? [top] = 1 : [top] = 0 bcNOT, // [top] = ![top] bcAND, // [top] = [top] and [top-1] bcOR, // [top] = [top] or [top-1] bcXOR, // [top] = [top] xor [top-1] bcSHR, // [top] = [top] shr [top-1] bcSHL, // [top] = [top] shl [top-1] bcNEG, // [top] = -[top] bcINC, // [top]++ bcDEC, // [top]-- bcADD, // [top] = [top] + [top-1] bcSUB, // [top] = [top] - [top-1] bcMUL, // [top] = [top] * [top-1] bcDIV, // [top] = [top] / [top-1] bcMOD, // [top] = [top] % [top-1] bcIDIV, // [top] = [top] \ [top-1] bcMV, // [top]^ = [top-1]^ bcMVBP, // [top]^^ = [top-1]^ bcGVBP, // [top]^ = [top-1]^^ bcMVP, // [top]^ = [top-1] {** memory operation's **} bcMS, // memory map size = [top] bcNW, // [top] = @new bcMC, // copy [top] bcMD, // double [top] bcRM, // rem @[top] bcNA, // [top] = @new array[ [top] ] of pointer bcTF, // [top] = typeof( [top] ) bcSF, // [top] = sizeof( [top] ) {** array's **} bcAL, // length( [top] as array ) bcSL, // setlength( [top] as array, {stack} ) bcPA, // push ([top] as array)[top-1] bcSA, // peek [top-2] -> ([top] as array)[top-1] {** memory grabber **} bcGPM, // add pointer to TMem to grabber task-list bcGC, // run grabber {** constant's **} bcPHC, // push copy of const bcPHCP, // push pointer to original const {** external call's **} bcPHEXMP, // push pointer to external method bcINV, // call external method bcINVBP, // call external method by pointer [top] {** for thread's **} bcPHN, // push null bcCTHR, // [top] = thread(method = [top], arg = [top+1]):id bcSTHR, // suspendthread(id = [top]) bcRTHR, // resumethread(id = [top]) bcTTHR, // terminatethread(id = [top]) {** for try..catch..finally block's **} bcTR, // try @block_catch = [top], @block_end = [top+1] bcTRS, // success exit from try/catch block bcTRR, // raise exception, message = [top] {** for string's **} bcSTRD, // strdel bcCHORD, bcORDCH, {** [!] directly memory operations **} bcALLC, //alloc memory bcRALLC, //realloc memory bcDISP, //dispose memory bcGTB, //get byte bcSTB, //set byte bcCBP, //mem copy bcRWBP, //read word bcWWBP, //write word bcRIBP, //read int bcWIBP, //write int bcRFBP, //read float bcWFBP, //write float bcRSBP, //read string bcWSBP, //write string bcTHREXT,//stop code execution bcDBP //debug method call ); 

So, you are fluent in what operations a VM written by me can perform. Now I want to say how this all works.

VM is implemented as an object, so you can easily implement support for multithreading.

It has a pointer to an array with opcodes, IP (Instruction Pointer) is the offset of the instruction being executed and pointers to other VM structures.

Code execution is a big switch-case.

Just give a description of VM:

 type TSVM = object public ip, end_ip: TInstructionPointer; mainclasspath: string; mem: PMemory; stack: TStack; cbstack: TCallBackStack; bytes: PByteArr; grabber: TGrabber; consts: PConstSection; extern_methods: PImportSection; try_blocks: TTRBlocks; procedure Run; procedure RunThread; procedure LoadByteCodeFromFile(fn: string); procedure LoadByteCodeFromArray(b: TByteArr); end; 

Little about exception handling


To do this, the VM has a stack of exception handlers and a large try / catch block in which the execution of the code is wrapped. From the stack, you can put a structure that has an entry point offset at the catch and finally / end of the exception handling block. I also provided a trs opcode that is put in front of catch and flips the code to finally / end, if it succeeds, simultaneously deleting a block with information about exception handlers from the top of the corresponding stack. Simply? Simply. Conveniently? Conveniently.

Talk about external methods and libraries.


I have already mentioned them before. Imports, libraries ... Without them, the language will not have the desired flexibility and functionality.

First of all, in the implementation of the VM, we will declare the type of the external method and its call protocol.

 type TExternalFunction = procedure(PStack: pointer); cdecl; PExternalFunction = ^TExternalFunction; 

When importing the VM, the import section parser fills in an array of pointers to external methods. Consequently, each method has a static address, which is calculated at the stage of building the application under the VM and by which the desired method can be called.

The call then goes on like this during the execution of the code:

 TExternalFunction(self.extern_methods^.GetFunc(TSVMMem(self.stack.popv).GetW))(@self.stack); 

Let's write a simple library for our VM


And let it implement the Sleep method for starters:

 library bf; {$mode objfpc}{$H+} uses SysUtils, svm_api in '..\svm_api.pas'; procedure DSleep(Stack:PStack); cdecl; begin sleep(TSVMMem(Stack^.popv).GetW); end; exports DSleep name 'SLEEP'; end. 

Results


At this point I will probably finish my first article from the conceived cycle.

Today I described in some detail the creation of the language runtime. I believe that this article will be very useful for people who decide to try to write their own programming language or deal with how such programming languages ​​work.

The full VM code is available in the repository, in the / runtime / svm branch.

If you liked this article, do not be lazy to throw a plus into karma and pick it up in the top, I tried and I will try for you.

If something is not clear to you, then welcome to the comments or to the forum .

Perhaps your questions and answers to them will be interesting not only to you.

Source: https://habr.com/ru/post/435202/


All Articles