We write the script interpreter and stack machine

This article will discuss a rather unusual project. Once I was visited by the desire to write my own interpreter of some scripting language and the executing machine for it. Just to see how it works inside. This goal does not sound very noble, and I put it off indefinitely, because I wanted more useful wording.
Once, a friend of mine complained that you need to write an automation script to WSH, but he does not know either VBScript or Javascript. Here the "noble" wording arose by itself ... you need to help someone. As a result, a compiler and an executing machine were written, allowing execution of scripts for Windows Script Host, without resorting to VBScript and JS. Under the cut - a brief background of the project and its internal structure and the programming language itself.

Brief background

Gentlemen, I will reveal to you a terrible secret. I know that I am very ashamed to admit such a sin, but I confess nonetheless. I am one-esnik. Yes, yes, yes, I write large and small solutions using a yellow framework and conscience does not torment me))).

In addition to 1C, I know other languages, but I am young, I can afford it. My friend, whom I mentioned in the introduction - on the contrary - a military retiree, a good electronics engineer, retrained as a programmer when he went out to a citizen. For a competent technician, such retraining is a simple matter, paying in the 1C sphere is not bad, and good specialists are doing so well. However, with age it becomes more and more difficult to chase after new technologies, it becomes more and more difficult to follow the rapidly changing world of IT. When the said conversation took place, he said something like: “Yes, this is a long time to learn a new language, so if such a script could be written on 1C, but so that the console was executed ...” Then I remembered my old idea of writing the interpreter, which now from a purely research one received a completely applied focus - the execution of scripts working with the WSH infrastructure, but in 1C.

A little bit about the programming language

About 1C language they say that this is Visual Basic, translated by promt. It really is. In spirit, the language is very close to VB — it has weak typing, no OOP, lambda expressions, or closures. It has “verbal”, not “sishny” syntax. In addition, since all keywords have English analogues, the 1C code written in English terms is almost impossible to distinguish at a glance from VB. Is that the comment character - two slashes, and not an apostrophe, as in BASIC.
The language distinguishes procedures and functions, it has the classic “For” and “While” cycles, as well as the iteration cycle “For Each ... In”. Appeal to the properties and methods of objects is performed "through the point", the appeal to arrays is in square brackets. Each statement is terminated by a semicolon.
')

What is a stack machine and how it works

As far as I know, stack machines are the most common. For example, .NET CLR and JVM virtual machines are stackable. On Habré there is even a separate article about this case. Nevertheless, in order not to chase the reader by reference, I think it is worth describing the principle of their work here. A stack machine performs all operations on data that is organized in a stack. Each operation extracts the required number of operands from the stack, performs actions on them and puts the result back onto the stack. This approach allows you to create a lightweight bytecode with a minimum of commands. In addition, it also works quite quickly.
Bytecode is a set of commands that the machine will execute.

Each instruction is a single-byte opcode from 0 to 255, followed by parameters such as registers or memory addresses ( wikipedia ).

In our machine, the command will have a fixed length and consist of two numbers — an operation code and an operation argument. Each command involves a set of actions on the data in the stack.
For example, the pseudo-code of the “1 + 2” addition operation might look like this:

Push 1 Push 2 Add

The first two teams put the items on the stack, the third team performs the addition and puts the result on the stack. In this case, the assignment operation “A = 1 + 2” might look like this:

 Push 1 ;    "1" Push 2 ;    "2" Add ;    "1"  "2",      . LoadVar A ;         ""

It can be seen that the Push and LoadVar commands have an argument, the Add command does not need an argument.
The advantage of stack computing is that operations are performed in the order they are followed, no need to pay attention to the priorities of operations. What is written is done. Before performing the operation, its operands are pushed onto the stack. This way of describing expressions is called reverse polish notation .

The task of writing a stack machine comes down to the “invention” of the necessary set of commands that this machine will understand.

Compiler device

The task of the compiler is to convert the code in a given language into the byte-code of the machine that will execute it. The classic compiler has 3 components:

Lexical Analyzer (Parser)
Syntactical analyzer
Code generator

The lexical analyzer breaks a chaotic stream of incoming characters into lexemes. It selects words, numbers, string constants, signs of operations from the source text and turns them into atomic entities with which it is convenient to work further.

The parser checks that the set of tokens that the lexical analyzer gave it is meaningful, that we have a program, and not a meaningless set of letters.

The analyzer builds an abstract syntax tree (AST), which is then fed to the input of the code generator.
The code generator creates a bytecode, bypassing the nodes of the syntax tree.

I started writing the compiler in the same three-tier scheme, but then it seemed to me to be unnecessarily correct and, in the context of this task, unnecessary. Therefore, I refused AST and combined parsing with code generation.
The compiler in the loop requests the next lexeme from the parser and, looking at it, generates a code. Along the way, it checks that the token is legal in this context. AST in this scheme is superfluous.

State machine

Parser and compiler is more convenient to implement as a finite state machine. When extracting the next character, the conclusion is made about which mode to switch to in order to interpret this character. For example, if a letter appeared at the input, then we switch to the word reading mode, if a number appears, then we read in the number mode, etc. Each state defines a set of valid input symbols. After extracting the token, the machine goes to the previous state. With code generation - the same. Appeared at the IF input, then we switch to the condition generation mode, and at the end - we return to the previous state.

Virtual machine device

Contexts of visibility

All variables and methods always belong to some context. There is a global context, a module context, and a local context inside the method. It turns out a stack of available (visible) names that can be accessed during execution. With this stack is active work at the time of compilation and execution.
Moreover, if you develop the idea and assume that you can create instances of contexts , then modularity naturally arises. If we have a certain module, as a set of methods and variables, then we have a context in which all the code of this module works. Now, if we create two instances of the same module, then in fact we will create two instances of the same class. Each of them will see their own context and have their own state.
The 1C documentation does not explicitly say this, but the observed behavior allows us to conclude that the execution of modules in the 1C execution environment works in this way. The runtime environment does not work with instances of objects, but with contexts that “join” the virtual machine as needed.

Memory

A conditional “memory” of a machine is a list of contexts connected to it. Each context has its own number and state (specific values of variables). Commands for reading / writing variables have an argument in the form of a cell number in the context table. Each entry in the table describes the context number and the variable number in this context over which the action is performed.

Index	Context number	Variable number
0	0	0
one	0	one
2	one	3

For example, if the PushVar 1 command is received , then in the table at index 1, an entry is selected that says to take the variable number 1 from the context 0.
Variables are a simple array that each connected context reports to the machine. The machine writes and reads data from this array, thereby changing the state of the connected contexts. At the same time, it does not matter what context is connected to - an instance of some class or a global context. The machine can change variables, and what kind of variables - does not matter.

Call stack

The execution of the code is organized using so-called “frames”, each of which is a set of local variables and a pointer to the current instruction. When any method is called, the current frame is pushed onto the stack. Thus, the state of the machine at the time of the call is saved

Then a new frame is formed with an empty array of local variables and the instruction number equal to the beginning of the called method. This new frame becomes current and the command execution cycle continues from there.
When the Return command is reached, the current frame is destroyed (along with the local variable values that went out of scope), the saved frame with the previous state is retrieved from the call stack and the command execution cycle continues from the call point.

Fragment of pseudocode with method call.

 0: Push 1 1: Push 2 2: Add 3: Return 4: Nop 5: Nop 6: Call 0 7: LoadVar

Suppose that the current instruction is number 6. This is a call to address 0. Local variables and the number of the current instruction are stored in the call stack. Further, control is transferred to the address 0, and then returns to the address 6, where the transition further occurs, to the next instruction.

Command implementation

All command implementations are methods of the MachineInstance class, and pointers to these methods are located in an array of operation codes. When retrieving the next command, by its number, the pointer to the implementation is extracted from the array and this implementation is performed. Exception processing is in progress.

Main command execution loop

 private void ExecuteCode() { while (true) { try { MainCommandLoop(); break; } catch (RuntimeException exc) { if (_exceptionsStack.Count == 0) throw; var handler = _exceptionsStack.Pop(); SetFrame(handler.handlerFrame); _currentFrame.InstructionPointer = handler.handlerAddress; _lastException = exc; } } } private void MainCommandLoop() { try { while (_currentFrame.InstructionPointer >= 0 && _currentFrame.InstructionPointer < _module.Code.Length) { var command = _module.Code[_currentFrame.InstructionPointer]; _commands[(int)command.Code](command.Argument); } } catch (RuntimeException) { throw; } catch (Exception exc) { throw new ExternalSystemException(exc); } }

Device executable

This machine uses a command with one numeric argument. Each command is determined by an operation code and an argument, the interpretation of which depends on the command itself. On this principle, the whole system of commands and the structure of the executable module are built.
To begin with, let's look at what the source module in 1C consists of.
It has 3 distinct sections:

In the variable section, all variables of the module level are declared, i.e. those that are visible everywhere inside the module.
Module methods are the actual procedures and functions with code.
The body of the module is the code that runs when the module is loaded. Performance begins with him.

Since the module body is essentially a method without a name, we can say that there are only two types of different types of constructions - the declaration of variables and the executable code.

Constants

Obviously, all operations are performed on some values. Values, in turn, are represented by variables and constants. Constants include literals of numbers, strings, dates, keywords , and .
In order to add 2 and 2 to any computer, you need to explain what “2” is and where to get it. For this purpose, the compiled module includes the description of all constants used in the code. If the code says " = "" ", then this word "Hello" should be in the list of constants.

Variables

Variable names are only important at compile time to allow scopes. At runtime, variable names are not needed. To allocate memory for an array of variables, it is enough to know only their number. When the machine is created, an array of variables is created, which is accessed by the number. In the compiled module only the number of variables is indicated.
With variables, however, there is a slight complication. The fact is that in addition to the module variables, there are also global variables declared somewhere outside the module. As mentioned above - there is a stack of visible names, and the compiler, meeting a certain name, searches for it in the stack of declared names. Theoretically, you can connect several independent libraries with their properties and functions that will be visible to the client script.

If the code uses a variable, the machine must know what context it belongs to. Since a command has only one numeric argument, it is necessary to determine the context in which this variable is declared with the help of a single argument. For this purpose, a variable table has been introduced into the module. each entry in the table contains a context number and a variable number within the context. The command argument is interpreted as the record number in the variable table (see the "Memory" section).

Variable operations on the stack

 private void PushVar(int arg) { var vm = _module.VariableRefs[arg]; var scope = _scopes[vm.ContextIndex]; _operationStack.Push(scope.Variables[vm.CodeIndex]); NextInstruction(); } private void LoadVar(int arg) { var vm = _module.VariableRefs[arg]; var scope = _scopes[vm.ContextIndex]; scope.Variables[vm.CodeIndex].Value = BreakVariableLink(_operationStack.Pop()); NextInstruction(); }

Methods

Methods are divided into procedures and functions . The latter can return values. By default, parameters in methods are passed by reference. To pass by value, the method parameter must be designated by the “Value” keyword (analogous to ByVal in BASIC).
The compiled module contains information about the parameters of the method, their binding, the presence of return values, etc. Each method has its own number. In addition, for each method, the number of local variables of this method is indicated.
Methods, like variables, can be external to the module (i.e., declared somewhere globally). Calling to methods is organized in the same way as in variables — via a matching table that defines the context in which the method to be called is located.

The final structure of the module

Combine all of the above and get the following structure of the compiled module:

After compilation, the module is a structure that includes:

Number of module level variables
List of constants
List of method signatures
Module byte code
Variable map
Method map
The number of the method that is the entry point to the module (module body)

Work with values

1C language does not have strict typing. The variable gets the type at the moment of assigning it a value. Any value is essentially a universal type VARIANT. When performing operations, the universal value is given with the desired type. For example, arithmetic operations are reduced to a number, and for Boolean operations, to a boolean value.
The following basic value types exist:

Undefined
Line
Number
date
Boolean
An object
Type of

The latter is a primitive type for working with types (similar to System.Type in .NET)
The universal value is represented by the IValue interface:

 interface IValue : IComparable<IValue>, IEquatable<IValue> { DataType DataType { get; } TypeDescriptor SystemType { get; } double AsNumber(); DateTime AsDate(); bool AsBoolean(); string AsString(); TypeDescriptor AsType(); IRuntimeContextInstance AsObject(); }

The interface allows you to find out the actual type of the value, as well as perform a cast to the basic types. Such a cast is necessary, for example, to perform arithmetic operations. A class that implements a specific type of value itself tries to cast its value to each of the basic types.
When evaluating an expression, the type of the final result is determined by the type of the first operand. Thus, the expression "12345" + 10 should produce a string result. When performing the addition, the second argument will be cast to a string and concatenated.
In contrast, operation 10 + "12345" will attempt to cast the string "12345" to a number. If this casting is impossible - the exception “casting error to the type Number” arises.
In the examples, the class implementing the type “Number” calls the method AsString (), and the class implementing the type “String” calls the method AsNumber ().

Addition operation

 private void Add(int arg) { var op2 = _operationStack.Pop(); var op1 = _operationStack.Pop(); var type1 = op1.DataType; if (type1 == DataType.String) { var result = op1.AsString() + op2.AsString(); _operationStack.Push(ValueFactory.Create(result)); } else if (type1 == DataType.Date && op2.DataType == DataType.Number) { var date = op1.AsDate(); var result = date.AddSeconds(op2.AsNumber()); _operationStack.Push(ValueFactory.Create(result)); } else { //    . var result = op1.AsNumber() + op2.AsNumber(); _operationStack.Push(ValueFactory.Create(result)); } NextInstruction(); }

Access to object properties and methods

Each object can have properties and methods that can be accessed "through the point". Appeal performed by name. The mechanics of addressing by name are represented by a special interface that allows you to find out if the object has the necessary members and access to them.

 interface IRuntimeContextInstance { bool IsIndexed { get; } IValue GetIndexedValue(IValue index); void SetIndexedValue(IValue index, IValue val); int FindProperty(string name); bool IsPropReadable(int propNum); bool IsPropWritable(int propNum); IValue GetPropValue(int propNum); void SetPropValue(int propNum, IValue newVal); int FindMethod(string name); MethodInfo GetMethodInfo(int methodNumber); void CallAsProcedure(int methodNumber, IValue[] arguments); void CallAsFunction(int methodNumber, IValue[] arguments, out IValue retValue); }

When accessing a property or method, the object is requested the number of this property / method. Further, by this number, a request is made about the readability and readability of the property, the number of method parameters, the presence of a return value, etc. After clarifying this information, the call is made.
For properties, this is setting or reading a value. For methods, call as a procedure or call as a function. In the latter case, the returned value is pushed onto the virtual machine stack.

Working with objects as contexts and vice versa

Above, I mentioned modularity and the creation of objects as context instances . Let's take a closer look at what was meant and how the access to objects “through the point” is organized.

Imagine that we have a certain library of functions. We can call functions from this library and enjoy the result. Now imagine that a library is an instance of a class, with a set of public methods and properties. This instance is inconspicuously “embedded” in our scope so that we invoke instance methods directly, as if they were global functions.

If an instance of the class " MathLibrary " is connected to the stack of scopes, then the functions Sin, Cos and Sqrt can be called by direct access. They will be visible, as the usual methods declared somewhere at the level of libraries.

  = Sin(X); //  Sin   ,      ,       .

And now we do the reverse operation. If we wrote a script, then it is in itself the context. It provides the scope in which the code works. But what if we create an instance of this script and write it into a variable? It turns out that our script is a class, with its properties and methods and you can work with it as with an object. This principle is the basis for the ability to connect (import) external script files, and work with them as with objects — create instances, call methods, etc.
Moreover, when script A calls script B, script B is connected to the machine’s memory, as the context and execution of code B goes as if B were the only script. When returning from module B, it is disconnected and execution again goes to script A.

Byte code examples

Below are examples of how the execution of certain operations in bytecode is organized.

Addition and Assignment

  = 1;  = 2;  =  + ;

 .constants 0 :type: Number, val: 1 1 :type: Number, val: 2 .code 0 :(PushConst 0) ;   0   1 :(LoadLoc 0) ;       0 2 :(PushConst 1) ;   1   3 :(LoadLoc 1) ;       1 4 :(PushLoc 0) ;     0 5 :(PushLoc 1) ;     1 6 :(Add 0) ;       7 :(LoadLoc 2) ;       2

Precondition loop (While)

  = 1;   < 5   =  + 1; ;

 .constants 0 :type: Number, val: 1 1 :type: Number, val: 5 .code 0 :(PushConst 0) 1 :(LoadLoc 0) ;      2 :(PushLoc 0) ;   ( ) 3 :(PushConst 1) 4 :(Less 0) ;   ""   №1 ( 5) 5 :(JmpFalse 11) ;    - ,      ( 11) 6 :(PushLoc 0) ;   =  + 1 7 :(PushConst 0) 8 :(Add 0) 9 :(LoadLoc 0) 10 :(Jmp 2) ;        11 :(Nop 0) ;

Condition

  1 > 2   = 1;   = 0; ;

 .constants 0 :type: Number, val: 1 1 :type: Number, val: 2 2 :type: Number, val: 0 .code 0 :(PushConst 0) 1 :(PushConst 1) 2 :(Greater 0) ;   "" 3 :(JmpFalse 7) ;    -    "" 4 :(PushConst 0) ;    5 :(LoadLoc 0) 6 :(Jmp 9) ;   ,     7 :(PushConst 2) ;    8 :(LoadLoc 0) 9 :(Nop 0);

Method call

 ("");

 .constants 0 :type: String, val:  .code 0 :(PushConst 0) ;     1 :(ArgNum 1) ;      2 :(CallProc 1) ;  ,        1 .procmap 0 :(1,2) 1 :(0,4) ;  1    0     4.

What was there about WSH?

The WSH scripting infrastructure is represented by a number of COM objects. These objects can be accessed from scripting languages by referring to their members by name using IDispatch. Nothing like?
It is enough to make a small wrapper on IDispatch, which will allow the machine to work with these COM objects through the IRuntimeContextInstance interface mentioned.

The figure shows the test application window that I used to debug scripts. , Scripting.FileSystemObject

, WSH . , WSH , , . , , , .

dll .
TestApp — GUI-based , , .
oscript .
oscript , .
, exe-. .
bitbucket . , wiki .
, , — setup.exe

1, , 1 - .
. , , « », , , . - . , .
, . , , . Good luck.

PS 1 « ?» :)

Source: https://habr.com/ru/post/223887/

All Articles