We write our programming language without moms, dads and bison. Part 0: Theory

The topic of writing my YaP doesn’t give me rest for about half a year. I did not set myself the goal of "kill" CoffeeScript , TypeScript , ELM , thousands of them , I just wanted to understand the kitchen and how they are generally written.

To my unpleasant surprise, most of these languages use Jison ( Bison for JavaScript ), and this did not quite fall under my task - “to understand”, since in fact Jison does everything for you, collects AST according to the rules you specify (Jison as such a great tool that does the lion's share of the work for you, but now is not about it).

In the end, through trial and error (to be more precise, reading articles and reverse engineering), I learned how to write my full-fledged programming languages from breaking the source text into lexemes to translating it into JS code.

It is worth noting that this guide is not tied to JavaScript , it is chosen solely for reasons of development speed and readability, so you can write your "lisp" / "python" / "your absolutely new syntax" in any familiar language.

Also, until the compiler is written (in our case, the translator), the process of writing a language does not differ from the process of creating languages compiled in ASM / JVM bitcode / LLVM bitcode / etc , which means that this guide is not limited to creating a language translated in JavaScript .

All the code that will be written in this (and subsequent articles) lies on Github. Tags indicate the beginning and ends of articles for convenience.

Some theory

Without going into Wikipedia, the process of translating source code into the final JS code proceeds as follows:

source code -(Lexer)-> tokens -(Parser)-> AST -(Compiler)-> js code

What's going on here:

1) Lexer

The source code of our program is divided into lexemes . In a simple way, it is finding keywords, literals, symbols, identifiers, etc. in the source text.

Those. at the output of this ( CoffeeScript ):

 a = true if a console.log('Hello, lexer')

We get this (abbreviated entry):

 [IDENTIFIER:"a"] [ASSIGN:"="] [BOOLEAN:"true"] [NEWLINE:"\n"] [NEWLINE:"\n"] [KEYWORD:"if"] [IDENTIFIER:"a"] [NEWLINE:"\n"] [INDENT:" "] [IDENTIFIER:"console"] [DOT:"."] [IDENTIFIER:"log"] [ROUND_BRAKET_START:"("] [STRING:"'Hello, lexer'"] [ROUND_BRAKET_END:")"] [NEWLINE:"\n"] [OUTDENT:""] [EOF:"EOF"]

So, as CoffeeScript is indent-sensitive and does not have an explicit selection of the block with brackets { and } , the blocks are separated by indents ( INDENT and OUTDENT ), which essentially replaces the brackets.

2) Parser

The parser makes AST from tokens (tokens). It bypasses the entire array and recursively selects suitable patterns based on the type of the token or their sequence.

From the received tokens in paragraph 1 , parser will make, approximately such a tree (abbreviated entry):

 { type: 'ROOT', //     nodes: [{ type: 'VARIABLE', // a = true id: { type: 'IDENTIFIER', value: 'a' }, init: { type: 'LITERAL', value: true } }, { type: 'IF_STATEMENT', //   test: { type: 'IDENTIFIER', value: 'a' }, consequent: { type: 'BLOCK_STATEMENT', nodes: [{ type: 'EXPRESSION_STATEMENT', //  console.log expression: { type: 'CALL_EXPRESSION', callee: { type: 'MEMBER_EXPRESSION', object: { type: 'IDENTIFIER', value: 'console' }, property: { type: 'IDENTIFIER', value: 'log' } }, arguments: [{ type: 'LITERAL', value: 'Hello, lexer' }] } }] } }] }

Do not be afraid of the volume of the tree, in fact, it is generated recursively and its creation does not cause difficulties.

3) Compiler

Construction of the final code on AST. This item can be replaced by a compile to bytecode, or even runtime, but in this series of articles we will consider the implementation of a translator to another programming language.

The compiler (read the translator) converts the Abstract Syntax Tree into JavaScript code:

 var a = true; if (a) { console.log('Hello, lexer'); }

That's all. Most compilers work according to this principle (with minor changes. Sometimes they add source streaming process to the stream of characters, sometimes they combine parsing and compilation in one step, but not for us to judge them).

Habrlang

So, having understood the theory, we have to put together our own programming language, which will have approximately the following syntax (so that we don’t have to worry too much, we will make a mixture of Ruby , Python and CoffeeScript ):

 #!/bin/habrlang # Hello habrlang def hello <- "world" end console.log(hello())

In the next chapter, you will implement all the main classes of our translator, and teach him to translate Habrlang's comments in JavaScript .

Github Repo : https://github.com/SuperPaintman/habrlang

Source: https://habr.com/ru/post/316460/

All Articles