⬆️ ⬇️

Creating a programming language using LLVM. Part 1: Introduction and Lexical Analysis

Welcome to the tutorial "Creating a programming language with LLVM." This tutorial introduces you to creating a simple programming language, and at the same time shows how it can be easy and interesting, and also gives you a basic knowledge that you can then apply in other programming languages. The code in this tutorial can also be used as a launching pad for your creations using LLVM.



The purpose of this textbook is a gradual presentation of our language, a description of its step-by-step creation. This will allow us to cover a fairly wide range of language design issues and LLVM usage, simultaneously showing and explaining the code without a huge amount of unnecessary details.



It is useful to note in advance that this textbook does teach some compilation techniques and LLVM specificity, but does not teach modern software development principles. In practice, this means that we will use the most simplified and abbreviated solutions to simplify the presentation. So, for example, the code for working with memory uses global variables everywhere, without using excellent design patterns such as Visitor and others — this is not the best way, but it is very simple and understandable. If you understand the code and use it as a basis for future projects, the correction of such flaws will not be a problem for you.



I tried to build this tutorial in such a way that you could easily skip parts that you already know or are not interested in. The structure of the textbook is as follows:

By the end of the textbook, we wrote a little less than 700 lines of code without taking into account comments and empty lines. With this small amount of code, we have created a very good compiler for a non-trivial programming language, including a handwritten lexical analyzer, a parser, AST, and support for code generation with JIT. While other systems have “Hello, World” style textbooks, the width of this textbook is a great testament to the power of LLVM. If you are interested in building languages ​​and compilers, you should definitely read this tutorial.

')

And one more thing: we expect you to expand the language and play with it on your own. Take the code and start frantically programming, get crazy, the compilers should not scare you - it's so much fun to play with languages!



Kaleidoscope language

This textbook will illustrate a toy programming language, which we called “ Kaleidoscope ” (the name comes from three Greek words: “beautiful,” “look,” and “look, I watch”). Kaleidoscope is a procedural language that allows you to define functions, use conditions, mathematics, etc. In the course of the study, we will extend Kaleidoscope to support if / then / else constructs, loops, user operators, JIT compilation with a simple console interface, etc.



For simplicity, Kaleidoscope has only one data type — a 64-bit floating point number (like “double” in C language). Thus, all values ​​are real double precision numbers and the language does not require type declarations. This makes the language very beautiful and simplifies the syntax. For example, the following simple example calculates Fibonacci numbers :



#  x-   def fib(x) if x < 3 then 1 else fib(x-1)+fib(x-2) #    40- . fib(40) 


We also allowed Kaleidoscope to call standard library functions (LLVM JIT makes it completely trivial). This means that you can use the "extern" keyword to declare a function before using it (this is also useful for mutually recursive functions). For example:



 extern sin(arg); extern cos(arg); extern atan2(arg1 arg2); atan2(sin(.4), cos(42)) 


A more interesting example is given in Chapter 6, where we write a small application on Kaleidoscope, which displays a set of Mandelbrot in different degrees of approximation.



And now let's dive into the implementation of this language!



Lexical analyzer

When it comes to implementing a programming language, the first thing that is needed is the ability to process a text file and recognize the code in it. The traditional way to do this is to use a “ lexical analyzer ” (or “scanner”) to break the input stream into “tokens”. Each token returned by a lexical analyzer includes a unit of code and potentially some metadata (for example, a numeric value). First, we will determine the possibilities:



 //     [0-255],   , //       enum Token { tok_eof = -1, //  ( ) tok_def = -2, tok_extern = -3, //  (, ) tok_identifier = -4, tok_number = -5, }; static std::string IdentifierStr; // ,  tok_identifier (  - ) static double NumVal; // ,  tok_number (  - ) 


Each token returned by our lexical analyzer will either be one of the token enum values ​​or it will be an “unknown” like “+” character that is returned as an ASCII value. If the current token is an identifier, then the global variable IdentifierStr contains the name of the identifier. If the current token is a numeric literal (for example, 1.0), NumVal contains its value. Please note that we use global variables for simplicity, but this is not the best choice for real use :).



In fact, our lexical analyzer is implemented by a single function called gettok. The call to gettok function returns the next token from the standard input stream. Its definition begins like this:



 /// gettok -       . static int gettok() { static int LastChar = ' '; //  . while (isspace(LastChar)) LastChar = getchar(); 


gettok calls the C function getchar () to read the characters one by one from the standard input. It recognizes them and stores the last character read in LastChar, but does not process it. The first thing she should do is ignore the spaces between the tokens. This is achieved in the above cycle.



The next gettok action is to recognize identifiers and special keywords, such as "def." Kaleidoscope does this with this simple loop:



 if (isalpha(LastChar)) { // : [a-zA-Z][a-zA-Z0-9]* IdentifierStr = LastChar; while (isalnum((LastChar = getchar()))) IdentifierStr += LastChar; if (IdentifierStr == "def") return tok_def; if (IdentifierStr == "extern") return tok_extern; return tok_identifier; } 


Note that this code collects the global variable "IdentifierStr" while it is analyzing the identifier. In addition, language keywords are checked in the same loop. For numerical values, everything is the same:



 if (isdigit(LastChar) || LastChar == '.') { // : [0-9.]+ std::string NumStr; do { NumStr += LastChar; LastChar = getchar(); } while (isdigit(LastChar) || LastChar == '.'); NumVal = strtod(NumStr.c_str(), 0); return tok_number; } 


This is a fairly simple code for processing input data. When reading a numeric value, we use the C-function strtod to convert it to a numeric value, which we store in NumVal. Note that sufficient error checking does not occur here: the value "1.23.45.67" will be read and processed as if you entered in "1.23". You can easily add code to check for similar errors :). Now let's look at the comments:



 if (LastChar == '#') { //    . do LastChar = getchar(); while (LastChar != EOF && LastChar != '\n' && LastChar != '\r'); if (LastChar != EOF) return gettok(); } 


Having defined the comment, we skip the characters to the end of the line, and then return the next token. Then, if the input does not match one of the above cases, then this is either an operator like "+" or the end of the file. They are processed by this code:



 //   . if (LastChar == EOF) return tok_eof; //         ASCII int ThisChar = LastChar; LastChar = getchar(); return ThisChar; } 


At the same time, we have a full-fledged lexical analyzer for the Kaleidoscope language (a full listing of the code of the lexical analyzer is available in the next chapter from the textbook). Next, we will develop a simple parser that we use to build the Abstract Syntax Tree. Then we can use the lexical and parser together.



UPD.

Useful links that will be useful in the study:

Source: https://habr.com/ru/post/119850/



All Articles