Creating a programming language using LLVM. Part 1: Introduction and Lexical Analysis

Welcome to the tutorial "Creating a programming language with LLVM." This tutorial introduces you to creating a simple programming language, and at the same time shows how it can be easy and interesting, and also gives you a basic knowledge that you can then apply in other programming languages. The code in this tutorial can also be used as a launching pad for your creations using LLVM.

The purpose of this textbook is a gradual presentation of our language, a description of its step-by-step creation. This will allow us to cover a fairly wide range of language design issues and LLVM usage, simultaneously showing and explaining the code without a huge amount of unnecessary details.

It is useful to note in advance that this textbook does teach some compilation techniques and LLVM specificity, but does not teach modern software development principles. In practice, this means that we will use the most simplified and abbreviated solutions to simplify the presentation. So, for example, the code for working with memory uses global variables everywhere, without using excellent design patterns such as Visitor and others — this is not the best way, but it is very simple and understandable. If you understand the code and use it as a basis for future projects, the correction of such flaws will not be a problem for you.

I tried to build this tutorial in such a way that you could easily skip parts that you already know or are not interested in. The structure of the textbook is as follows:

Chapter 1 : Introduction to the Kaleidoscope language and the definition of its lexical analyzer - Shows the direction in which we are moving and the basic functionality that we have to develop. In order to make this tutorial more understandable, and examples closer to reality, we decided to implement everything only in pure C ++ without using parser-generators. LLVM, of course, works great with such tools, so you can easily use one of them.
Chapter 2 : Implementation of the parser and AST - With the ready lexical analyzer, we can already talk about the methods of parsing and the basics of building AST. This manual describes the analysis of recursive descent and analysis of the priority of operators. Nothing in Chapters 1 or 2 refers to LLVM, the code is not even associated with LLVM in these chapters. :)
Chapter 3 : Code Generation LLVM IR - With a ready AST, we can show how easy it is to actually generate LLVM IR.
Chapter 4 : Adding JIT and Optimizer Support - Since many people are interested in using LLVM as a JIT, we dive into it and show you 3 steps to add support for JIT. LLVM is also useful in many other ways, but it is one of the easiest and “sexy” ways to demonstrate your power. :)
Chapter 5 : Language Expansion: Control Flow - Now that we have a working programming language, we can expand it with control structures (if / then / else and the "for" loop). This gives us the opportunity to talk about simple SSA building and flow control.
Chapter 6 : Language Expansion: User-defined operators are a simple but interesting chapter on language extensions that allow the user to define his own unary and binary operators (with an assignable priority!). This will allow us to build a significant piece of "programming language" as a library of subroutines.
Chapter 7 : Language Expansion: Variable Variables — This chapter describes how to add custom local variables using an assignment operator. The most interesting thing here is how easy and trivial it is to build in the SSA form: and no, LLVM does NOT require using the SSA form in your front end!
Chapter 8 : Conclusion and other goodies of LLVM - This chapter completes the series, telling you about possible ways to extend the language, and also includes a bunch of links to various topics, such as Garbage Collection support, exceptions, debugging, spaghetti support -stak "(" spaghetti stacks "), and a bunch of other tips and tricks.

By the end of the textbook, we wrote a little less than 700 lines of code without taking into account comments and empty lines. With this small amount of code, we have created a very good compiler for a non-trivial programming language, including a handwritten lexical analyzer, a parser, AST, and support for code generation with JIT. While other systems have “Hello, World” style textbooks, the width of this textbook is a great testament to the power of LLVM. If you are interested in building languages and compilers, you should definitely read this tutorial.
')
And one more thing: we expect you to expand the language and play with it on your own. Take the code and start frantically programming, get crazy, the compilers should not scare you - it's so much fun to play with languages!

Kaleidoscope language

This textbook will illustrate a toy programming language, which we called “ Kaleidoscope ” (the name comes from three Greek words: “beautiful,” “look,” and “look, I watch”). Kaleidoscope is a procedural language that allows you to define functions, use conditions, mathematics, etc. In the course of the study, we will extend Kaleidoscope to support if / then / else constructs, loops, user operators, JIT compilation with a simple console interface, etc.

For simplicity, Kaleidoscope has only one data type — a 64-bit floating point number (like “double” in C language). Thus, all values are real double precision numbers and the language does not require type declarations. This makes the language very beautiful and simplifies the syntax. For example, the following simple example calculates Fibonacci numbers :

#  x-   def fib(x) if x < 3 then 1 else fib(x-1)+fib(x-2) #    40- . fib(40)

We also allowed Kaleidoscope to call standard library functions (LLVM JIT makes it completely trivial). This means that you can use the "extern" keyword to declare a function before using it (this is also useful for mutually recursive functions). For example:

 extern sin(arg); extern cos(arg); extern atan2(arg1 arg2); atan2(sin(.4), cos(42))

A more interesting example is given in Chapter 6, where we write a small application on Kaleidoscope, which displays a set of Mandelbrot in different degrees of approximation.

And now let's dive into the implementation of this language!

Lexical analyzer

When it comes to implementing a programming language, the first thing that is needed is the ability to process a text file and recognize the code in it. The traditional way to do this is to use a “ lexical analyzer ” (or “scanner”) to break the input stream into “tokens”. Each token returned by a lexical analyzer includes a unit of code and potentially some metadata (for example, a numeric value). First, we will determine the possibilities:

 //     [0-255],   , //       enum Token { tok_eof = -1, //  ( ) tok_def = -2, tok_extern = -3, //  (, ) tok_identifier = -4, tok_number = -5, }; static std::string IdentifierStr; // ,  tok_identifier (  - ) static double NumVal; // ,  tok_number (  - )

Each token returned by our lexical analyzer will either be one of the token enum values or it will be an “unknown” like “+” character that is returned as an ASCII value. If the current token is an identifier, then the global variable IdentifierStr contains the name of the identifier. If the current token is a numeric literal (for example, 1.0), NumVal contains its value. Please note that we use global variables for simplicity, but this is not the best choice for real use :).

In fact, our lexical analyzer is implemented by a single function called gettok. The call to gettok function returns the next token from the standard input stream. Its definition begins like this:

 /// gettok -       . static int gettok() { static int LastChar = ' '; //  . while (isspace(LastChar)) LastChar = getchar();

gettok calls the C function getchar () to read the characters one by one from the standard input. It recognizes them and stores the last character read in LastChar, but does not process it. The first thing she should do is ignore the spaces between the tokens. This is achieved in the above cycle.

The next gettok action is to recognize identifiers and special keywords, such as "def." Kaleidoscope does this with this simple loop:

 if (isalpha(LastChar)) { // : [a-zA-Z][a-zA-Z0-9]* IdentifierStr = LastChar; while (isalnum((LastChar = getchar()))) IdentifierStr += LastChar; if (IdentifierStr == "def") return tok_def; if (IdentifierStr == "extern") return tok_extern; return tok_identifier; }

Note that this code collects the global variable "IdentifierStr" while it is analyzing the identifier. In addition, language keywords are checked in the same loop. For numerical values, everything is the same:

 if (isdigit(LastChar) || LastChar == '.') { // : [0-9.]+ std::string NumStr; do { NumStr += LastChar; LastChar = getchar(); } while (isdigit(LastChar) || LastChar == '.'); NumVal = strtod(NumStr.c_str(), 0); return tok_number; }

This is a fairly simple code for processing input data. When reading a numeric value, we use the C-function strtod to convert it to a numeric value, which we store in NumVal. Note that sufficient error checking does not occur here: the value "1.23.45.67" will be read and processed as if you entered in "1.23". You can easily add code to check for similar errors :). Now let's look at the comments:

 if (LastChar == '#') { //    . do LastChar = getchar(); while (LastChar != EOF && LastChar != '\n' && LastChar != '\r'); if (LastChar != EOF) return gettok(); }

Having defined the comment, we skip the characters to the end of the line, and then return the next token. Then, if the input does not match one of the above cases, then this is either an operator like "+" or the end of the file. They are processed by this code:

 //   . if (LastChar == EOF) return tok_eof; //         ASCII int ThisChar = LastChar; LastChar = getchar(); return ThisChar; }

At the same time, we have a full-fledged lexical analyzer for the Kaleidoscope language (a full listing of the code of the lexical analyzer is available in the next chapter from the textbook). Next, we will develop a simple parser that we use to build the Abstract Syntax Tree. Then we can use the lexical and parser together.

UPD.
Useful links that will be useful in the study:

LLVM official website
Articles in the Russian and English wikipedia
A bunch of English documentation
LLVM from the inside: how it works [article in Habré]
LLVM overview [article on Habré]

Source: https://habr.com/ru/post/119850/

All Articles

Creating a programming language using LLVM. Part 1: Introduction and Lexical Analysis

Kaleidoscope language

Lexical analyzer

More articles: