Creating a programming language using LLVM. Part 10: Conclusion and other goodies LLVM

Table of contents:
Part 1: Introduction and Lexical Analysis
Part 2: Implementing Parser and AST
Part 3: LLVM IR Code Generation
Part 4: Adding JIT and Optimizer Support
Part 5: Language Expansion: Control Flow
Part 6: Language Extension: User Defined Operators
Part 7: Language Expansion: Variable Variables
Part 8: Compile to Object Code
Part 9: Add Debugging Information
Part 10: Conclusion and other goodies LLVM

9.1. Conclusion

Welcome to the final part of the guide “Creating a programming language using LLVM”. Throughout this tutorial, we have grown our little Kaleidoscope language from a useless toy to a rather interesting (although perhaps still useless) toy.

It is interesting to see how far we have come, and how little code it required. We built a complete lexical analyzer, parser, AST, code generator, interactive execution (with JIT!) And generation of debugging information into a separate executable file - all this is less than 1000 lines of uod (excluding blank lines and comments).
')
Our small language supports a couple of interesting features: it supports user-defined binary and unary operators, uses JIT compilation for immediate execution, and supports some flow control constructs using code generation in SSA form.

Part of the idea for this tutorial was to show you how easy it is to identify, build, and play with a language. Building a compiler doesn’t have to be a scary or mystical process! Now that you have seen the basics, I strongly recommend that you take the code and deal with it. For example, try adding:

global variables - although the value of global variables in modern software engineering is questionable, they are often used for small fast hacks, such as the Kaleidoscope compiler itself. Fortunately, it is very easy to add global variables to our program: for each variable, we simply check if it is in the global symbol table. For. To create a new global variable, create an instance of the LLVM GlobalVariable class.
typed variables - now the Kaleidoscope supports only double variables. This makes the language very elegant, because supporting only one type means that you do not need to specify the types of variables. Different languages have different ways to solve this problem. The easiest way is to require the user to specify the type for each variable definition, and write the variable types to the symbol table along with their Value *.
arrays, structures, vectors, etc. If you enter types, you can begin to expand the type system in various interesting ways. Simple arrays are very simple and useful for different kinds of applications. Add them as an exercise to learn how the LLVM getelementptr instruction works: it is so elegant and unusual that it has its own FAQ!
standard runtime - in its current form, the language allows the user to access arbitrary external functions, and we use this for things like “printd” and “putchard”. You can expand the language so as to add higher level constructs; it often makes sense to convert such constructs to runtime functions than to do them in the form of inline-sequences of commands.
memory management - now in the language of Kaleidoscope there is access only to the stack. It will also be useful to allocate memory on the heap, either by calling the standard libc malloc / free interfaces, or using the garbage collector. If you prefer a garbage collector, note that LLVM fully supports accurate Accurate Garbage Collection, including object moving algorithms, and those required for scanning / updating the stack.
exception support - LLVM supports the generation of zero-cost exceptions with the ability to interact with code compiled in other languages. You can also generate code that implies that each function returns an error value and verifies this. You can also implement exceptions explicitly using setjmp / longjmp. In general, there are many different ways.
OOP, generic types, access to databases, complex numbers, geometric programming, ... in fact, there is no end to the insane things that can be added to the language.
unusual applications - we talked about using LLVM in an area that many are interested in: building a compiler for a specific language. However, there are many other areas for which the use of the compiler, at first glance, is not considered. For example, LLVM is used to accelerate OpenGL graphics, translate C ++ code into ActionScript, and many other interesting things. Perhaps you will be the first to build a JIT compiler in native regular expression code using LLVM?
pleasure - try to do something crazy and unusual. Making a language just like everyone else is not as fun as making something crazy. If you want to talk about it, feel free to write llvm-dev to the newsletter: there are many people who are interested in languages and often want to help.

Before we finish the tutorial, I want to give some tips on how to generate LLVM IR. There are some subtleties that may not be obvious, but very useful if you want to take advantage of the LLVM features.

10.2. LLVM IR Properties

There are a couple of questions about the code in the form of LLVM IR, let's look at them now.

10.2.1. Target platform independence

A kaleidoscope is an example of a “portable language”: any program written in the Kaleidoscope will work equally on any target platform on which it will run. Many other languages have the same property, for example, Lisp, Java, Haskell, Javascript, Python, etc. (note that while these languages are portable, not all of their libraries are portable).

One good aspect of LLVM is to maintain independence from the target platform at the IR level: you can take the LLVM IR for a program compiled with a Kaleidoscope and run on any target platform supported by LLVM, even generate a C code and compile on those target platforms that the Kaleidoscope does not supports natively. It can be said that the Kaleidoscope compiler generates a platform-independent code because it does not request any information about the platform when generating code.

The fact that LLVM provides a compact, platform-independent presentation of code is very appealing. Unfortunately, people often only think about compiling C or C-like languages when they ask about language portability. I said “unfortunately”, because in fact it is impossible (in general) to make the C-code portable, because, of course, the C source code itself is not portable in the general case, even in the case of porting applications from 32 to 64 bits.

The problem with C (again, in general) is that it relies heavily on platform-dependent assumptions. As a simple example, the preprocessor will make the code platform dependent if it processes the following text:

#ifdef __i386__ int X = 1; #else int X = 42; #endif

Although it is possible to solve this problem in various complex ways, it cannot be solved in a general way.

But the subset C can be made portable. If you make fixed-size primitive types (for example, int = 32 bits, long = 64 bits), do not worry about ABI compatibility with existing binary files, and discard some other features, then you can get portable code. This makes sense for some special cases.

10.2.2. Security guarantees

Many of the languages mentioned are also “safe”: for a program written in Java, it is impossible to spoil the address space and drop the process (implying that the JVM has no bugs). Security is an interesting feature that requires a combination of language design, runtime support, and, often, OS support.

It is definitely possible to implement a secure language in LLVM, but LLVM IR itself does not guarantee security. LLVM IR allows unsafe pointer transformations, memory use after it is released, buffer overflows, and various other problems. Security should be implemented at a level higher than the LLVM and, fortunately, several groups have been investigating this issue. Ask on the llvm-dev mailing list if you are interested in the details.

10.2.3. Language-specific optimizations

There is one thing in LLVM that many people don’t like: it doesn’t solve all the problems of the world within one system (sorry, starving children, someone else should solve your problem, not today). One complaint that LLVMs show is that it is not able to perform high-level, language-specific optimization: LLVM "loses too much information."

Unfortunately, there is no place here to write for you a complete and universal version of the "theory of design compilers". Instead, I will make several observations:

The first is true, LLVM loses information. For example, it is impossible to distinguish at the LLVM IR level whether the SSA value was generated from type C “int” or “long” on the ILP32 machine (except from debug information). Both are compiled into a value of type "i32", and information about the original type is lost. A more general problem is that the LLVM type system considers types that are equivalent with the same structure, not with the same name. This is another thing that surprises people that if you have two types in a high-level language that have the same structure (for example, two different structures that have one int field): these types will be compiled into one type of LLVM, and it will be impossible say which source structure the variables belonged to.
The second, although LLVM loses information, it does not have a fixed target platform: we continue to expand and improve it in different directions. We add new features (LLVM did not always support exceptions or debugging information), we extend IR to capture information that is important for optimization (whether the argument was expanded with zeros or a sign bit, information about alassing of pointers, etc.). Many improvements are initiated by users: people want LLVM to have any specific features, and we are going to meet them.
Third, it is possible to easily add language-specific optimizations, and there are a number of ways to do this. As a trivial example, it is easy to add an optimization pass that “knows” different things about the source code. In the case of C-like languages, such an optimization pass “knows” about the functions of the standard C library. If you call the “exit (0)” function in main (), it knows that the call can be safely converted to “return 0” because Standard C describes what the “exit” function should do.

In addition to simple library knowledge, it is possible to embed other various language-specific information into LLVM IR. If you have specific needs, please write llvm-dev on the mailing list . In the worst case, you can treat LLVM as if it were a “stupid code generator” and implement the high-level optimizations that you like, in your frontend, in AST specific to your language.

10.3. Tricks and Tricks

There are various useful techniques and tricks to which you come after having worked with / over LLVM, and which are not obvious at first glance. So that everyone does not open them again, this section is devoted to some of them.

10.3.1. Implementing a portable offsetof / sizeof

One interesting thing is that if you are trying to keep the code generated by your platform-independent compiler, you need to know the size of the LLVM types and the offset of specific lines in the structures. For example, you can pass the size of a type into a function that allocates memory.
Unfortunately, the size of the types can vary greatly depending on the platform: the pointer size is the simplest example. The smart way to solve such problems is to use the instruction getelementptr .

10.3.2. Stack collector frames

Some languages want to explicitly manage stack frames, often from = for the presence of a garbage collector or to make closures easier. There are often better ways to implement these capabilities than explicit stack frame management, but LLVM supports this if you want. To do this, you need your front-end to convert the code to the Continuation Passing Style and use tail calls (which LLVM also supports).

Source: https://habr.com/ru/post/337240/

All Articles