In my in-depth course on compilers last fall, we spent some time studying the LLVM source tree. A million lines of C ++ code look scary, but I find it an interesting exercise, and at least some students agree, and I thought that I would try to write something like that. We will use LLVM 3.9, but previous (and possibly future) releases are not much different.

I don’t want to spend much time on the theoretical foundations of LLVM, but there are a few things you should know.
The LLVM core does not contain frontends, only midland optimizers, several backends, documentation, and a lot of auxiliary code. Frontends, such as Clang, live in separate projects.
The intermediate code representation in the LLVM core lives in RAM and can be manipulated using the large C ++ API. This view can be saved as readable text and parsed back into memory, but only for debugging convenience: during normal compilation using LLVM, a text IR is never generated. Usually, the frontend builds IR using LLVM API calls, then runs some optimization passes, and then calls the backend, which generates an assembler or machine code. When LLVM code is written to disk (which does not happen during normal compilation of C and C ++ projects using Clang), it is stored as a “bitcode”, a compact binary representation.
')
The main documentation on the LLVM API is generated by doxygen, and can be found
here . This information is difficult to use if you do not already know exactly what you need to do and what to look for. The tutorials referenced below are the starting point for exploring the LLVM API.
Refer to the code.
The root directory contains:
bindings - “bundles”, allow you to use LLVM API from languages ​​other than C ++. There are also other bundles, with language C (which will be discussed below), and Haskell (it is not in this tree).
cmake - LLVM uses CMake, not autoconf. Just say thanks to those who did it for you.
docs - ReStructuredText documentation. See an example
of a language guide that defines the meaning of each LLVM instruction (GitHub displays .rst files as HTML by default, you can see the raw file
here ). The material in the subdirectory with the
manual is particularly interesting, but do not look at it there, better go
here . This is the best way to learn LLVM!
examples : These are the sources that come with the tutorial. As LLVM hackers, you should get the code, CMakeLists.txt, etc. from here. from here whenever possible.
include : The first subdirectory,
llvm-c , contains bundles for the C language, which I did not use, but which looks quite reasonable. It is important that LLVM developers try to keep these bundles stable, while the C ++ API changes with each release, although the rate of change seems to be slowing down over the last few years.
The second subdirectory,
llvm , is large: it contains 878 header files that define the LLVM API. In general, it is easier to use doxygen-versions of these files than to read them directly, but often you have to download these files in search of a function.
The lib contains really useful things, we will look at them below separately.
projects does not contain anything by default, however LLVM components are copied here, such as compiler-rt (runtime libraries for such things as sanitizers), OpenMP support and LLVM C ++ libraries that live in other repositories.
resources : something for Visual C ++ that neither you nor me need (more
here )
runtimes: another placeholder for external projects, added only last summer (
2016. approx. transl. ), and I do not know, in fact, what it is for.
test:: large directory, contains thousands of LLVM unit tests, they run when you build a check target (
make check-all, approx. transl. ). Most of these are .ll files that contain LLVM IR in text form. They test different things, for example, that a pass of optimization leads to the expected result. I will look at LLVM tests in detail in a future post.
tools: LLVM itself is just a collection of libraries, and there is no dedicated main function in it. Most of the subdirectories in the tools directory contain executable tools that are linked to the LLVM libraries. For example, llvm-dis, this is a disassembler that translates a bitcode into a text assembler format.
unittests: more unit tests are also run when building a check target. These are C ++ files that use the
Google Test framework to call the API directly, unlike the tests in the “tests” directory, which run LLVM functions not directly, but by running an assembler, disassembler, or optimizer.
utils: emacs and vim mods to follow LLVM coding style, Valgrind file to suppress false positives, lit and FileCheck tools to support unit testing, and many other different things. Perhaps most of them you do not need.
OK, so far everything has been pretty simple. We missed the
lib directory, which contains almost everything important. Look at the subdirectories:
Analysis directory contains many static analyzers, such as the analysis of aliases and global values. Some analyzers have an LLVM pass structure and must be run by the pass manager, others are libraries, and can be called directly. The strange member of the analyzer family, this InstructionSimplify.cpp, is actually a transformation, not an analysis. I am sure that many will not notice a comment explaining what this passage is doing here.
here is this commentThis pass does not change IR itself. The rule is that llvm :: SimplifyInstruction can only return constants and existing Value objects, which meets the requirements for the analyzer. The passage calling SimplifyInstruction for each instruction is a transformation pass (lib / Transforms / Utils / SimplifyInstructions.cpp.).
AsmParser : parsing text IR into memory.
Bitcode : serialize IR into a compact format and read from a compact format into RAM.
CodeGen: a hardware-independent LLVM code generator, a framework on which the LLVM backends are written, and a set of libraries that these backends can use. There is a lot of code (> 100 KLOC), and, unfortunately, I don't know much about it.
DebugInfo is a library to support the mapping between LLVM instructions and source code locations. Lots of good information on
these slides from the 2014 LLVM Developers' Meeting talk.
ExecutionEngine: Although LLVM is usually translated to machine code or assembler, it can be executed by an interpreter. The non-JIT interpreter did not work as it should the last time I tried to use it, but in any case, it works slower than JIT. The latest
JIT API, Orc , is here.
Fuzzer: this is
libFuzzer , similar to
AFL (
fuzzing ). It uses the LLVM functionality for fuzzing programs compiled using LLVM.
IR : various
IR related code. The code for outputting IR code in text format, for upgrading bitcode files created in earlier versions of LLVM, for folding constants in the process of creating IR nodes, etc.
IRReader ,
LibDriver ,
LineEditor : almost no one is interested in what is here, and there is hardly any useful code there at all.
Linker: The LLVM module, like the C and C ++ compilation unit, contains functions and variables. The linker combines multiple modules into one large module.
LTO: Optimization of layout time, the subject of many posts and scientific articles, allows the optimizer to see beyond the individual compiled modules. LLVM does the layout optimization “for free,” using a linker to create a large module and then optimizing it with normal optimization passes. This is a good approach, but it does not scale for very large projects. The modern approach is
ThinLTO , which allows you to get most of the advantages for a small part of the price.
MC: The compiler usually generates assembly code and allows the assembler to create machine code. The MC subsystem in LLVM eliminates the intermediate and allows you to generate machine code directly. This speeds up compilation and is especially useful when LLVM is used as a JIT compiler.
Object : Implementation of the details of object file formats, such as ELF.
ObjectYAML - supports encoding object files in
YAML . I do not know why this is necessary.
Option: - command line parsing.
Passes: the part of the pass manager that controls the start of the LLVM passes, taking into account the dependencies.
ProfileData: - reads and writes profiling data to support optimization based on profiling.
Support: Support for various codes, including APInts (arbitrary precision integers widely used in LLVM), etc.
TableGen: a kind of
Swiss knife , a tool that receives as input .td files (of which there are more than 200 in LLVM) containing structured data, and C ++ generation code that is compiled into LLVM. TableGen is used, for example, to implement assembler and disassembler.
Target: backends for different processors live here. There are a lot of tablegen files. You can create a new backend by making a clone of one of them, whose architecture is closest to yours, and then spending a couple of years in its development.
Transforms: this is my favorite directory, here live Midland optimizers. IPO contains interprocedural optimization, working between the boundaries of functions, they are usually not very aggressive, but they see a lot of code at once. InstCombine is a peephole optimizer. Instrumentation - support for sanitizer. ObjCARC supports
this . Scalar contains compiler optimizations “from the textbook”, I will try to write a more detailed post about the contents of this directory. Utils is an auxiliary code. Vectorize - LLVM autorunctor, the subject of great work in recent years.
With this we will finish our review tour, I hope it was useful and, as always, you will let me know if I missed it somewhere or missed something.