I planned to write an article about how LLVM optimizes a function, but first you need to write how Clang translates C or C ++ to LLVM.

Consider the high-level aspects, not plunging into the depths of Clang. I want to draw attention to how the output of Clang relates to the input, while we will not consider the nontrivial possibilities of C ++. We use this little function, which I borrowed from the excellent
lectures on cycle optimization :
')
bool is_sorted(int *a, int n) { for (int i = 0; i < n - 1; i++) if (a[i] > a[i + 1]) return false; return true; }
Since Clang does not make any optimizations, and since LLVM IR was originally designed to work with C and C ++, the conversion is relatively easy. I will use Clang 6.0.1 (or a close version, since this one has not yet been released) on x86-64.
The command line is as follows:
clang++ is_sorted.cpp -O0 -S -emit-llvm
In other words: compile the is_sorted.cpp file as C ++ and then we say the following LLVM toolchain: do not optimize, output the assembler, as a textual representation of LLVM IR. LLVM IR is volumetric, and cannot be quickly output or parsed, binary bitcode format is always preferable if a person does not need to look at this code.
Here is the full LLVM IR, we will look at it in parts.
Let's start with the top of the file:
; ModuleID = 'is_sorted.cpp' source_filename = "is_sorted.cpp" target datalayout = "em:e-i64:64-f80:128-n8:16:32:64-S128" target triple = "x86_64-unknown-linux-gnu"
The whole text between the semicolon and the end of the line is a comment, which means that the first line does nothing, but if you are interested, in LLVM a “module” is a compilation unit, a container for code and data. The second line should not bother us either. The third line describes some assumptions made by the compiler; they do not play a role in this article, but you can read more
here .
The target troika is the legacy of gcc and we will not need it later.
The LLVM function has optional attributes:
; Function Attrs: noinline nounwind optnone uwtable
Some of them (like these) are supported by the front end, others are added later by optimization passes. These attributes have nothing to do with the meaning of the code, I will not discuss them here, but you can read about them
here if you are interested.
And finally, our function:
define zeroext i1 @_Z9is_sortedPii(i32* %a, i32 %n)
“Zeroext” means that the return value of the function (i1, one-bit integer) must be expanded with zeros in the backend, to the width that the ABI requires. Then comes the “augmented” (mangled) function name, then the parameter list is basically the same as in C ++ code, except that i32 defines a 32-bit variable. # 0 connects the function to
the attribute group at the end of the file.
Here is the first base unit:
entry: %retval = alloca i1, align 1 %a.addr = alloca i32*, align 8 %n.addr = alloca i32, align 4 %i = alloca i32, align 4 store i32* %a, i32** %a.addr, align 8 store i32 %n, i32* %n.addr, align 4 store i32 0, i32* %i, align 4 br label %for.cond
Each LLVM instruction must be located inside the base unit: a set of instructions having one input at the beginning and one output at the end. The last instruction of the base unit must be a
terminating instruction : “dropping” into the next base unit is not allowed. Each function must have an input block that does not have predecessors (predecessors) that perform the transition to this block. This and
other properties are checked when IR is parsed, these checks can also be invoked multiple times during the compilation process by the “module verifier”. The verifier is useful for debugging when the pass generates the wrong IR.
The first four instructions in this base block are alloca: allocating stack memory. The first three create variables that are implicitly created during compilation, the fourth - a loop variable. Variables allocated in this way can only be accessed via the load and store instructions. The following three instructions initialize three stack slots, a.addr and n.addr are initialized using the values ​​passed to the function as parameters, and i is initialized to zero. The return value does not need to be initialized, any code that is not undefined in C and C ++ will have to take care of this. The last instruction is an unconditional transition to the next base unit (we are not worried about this yet, most of the unnecessary transitions will be deleted by the LLVM backend).
You might ask: Why does Clang allocate stack slots for a and n? Why doesn't he just use these values ​​directly? In this function, since a and n do not change, such a strategy will work, but this case will be taken into account by the optimizer, and is outside the competence of Calng. If a and n can be modified, they should be in memory, and should not be SSA-values, which, by definition, can take on value only at one point in the program. Memory cells are outside the SSA world and can be modified at any time. This may seem strange, but this solution allows you to organize the work of all parts of the compiler in a natural and effective way.
I think of Clang as a degenerate SSA code generator that satisfies all SSA requirements, but only because the exchange of information between the basic blocks occurs through memory. The generation of non-degenerate code requires some care and some analysis, and the Clang developers refused to do this in order to share the responsibilities of code generation and optimization. I did not see the measurement results, but in my understanding, a lot of memory operations are generated, and then almost immediately most of them are deleted by the optimizer, without leading to a large overhead of compile time,
Consider how the for loop is translated. In general, it looks like this:
for (initializer; condition; modifier) { body }
This translates to something like this:
initializer goto COND COND: if (condition) goto BODY else goto EXIT BODY: body modifier goto COND EXIT:
Of course, this translation is not Clang specific, any C and C ++ compiler does the same.
In our example, the loop initializer is minimized to an input base unit. The following basic block is a test of the cycle condition:
for.cond: ; preds = %for.inc, %entry %0 = load i32, i32* %i, align 4 %1 = load i32, i32* %n.addr, align 4 %sub = sub nsw i32 %1, 1 %cmp = icmp slt i32 %0, %sub br i1 %cmp, label %for.body, label %for.end
Clang also makes a useful comment that this base block can be reached either from for.inc or from an input base block. This block loads i and n from memory, decreases n (the nsw flag reflects the C property that the sign overflow is undefined; without this flag, LLVM uses the additional code semantics), compares the reduced value with i, using the slt command (signed less than, signed less than) and then finally branch to the for.body or for.end base unit.
Logging into the loop body is possible only from the for.cond block:
for.body: %2 = load i32*, i32** %a.addr, align 8 %3 = load i32, i32* %i, align 4 %idxprom = sext i32 %3 to i64 %arrayidx = getelementptr inbounds i32, i32* %2, i64 %idxprom %4 = load i32, i32* %arrayidx, align 4 %5 = load i32*, i32** %a.addr, align 8 %6 = load i32, i32* %i, align 4 %add = add nsw i32 %6, 1 %idxprom1 = sext i32 %add to i64 %arrayidx2 = getelementptr inbounds i32, i32* %5, i64 %idxprom1 %7 = load i32, i32* %arrayidx2, align 4 %cmp3 = icmp sgt i32 %4, %7 br i1 %cmp3, label %if.then, label %if.end
The first two lines load a and i into SSA registers; i then expands to 64 bits and can participate in the address calculation. The
getelementptr team (or gep for short) is the well-known LLVM command, it even has its own
help section . Unlike machine language, LLVM does not treat pointers as integers. This facilitates alias analysis and other memory optimizations. This code loads a [i] and a [i + 1], compares them and performs branching depending on the result.
The if.then block stores 0 in the stack slot for the return value of the function and performs an unconditional transition to the output block of the function:
if.then: store i1 false, i1* %retval, align 1 br label %return
The else block is trivial:
if.end: br label %for.inc
And the block for adding one to the loop variable is also very simple:
for.inc: %8 = load i32, i32* %i, align 4 %inc = add nsw i32 %8, 1 store i32 %inc, i32* %i, align 4 br label %for.cond
This code goes back to the loop condition check.
If the loop ends normally, we return true:
for.end: store i1 true, i1* %retval, align 1 br label %return
Finally, what we loaded into the stack return value slot is loaded and returned:
return: %9 = load i1, i1* %retval, align 1 ret i1 %9
There is nothing special at the end of the function. The post turned out longer than I thought, in the next post we will look at the optimization of the IR level for this function.
(Thanks to Xi Wang and Alex Rosenberg for the fixes sent)