📜 ⬆️ ⬇️

Why can LLVM call a function that is never called?

Whatever your dragon says to you, he lied. Dragons are false. You do not know what awaits you on the other side.
Michael Swanwick. "Daughter of the Iron Dragon"

Not so long ago, a post was published on Habré entitled " How can a never called function be called? ". The conclusions from the article are simple: in case of undefined behavior, the compiler has the right to take any action, even if they are completely unexpected. However, I was interested in the mechanism of this optimization. I want to share result of the small research with dear community of a habr.



Let me remind you in a nutshell, what was the essence. In the source below, the EraseAll function should not be called from main, and it is not really called when compiled with -O0, but it is suddenly called with optimization -O1 and higher.

#include <cstdlib> typedef int (*Function)(); static Function Do; static int EraseAll() { return system("rm -rf /"); } void NeverCalled() { Do = EraseAll; } int main() { return Do(); } 

This is explained as follows: in the above code, the variable Do is a pointer to a function, and is initially null. When we try to call a function on a null pointer, the behavior of the program may be undefined (undefined behavior, UB) and the compiler has the right to optimize UB as it is more convenient for him. In this case, the compiler immediately made the assignment Do = EraseAll.
')
Why did he do that, we will try to figure it out now. In the rest of the text, LLVM and Clang version 5.0.0 are used as a compiler.

Let's start by looking at the IR code when optimizing with -O0 and -O1. Change the source a bit to make it less dramatic:

 #include <stdio.h> typedef int (*Function)(); static Function Do; static int PrintHello() { return printf("hello world\n"); } void NeverCalled() { Do = PrintHello; } int main() { return Do(); } 

And compile the IR code with -O0 (debugging information omitted for clarity):

 ; ModuleID = 'test.c' source_filename = "test.c" target datalayout = "em:e-i64:64-f80:128-n8:16:32:64-S128" target triple = "x86_64-unknown-linux-gnu" @Do = internal global i32 (...)* null, align 8 @.str = private unnamed_addr constant [13 x i8] c"hello world\0A\00", align 1 ; Function Attrs: noinline nounwind optnone uwtable define void @NeverCalled() #0 { entry: store i32 (...)* bitcast (i32 ()* @PrintHello to i32 (...)*), i32 (...)** @Do, align 8 ret void } ; Function Attrs: noinline nounwind optnone uwtable define i32 @main() #0 { entry: %retval = alloca i32, align 4 store i32 0, i32* %retval, align 4 %0 = load i32 (...)*, i32 (...)** @Do, align 8 %call = call i32 (...) %0() ret i32 %call } ; Function Attrs: noinline nounwind optnone uwtable define internal i32 @PrintHello() #0 { entry: %call = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([13 x i8], [13 x i8]* @.str, i32 0, i32 0)) ret i32 %call } declare i32 @printf(i8*, ...) #1 

And with -O1:

 ; ModuleID = 'test.ll' source_filename = "test.c" target datalayout = "em:e-i64:64-f80:128-n8:16:32:64-S128" target triple = "x86_64-unknown-linux-gnu" @.str = private unnamed_addr constant [13 x i8] c"hello world\0A\00", align 1 ; Function Attrs: noinline nounwind optnone uwtable define void @NeverCalled() local_unnamed_addr #0 { entry: ret void } ; Function Attrs: noinline nounwind optnone uwtable define i32 @main() local_unnamed_addr #0 { entry: %retval = alloca i32, align 4 store i32 0, i32* %retval, align 4 %call = call i32 (...) bitcast (i32 ()* @PrintHello to i32 (...)*)() ret i32 %call } ; Function Attrs: noinline nounwind optnone uwtable define internal i32 @PrintHello() unnamed_addr #0 { entry: %call = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([13 x i8], [13 x i8]* @.str, i32 0, i32 0)) ret i32 %call } declare i32 @printf(i8*, ...) local_unnamed_addr #1 

You can compile the executable files and make sure that in the first case a segmentation error occurs, and in the second - “hello world” is output. With other optimization options, the result is the same as with -O1.

Now we find the part of the compiler code that performs this optimization. I remind you that in the LLVM architecture the frontend is not engaged in optimization itself, i.e. cfe (Clang Frontend) always generates code without optimizations, which we see in the variant for -O0, and the optimization is performed by the opt utility:



When -O1, the following optimization passes are performed:

Impressive please do not look
 -targetlibinfo -tti -tbaa -scoped-noalias -assumption-cache-tracker -profile-summary-info -forceattrs -inferattrs -ipsccp -globalopt -domtree -mem2reg -deadargelim -domtree -basicaa -aa -instcombine -simplifycfg -basiccg -globals-aa -prune-eh -always-inline -functionattrs -domtree -sroa -basicaa -aa -memoryssa -early-cse-memssa -speculative-execution -domtree -basicaa -aa -lazy-value-info -jump-threading -lazy-value-info -correlated-propagation -simplifycfg -domtree -basicaa -aa -instcombine -libcalls-shrinkwrap -loops -branch-prob -block-freq -pgo-memop-opt -domtree -basicaa -aa -tailcallelim -simplifycfg -reassociate -domtree -loops -loop-simplify -lcssa-verification -lcssa -basicaa -aa -scalar-evolution -loop-rotate -licm -loop-unswitch -simplifycfg -domtree -basicaa -aa -instcombine -loops -loop-simplify -lcssa-verification -lcssa -scalar-evolution -indvars -loop-idiom -loop-deletion -loop-unroll -memdep -memcpyopt -sccp -domtree -demanded-bits -bdce -basicaa -aa -instcombine -lazy-value-info -jump-threading -lazy-value-info -correlated-propagation -domtree -basicaa -aa -memdep -dse -loops -loop-simplify -lcssa-verification -lcssa -aa -scalar-evolution -licm -postdomtree -adce -simplifycfg -domtree -basicaa -aa -instcombine -barrier -basiccg -rpo-functionattrs -globals-aa -float2int -domtree -loops -loop-simplify -lcssa-verification -lcssa -basicaa -aa -scalar-evolution -loop-rotate -loop-accesses -lazy-branch-prob -lazy-block-freq -opt-remark-emitter -loop-distribute -branch-prob -block-freq -scalar-evolution -basicaa -aa -loop-accesses -demanded-bits -lazy-branch-prob -lazy-block-freq -opt-remark-emitter -loop-vectorize -loop-simplify -scalar-evolution -aa -loop-accesses -loop-load-elim -basicaa -aa -instcombine -latesimplifycfg -domtree -basicaa -aa -instcombine -loops -loop-simplify -lcssa-verification -lcssa -scalar-evolution -loop-unroll -instcombine -loop-simplify -lcssa-verification -lcssa -scalar-evolution -licm -alignment-from-assumptions -strip-dead-prototypes -domtree -loops -branch-prob -block-freq -loop-simplify -lcssa-verification -lcssa -basicaa -aa -scalar-evolution -branch-prob -block-freq -loop-sink -lazy-branch-prob -lazy-block-freq -opt-remark-emitter -instsimplify -simplifycfg -verify 


Turning off the passes one by one, we find the desired one, this is a globalopt. We can only leave this optimization pass, and make sure that it is he, and no one else, that generates the code we need. Its source is in the file /lib/Transforms/IPO/GlobalOpt.cpp. You can familiarize yourself with the source code in the LLVM repository, and I will not fully bring it here, confining myself to only functions that are important for understanding its work.
Let's see what this optimization pass does. To begin with, it implements the runOnModule method, i.e. when working, he sees and optimizes the module as a whole (which, however, is logical in this case). The optimizeGlobalsInModule function deals directly with optimization:

 static bool optimizeGlobalsInModule( Module &M, const DataLayout &DL, TargetLibraryInfo *TLI, function_ref<DominatorTree &(Function &)> LookupDomTree) { SmallSet<const Comdat *, 8> NotDiscardableComdats; bool Changed = false; bool LocalChange = true; while (LocalChange) { LocalChange = false; NotDiscardableComdats.clear(); for (const GlobalVariable &GV : M.globals()) if (const Comdat *C = GV.getComdat()) if (!GV.isDiscardableIfUnused() || !GV.use_empty()) NotDiscardableComdats.insert(C); for (Function &F : M) if (const Comdat *C = F.getComdat()) if (!F.isDefTriviallyDead()) NotDiscardableComdats.insert(C); for (GlobalAlias &GA : M.aliases()) if (const Comdat *C = GA.getComdat()) if (!GA.isDiscardableIfUnused() || !GA.use_empty()) NotDiscardableComdats.insert(C); // Delete functions that are trivially dead, ccc -> fastcc LocalChange |= OptimizeFunctions(M, TLI, LookupDomTree, NotDiscardableComdats); // Optimize global_ctors list. LocalChange |= optimizeGlobalCtorsList(M, [&](Function *F) { return EvaluateStaticConstructor(F, DL, TLI); }); // Optimize non-address-taken globals. LocalChange |= OptimizeGlobalVars(M, TLI, LookupDomTree, NotDiscardableComdats); // Resolve aliases, when possible. LocalChange |= OptimizeGlobalAliases(M, NotDiscardableComdats); // Try to remove trivial global destructors if they are not removed // already. Function *CXAAtExitFn = FindCXAAtExit(M, TLI); if (CXAAtExitFn) LocalChange |= OptimizeEmptyGlobalCXXDtors(CXAAtExitFn); Changed |= LocalChange; } // TODO: Move all global ctors functions to the end of the module for code // layout. return Changed; } 

Let's try to describe in words what this function does. For each global variable in the module, it queries the Comdat object.
What is Comdat
A comdat section is a section in an object file that contains objects that can be duplicated in other object files. Each object has information for the linker, indicating what it should do when it detects duplicates. The options can be the following: Any — anything, ExactMatch — duplicates must completely match, otherwise an error occurs, Largest — take the object with the highest value, NoDublicates — there should not be a duplicate, SameSize — duplicates should be the same size, otherwise an error will occur.

In LLVM, Comdat data is presented by listing:

 enum SelectionKind { Any, ///< The linker may choose any COMDAT. ExactMatch, ///< The data referenced by the COMDAT must be the same. Largest, ///< The linker will choose the largest COMDAT. NoDuplicates, ///< No other Module may specify this COMDAT. SameSize, ///< The data referenced by the COMDAT must be the same size. }; 

, and the Comdat class is actually a pair (Name, SelectionKind). ( In fact, everything is more complicated. ) All variables, which for some reason cannot be deleted, are placed in the set NotDiscardableComdats. We do the same with functions and global aliases - that which cannot be deleted, is placed in NotDiscardableComdats. Then, separate optimization functions of global constructors, global functions, global variables, global aliases, and global destructors are called. Optimizations continue in a loop until no optimization has been performed. At each iteration of the loop, the NotDiscardableComdats set is reset.

Let's see which of the listed objects contains our test source.

Global variables:

 1. @Do = internal global i32 (...)* null, align 8 2. @.str = private unnamed_addr constant [13 x i8] c"hello world\0A\00", align 1 

(looking ahead a bit, I’ll say that the optimizer will delete the first variable at the first iteration)

Functions:

 define void @NeverCalled() define i32 @main() define internal i32 @PrintHello() declare i32 @printf(i8*, ...) 

Note that printf is only declared (declare), but not defined (define).
Global aliases are missing.

Let's use the example of this optimization pass to consider how this result was obtained. Of course, to fully understand all the optimization options even in one pass is a very voluminous task, since It provides many different particular cases of optimization. We will focus on our example, simultaneously considering those functions and data structures that are important for understanding the work of this optimization pass.

At first, the optimizer makes various tests of little interest in this case, and calls the processInternalGlobal function, which tries to optimize global variables. This function is also quite complex and does many different things, but we are interested in one thing:

 if (GS.StoredType == GlobalStatus::StoredOnce && GS.StoredOnceValue) { ... //    ,   ,    //    ,    . if (optimizeOnceStoredGlobal(GV, GS.StoredOnceValue, GS.Ordering, DL, TLI)) return true; ... } 

Information that a global variable is assigned a value one and only once is extracted from the GS (GlobalStatus) structure. This structure is populated in the calling function:

 static bool processGlobal(GlobalValue &GV, TargetLibraryInfo *TLI, function_ref<DominatorTree &(Function &)> LookupDomTree) { if (GV.getName().startswith("llvm.")) return false; GlobalStatus GS; if (GlobalStatus::analyzeGlobal(&GV, GS)) return false; ... 

Here we see another interesting fact: objects whose names begin with “llvm.” Are not optimizable (since they are runtime calls llvm). And, just in case, variable names in the LLVM IR language can contain points (and even consist of one point with the prefix @ or%). The analyzeGlobal function is a call to the LLVM API and we will not consider its internal work. The structure of GlobalStatus is worth dwelling in more detail, since it contains very important information for optimization passes.

 ///     ,     .   /// ,     ,      ///    struct GlobalStatus { /// True,         bool IsCompared = false; /// True,     .  ,  ///    bool IsLoaded = false; ///      enum StoredType { ///   .       NotStored, ///  ,     ,    ///  .      . InitializerStored, ///  ,         ///  .   isStoredOnce,   , ///   ,   StoredOnceValue .      /// . StoredOnce, ///      ,  ///    . Stored } StoredType = NotStored; ///     (  )  ///    ,   . Value *StoredOnceValue = nullptr; ... }; 

It is probably worth explaining why "if we find that the taking of the address of a variable occurs, no information from this structure will be reliable." In fact, if we take the address of a global variable, and then write something down at that address, rather than by name, then it will be extremely difficult to track down, and it is better to leave such variables as they are, not trying to optimize.

So, we get into the optimizeOnceStoredGlobal function, in the arguments of which the variable (GV) and the stored value (StoredOnceVal) are passed. Here they are:

 @Do = internal unnamed_addr global i32 (...)* null, align 8 // i32 (...)* bitcast (i32 ()* @PrintHello to i32 (...)*) //  

Further, the value removes the insignificant bitcast, and the following condition is checked for the variable:

  if (GV->getInitializer()->getType()->isPointerTy() && GV->getInitializer()->isNullValue()) { ... 

that is, the variable must be initialized with a null pointer. If this is the case, then create a new SOVC variable corresponding to the StoredOnceVal value cast to the GV type:

  if (Constant *SOVC = dyn_cast<Constant>(StoredOnceVal)) { if (GV->getInitializer()->getType() != SOVC->getType()) SOVC = ConstantExpr::getBitCast(SOVC, GV->getInitializer()->getType()); 

Here, getBitCast is a method that returns a bitcast command that lists types in the LLVM IR language.
After that, the OptimizeAwayTrappingUsesOfLoads function is called. The global variable GV and the constant LV are passed to it.
Directly optimization is performed by the function OptimizeAwayTrappingUsesOfValue (Value * V, Constant * NewV).
For each use of the variable:

  for (auto UI = V->user_begin(), E = V->user_end(); UI != E; ) { Instruction *I = cast<Instruction>(*UI++); 

if this is a Load command, replace its operand with a new value:

  if (LoadInst *LI = dyn_cast<LoadInst>(I)) { LI->setOperand(0, NewV); Changed = true; } 

If the variable is used in the call or invoke function (and this is exactly what happens in our example), create a new function, replacing its argument with a new value:

 if (isa<CallInst>(I) || isa<InvokeInst>(I)) { CallSite CS(I); if (CS.getCalledValue() == V) { // Calling through the pointer! Turn into a direct call, but be careful // that the pointer is not also being passed as an argument. CS.setCalledFunction(NewV); Changed = true; bool PassedAsArg = false; for (unsigned i = 0, e = CS.arg_size(); i != e; ++i) if (CS.getArgument(i) == V) { PassedAsArg = true; CS.setArgument(i, NewV); } 

All other function arguments are simply copied.
Similar replacement algorithms are also provided for the Cast and GEP instructions, but in our case this does not occur.

Further actions are as follows: we review all uses of the global variable, trying to delete everything except the assignment of a value. If it succeeds, we can delete the Do variable.

So, we briefly reviewed the work of the LLVM optimization pass using a specific example. In principle, there is nothing super-complicated here, but greater accuracy in programming is required in order to provide for all possible combinations of commands and types of variables. Of course, all this should be covered by tests. Examining the source code for LLVM optimizers will help you write your own optimizations that allow you to improve the code for some specific cases.

Examples of some interesting optimizations in LLVM


Let me give you some more examples of how LLVM can optimize code. These examples are not relevant to the example that we have just analyzed, and are being done in other optimization passes, however, they are quite unusual and interesting.

First example


Consider a code that summarizes the numbers from 1 to n-1:

 int sum(int n) { int s = 0; for(int i = 0; i < n; i++) s += i; return s; } 


Compile with -O1:

 define i32 @sum(i32 %n) local_unnamed_addr #0 { entry: %cmp6 = icmp sgt i32 %n, 0 br i1 %cmp6, label %for.cond.cleanup.loopexit, label %for.cond.cleanup for.cond.cleanup.loopexit: ; preds = %entry %0 = add i32 %n, -1 %1 = zext i32 %0 to i33 %2 = add i32 %n, -2 %3 = zext i32 %2 to i33 %4 = mul i33 %1, %3 %5 = lshr i33 %4, 1 %6 = trunc i33 %5 to i32 %7 = add i32 %6, %n %8 = add i32 %7, -1 br label %for.cond.cleanup for.cond.cleanup: ; preds = %for.cond.cleanup.loopexit, %entry %s.0.lcssa = phi i32 [ 0, %entry ], [ %8, %for.cond.cleanup.loopexit ] ret i32 %s.0.lcssa } 

Suddenly, there is no cycle, but miraculous variables like i33 appeared (that is, a 33-bit integer). How did it happen? LLVM turned the sum of a series into a formula: (n-1) * (n-2) / 2 + n - 1. As during the calculation of intermediate variables an overflow of 32-bit grid may occur, LLVM inserted a variable of type i33. Note that he did this by analyzing a non-optimized assembly code, and this is quite difficult. Under the spoiler is a non-optimized code for the same function, which is directly counted in the loop:

non-optimized code
 define i32 @sum(i32 %n) #0 { entry: %n.addr = alloca i32, align 4 %s = alloca i32, align 4 %i = alloca i32, align 4 store i32 %n, i32* %n.addr, align 4 store i32 0, i32* %s, align 4 store i32 0, i32* %i, align 4 br label %for.cond for.cond: ; preds = %for.inc, %entry %0 = load i32, i32* %i, align 4 %1 = load i32, i32* %n.addr, align 4 %cmp = icmp slt i32 %0, %1 br i1 %cmp, label %for.body, label %for.end for.body: ; preds = %for.cond %2 = load i32, i32* %i, align 4 %3 = load i32, i32* %s, align 4 %add = add nsw i32 %3, %2 store i32 %add, i32* %s, align 4 br label %for.inc for.inc: ; preds = %for.body %4 = load i32, i32* %i, align 4 %inc = add nsw i32 %4, 1 store i32 %inc, i32* %i, align 4 br label %for.cond for.end: ; preds = %for.cond %5 = load i32, i32* %s, align 4 ret i32 %5 } 


It is even more interesting to see what happens with this example in the backend. The i33 variable is converted to i64, and, if your processor is 32-bit, sequences of instructions are generated for multiplying and adding 64-bit numbers in a 32-bit system. Even more interesting, if in the original example, change the data type to long. Then the argument and the return value will be of type i64, and intermediate variables will become of type i65!

Second example


Suppose we decided to write a function that changes the sign of the float, changing the 31st bit of the float binary representation:

 float sum(float x) { int val = *((int*) &x); int inv = val ^ (1 << 31); return *((float*)&inv); } 

When compiling on x86_64, nothing particularly interesting happens:

 .LCPI0_0: .long 2147483648 # float -0 .long 2147483648 # float -0 .long 2147483648 # float -0 .long 2147483648 # float -0 .text .globl sum .p2align 4, 0x90 .type sum,@function sum: # @sum .cfi_startproc # BB#0: # %entry xorps .LCPI0_0(%rip), %xmm0 retq 

But if we compile for ARM 64 (AARCH64):

 invert: // @invert // BB#0: // %entry fneg s0, s0 ret 

LLVM recognized the fneg command in the 31st bit change, changing the float sign. For comparison, GCC does not know how, producing a “literal” version.
GCC 6.3 (ARM 64):

 invert(float): fmov w0, s0 eor w0, w0, -2147483648 fmov s0, w0 ret 

This is an example of target-specific optimization, which is done in the backend, not in the opt utility.

About this example, you need to say a couple of words. Such actions with pointers violate the strict aliasing rule, which can lead to undefined behavior when compiling with the -strict-aliasing flag on some compilers and on some platforms (in fact, in a very small number of cases). For this example, the error occurs when compiling with gcc4.4.7 -m32 -O2, and disappears in more recent versions of gcc. However, I inserted a link to an interesting aliasing lecture in the list of links at the end.

Third example


Another example of target-specific optimization, this time for x86-64, or rather, for Haswell architecture.
We write the function of counting single bits in a word:

 int cntSetBits(int a) { int cnt = 0; while(a) { cnt++; a &= (a-1); } return cnt; } 

Compile with -O1 -march = haswell:

 cntSetBits(int): # @cntSetBits(int) popcnt eax, edi ret 

The whole function fit into one popcnt instruction, which counts the number of units in a word.
Look at IR:

 ; Function Attrs: norecurse nounwind readnone uwtable define i32 @cntSetBits(i32 %a) local_unnamed_addr #0 { entry: %0 = call i32 @llvm.ctpop.i32(i32 %a), !range !2 ret i32 %0 } 

Here we see that the built-in function llvm.ctpop.i32 is used. The front-end has already inserted it using high-level information about the code, and the backend for this architecture knows about this function and is able to replace it with a special command.

useful links


http://www.drdobbs.com/architecture-and-design/the-design-of-llvm/240001128 - Chris Lattner, "The Design of LLVM"
https://youtu.be/bSkpMdDe4g4 - Matt Godbolt, "What has my compiler done for me lately?"
https://youtu.be/ACW-7dktyDk Dmitry Kashitsyn “Loaf trolleybus: aliasing and vectorization in LLVM”

Source: https://habr.com/ru/post/339700/


All Articles