Interprocedural analysis and optimization (I)

One of the most interesting and important components of a modern optimizing compiler is interprocedural analysis and optimization. A good programming style and the need to divide work among developers dictate the need to partition a large project into separate modules, in which the main utilities are implemented as “black boxes” for all main users. The implementation details, at best, are known to a couple or three specific developers, and sometimes, after years of age, they are not known to anyone at all. (And what can you do - specialization, globalization). Your code often contains calls to external functions whose bodies are defined in external files or libraries, and your knowledge of these functions is minimal. In addition to this, when developing large projects, all sorts of global variables are generated, with the help of which components of a large project exchange valuable information, and in order to understand the work of your part of the code in case of any problems, it is necessary to shovel a lot of code. Obviously, all this greatly complicates the work of the optimizing compiler. What negative effects modularity generates and whether there are special tools in the compiler for overcoming them is a topic for a big conversation. Now I will try to start such a conversation. I will talk about some interesting features of the work of the optimizing compiler using the example of the Intel compiler. I will be guided by OS Windows therefore the options of the compiler I bring characteristic for this platform. Well, in order to make life easier for me, I will sometimes use IPA abbreviations for interprocedural analysis and IPO for interprocedural optimizations. And I will begin with the story about the models of interprocedural analysis and the call graph.

The impact of modularity on performance

To understand what problems may arise from modularity, you need to look at what options exist for the compiler, how the compiler tries to solve certain problems, what are the restrictions on the analysis methods used by the compiler. The compiler should work conservatively, i.e. with complete distrust of the programmer and the program and to do this or that optimization only if its correctness is fully proved.

The simplest version of the compiler is a one-pass compilation. In this version of the work, the compiler processes in turn all the functions from the source file plus all the previous declarations. Those. the compiler sequentially parses the characters from the source file, fills the tables with different definitions of functions, sets descriptions of defined types and sequentially parses the bodies of all functions in the processed source file. First, the compiler solves the problem of checking the program for compliance with the requirements of the language standard, then it can perform various optimizations - cyclic, scalar, optimization of the control flow graph, etc. At the end of processing, the result is saved to the object file. With such work successfully (and usually relatively quickly) the task is solved - to get a working application, but its optimal work is very far. The reason for this is that the compiler has insufficient information about non-local objects for the function being processed. Also, the compiler has very little information about called functions. Does this situation affect the performance of the application and what are the problems associated with the use of function calls? There is no need to go far for examples. I will list some of the most important ones, in my opinion:

Problems with ambiguity resolution when working with memory. For example, in my previous posts I discussed the problem of recognizing loops and the accompanying problem of ambiguity resolution when working with memory (alias analysis). Those. if the program has arguments, pointers and / or global variables, then there is the possibility that work with the same memory is going on through several different objects. Such uncertainty prevents the compiler from recognizing cycles, creates putative dependencies that interfere with the execution of cycle optimizations, complicates data flow analysis, etc. The compiler often tries to solve such problems by creating a multi-version code, i.e. inserts into the runtime check code. But this "makes the code heavier" and is good only for simple cases when the intended dependencies are few, and what if there are a lot of unresolved issues?
When passed to a function, many properties of actual arguments that could have an impact on the quality of optimization are lost. For example, information about the location of objects in memory affects the gain from vectorization. If the actual argument is a constant, then dragging it into a function can also cause a simplification of the calculations performed. Well, etc.
Calling a function with unknown properties makes it difficult to analyze the data flow. It becomes necessary to assume that this unknown call can change all the data to which, according to the language standard, it can reach. (For example, global pointers and variables, objects, access to which is passed to the function through arguments). This interferes with such optimizations as the pushing of constants, the removal of general subexpressions, and many others.
Cycles calling unknown functions cannot be correctly classified (as cycles with a certain number of iterations). This is because an unknown function may contain an exit from the program. As a result, almost all cyclic optimizations are consistent.
The function call itself is quite expensive. It is necessary to prepare arguments, allocate space on the stack for storing local variables, load the instructions from the new address for execution. Chances are that these addresses will not be in the cache. If the body of the function is small, it may happen that the cost of its challenge is huge compared with the time of its work.
With static linking, the program will contain redundant code — unused functions from sources and libraries.

Probably, the list goes on.
')
There is a desire to collect various properties of functions and then use them in the optimization and processing of each particular function. Actually, this is the main idea of interprocedural analysis. To conduct such an analysis, it is necessary to take into account the relationships of all the functions involved in the calculations with each other. It is clear that in order to establish the properties of a certain function, it is necessary to analyze all the functions that it calls, etc. For example, in some languages there are “pure” functions (pure function) and you have got, use and calculate a similar attribute for all functions. Pure function does not change its arguments and any global variables and does not display any information. It is clear that this attribute for a function can be set only if all functions called within this function are pure.
To reflect these mutual influences and relationships, the call graph is used. Its vertices are the existing functions, and the faces are the challenges. If the function foo calls bar, then the vertices corresponding to these functions are connected by an edge. Since there may be several calls, the number of edges may be different, that is, the call graph is a multigraph.
IPA analysis works with a static graph, i.e. a graph reflecting all possible paths that can be covered during program execution. Sometimes, when analyzing performance, it is necessary to analyze a dynamic call graph, i.e. the real way in which control was transferred when a program was executed with some particular data set.
In order to build a call graph, you must first go through all the functions and determine which calls are made in each function. Then analyze the constructed graph and determine what features each function has. To implement various interprocedural optimization and only after that to process each individual function. With this order of work, it is necessary to at least twice process the body of each function — this compilation method is called two-pass.

You can draw something like this for a two-pass and one-pass compilation.

In the case when the programmer sets the option to include interprocedural optimizations, the compiler does lexical and syntactic analysis of each source file and, if the file conforms to the language standard, the compiler creates an internal representation and packs it into an object file. When all the necessary files with internal representation are created, the compiler re-examines them all, collects information about each function, finds all calls inside functions and builds a call graph. Analyzing the call graph, the compiler pulls through it various properties of functions, analyzes the properties of various objects, etc. After that, interprocedural optimizations are made that change the call graph or properties of objects. (I hope that after some time I will get together and talk about the most interesting optimizations in my opinion in more detail.) Many optimizations are applicable only if the compiler manages to prove that the executable file is being generated, and the call graph contains all the functions of the program. Such optimizations are called whole program optimization. Examples of such optimizations are the removal from the call graph of individual functions or parts of the call graph that cannot be reached from main (dead code elimination). Another famous example is the modification of data structures. Most interprocedural optimizations can be performed for both full and incomplete call graphs. The most well-known interprocedural optimization is the substitution of the function body (inlining), which substitutes the function body instead of calling it.
After all interprocedural optimizations have been performed, all functions included in the final call graph are processed in turn by the optimizing part of the compiler (backend). The difference with a single-pass compilation is that some additional information may be known about each function being called, for example, whether a function changes a particular object. This additional information allows for optimization more efficiently.
In fact, from my description it can be seen that the term two-pass optimization is a rather conditional term and in fact there can be more passes through the function body.

So, if you use the compiler to create an object file, for example with the –c option, then different entities will be created depending on the presence or absence of the –Qipo option. With -Qipo, an object file is a packaged internal representation, not a regular object file. Such files will not understand the linker link and the utility for building lib libraries. To work with such object files, you need to use their extended Intel analogues xilink and xilib. A library created from object files with an internal view will contain sections with an internal view, which means that the utilities that it contains will be able to participate in inter-procedural analysis at the moment when such a library will link to the application. Intel compiler and -Qipo options.) It follows that in order to get the maximum benefit from using the Intel compiler and interprocedural analysis, you must use the libraries created using the Intel compiler with -Qipo. In order for interprocedural analysis to work with system functions, the compiler has information about these functions.

By the way, if for some reason you mix flies and cutlets, i.e. Since ordinary object files and files compiled with –Qipo and causing an Intel compiler (icl) to process them, nothing terrible will happen. The compiler will notice that you pass the internal representation to it and will enable interprocedural analysis / optimization automatically. IPA and IPO will be executed for an incomplete call graph (i.e., a call graph with unknown functions), which will include all functions for which there is an internal representation. The compiler will create object files and then call the linker to link normal object files to them. But in the general case, an “ordinary” object file that is included in the program assembly with –Qipo can disable the whole layer of optimizations, the whole program optimizations.

Call graph

It is clear that interprocedural optimization is not a gift. Working with a call graph can be quite complex and resource intensive and much is determined by the size of the graph. For the sake of interest, I looked at the well-known performance test from the suite for measuring the performance of CPU2006 447.dealII (this is a c ++ application) and outputted information about the number of vertices of the call graph before the start of interprocedural analysis. It turned out that before the interprocedural analysis, there were approximately 320,000 different functions and approximately 380,000 different arcs connecting them. Large projects may experience problems using two-pass compilation and interprocedural analysis.

The Intel compiler implements two different models of interprocedural analysis, namely the single-file and multi-file model. Single-file enabled by default, unless multi-file inter-procedural optimization mode is specified. In the description of the options, it looks something like this:

/Qip[-] enable(DEFAULT)/disable single-file IP optimization within files /Qipo[n] enable multi-file IP optimization between files

Well, plus there are still two additional options, which I will mention later.

 /Qipo-c generate a multi-file object file (ipo_out.obj) /Qipo-S generate a multi-file assembly file (ipo_out.asm)

Here I will allow myself to give a scheme of some abstract call graph, characteristic of the C-shny project, so that the ~~article would look more beautifully the~~ work of the compiler with the call tree for both methods looked more clearly.

Formally, it can be said that c –Qipo requires more resources, and –Qip requires less resources, but also provides the worst quality of optimizations.
The reality is not so simple, at least with resources. Of course, in fact, call graphs for a single file are not subgraphs of the call graph for the entire project. For example, the columns for different files may contain the same functions, which came from the description files and are local to each file. This is especially true for large C ++ projects. Such functions can be duplicated in call graphs for some files. In my opinion, it can be conditionally assumed that these functions are a kind of analogue of weak functions. Therefore, if we sum up the number of vertices for each individual graph with a single-file optimization model, then the number of vertices can be much larger than the number of vertices in the complete call graph. In this case, with the single-file optimization model, all these weak functions go into processing and the compiler “fiddles” with each of them so that the linker leaves one version of this function in the executable file or library. Also, the amount of work during multi-file processing can be reduced by removing the subgraphs and functions that are unreachable from main. If the function has been substituted into all its calls (and some more conditions have been fulfilled), then the function will not be processed by the optimizing part of the compiler, i.e. the final executable file will not contain the body of this function. For a single-file model, this is true only if the function has a static attribute. That is, with a single-file IPA, a certain number of "dead" functions will get into the processing.
In general, it may happen that in total the compiler will do much more work with –Qip than with –Qipo. When compiling a program instead of one big call graph, you will have to handle many smaller graphs. As a result, each single-file compilation will require much less resources than working with a full call graph, but the amount of resources spent on compiling all files can be much more than with a full call graph.
I will illustrate this with the already mentioned benchmark 447.dealII from CPU2006. I already wrote that in the full graph of calls to this application about 320,000 vertices. Almost 120 files are involved in the compilation. The total number of all vertices in the call graphs of all processed files is ~ 1500000 vertices (although in each vertex file there are less than 50,000). Those. 5 times more than the number of all vertices in the full call graph. I turned off all parallel compilation options and got that the program with –Qipo compiles on my Nehalemam approximately 160s, while with –Qip the compilation time is ~ 495s, i.e. 3 times more. If you count the number of functions that are processed in the backend with –Qipo, you get ~ 19,500 functions. The total number of processed functions when compiling a project with –Qip is ~ 55000. Well, if you look at the size of the executable file, then for this application the file compiled with –Qipo is almost a third smaller than the file compiled with –Qip.
The main advantage of single-file optimization is that in this case, when editing a single file, you do not need to do a complete rebuild of the entire application, but recompile just one file and link it with other object files. In addition, working with –Qip is easy to parallelize, but creating an application with –Qipo is more difficult to parallelize. Although the –Qipo option may contain a numeric argument (for example, -Qipo2 implies operation in two streams), IPA is executed on the full call graph and has not yet come up with a good method for parallelizing this analysis, so this part of the work is done in one stream (or repeated in each stream) and, moreover, each stream receives into the load a lot of data collected by IPA. Processing all functions in the backend can be efficiently distributed between threads, but due to the difficulties listed above, such automatic parallelization is less efficient than in the case of using a single-file model.

By the way, I would like to mention here another useful option.
If you suddenly decide to go the simplest way and compile your project with a single call to the Intel compiler, passing the compiler all the options and a list of all sources at once, then by default the compilation will be done in one stream and you need to apply a special option to speed up the compilation. This option only affects compilation without multi-file IPA.

 /MP[<n>] create multiple processes that can be used to compile large numbers of source files at the same time

If the mentioned 447.deallI is compiled with different values in –Qipo, then the compile time will vary as follows: -Qipo1: 160s, -Qipo2: 136s, -Qipo4: 150s. Those. very soon the streams start interfering with each other.
If you replace –Qipo with –Qip in the same icl call (or remove it altogether, -Qip will work by default) and add the / MP option, you get the following: / MP1: 495s; / Mp2: 259s; / Mp4: 145s; / Mp8: 117s; / MP16: 84s

If we talk about performance, then multi-file IPA is much more efficient. (Although there is always some likelihood of "grief from the mind") It is easy to come up with a simple example that would demonstrate the advantages of a multi-file model in terms of the ability to use various optimizations thanks to additional information from the IPA.

test_main1.c:

 #include <stdio.h> #include <stdlib.h> #define N 10000 extern float calculate(float *a, float *b); int main() { float *a,*b,res; int i; a = (float*)malloc(N*sizeof(float)); b = (float*)malloc(N*sizeof(float)); for(i=0;i<N;i++) { a[i] = i; b[i] = 1 - i; } res = calculate(a,b); printf("res = %f", res); }

test1.c:

 #include <stdio.h> #define N 10000 extern float calc(float a, float b); float calculate(float * restrict a,float * restrict b) { int i; float res; for(i=0;i<N;i++) { res = calc(a[i],b[i]); } return res; }

test2.c:

 float calc(float a,float b) { return(a+b); }

 icl -Feaaa.exe -Qparallel -Qpar_report3 test_main.c test1.c test2.c -Qstd=c99 icl -Feaaa.exe -Qipo -Qparallel -Qpar_report3 test_main.c test1.c test2.c -Qstd=c99

Here, the –Qparallel option is used, which includes the auto-parallelizer of the Intel compiler and –Qpar_report3 - an option that gives information about which cycles were processed or rejected by the auto-parallelizer and why.

Output without –Qipo reports the following:

 ... test_main.c procedure: main ..\test_main.c(14): (col. 1) remark: LOOP WAS AUTO-PARALLELIZED test1.c procedure: calculate ..\test1.c(11): (col. 1) remark: loop was not parallelized: existence of parallel dependence test2.c procedure: calc

Conclusion with –Qipo:

 test_main.c test1.c test2.c procedure: main C:\iusers\aanufrie\for_habrahabr\7a\test_main.c(13): (col. 1) remark: LOOP WAS AUTO-PARALLELIZED C:\iusers\aanufrie\for_habrahabr\7a\test_main.c(17): (col. 8) remark: LOOP WAS AUTO-PARALLELIZED

From the output from –Qipo, it is clear that the interprocedural analysis helped automate the parallelization of the loop.
It is clear that in the case of –Qipo, inlining also helps, but even if inlining is prohibited, then by analyzing the calc function, the compiler has the opportunity to prove the validity of auto-parallelization.

What to do if for some reason a multi-file model is not applicable, for example, it takes a long time to build a program and this does not suit you? You can enhance the usefulness of a single-file model by placing shared functions in a file. The static attribute is very useful for functions. If the function has this attribute, it means that IPA sees all the calls to this function during single-file operation, and if these calls have any peculiarities, then they can be extended inside the function. If all calls to the static function have been substituted, then the function body itself can be deleted. The same goes for static objects. Interprocedural analysis believes that it sees all the examples of their use and can perform various optimizations with them. In general, any restrictions on the scope of objects by a single file can help to make a better single-file interprocedural analysis.
And what if it’s impossible to solve any problems within a single-file framework, since the important functionality is split between several files?
The option –Qipo-c provides an opportunity to organize some intermediate model of work, i.e. to build the call graph not on the basis of functions from one or all files, but to divide all files into groups. This is useful if your project is divided into several closely interrelated pieces. (For example, such a project is our compiler, in which there are several components using common utilities and located in separate directories). –Qipo-c creates an object file with a fixed name ipo_out.obj and all the functions from the files listed in the compiler call fall into this file. In the example above, the application could be compiled as follows:

 icl -Qipo-c -Qparallel -Qpar_report3 -Qstd=c99 test1.c test2.c icl test_main.c ipo_out.obj

The output from the compilation of the first two files shows that in this case both auto-parallelization and in-lining will work, since the critical functions were combined in one call graph, which made it possible to do interprocedural analysis and optimization.

This is the question, did any of the readers use such a compilation scheme, or at least heard about it? There is an opinion that only 5% of developers are interested in the performance problem. Often, the desire to provide the user with some additional levers to control the operation of the compiler rests on a simple question: “And who needs it at all, and who will use it?”

By the way, you can add and remove some files from interprocedural analysis using the “flies and cutlets” method: creating object files with and without Qipo. In this case, the interprocedural analysis will be carried out on an incomplete tree obtained from the functions that are described in files compiled with –Qipo. In the example I described, this also works.

Finally, the –Qipo-S option is useful for those who want to upgrade or learn an assembler after interprocedural optimizations.

With this, I will finish the story about the call graphs as the basis of IPO and IPA. I hope to continue the story about interprocedural optimizations and I will try to answer all the questions.

Source: https://habr.com/ru/post/199112/

All Articles

Interprocedural analysis and optimization (I)

The impact of modularity on performance

Call graph

More articles: