
I continue to talk about application optimization, begun here in the post
"Is there a simple assessment of the quality of application optimization?"You can talk about processors a lot and in detail and, for sure, among the readers of Habr there are a lot of people who are capable of such conversations. But my point of view on the processor is purely pragmatic. Since I am interested in the performance of the application, through the lens of processor performance, it is enough for me to understand the basic principles of the computational core. As well as the methods that exist to influence these basic principles. I will focus on the Intel64 architecture. This is due to the fact that in our performance analysis team we are engaged in analyzing the work of the Intel optimizing compiler, mainly for this architecture. In the high-performance computing computing market, this and compatible architectures occupy the lion’s share, so most performance problems are of a rather general nature. Let me briefly list the main problems and opportunities that determine the performance of the kernel and computing system and offer a short list of various optimizations designed to influence these problems.
')
Instructional concurrency level
Modern computing core is characterized by the presence of a pipeline (pipelining), superscalarity. The computing core has a mechanism for determining independent commands and scheduling instructions. Those. these mechanisms view the buffer of arrivals for processing instructions and select them for execution, if there are suitable computing powers and instructions do not depend on other instructions that have not yet been executed. Thus, for the Intel64 architecture, out-of-order execution is characteristic. Those. for such an architecture, the compiler does not need to engage in a detailed definition of the order of instructions. The level of loading of the pipeline will be determined by how many independent instructions go into processing, the so-called instructional parallelism level.
Number of conditional transitions and the quality of work of the transition predictor
The order of execution of various commands is determined by the control flow graph, which connects straight-line sections of code. In places where branching of the flow of control occurs, various conditional transitions are used, as a rule. That is, until the condition is calculated, it is not clear which instruction should be executed next. In order for the pipeline operation not to be interrupted, there is a branch prediction mechanism in the computational core, which selects one of the possible control transfer paths and will continue to deliver instructions from this direction to the computing conveyor. Already executed instructions are waiting for validation, which occurs after the condition is calculated. If the predictor was wrong, then there is a delay in the operation of the computational core, caused by the need to clear the compute core buffers and load new instructions into them. This situation is called a branch misprediction error. The branch prediction mechanism is quite complex and includes both a static part, which attempts to guess the direction at the first meeting with a conditional transition, and a dynamic one that accumulates and uses statistics.
The quality of the memory subsystem
Since all the main data and instructions are located in RAM, for performance it is important how long this data can be delivered to the computing device. In this case, the delay in receiving data from the main memory amounts to hundreds of processor cycles. To speed up the processor has a cache or subsystem caches. The cache stores copies of frequently used data from main memory. For modern multi-core processors, a typical situation is when the core owns a first-level and second-level cache, and the processor also has a third-level cache, which the cores use together. The first level cache is the smallest and fastest, the upper hierarchy caches are slower, but larger. Intel64 architecture processors have an inclusive cache. This means that if an address is represented in the cache of any level, then it is also contained in the caches of the upper hierarchy. Therefore, if it is necessary to obtain data from memory, the memory subsystem sequentially checks for the presence of addresses in the first, second, and third level caches. In case of a cache miss (i.e. there is no necessary information in the cache), in each subsequent case the delay will be greater. For example, for Nehalemma, the approximate access time to caches of a different level is as follows: 4 for the first level cache and 11 and 38 respectively for the second and third level caches. The delay in accessing RAM is approximately 100 to 200 cycles. To understand the operation of the memory subsystem, it is important that the transfer and storage of the necessary information in the cache occurs in fixed-length packets, which are called cache lines and constitute (for now?) 64 bytes. When requesting a specific memory address, some “neighboring” data is loaded “on the way” into the cache. The system bus transfers data between the cache subsystem and the CPU. An important characteristic of the memory subsystem is the bus bandwidth, i.e. how many cache lines can be transferred to the cache per unit of time. Accordingly, the operation of the system bus often becomes a “bottleneck” when an application is executed.
In connection with such cache organization, a large role is played by such a factor as the locality principle used by the program data. There are temporal locality, which characterizes the reuse of the main objects and data in the process of the application operation, and spatial locality - the use of data that have relatively close storage areas. And in the computational core, mechanisms are implemented to support the principle of locality. Temporary locality supports a caching mechanism that tends to store the most frequently used data in caches. When the question arises of the need to free the cache line to load a new one, the line that has not been used the most time is deleted. The prefetching mechanism works on spatial locality, which aims to determine in advance the memory addresses that will be required for subsequent work. At the same time, the higher the spatial locality (access to neighboring elements is), the less data you need to upload to the cache, the less load on the system bus.
It is also possible to mention here the registers of the computing core, as the fastest memory. It is beneficial to store reused data in registers. There is such a performance problem as register spilling, when calculations constantly copy data from registers to the stack and back.

Use vectorization
Intel64 architecture processors support vector instructions. Those. it is possible to create vectors of various types and apply instructions to them working with vectors and receiving output vector results. These are instructions of the SSE family. The instruction set is being developed, complemented by new teams and opportunities. SSE2, SSE3, SSE4, AVX are extensions of the original idea. The problem is that programming languages ​​are initially scalar and we need to make some efforts to use this feature of computational cores. Vectorization is a code modification that replaces a scalar code with a vector one. Those. scalar data is packed into vectors and scalar operations are replaced with operations with vectors (packages). You can use xmmintrin.h to program vectorization manually, use some kind of optimizing compiler with auto-vectorization, or optimized libraries with utilities that use vector instructions. There are many restrictions on the possibility of applying this modification. Besides, there are many factors that determine its profitability. That's why I wrote a modification, because optimization is what improves performance. In order for vectorization to improve it, certain conditions must be met. Those. vectorization is just an “opportunity”. We need to work hard to translate this opportunity into real achievements.
Using multi-threaded computing
The presence of several cores on a modern processor makes it possible to achieve high application performance by distributing computations between these cores. But this is just an "opportunity." The presence of a million highly professional programmers does not give hope that they will be able to implement an optimizing compiler in an hour. Converting a single-threaded code into a multi-threaded requires serious effort. Some optimizing compilers have an auto-parallelizer, which reduces the process of creating a multi-threaded application to the process of adding a single option when compiling. But there are so many pitfalls in the path of the auto-parallelizer that in many cases the optimizing compiler needs help, and in many cases it is simply impotent. Great success can be achieved using OpenMP directives. But in most cases, efforts are required to parallelize the initially single-threaded algorithm. I will not mention other methods of parallelization here, since they are connected, in my opinion, with serious changes in the calculation algorithms and lie beyond what can be called non-algorithmic code optimization.
Based on the above problems, there are 3 main characteristics determining the performance of the executed application and 2 great features that can be implemented:
1.) The quality of the memory subsystem.
2.) The number of conditional transitions and the quality of operation of the predictor of transitions.
3.) Level of instructional parallelism.
4.) Use vectorization.
5.) Use multi-threaded computing.
Optimization
What means does the compiler and programmer have to influence the listed computation characteristics?
Many cyclic optimizations, such as loop permutation, loop blocking, merging and looping (loop fusion / distribution), are designed to improve the performance of the memory subsystem. These optimizations have a flaw. Not all cycles are suitable for such optimizations and they are mainly found in computational programs in C and Fortran. The compiler may try to help the preemptive mechanism and automatically insert the programmed prefetch instructions, but again this is a very specialized operation that requires a favorable location of the stars. If an executable module is built, then with a dynamic profiler, the compiler is able to improve the spatial locality of the application data by rearranging the fields of the structures and carrying the cold regions into separate structures. There is even such an optimization as the transfer of fields from one structure to another. It is quite difficult for the compiler to prove the correctness and profitability of such optimizations, but it is much easier for a programmer to do such things, the main thing is to understand the idea of ​​this or that code modification.
An important role in improving the localization of data can be played by scalar optimizations, i.e. optimizations involved in analyzing data flow, pushing constants, optimizing a control graph, deleting general subexpressions, deleting dead code. These optimizations reduce the amount of computation the application needs to perform. As a result, the number of instructions is reduced and the locality of the data is improved, since the instructions also need to be stored in the cache. The quality of performance of these optimization methods can be called the efficiency of calculations. All these optimizations are helped by interprocedural analysis. If it is used, the data flow analysis becomes global, i.e. The transfer of object properties is analyzed not only at the level of a specific procedure processed by the compiler, but also within the entire program. Inside the procedure, the properties of specific arguments with which this procedure was called from different places of the application are studied, the properties of global objects are studied, a list of objects is created for each function, which can be used or modified inside this function, etc. Substitution of the function body ( inlining) and partial substitution can also have a positive effect on data locality. Many of the above optimizations can no longer be done with your hands, since the effect of each outstretched constant can be scanty, but in the end, pushing constants gives a good total effect and often makes it possible to apply and more correctly evaluate the effect of more complex optimizations.
In addition, the optimizing compiler pays a lot of attention to register allocation.
An optimizing compiler can perform certain actions to reduce the number of conditional transitions and the quality of operation of the conversion predictor. The most trivial is to improve the work of the static predictor of branching due to a change in the condition under which the transition occurs. A static predictor proceeds from the fact that if a downward branch is possible in a control graph, it will not be executed. If the compiler believes that the probability of the if branch is less than for else, then it can swap them. A very strong optimization is the removal of invariant checking from cycles. Another well-known optimization is loop unrolling. There is a useful scalar optimization that pulls conditions along the control graph, which allows you to remove unnecessary checks, merge them and simplify the control graph. The compiler is also able to recognize certain patterns and replace branches with the use of instructions like cmov. Perhaps we can give more examples, but this one also shows that the optimizing compiler has methods for dealing with delays due to incorrect predictions.
To improve instructional parallelism, scheduling is used, which is performed in the final stages of the compiler operation during code generation.
To improve the performance of the application due to the use of vectorization in our compiler, there is an automatic vectorizer that performs vectorization of cycles and replaces the scalar code with the vector one. This compiler component is in development all the time, realizing new features of vector extensions, increasing the number of recognizable idioms, vectorizing loops with mixed data, etc. In addition, the vectorizer inserts various runtime checks into the code and creates a multiversioned code, since the efficiency of vectorization envy on the number of iterations in the loop, on the location of various objects in memory, on the used multimedia extension. To work more efficiently, the vectorizer needs to interact with other cyclic optimizations. The compiler is not always able to prove the admissibility of vectorization; therefore, various directives are supported so that the programmer can use his knowledge of the program and help create productive code.
The Intel compiler suggests that developers use auto-parallelization — an automatic loop parallelization mechanism. This is, in fact, the cheapest way to create a multi-threaded application in terms of labor costs — run building an application with a special option. But from the point of view of an optimizing compiler, this is not an easy task. It is necessary to prove the admissibility of optimization, it is necessary to correctly estimate the amount of work performed within the cycle so that the parallelization decision improves performance. Automatic parallelization as well as auto-vectorization should interact with other cyclic optimizations. Probably, in this area the best optimization algorithms have not yet been written, and the scope for experimentation and innovative ideas is still great. New language extensions and various support libraries, such as CILK + or Threading Building Blocks, are being created to develop more intuitive parallelization methods for programmers. Well, kind and proven OPENMP is also too early to be discounted.
Conclusion:
In a previous post, I suggested that there is no simple criterion for understanding how well an application is optimized. This is due to the fact that there are a lot of different performance optimizations that can be used only when certain conditions are met. Whether these conditions are met, whether there is an opportunity to do this or that optimization is to understand this issue and need to be in the process of analyzing the performance of the application. Therefore, in my opinion, the analysis of application performance comes down to identifying critical application performance areas, identifying the causes of possible processor delays in these areas, trying to determine if optimizations that affect the identified cause are possible, determining whether the application can improve performance by vectoring and parallelization.
Whether vector is vectorized (small example of analysis)
Sometimes even the same application can show absolutely amazing results when analyzing performance. Here, for example, we will ask a question vector is vectorized? Consider this test:
#include <vector> using namespace std; int main () { vector<float> myvector; vector<float>::iterator it; for(int j=0; j<1000; j++) { it = myvector.end(); it = myvector.insert ( it , j ); } for (int i=1; i < myvector.size(); ++i) myvector[i]++; printf("%f\n",myvector[0]); return 0; }
I use the Intel compiler integrated in VS2008 to compile this text. In this case, I am interested in the question whether the cycle is reported in the 14th line of code. Use the –Qvec_report3 option to view the auto vectorizer report:
icl -Qvec_report3 -Qipo -Qansi_alias vector.cpp
Intel C ++ Intel 64 Compiler XE for applications running on Intel 64, Version Mainline Beta Build x
...
... \ vector.cpp (14): (col. 30) Remark: loop was not vectorized: unsupported loop structure.
...
Now I use the same compiler, but already integrated into VS2010.
icl -Qvec_report3 -Qipo -Qansi_alias vector.cpp
...
... \ vector.cpp (14): (col. 30) remark: LOOP WAS VECTORIZED.
...
I will not talk here about the ways in which one can get to the bottom of the reason for the different behavior of the compiler when compiling the same program. (You can see the assembler, you can preprocessed code). Since the Intel compiler is integrated into different versions of VS, in this case different versions of the STL library are used. In the STL VS2008 version, the [] operator contains a check for overrun of the array. As a result, inside the loop, the __invalid_parameter_noinfo procedure call appears with the exit from the program and the compiler is unable to vectorize such a loop. And in VS2010 there is no such verification anymore. I wonder why?
It's funny that if you remove the –Qansi_alias option, then the vectorization also disappears. The same thing happens if you create a vector for storing integers, not real ones. But the reasons for these "miracles" are a reason for a separate article.
Primary sources
Research literature on various performance issues:
The Software Optimization Cookbook, Second Edition or The Software Vectorization Handbook.Basic sources for various optimization techniques: Randy Allen, Ken Kennedy. Optimizing compilers for modern architectures, as well as David F. Bacon, Susan L. Graham, Oliver J. Sharp. Compiler transformations for High-Performance Computing.
For programmers interested in software improvement of the code, we can recommend the book Agner Fog,
Optimizing guide in C + +: An optimization guide for Windows, Linux and Mac platforms.I was not ashamed to copy any thoughts from my article
“Optimizing Compilers” in the journal “Open Systems”.