⬆️ ⬇️

Intel Architecture Code Analyzer 2.0.1

My needs for analyzing the performance of software on x86 cover three tula. One of them - Vtune XE is familiar, probably, to everyone who faced optimization.



The second tool , unfortunately, is not so widely known. It was already mentioned on Habré in the context of optimizing AVX code, but its area of ​​application is somewhat broader.



Sometimes after Vtune has found the most important hotspot (and often the developer already knows it), there is a need to make some efforts to reduce the number of clocks that are spent on its execution. For almost three years now, I have been using performance analysis of such small but critical sections of the Intel Architecture Code Analyzer.

')

It’s easy to use, here’s a recursive algorithm with just 6 steps:

1. In the appropriate .c / .cpp file is included

#include "iacaMarks.h",

2. IACA dll / so libraries are placed in an accessible system location,

3. Macros are added to the source.

IACA_START, IACA_END

respectively, before and after the end of the code being optimized. For example,



 IACA_START
	 for (int i = 0; i <len / 32; i ++) 
	 { 
		 reg2 = _mm_load_si128 ((__ m128i *) src + i * 2); 
		 reg3 = _mm_load_si128 ((__ m128i *) src + i * 2 + 1); 
		 reg1 = reg0; 
		 reg0 = reg3; 
		 reg3 = _mm_alignr_epi8 (reg3, reg2, 16 - 1); 
		 reg2 = _mm_alignr_epi8 (reg2, reg1, 16 - 1); 
		 _mm_store_si128 ((__ m128i *) dst + i * 2, reg2); 
		 _mm_store_si128 ((__ m128i *) dst + i * 2 + 1, reg3); 
	 } 
 IACA_END


4. Compile with your favorite compiler (I hope this is ICC, but there will be no problems with GCC or MSVC too) with all commonly used optimizations (except for PGO, unfortunately). Link the entire project is not necessary, one object is enough.



5. Then in the command line we’ll feed the resulting object manager:

iaca -64 -arch IVB -cp DATA_DEPENDENCY -mark 0 -o output.txt ssecpy.obj

The parameters for IACA are as follows:

-64 - means 64-bit code. Possible and -32.

-arch IVB — shows IACA that you need to analyze the performance of this code on the Ivy Bridge. Other possible values: nehalem, westmere, SNB.

-analysis LATENCY asks IACA to show which instructions are on the critical path for the data (that is, which instructions need to be optimized so that the result of this code is calculated faster). Another possible value: -cp THROUGHOPUT asks IACA to show which instructions plug the processor pipeline.

-mark 0 tells IACA to analyze all the tagged parts of the code. If you specify -mark n, IACA will only analyze the nth tagged code.

6. We think a lot about the result, we make changes in the code (I confess, at this stage often I just thoughtlessly rearrange the instructions in places and see what happened), repeat steps 4-6 until I get bored.



Now about the limitations. Although the u-ops and execution ports will be mentioned in the tula output, IACA is not a simulator — it performs a static analysis. For example, if a conditional branch command is encountered in the analyzed code, the IACA considers that the transition does not occur. He also believes that the missions in L1D and L1I, and even more so, are not higher. He does not support some instructions and simply skips, in the output in their place there will appear "!".



Recently, version 2.0.1 appeared on the site with support for code analysis for SNB and IVB. The documentation describes in more detail how to use it, and provides several examples of typical optimizations that can be found using IACA. They say that it works under Linux, Windowx, Mac OS X, but I did not personally check the latter.

Source: https://habr.com/ru/post/144195/



All Articles