The report on the results of visits to ISC-2015

On September 17, 2015, the regular annual Intel Software Conference took place in Moscow. The conference program included general presentations (opening remarks, a review of the company's technologies for developers, Intel customers' success stories) and two parallel sessions: the first was devoted to code optimization and parallel computing, the second concerned mobile development and media.

As a result of attending the first session, I was most interested in a report on the topic “ We vectorize code with Intel Advisor XE ”. In addition to demonstrating the capabilities of the code optimization tool, general questions of vectorization were considered, recommendations were given for writing vectorized code, and examples of constructions that impede automatic vectorization were analyzed, and advice was given to eliminate them. But let's get everything in order.

Prehistory

This little story began with the fact that one of my colleagues somehow came up and said: “Listen, and yet another Intel software conference will be held soon, why don't we visit it?”. He was there several times before, always responded positively. No sooner said than done, the registration is free, and we managed to agree on the management with the management (for which - special thanks) for attending this event in the middle of the working week.

By the way, when I mentioned the upcoming campaign to the conference to my old friend (former classmate, classmate, and colleague in one person), he also, taking this opportunity, decided to “break free” from the routine of working days, since the workload on the working front allowed it. And although he contacted the organizers after the official registration on the site was closed, we must give them their due - no problems arose, the same day a positive response came.
')

Overall impression

The conference made a double impression on me. There were not so many visitors (it could be the fault of everyday life, people are busy at the main job), as it seemed to me, they were mostly representatives of the older generation (among the visitors we met one of the respected employees of our university, hello to you, Boris Mikhailovich !)

The opening speech was held by Greg Anderson (Director, Worldwide Software Sales, Intel Software and Services Group, Portland, Oregon, USA), who spoke in his native language in general about the line of Intel development tools, their vision of the future optimization technologies and about some details, new features of their fresh produce. It is always helpful to listen to a native speaker, and here the most inquisitive have the opportunity to ask their questions directly.

The whole program of retelling does not make sense, it is available on the official conference website .

"Briefly about the main thing"

Discussing this event with my colleagues and friends, I found out that the topic of code vectorization is interesting enough for many, but few people remember the details well. Following the results of attending the conference, a small report was written, which can be used as a brief educational program.
Vectorization is a special case of parallel computing model SIMD (Single Instruction Multiple Data). In essence, this means the execution of a single command that processes a certain amount of scalar data (as opposed to scalar operations on the SISD model, Single Instruction Single Data, which process one data element at a time). Different processor architectures support different SIMD extensions and / or SIMD instructions.

Such an approach allows for a significant gain in the speed of performing calculations of a certain type. As the simplest classical example, often add the addition of two arrays of integers, which can be written as follows (hereinafter the definitions of variables are omitted):

for (i=0;i<=MAX;i++)       c[i]=a[i]+b[i];

Without vectorization (when considering one SSE2 register with a size of 128bit, which corresponds to four 32bit integers), a significant part of the register remains unoccupied (Figure 1).

Figure 1. Illustration of the process of adding two arrays (without vectorization)

With vectorization enabled, the compiler can use additional 32bit elements to perform four additions in a single operation (Figure 2).

Figure 2. Illustration of vectorized addition of two arrays

In general, this can lead to a noticeable increase in performance. However, specific results can vary greatly depending on the source code. compiler, processor.

Modern versions of popular compilers (for example, Microsoft Visual C ++ 14.0, GCC 5.3, Clang 3.7, ICC) support automatic vectorization to one degree or another. For more detailed and up-to-date information, refer to the documentation of the specific version of the compiler.

I would like to list the main conditions under which the cycles can be vectorized. It is worth emphasizing that these are the most general recommendations (giving the opportunity to understand the key principles underlying vectorization), which will not necessarily work in a particular environment. Each specific compiler may have additional restrictions and / or vice versa, “be able” to cope with some of the numbers listed below.

The number of loop iterations must be known by the time the loop starts to execute (i.e., the loop exit is not data dependent);

No data loopout points dependent. For example, in the following cycle this condition is not met:

 for (i=0;i<=MAX;i++) { c[i]=a[i]+b[i]; // data-dependent exit condition if (a[i] < 0)   break; }

Rectilinear code execution (no logical branches). Thus, the presence of the switch statement prevents code vectorization. At the same time, the presence of the if statement is not so critical and often allows vectorization;
Usually, only the inner (most nested) loop is vectorized. The exceptions are external cycles that have been converted to internal through some or other optimization steps (for example, spin-up / collapse of the cycle);
No function calls. Exceptions, as a rule, are built-in mathematical functions (in the documentation of the compiler, you can see the tables of vectorized functions) and functions to be embedded.

There are also several circumstances that do not necessarily prevent vectorization, but may adversely affect this process. Consider them:

References to non-continuous memory areas. As an example, we can consider 4 adjacent integers that can be loaded into a register using one SSE instruction. If these 4 numbers are not located in a contiguous room, a larger number of operations may be required to load them, which is definitely less efficient. In such cases, vectorization is possible, but the compiler may consider it to be meaningless.

The most common case of this may be non-sequential access cycles (for example, when the array index is incremented by values other than 1) or by indirect addressing (the array index is taken from another array), Figure 3.

Figure 3. Memory access patterns (from top to bottom): continuous, with constant pitch ≠ 1, with variable pitch
Availability of "dependent data". Since each SIMD instruction processes several scalar values at each moment of time, vectorization is possible if a change in the order of processing values does not affect the result of the calculations.

A simple example of “independent data” in calculations is when data that is accessed in a given iteration of a loop is not used in subsequent iterations. In this case, each iteration is independent and can be performed in any order, without affecting the final result of the calculations.

If, for example, a variable is written in one iteration of a loop, and is read in the next iteration, there is a read-after-write / flow dependency relationship.

Also to the general recommendations for writing vectorized code include the following:

Use of aligned data structures (for example, by powers of two). A properly designed data structure allows you to perform operations on yourself most efficiently, both in terms of execution time and the amount of memory used;
Preference for array structures (Structures of Arrays, SoA) to arrays of structures (Array of Structures, AoS). The choice of data organization has a significant impact on the possibility of code vectorization. If we compare the two structures:

 struct color // AoS { int r, g, b;                      };

 struct color // SoA { int *r, *g, *b;                      };

And how the data will be located in memory (Figure 4):

Figure 4. Graphical representation of the location of data in memory.

it becomes obvious that the SoA option is a more “friendly” way of storing data for performing vectorization operations (since it provides good data locality, and therefore the possibility of efficient vectorization by performing operations on adjacent values).

It should also be noted that the applicability of the described techniques may be limited, since the real benefits can only be gained by careful analysis of the generated code.

Summing up

In general, the impressions from attending the conference remained positive. But it seems that from year to year the program changes only slightly, so I would summarize it as follows: if it is possible to visit it once, it is definitely necessary for those who are interested in Intel products / technologies.

If I were a student, then whenever possible (and there are, on average, more of them during this period of life than, for example, with full-time employment), I would attend this conference to expand my horizons and form interesting acquaintances.

The same can be recommended to people whose professional activities are closely connected with high-performance computing and / or deep code optimization - here you can meet like-minded people, listen to success stories and get answers to interesting questions, which is called “first-hand”.

Thanks to the organizers and speakers!

List of sources:
1. A Guide to Vectorization with Intel® C ++ Compilers , Intel Corporation, 2010;
2. Optimizing software in C ++ , A.Fog, Technical University of Denmark, 2015;
3. “Vectorize code with Intel Advisor XE” , K. Rogozhin, Intel Software Conference, 2015.

Source: https://habr.com/ru/post/276541/

All Articles

The report on the results of visits to ISC-2015

Prehistory

Overall impression

"Briefly about the main thing"

Summing up

More articles: