Intel Parallel Studio XE 2016: New C / C ++ Compiler Features
Last week, a new version of the C / C ++ compiler from Intel - 16.0 aka Parallel Studio XE Composer Edition for C ++. The support for new standards (C11, C ++ 14, OpenMP 4.1), the possibilities for working with Xeon Phi, new versions of libraries and a lot more "tasty" have significantly expanded. Let's take a closer look at what appeared in the latest release. Go!
Added support for SIMD operators for working with integer SSE types on Linux. Now the following operators work: + - * / & |^ + = - = * = / = & = | = ^ = ==! => <> = <= A simple example that was not compiled before (don't forget to include immintrin.h ):
__m128i x,y,z; x = y + z;
It is worth noting that it works only on Linux. The compiler on Windows still swears: operation not supported for these simd operands In addition, only 128 and 256 bit SIMD types are supported, and only a form with two operands, as in the example. The operands themselves must be of the same type, for example, Intel's SSE types cannot be used with GNU types declared with the vector_size attribute.
Support for standards has significantly expanded. Perhaps this direction of development of the compiler can also be attributed to one of the most important in the new version. So, if in version 15 of the C11 standard (for the C language, not to be confused with C ++ 11) only binary literals were supported (start with the prefix 0b or 0B), now there is almost everything . The latest public draft standard can be found here in the public domain. I didn’t find a good review article in Russian about C11, so I’ll write about each language opportunity in more detail. Do not forget that when compiling you need to specify the key / Qstd = c11 on Windows and -std = c11 on Linux and Mac OS X in order for all this to work:
New keywords (as in C ++ 11) for alignment of _Alignas and _Alignof data , allowing to escape from compiler- dependent solutions:
I already wrote about the necessity and significance of data alignment earlier .
Type-independent expressions using the _Generic keyword . This is a kind of "templates" from C ++. For example, the following macro for extracting the square root sqrt (x) is translated to sqrtl (x) , sqrt (x) or sqrtf (x) depending on the type of parameter x :
#define sqrt(x) _Generic((x), long double: sqrtl, default: sqrt, float: sqrtf)(x)
But before it had to work hard to realize this handles yourself !
The _Noreturn function specifier allows you to declare functions that never return to the calling code. This allows you to avoid warnings from the compiler for functions that do not have a return, as well as include a number of optimizations that can only be performed on “non-returnable” functions.
_Noreturn voidfunc(); // func never returns
New keyword _Static_assert , which allows to produce a compilation error in case the expression is zero. A simple example:
// static_assert(sizeof(int) < sizeof(char), "app requires sizeof(char) to be less than char"); error: static assertion failed with "app requires sizeof(char) to be less than char"
Unlike the #if and #error directives, you can catch errors that are difficult to find during preprocessing.
Anonymous structures and associations. This is a non-profit society of anonymous alcoholics ... Just kidding, just checked your concentration. They are used for nesting structures and associations. For example:
structT // C11 {int m; union// { char * index; int key; }; }; struct T t; t.key=1300; // key
The peculiarity of the C11 standard is that it standardizes multithreading in the C language. Of course, developers have long since used the benefits of parallelism in C, but, nevertheless, through libraries and other language extensions. Now it is written in the standard. So, one of the new keywords supported by the Intel compiler is _Thread_local . With it, you can specify that the variable is not shared by the threads and each of them receives its local copy.
According to C and C ++ standards, the compiler is not required to comply with the priority for evaluating expressions in brackets. For example, it is far from a fact that the addition of B and C in the expression A + (B + C) will be performed in the first place, which leads to differences in the numerical results. Finally, a compiler option has appeared that disables optimization, which changes the order of summation (reassociation) for floating point types. Now, if the -fprotect-parens (Linux * OS and OS X *) or / Qprotect-parens (Windows *) options are used, the order of operations will be determined by the brackets. Using this option may slow down the execution of the code. By default, the compiler does not include this option.
Having fully implemented support for C ++ 11 in version 15.0, the compiler’s developers have come to grips with the following: C ++ 14 , which will now be supported by more than half. Like C11, there is a page that tracks support for various standard features in different versions of the compiler. You can enable support for C ++ 14 with the / Qstd = c ++ 14 option on Windows and -std = c ++ 14 on Linux and Mac OS X. So, what is now supported, starting with the new release:
Generalized lambda functions
Capture expressions for lambda functions
Separators
Attribute [[deprecated]]
Output type return for functions
Aggregate class initialization with field initializers
#if __cpp_binary_literals int const packed_zero_to_three = 0b00011011; #else int const packed_zero_to_three = 0x1B; #endif
Now we can very easily determine if the compiler supports binary literals. For more details (for example, to find a label with macro names __cpp_binary_literals , __cpp_digit_separators , etc.) you can familiarize yourself with this useful feature here .
A very useful pragma block_loop directive has been added, which allows you to control optimization with the blocking of loops, which I wrote about in detail in this post .
The support of the next version of the OpenMP 4.1 standard ( Technical Report 3 ) mainly expands the possibilities for working with the off-load (unloading) of calculations on the Xeon Phi coprocessor and other possible accelerators:
A new omp target enter data directive has been added to map variables to the coprocessor (it is possible to set to and alloc for the map option). If the omp target directive mapil variables and executed the code on the device, then omp target data only deals with data. Accordingly, now there is also a omp target exit data directive for unmap variables (it is possible to specify from , release and delete for the map option).
Improved capabilities for asynchronous code execution. The target region is now a task (task), so an asynchronous offload is possible using the existing task model and the nowait option for the omp task directive.
The depend option for the omp task directive, which allows unloading with dependencies
New always and delete modifiers for the map option
In addition to the fact that within OpenMP 4.1, the possibilities for working with accelerators are significantly expanded, the implementation of work with coprocessors specific for the Intel compiler is also refined:
If earlier it was impossible to pass through the pointer the field of the object to be unloaded to the coprocessor in the form of ptr-> field , now this restriction is removed. At the same time, it became possible to transfer structures whose fields are pointers. In this case, the structures themselves are transmitted bitwise, the pointers are copied, but the fields of the structures to which they point are not.
It became possible to allocate memory only on the coprocessor, without allocating memory on the host, using the modifiers targetptr and preallocated .
The concept of stream (and a new stream option for the pragma offload directive) appeared - a logical queue for unloading. With its help, you can now download several independent computations to Xeon Phi from one CPU thread. The work order is as follows: First, create a stream using the _Offload_stream_create API function:
OFFLOAD_STREAM* handle = _Offload_stream_create( int device, // Intel MIC Architecture device number int number_of_cpus); // Threads allocated to the stream
And offload in the stream using the offload directive and the stream option, while indicating the signal value to identify the upload. This will help determine if a particular offense has been completed:
// Issue offload to a stream and identify with signal value s1 #pragma offload … stream(handle) signal(s1) { … } … // Issue offload to a stream and identify with signal value s2 #pragma offload … stream(handle) signal(s2) { … } … // Check if offload with signal value s1 has completed if (_Offload_signaled(s1)) …
Much more details can be found in our documentation, which has been significantly expanded with the release of the new version.
In addition to all the above, naturally, new versions of all libraries (Intel IPP, TBB, MKL) were released, where you can also find a lot of interesting things. In addition, in addition to the well-known “three-letter”, the new Intel DAAL library was added, which I already mentioned in a separate post. My list of improvements and additions is not the most comprehensive, but I tried to talk about the most significant things. Changes in Intel Cilk Plus, new listings with annotations from the compiler, improvements in compilation speed and many more minor improvements remained behind the board. We try the latest version , which is still available with a trial license for 30 days (without any functional limitations and with full support) and share your experience with us!