In the first week of September this year, a new version of the product Intel Parallel Studio XE 2017 was released. Let's understand what is interesting in it.
As usual, the studio package is available in three versions: Composer, Professional and Cluster Edition.
The first version includes compilers (C / C ++ and Fortran) and various libraries. Ever since the previous version of XE 2016 there were as many as 4 of these:
- Intel Math Kernel Library (MKL) fast math library
- Intel Integrated Performance Primitives (IPP) data processing and multimedia library
- C ++ Intel Threading Building Blocks (TBB) Template Library
- library for machine learning and data analytics Intel Data Analytics Acceleration Library (DAAL).
In the Professional Edition, tools for dynamic analysis of applications are added, such as the Intel VTune Amplifier profiler, a tool for prototyping concurrency and working with vectorization of the Intel Advisor code, as well as finding errors with memory and Intel Inspector threads. In the flagship version of Intel Parallel Studio XE Cluster Edition we have in our hands the entire set of tools that will help create and optimize our application already on a cluster, using Intel MPI and tools for working with it.
In addition to all this, there is a third-party Rogue Wave IMSL library, often used in calculations. It is available both in the Fortran compiler package and as a separate add-on. The following table shows the distribution of development tools by package:
')

An important addition in all studio versions (and not only) is the Intel Distribution for Python * package, which allows you to use Python (2.7 and 3.5), which is “optimized” by Intel's caring engineers on Windows, Linux and OS X. Nothing in the language The new did not appear, just the NumPy, SciPy, pandas, scikit-learn, Jupyter, matplotlib, and mpi4py packages are now optimized for hardware using the Intel MKL, TBB, DAAL and MPI libraries, due to which the application runs more efficiently and quickly. By the way, you can download this package absolutely free of charge from the Intel site. There you can see what is so good "zapyucheny" Python:

Obviously, the performance of Intel Python brings it closer in speed to the C / Intel MKL, while significantly outperforming competitors. By the way, the well-known VTune Amplifier profiler now also supports Python.
Of the globally new products in the Intel Parallel Studio XE package, this is perhaps the most important thing.
Now that appeared in each of the components of the package.
Compilers
According to them, I will definitely write a separate, more detailed blog. And here, “briefly” I will list what appeared there:
- C ++ SIMD Data Layout Templates (SDLT) template library for solving the problem of transition from Array of Structures (AoS) to Structure of Arrays (SoA) structure of arrays, which leads to a decrease in non-consecutive memory accesses and gather / scatter instructions with vectorization.
- Support for second-generation Intel Xeon Phi processors and a new name mic_avx512.
- More complete support for OpenMP (both standard 4.0 and later 4.5).
- New offloading functionality on Intel Xeon Phi using OpenMP
- The ability to generate files (html or text) with source code and an integrated optimization report, which will be very useful for developers working from the command line without a development environment. In addition, the information content of the optimization and vectorization reports themselves has been improved in many areas.
- New attribute, directive and compiler option for code alignment (not data, but instructions for functions or cycles themselves).
- Wider support for the C ++ standard 14, in particular, we can now use variable templates, free memory of a certain size by the global delete operator, use constexpr functions with noticeably fewer restrictions than were in the C ++ 11 standard. A full list of supported features of the standard can be found on this page.
- Standard C11 is now fully supported (not to be confused with C ++ 11, which has long been supported), with the exception of the _Atomic keyword and the corresponding attribute __ attribute ((atomic)).
- A number of new compiler options have appeared, designed to simplify its use. For example, the key / fp: consistent (Windows) and -fp-model consistent (Linux), which includes a number of others (/ fp: precise / Qimf-arch-consistency: true / Qfma-).
- New features in offload computing on integrated graphics using OpenMP. For example, an asynchronous offload is now possible using the DEPEND clause in the TARGET directive. In addition, the compiler can produce vectorization with the type short.
- The Fortran compiler also has a lot of changes. Especially interesting is the substantial increase in the performance of the Coarray Fortran code. In some cases, the acceleration reaches three times compared with the previous implementation. It is also possible to tell the compiler about the alignment of dynamically allocated arrays, and the standards Fortran 2008 and 2015 are supported even more widely. In particular, now an experienced developer has an array of constants with an implied form, the ability to use BIND © for internal procedures, and pointers can be initialized when they are declared (yes, until 2008 the standard in Fortran was not supported).
And this is just the tip of the iceberg. In fact, a huge amount of change. In general, the compiler, as expected, began to generate even more efficient code, supporting all the latest innovations in hardware. A comparison of the Intel compiler on benchmarks with competitors proves it once again.

As you can see, on the SPECfp and SPECint tests, the advantage of version 17.0 of the compiler is quite impressive.
Intel VTune Amplifier XE 2017
Since the second generation Intel Xeon Phi processor (codename Knights Landing (KNL)) has entered the market, the profiler has support for various types of analysis for it so that your applications simply “fly” after optimization. Here and the analysis of memory bandwidth, to understand what data you need to put in a fast MCDRAM, and microarchitectural analysis, and analysis of the scalability of MPI and OpenMP, and much more tasty.
By the way, VTune has an HPC Performance Characterization analysis, which is a kind of entry point for evaluating the performance of computationally intensive applications, and allowing you to look in detail (unlike Application Performance Snapshot, which will be discussed later) on three important aspects of recycling hardware. - PU, memory and FPU:
As I said, now VTune can profile applications in Python by performing hotpot analysis through software sampling (Basic Hotspots analysis). Another innovation is the ability to profile Go applications using the PMU hardware events.
A new type of analysis appeared in version 2017 - disk input / output analysis (Disk Input and Output analysis). With it, you can track the use of the disk subsystem, CPU and PCIe tires:

In addition, the analysis helps to find I / O operations with high latency, as well as an imbalance between them and real calculations.
The size of the maximum DRAM bandwidth is now automatically determined when analyzing memory access (Memory Access analysis), thus allowing you to understand how well it is used. This type of analysis also adds support for custom memory allocators, which allows you to correctly identify objects in memory. It is possible to correlate cache misses to a specific data structure, and not just the code leading to it. By the way, now there is no need to install special drivers on Linux to perform this analysis.
Additions have appeared when profiling OpenCL and GPU. VTune will show a general problem summary with a detailed description:
In addition, you can see the source code or ASM for OpenCL cores, as well as catch cases of using shared virtual memory (Shared Virtual Memory).
Using the profiler remotely and through the command line has become even easier.
The new group Arbitrary Targets allows you to generate a command line to analyze performance on a system that is not accessible from the current host. This can be very useful for microarchitectural analysis, as it gives access to hardware events available on the platform we have chosen:

In the package with Parallel Studio there are special “lightweight” tools that allow you to get a general idea of our application. They got the signboard Performance Snapshots, and they currently have three options:
- Application - for non-MPI applications
- MPI is correct for MPI applications
- Storage - for analyzing storage usage
What are they like? Consider, for example, the Application Performance Snapshot tool:

By running it, at the output we have a high-level summary of where our application can practically win. On the other hand, we can call it weak points that need to be corrected. We get the application execution time, performance in flops, as well as the answer to 3 questions: how CPU CPU utilization was used effectively (CPU Utilization), how much we spent on working with Memory (Memory Bound), and how FPU Utilization was used . As a result, we see the general picture of the application. If CPU utilization shows too small values, then there are problems with the "parallelism" of the application and the efficiency of using the cores. For example, the imbalance of the load, too much overhead during planning, synchronization problems, or just a large number of regions that are performed sequentially. The following metric will tell us whether the application rests on memory constraints. If the numbers are large, then there may be problems with the use of caches, false sharing of data (
false sharing ) or we rest on bandwidth.
Well, the characteristic of FPU Utilization can show that there are problems with vectorization (its absence or inefficiency), provided there are calculations with floating point numbers.
Intel Advisor
Since the previous version of 2016, Advisor has received a huge functionality for working with vectorization, which has been expanded and improved in the new version.
Of course, there is support for the new generation Xeon Phi, as well as in other media, and the AVX-512. For example, Advisor may show the use of the new KNL-specific AVX-512 ERI instruction set. But even if we don’t have a new piece of hardware, we can get some data for the AVX-512. To do this, you must instruct the compiler to generate two code branches (AVX and AVX-512) for loops using the –axMIC-AVX512 –xAVX options, and the Advisor will display information in its results even for the branch that was not actually executed:

In addition, the possibility of measuring FLOPS'ov for all processors, with the exception of previous generation Xeon Phi, based on the use of instrumentation and sampling. The Memory Access Analysis was also improved, which will now give us even more useful information (comparison with cache size, identifying unnecessary gather / scatter). In terms of usability, there are also a number of improvements. "Smart Mode" (Smart Mode) will allow you to show only those cycles that are most interesting to us, by pressing one button:

And to select several types of analysis at once and to start them at once it became possible using the batch mode Batch Mode.
The bottom line with the new version of 2017, we have a large set of new features that are sure to be useful when optimizing applications for developers in all areas. You can try the new version for 30 days, as always, absolutely free of charge on this
page , as one of my colleagues says: “With registration, but without SMS”.