📜 ⬆️ ⬇️

Intel® Parallel Studio XE 2017: “Python Comes to Us” and Others


In the first week of September this year, a new version of the product Intel Parallel Studio XE 2017 was released. Let's understand what is interesting in it.

As usual, the studio package is available in three versions: Composer, Professional and Cluster Edition.
The first version includes compilers (C / C ++ and Fortran) and various libraries. Ever since the previous version of XE 2016 there were as many as 4 of these:


In the Professional Edition, tools for dynamic analysis of applications are added, such as the Intel VTune Amplifier profiler, a tool for prototyping concurrency and working with vectorization of the Intel Advisor code, as well as finding errors with memory and Intel Inspector threads. In the flagship version of Intel Parallel Studio XE Cluster Edition we have in our hands the entire set of tools that will help create and optimize our application already on a cluster, using Intel MPI and tools for working with it.

In addition to all this, there is a third-party Rogue Wave IMSL library, often used in calculations. It is available both in the Fortran compiler package and as a separate add-on. The following table shows the distribution of development tools by package:
')


An important addition in all studio versions (and not only) is the Intel Distribution for Python * package, which allows you to use Python (2.7 and 3.5), which is “optimized” by Intel's caring engineers on Windows, Linux and OS X. Nothing in the language The new did not appear, just the NumPy, SciPy, pandas, scikit-learn, Jupyter, matplotlib, and mpi4py packages are now optimized for hardware using the Intel MKL, TBB, DAAL and MPI libraries, due to which the application runs more efficiently and quickly. By the way, you can download this package absolutely free of charge from the Intel site. There you can see what is so good "zapyucheny" Python:



Obviously, the performance of Intel Python brings it closer in speed to the C / Intel MKL, while significantly outperforming competitors. By the way, the well-known VTune Amplifier profiler now also supports Python.

Of the globally new products in the Intel Parallel Studio XE package, this is perhaps the most important thing.
Now that appeared in each of the components of the package.

Compilers


According to them, I will definitely write a separate, more detailed blog. And here, “briefly” I will list what appeared there:


And this is just the tip of the iceberg. In fact, a huge amount of change. In general, the compiler, as expected, began to generate even more efficient code, supporting all the latest innovations in hardware. A comparison of the Intel compiler on benchmarks with competitors proves it once again.



As you can see, on the SPECfp and SPECint tests, the advantage of version 17.0 of the compiler is quite impressive.

Intel VTune Amplifier XE 2017


Since the second generation Intel Xeon Phi processor (codename Knights Landing (KNL)) has entered the market, the profiler has support for various types of analysis for it so that your applications simply “fly” after optimization. Here and the analysis of memory bandwidth, to understand what data you need to put in a fast MCDRAM, and microarchitectural analysis, and analysis of the scalability of MPI and OpenMP, and much more tasty.

By the way, VTune has an HPC Performance Characterization analysis, which is a kind of entry point for evaluating the performance of computationally intensive applications, and allowing you to look in detail (unlike Application Performance Snapshot, which will be discussed later) on three important aspects of recycling hardware. - PU, memory and FPU:


As I said, now VTune can profile applications in Python by performing hotpot analysis through software sampling (Basic Hotspots analysis). Another innovation is the ability to profile Go applications using the PMU hardware events.

A new type of analysis appeared in version 2017 - disk input / output analysis (Disk Input and Output analysis). With it, you can track the use of the disk subsystem, CPU and PCIe tires:



In addition, the analysis helps to find I / O operations with high latency, as well as an imbalance between them and real calculations.

The size of the maximum DRAM bandwidth is now automatically determined when analyzing memory access (Memory Access analysis), thus allowing you to understand how well it is used. This type of analysis also adds support for custom memory allocators, which allows you to correctly identify objects in memory. It is possible to correlate cache misses to a specific data structure, and not just the code leading to it. By the way, now there is no need to install special drivers on Linux to perform this analysis.

Additions have appeared when profiling OpenCL and GPU. VTune will show a general problem summary with a detailed description:


In addition, you can see the source code or ASM for OpenCL cores, as well as catch cases of using shared virtual memory (Shared Virtual Memory).
Using the profiler remotely and through the command line has become even easier.
The new group Arbitrary Targets allows you to generate a command line to analyze performance on a system that is not accessible from the current host. This can be very useful for microarchitectural analysis, as it gives access to hardware events available on the platform we have chosen:



In the package with Parallel Studio there are special “lightweight” tools that allow you to get a general idea of ​​our application. They got the signboard Performance Snapshots, and they currently have three options:


What are they like? Consider, for example, the Application Performance Snapshot tool:



By running it, at the output we have a high-level summary of where our application can practically win. On the other hand, we can call it weak points that need to be corrected. We get the application execution time, performance in flops, as well as the answer to 3 questions: how CPU CPU utilization was used effectively (CPU Utilization), how much we spent on working with Memory (Memory Bound), and how FPU Utilization was used . As a result, we see the general picture of the application. If CPU utilization shows too small values, then there are problems with the "parallelism" of the application and the efficiency of using the cores. For example, the imbalance of the load, too much overhead during planning, synchronization problems, or just a large number of regions that are performed sequentially. The following metric will tell us whether the application rests on memory constraints. If the numbers are large, then there may be problems with the use of caches, false sharing of data ( false sharing ) or we rest on bandwidth.

Well, the characteristic of FPU Utilization can show that there are problems with vectorization (its absence or inefficiency), provided there are calculations with floating point numbers.

Intel Advisor


Since the previous version of 2016, Advisor has received a huge functionality for working with vectorization, which has been expanded and improved in the new version.

Of course, there is support for the new generation Xeon Phi, as well as in other media, and the AVX-512. For example, Advisor may show the use of the new KNL-specific AVX-512 ERI instruction set. But even if we don’t have a new piece of hardware, we can get some data for the AVX-512. To do this, you must instruct the compiler to generate two code branches (AVX and AVX-512) for loops using the –axMIC-AVX512 –xAVX options, and the Advisor will display information in its results even for the branch that was not actually executed:



In addition, the possibility of measuring FLOPS'ov for all processors, with the exception of previous generation Xeon Phi, based on the use of instrumentation and sampling. The Memory Access Analysis was also improved, which will now give us even more useful information (comparison with cache size, identifying unnecessary gather / scatter). In terms of usability, there are also a number of improvements. "Smart Mode" (Smart Mode) will allow you to show only those cycles that are most interesting to us, by pressing one button:



And to select several types of analysis at once and to start them at once it became possible using the batch mode Batch Mode.

The bottom line with the new version of 2017, we have a large set of new features that are sure to be useful when optimizing applications for developers in all areas. You can try the new version for 30 days, as always, absolutely free of charge on this page , as one of my colleagues says: “With registration, but without SMS”.

Source: https://habr.com/ru/post/311160/


All Articles