📜 ⬆️ ⬇️

Pilot video course "Parallel Programming and Optimization for Intel Xeon Phi Coprocessors"

Hi, Habr!

As stated in the title of this topic, I am actively working to create a training video course on parallel programming and code optimization for high-performance systems based on Intel architectures. Below is more information about this course, a list of topics covered and laboratory work, as well as pilot episodes that will give an idea of ​​the content and format of this course.

In the current module, using the histogram construction example, two optimization techniques will be shown to improve the automatic vectorization of the code by the compiler and the performance results for the Intel Xeon CPU and Intel Xeon Phi coprocessor.
')



This course is being filmed for Intel in English, and will be used where I or my colleagues from the company cannot attend in person. The list of topics included in this video course is based on our one-day training. Slides for this course can be viewed at the following link: http://research.colfaxinternational.com/post/2014/10/13/CDT-Slides.aspx . When downloading a file with slides, the email and name are requested for internal statistics and are not included in mailings without the client's consent.
List of topics for training (in English)
  1. Welcome
    • About This Document
    • Disclaimer
    • Intel Many Integrated Core (MIC) Architecture
    • Purpose of the Intel MIC Architecture
    • Details of the MIC Architecture
    • Software Tools for Intel Xeon Phi Coprocessors
    • Will My Application Benefit from the MIC architecture?
    • Models for Intel Xeon Phi Coprocessor Programming
  2. Overview of Programming Options
    • Native Coprocessor Applications
    • Explicit Offload
    • Data and Memory Buffer Retention
    • Virtual-Shared Memory Offload Model
    • Handling Multiple Coprocessors
    • Heterogeneous Programming with Coprocessors using MPI
    • File I / O in MPI Applications on Coprocessors
  3. Expressing Parallelism on Intel Architectures
    • SIMD Parallelism and Automatic Vectorization
    • Thread Parallelism and OpenMP
    • Thread Synchronization in OpenMP
    • Reduction Across Threads: Avoiding Synchronization
    • Distributed Memory Parallelism and MPI
    • Summary and Additional Resources
  4. Optimization Using Intel Software Development Tools
    • Optimization Roadmap
    • Library Solution: Intel Math Kernel Library (MKL)
    • Node-Level Tuning with Intel VTune Amplifier XE
    • Cluster-Level Tuning with Intel Trace Analyzer and collector
  5. Optimization of Scalar Arith cosmetics
    • Compiler-friendly Practices
    • Accuracy Control
    • Optimization of Vectorization
    • Diagnostics and Facilitation of Automatic Vectorization
    • Vector-friendly Data Structures
    • Data Alignment for Vectorization
    • Strip-Mining for Vectorization
    • Additional Vectorization `` Tuning Knobs ''
  6. Optimization of Thread Parallelism
    • Reduction instead of Synchronization
    • Elimination of False Sharing
    • Expanding Iteration Space
    • Controlling Thread Affinity
  7. Optimization of Data Traffic
    • Memory Access and Cache Utilization
    • PCIe Traffic Optimization in Offload Applications
    • MPI Traffic Optimization: Fabric Selection
  8. Optimization of MPI Applications
    • Load Balancing in Heterogeneous Applications
    • Inter-Operation with OpenMP
    • Additional Resources
  9. Course record
    • Knights Landing, the Next Manycore Architecture
    • Where to get more information
    • How to Obtain an Intel Xeon Phi Coprocessor


It is also planned to include laboratory works, in which, step by step, the stages of code optimization are shown with specific examples. A list of the names of these practical exercises is presented below.
Lab names (in English)
  • 2.1-native
  • 2.2-explicit-offload
  • 2.3-explicit-offload-persistence
  • 2.4-explicit-offload-matrix
  • 2.5-sharing-complex-objects
  • 2.6-multiple-coprocessors
  • 2.7-asynchronous-offload
  • 2.8-MPI
  • 2.9-openmp4.0
  • 3.1-vectorization
  • 3.2 OpenMP
  • 3.3-Cilk-Plus
  • 3.4-MPI
  • 4.1-vtune
  • 4.2-itac
  • 4.3-serial-optimization
  • 4.4-vectorization-data-structure
  • 4.5-vectorization-compiler-hints
  • 4.6-optimize-shared-mutexes
  • 4.7-optimize-scheduling
  • 4.8-insufficient-parallelism
  • 4.9-affinity
  • 4.a-tiling
  • 4.b-Nbody
  • 4.c-cache-oblivious-recursion
  • 4.d-cache-loop-fusion
  • 4.e-offload
  • 4.f-MPI-load-balance
  • 4.g-hybrid
  • 4.h-MKL


Work on recording and editing has just begun. Therefore, I would very much like to know Habr's opinion on the questions presented below. For me, the Russian translation of only one audio track of a 10 minute episode, and there will be 50-60 of them, is a few hours of work. So I would like to know in advance if my idea has value for Habr's visitors. So any constructive criticism of the content / presentation or just a comment is welcome.

Source: https://habr.com/ru/post/246055/


All Articles