📜 ⬆️ ⬇️

Intel and Facebook jointly increase Caffe2 library performance


Every day the world around us generates more and more information - text, graphic, multimedia, etc. In recent years, the technologies of artificial intelligence and in-depth study have been able to improve a number of applications that help people to perceive this information better, enriching them with speech, video, image recognition, as well as the functionality of recommendations.

Over the past year, Intel has added hardware support for the CPU in several deep learning frameworks to optimize applications that draw conclusions from the analysis. The basis of these optimizations is the Intel Math Kernel Library (Intel MKL) , which uses the Intel Advanced Vector Extension (Intel AVX-512) instructions for enhanced support for deep learning.

Caffe2 is an open source deep learning framework created by Facebook and featuring high speed and modular execution. Caffe2 is designed to help researchers train large machine learning models and develop mobile AIs.

Intel and Facebook jointly integrate Intel MKL features into Caffe2 for optimal pinout performance. The table below shows the speed at which conclusions are obtained from
using the Intel MKL and Eigen BLAS libraries. In the table, OMP_NUM_THREADS shows the number of used physical cores. The results show that Caffe2 can be well optimized from the point of view of the processor. For small packet loads, it is recommended to use your own processor core for each load and run them in parallel.
OMP_NUM_THREADS = 44OMP_NUM_THREADS = 1
Package sizeIntel MKL
(image / sec)
Eigen blas
(image / sec)
Intel MKL
(image / sec)
Eigen blas
(image / sec)
one173.45.228.65.1
321500.229.364.615.4
641596.335.366.015.5
2561735.244.967.316.2
Earlier this year, a new generation of Intel Xeon processors (codenamed Skylake) was introduced to the market. One of the new Skylake products is the 512-bit Fused Multiply Add (FMA) instructions as part of the Intel AVX-512 Vector Set, which provides a significant performance boost over the previous 256-bit AVX2 instructions for both model training and pin counting. The 512-bit FMA functions doubled the FLOPS achieved by the processor and greatly accelerate the single-precision matrix arithmetic used in convolutional and recurrent neural networks. The pin count is well parallelized and will benefit from an increase in the number of cores in new processors. In addition, an increase in the memory frequency and Mid-Level-Cache (MLC) cache size per core will have a beneficial effect on the speed of operation.

')

Source: https://habr.com/ru/post/329682/


All Articles