Why CNTK?

Hi, Habr! My name is Zhenia. At the beginning of my career, I was Data Scientist, when it was not yet mainstream. Then he switched to pure T-SQL development, which at the end managed to develop into business analytics. Now I’m a technology evangelist at Microsoft with an obvious focus on the data platform, although this doesn’t stop me from doing other cool things in my free time, like Docker containers or Mixed Reality.

Recently, I spoke with one of the partners and he asked me why we hardly talk about the advantages of CNTK on Habré. At first, we thought it might be more trivial than a story in a company blog and product advantages of this company. But then it was decided that it was excellent, it is possible to get your opinion and communicate. I invite under kat all who are interested in the topic of CNTK and TensorFlow.

Microsoft Cognitive Toolkit (CNTK) is a free, open-source, open-source toolkit for deep learning. If you take GitHub stars as a measure, today it is the third most popular specialized package for in-depth training after TensorFlow and Caffe, which has left platforms such as MxNet, Theano, Torch, and so on.

Disclaimer! This article does not claim to be the only truth, but reveals the key features of the CNTK. We will be glad to know your opinion in the comments.

CNTK and TensorFlow: In Brief

So go to the point. What is the difference between CNTK and TensorFlow:
')

Speed. CNTK as a whole works faster than TensorFlow, and in recurrent networks it gives up to five and tenfold gain in performance.
Accuracy. To date, CNTK has one of the highest accuracy for teaching depth learning models.
API structure CNTK has a flexible and powerful API for C ++ and offers both low-level and easy-to-use high-level Python API based on the functional programming paradigm.
Scalable. CNTK can be easily scaled and in the case of computationally demanding tasks, it can be performed even on thousands of graphics processors.
Scoring CNTK has a productive Eval API for C ++, .NET, Java and Python to simplify the integration of neural networks into their applications.
Extensibility CNTK is easily expanded due to the possibility of using Python to define your own layers and learning procedures.
Built-in reader modules. CNTK has memory-inbuilt built-in data readers that support distributed learning.

If the previous brief arguments you have raised questions and doubts, then we will focus more on each of them.

CNTK and TensorFlow: Details

Speed

With in-depth training, huge amounts of data are processed, which requires large computational resources. If you are developing an application or preparing a scientific article, success largely depends on the speed of the experiments.

The results of the HKBU study and this article showed that in all tested networks CNTK provides comparable performance with TensorFlow from the point of view of both CPU and graphic processors. In fact, if we take into account only the launch on graphics processors, CNTK showed the best results among all the tested packages.

When working with images, CNTK typically provides a performance boost of two to three times compared to TensorFlow. As for recurrent neural networks, here CNTK is the undisputed leader ( as stated in the above article, when running on the CPU, “CNTK shows much better performance (up to 5-10 times) than Torch and TensorFlow” ). And when executed on GPUs, "CNTK has shown an order of magnitude better results than other tools."

The gain in speed is not just the result of a successful set of circumstances. CNTK was originally developed by a team of speech recognition specialists at Microsoft Research and has been optimized for processing sequences. For example, it is used in building a natural language recognition model with a training set of more than 4 billion examples.

If your project uses sequence processing, for example, to recognize speech, to understand natural language, machine translation, and so on, then CNTK will be the best choice for you in terms of performance. Also try CNTK if you are involved in video processing and image recognition.

Accuracy

If you understand the subject of deep learning, you probably know how difficult it is to develop toolkits. Errors in the code of the toolkit may be imperceptible and in many cases do not block the receipt of highly efficient models. But, such errors often do not allow to reveal the full capabilities of the network architecture and artificially underestimate the results. Therefore, our colleagues involved in the development of CNTK pay great attention to identifying errors, ensuring that the toolkit allows you to train models from scratch and achieve the highest accuracy.

An example is the story of the Inception V3 network, developed by several researchers at Google. TensorFlow specialists offered the Inception V3 training script and a pre-trained model for downloading and testing. However, it was not possible to retrain the model from scratch and achieve similar accuracy, since this required additional information, including on preliminary data processing. The maximum accuracy achieved by a third party (in this case, Keras) is about 0.6% lower than that indicated by the developers in their article. As a result of the experiments, researchers from the CNTK team managed to train the CNTK Inception V3 model with a maximum error of 5.972%, which turned out to be even better than the indicator indicated in the original article. You can see for yourself that this result is a learning script available on GitHub .

In addition, we note that the CNTK automatic batch processing algorithm allows you to pack sequences of various lengths and achieve high execution efficiency for recurrent neural networks. Moreover, it improves data randomization for training and often improves accuracy by 1-2% compared to data packaging using other methods. Thanks to this approach, researchers from Microsoft Research for the first time taught a computer to recognize speech as well as a person does .

API structure

From the very beginning, we assumed that an integral part of the applications would be not only the scoring of ready-made models. Training tools can also be tightly integrated into “intelligent” applications, such as Office or Windows. Virtually all CNTK functionality is written in C ++. This does not allow not only to improve performance, but also to use it as a C ++ API, ready for integration into any applications. In addition, it makes it easy to add additional binders, such as Python, Java, .NET, and so on.

It should also be noted that the Python API in CNTK has a low-level and high-level implementation. High-level Python API is based on the functional programming paradigm, it is very compact and intuitive, this is especially noticeable when working with recurrent neural networks. This is the main difference from the Python API in TensorFlow, which most experts consider “too low-level”.

Scalability

Within the framework of modern tasks of deep learning, billions of examples are used for learning. Therefore, it is necessary to implement the ability to run on multiple GPUs and multiple computers. Many toolkits can work with multiple GPUs, but only on one computer. Scaling with an increase in the number of machines is possible, but for its implementation, more often than not, considerable effort will be required.

CNTK, by contrast, was designed with the idea of distributed learning at its core. It is very easy to go from learning on a single GPU to a configuration with multiple GPUs on several computers, these are just a few lines of code, which is confirmed by examples from the CNTK repository. Microsoft researchers have started training tasks with the help of CNTK on hundreds of graphics processors and many computers. Additionally, the framework includes several highly effective parallel learning schemes: 1-bit SGD and Block-Momentum SGD . These algorithms significantly optimized the configuration of hyperparameters, which accelerated the preparation of higher-quality models, as a result of which, for example, Microsoft Research specialists managed to significantly improve the quality of natural language recognition , surpassing the person on the Switchboard test telephone sample.

Scoring

TensorFlow provides excellent scoring opportunities. The platform supports several versions of models, saves them in a format optimized for execution, and the use of various metagraphs within one model provides support for various types of devices. In addition, thanks to XLA AoT compilation, TensorFlow can convert a model into an executable file, which significantly reduces the size of the model for mobile and embedded devices and minimizes delays.

Unlike TensorFlow, CNTK is more focused on directly integrating CNTK Eval into user applications. In addition to Python and C ++, CNTK supports C # /. NET and Java for scoring. At the core of these APIs is the same C ++ API, which makes it possible to get the same level of performance when used. If you are creating a .NET application and want to choose a set of tools for in-depth training and information extraction, CNTK may be a more convenient option.

CNTK supports the parallel use of trained models, and the memory load in this scenario increases slightly. This opens up great opportunities for model deployment in the form of services, for example, in a web application or REST API. CNTK also supports deployment on Intel-based or ARM-based peripherals.

Extensibility

TensorFlow is a very flexible toolkit that allows you to implement almost any model. However, if you are currently using Caffe, then converting an easily existing script into a TensorFlow script will fail. We'll have to rewrite everything from scratch. Similarly, in order to try out a new layer created by another developer using a different set of tools, you will have to implement it yourself.

Against this background, CNTK can be called a highly expandable set of tools. The UserFunctions abstraction allows you to implement any operator using Python tools. Using the NumPy array as an intermediary between CNTK and its extension, you simply implement a forward and backward pass, after which the newly created operator can be immediately included in the network structure. Moreover, a graph from another toolkit can often be placed directly in the CNTK UserFunction, thereby significantly speeding up the porting of projects, allowing yourself to use the opportunities unique to CNTK.

This also applies to weights gradient update procedures. Most of the algorithms are already included in the CNTK distribution, for example, RMSProp or Adam, but nevertheless you can implement new approaches to learning using pure Python.

Built-in reader modules

It is an obvious fact: the more data for training we will have, the better results we will get. In some situations, the amount of data is so large that they do not fit in the RAM, and sometimes they may not have enough resources of one computer. Even when data is placed in RAM, it often takes too much time to transfer data from RAM to GPU memory as part of the learning cycle.

The CNTK embedded readout modules solve the above problems by providing highly efficient iteration capabilities for data collection, without putting them in RAM. You can work with one disk or with a distributed file system, such as HDFS. The widespread use of prefetches eliminates the downtime of the GPU. CNTK readers also ensure that the model always receives data in a well-mixed form (which improves convergence), even if the underlying data set has been ordered. Finally, all these features are available to both current and user readers. Even if you are writing a reader for your own non-standard format, you can not worry about the implementation of pre-sampling procedures.

In conclusion, we will be glad to hear your comments under this article as well as get pull requests to turn CNTK into the coolest and most convenient tool for depth learning together.

Ps. We thank Konstantin Kichinsky ( Quantum Quintum ) for the illustration of this article.

Source: https://habr.com/ru/post/336552/

All Articles