Introduction to Neural Network Architectures

Grigory Sapunov (Intento)

My name is Grigory Sapunov, I am a service company of Intento. I have been involved in neural networks for a long time and machine learning, in particular, has been building neural network recognizers of road signs and numbers. I participate in the project on neural network stylization of images, I help many companies.

Let's get straight to the point. My goal is to give you a basic terminology and understanding of what is happening in this area, from which building blocks of neural networks, and how to use it.
')
The outline of the report is as follows. First, a small introduction about what a neuron is , a neural network , a deep neural network , so that we can communicate in the same language.

Then I will talk about important trends that are happening in this area. Then we dive into the architecture of neural networks , consider the three main classes . This will be the most informative part.

After that, we will look at 2 relatively advanced topics and end with a small overview of the frameworks and libraries for working with neural networks.

At the conference, Natalya Efremova from NTechLab spoke about practical cases. I will tell you how neural networks are built inside, what kind of building blocks they consist of inside.

Summary

Recap: neuron, neural network, deep neural network

Brief reminder

An artificial neuron is a very distant resemblance of a biological neuron.

What is an artificial neuron? This is a simple feature actually. She has entrances. Each input is multiplied by some weights, then everything is summed up, driven through some non-linear function, the result is output - all, this is one neuron.

If you are familiar with logistic regression, by which we mean the non-linear function SIGMOID, then one neuron is a complete analog of logistic regression, a simple linear classifier.

In fact, there are many different activation functions, including the hyperbolic tangent (TANH), SIGMOID, and RELU shown in the figure.

In reality, everything is much more complicated. This topic will not concern.

I gave a very basic idea of an artificial neuron, as a kind of biological neuron.

An artificial neural network is a way to assemble neurons into a network so that it performs a specific task, for example, a classification task. Neurons are collected in layers. There is an input layer where the input signal is fed, there is an output layer, from where the result of the neural network is taken, and between them there are hidden layers. There may be 1, 2, 3, a lot. If the hidden layers are greater than 1, the neural network is considered deep, if 1, then shallow.

There is a huge variety of different architectures, the main ones we will look at. But keep in mind that a lot of them. If interested, then follow the link - look, read.

Another useful thing to know when discussing neural networks. I have already described how one neuron works: how each input multiplies by weights, by coefficients, sums up, multiplies by non-linearity. This is, let's say, the production mode of the neuron, that is, inference, as it works in the already trained form.

There is a completely different task - to train a neuron. Training is to find these correct weights. Training is built on the simple idea that if we at the output of a neuron know what the answer should be, and we know how it turned out, we become aware of this difference, an error. This error can be sent back to all inputs of the neuron and understand which input influenced this error to a great extent, and accordingly, adjust the weight on this input so that the error is reduced.

This is the main idea behind Backpropagation, an error back-propagation algorithm. This process can be driven throughout the network and for each neuron to find how its weight can be modified. For this you need to take derivatives, but in principle, this is not required lately. All packages for working with neural networks are automatically differentiated. If 2 years ago it was necessary to manually write complex derivatives for tricky layers, now the packages do it themselves.

Recap: important trends

What is happening with the quality and complexity of models

First, the accuracy of neural networks is growing, and is growing very strongly. There are already several examples when neural networks come to some area and force out the whole classical algorithm. So it was in image processing and speech recognition, it will happen in different areas. That is, there are neural networks that greatly reduce the error.

Deep Learning is highlighted in purple in the diagram, the classic computer vision algorithm is highlighted in blue. It is seen that Deep Learning has appeared, the error has decreased and continues to decrease further. That is why Deep Learning completely supersedes all, conditionally, classical algorithms.

Another important milestone is that we are beginning to overtake the quality of a person. At the ImageNet competition, this happened for the first time in 2015. But in fact, neural network systems that are superior to humans in quality have appeared earlier. The first documented distinct case is 2011, when a system was built that recognized German road signs and did it 2 times better than a person.

The second important trend - the complexity of neural networks is growing. In terms of depth, depth increases. If the winner of 2012 on ImageNet is the AlexNet network - there were less than 10 layers, then in 2014 there were already more than 20, in 2015 - under 150. This year, it seems, already beyond 200. What will happen next is unclear, perhaps will be even more.

http://cs.unc.edu/~wliu/papers/GoogLeNet.pdf

In addition to the growing depth, the architectural complexity grows as well. Instead of simply joining the layers one by one, they begin to branch, blocks and structure appear. In general, the architectural complexity is also growing.

https://culurciello.imtqy.com/tech/2016/06/04/nets.html

This is a graph of the accuracy of various neural networks. Here is the time it takes to execute, on the miscalculation of this network, that is, some computational load. The size of the circle is the number of parameters that are described by the neural network. It is interesting to compare the classic network AlexNet - the winner of 2012 and later networks. They are better in accuracy, but usually contain fewer parameters. This is also an important trend that neural networks complicate very cleverly. That is, the architecture changes so that even though the number of layers is 150, the total number of parameters is less than in the 6-7-layer network, which was in 2012. The architecture is somehow complicated in a very interesting way.

Another trend is data growth. In 1998 for training convolutional
the neural network that recognized the handwritten checks was used 10 ⁷ pixels, in 2012 (IMAGENET) - 10 ¹⁴ .

7 orders in 14 years is a crazy difference and a huge shift!

At the same time, the number of transits on the processor is also growing, computing power is increasing - Moore's law is in effect. Over these 14 years, processors have become conditionally 1000 times faster. This is illustrated by the example of GPUs, which now dominate the area of Deep Learning. Almost everything counts on graphics accelerators.

NVIDIA has been redeployed from gaming to actually a company for artificial intelligence. Its exhibitors left far behind Intel exhibitors, which do not look at all against this background.

This is a picture of 2013, when the top-end video card was 4.5 TFLOPS. Now the new TITAN X is already 11 TFLOPS. In general, the exhibitor continues!

In fact, we can expect that FPGAs will appear in the near future, which will partially press the GPU, and maybe even neuromorphic processors will appear over time. Watch this - there is also a lot of interesting things happening.

Architecture neural networks. Direct propagation neural networks

Fully Connected Feed-Forward Neural Networks, FNN

The first classical architecture is fully connected direct propagation neural networks, or Fully Connected Feed-Forward Neural Network, FNN.

Multi-layer Perceptron is generally a classic of neural networks. That picture of neural networks that you saw, this is it - a multilayered fully connected network. Fully connected - this means that each neuron is connected with all the neurons of the previous layer. A good network works, it is suitable for classification, many classification problems are successfully solved.

However, she has 2 problems:

Many parameters

For example, if you take a neural network of 3 hidden layers that needs to process 100 * 100 ps pictures, this means that there will be 10,000 ps at the input, and they are turned into 3 layers. In general, to be honest with all the parameters, such a network will have about a million. It really is a lot. To train a neural network with a million parameters, you need a lot of training examples that are not always there. In fact, there are examples now, but they didn’t exist before - therefore, in particular, the networks could not train properly.

In addition, the network, which has many parameters, has an additional tendency to retrain. It can be sharpened by something that in reality does not exist: some noise Data Set. Even if, in the end, the network remembers examples, but on those that it did not see, then it will not be able to be used normally.

Plus there is another problem called:

Fading gradients

Remember that story about Backpropagation, when an error from the outputs is sent to the input, distributed to all weights and sent further along the network? Further, these derivatives - that is, the gradient (error derivative) - are run back through the neural network. When there are many layers in a neural network, a very, very small part of this gradient can remain at the very end. In this case, the input weight will be almost impossible to change because this gradient is practically “dead”, it is not there.

This is also a problem, due to which deep neural networks are also difficult to train. We will return to this subject further, especially on recurrent networks.

There are various variations of FNN networks. For example, a very interesting variation of the Auto ENCODER. This is a network of direct distribution with the so-called bottleneck in the middle. This is a very small layer, say, of only 10 neurons.

What are the advantages of such a neural network?

The purpose of this neural network is to take some kind of input, drive it through itself and generate the same input at the output, that is, so that they match. What's the point? If we can train such a network that takes input, drives it through itself and generates exactly the same output, it means that these 10 neurons in the middle are enough to describe this input. That is, you can greatly reduce the space, reduce the amount of data, economically encode any input data in the new terms of 10 vectors.

It is convenient and it works. Such networks can help you, for example, reduce the dimension of your task or find some interesting features that you can use.

There is another interesting model of RBM. I wrote it in the FNN variation, but in reality this is not true. Firstly, it is not deep, and secondly, it is not Feed-Forward. But it is often associated with FNN networks.

What it is?

This is a shallow model (on the slide it is drawn in a corner), which has an entrance and there is some hidden layer. You give a signal to the input and try to train the hidden layer so that it generates this input.

This is a generative model. If you have trained it, then you can generate analogs of your input signals, but slightly different. It is stochastic, that is, every time it will generate something slightly different. If, for example, you have trained such a model to generate handwritten ones, then it will accumulate them a number of slightly different ones.

What are good RBM - the fact that they can be used to train deep networks. There is such a term - Deep Belief Networks (DBN) - in fact, this is a way to train deep networks, when 2 lower layers of a deep network are taken separately, input is given and RBM is trained on these first two layers. After that, these weights are recorded. Next, the second layer is taken, considered as a separate RBM and is also being trained. And so throughout the network. Then these RBMs are joined, combined into one neural network. It turns out a deep neural network, which should be.

But now there is a huge advantage - if you had previously taught it simply from some random (random) state, now it is not random — the network is trained to restore or generate data from the previous layer. That is, her weight is reasonable, and in practice this leads to the fact that such neural networks are already quite well-trained. Then you can slightly train them with some examples, and the quality of such a network will be good.

Plus there is an additional advantage. When you use RBM, you essentially work on unallocated data, which is called Un supervised learning. You have just pictures, you do not know their classes. You drove millions, billions of pictures that you downloaded from Flickr or from somewhere else, and you have some kind of structure in the network itself that describes these pictures.

You do not know what it is yet, but these are reasonable weights, which can then be taken and supplemented by a small number of different pictures, and this will be good. This is a cool way to use a combination of 2 neural networks.

Then you will see that this whole story is really about Lego. That is, you have separate networks - recurrent neural networks, some other networks are all blocks that can be combined. They are well combined on different tasks.

These were the classic direct propagation neural networks. Next, we turn to convolutional neural networks.

Neural Network Architecture: Convolutional Neural Networks

Convolutional Neural Networks, CNN

https://research.facebook.com/blog/learning-to-segment/

Convolutional neural networks solve 3 main tasks:

Classification. You submit a picture, and the neural network just says - you have a picture about a dog, about a horse, about something else, and gives out a class.
Detection is a more advanced task, when a neural network does not simply say that there is a dog or a horse in the picture, but also finds the Bounding box - where it is in the picture.
Segmentation. In my opinion, this is the coolest task. In essence, this is a per-pixel classification. Here we are talking about every pixel of the image: this pixel belongs to the dog, this one refers to the horse, and this one also relates to something. In fact, if you know how to solve a segmentation problem, then the other 2 tasks are automatically given.

What is a convolutional neural network? In fact, the convolutional neural network is the usual feed-forward network, it’s just a little bit of a special kind. Lego starts now.

What is in the convolution network? She has:

Convolutional layers - I will tell you further what it is;
Subsampling, or Pooling-layers, which reduce the size of the image;
Ordinary fully connected layers, the same multilayer perceptron, which is simply hung from above on these first 2 cunning layers.

A little more detail about all these layers.

Convolutional layers are usually drawn as a set of planes or volumes. Each plane in such a picture or each slice in this volume is, in fact, one neuron that implements a convolution operation. Again, then I will tell you what it is. In essence, this is a matrix filter that transforms the original image into something else, and this can be done many times.
Layers of subsampling (I will call Subsampling, it's easier) just reduce the size of the image: it was 200 * 200 ps, after Subsampling it became 100 * 100 ps. In fact, averaging is a bit more cunning.
Completely connected layers are usually used by the perceptron for classification. There is nothing special about them.

http://intellabs.imtqy.com/RiverTrail/tutorial/

What is a convolution operation? It scares everyone, but in reality it is a very simple thing. If you worked in Photoshop and made Gaussian Blur, Emboss, Sharpen and a bunch of other filters, these are all matrix filters. Matrix filters are in fact a convolution operation.

How is it implemented? There is a matrix, which is called the filter kernel (in the figure kernel). For Blur it will be all units. There is an image. This matrix is superimposed on a piece of the image, the corresponding elements are simply multiplied, the results are added and recorded at the center point.

http://intellabs.imtqy.com/RiverTrail/tutorial/

So it looks more clearly. There is an image Input, there is a filter. You run a filter over the entire image, honestly multiply the corresponding elements, add, write to the center. Run, run - built a new image. All this is a convolution operation.

That is, in fact, convolution in convolutional neural networks is a cunning digital filter (Blur, Emboss, anything else), which itself is trained.

http://cs231n.imtqy.com/convolutional-networks/

In fact, convolutional layers all work on volumes. That is, even if we take an ordinary RGB image, there are already 3 channels - this is, in fact, not a plane, but a volume of 3, conditionally, cubes.

Convolution in this case is no longer a matrix, but a tensor - actually a cube.

You have a filter, you run through the entire image, it immediately looks at all 3 color layers and generates one new point for one of this volume. Run through the entire image - built one channel, one plane of the new image. If you have 5 neurons, you have built 5 planes.

This is how the convolutional layer works. The task of learning the convolutional layer is the same as in ordinary neural networks - to find the weights, that is, to actually find the convolution matrix, which is completely equivalent to the weights in the neurons.

What are these neurons doing? They actually learn to look for some features, some local signs in the small part that they see - and that’s all. Running one such filter is building some kind of map for finding these features in the image.

Then you built many such planes, then use them as an image, feeding them on the following entrances.

http://vaaaaaanquish.hatenablog.com/entry/2015/01/26/060622

The Pooling operation is an even simpler operation. It is just averaging or taking a maximum. It also works on some small squares, for example, 2 * 2. You overlay an image and, for example, select the maximum element from this 2 * 2 box, send it to the output.

Thus, you reduced the image, but not with a cunning Average, but with a slightly more advanced piece - you took the maximum. This gives a small shift invariance. That is, it does not matter to you whether some sign was found in this position or 2 ps to the right. This thing allows the neural network to be slightly more resistant to image shifts.

http://cs231n.imtqy.com/convolutional-networks/

This is how the pooling layer works. There is a cube of some size - 3 channels, 10, or 100 channels, which you counted by convolutions. It simply reduces it in width and height, it does not touch other dimensions. Everything is a primitive thing.

What are good convolutional networks?

They are good because they have much fewer parameters than the usual fully connected network. Recall the example of a fully connected network, which we considered, where we got a million weights. If we take a similar one, more precisely, a similar one — it cannot be called a similar one, but a close convolutional neural network that has the same input, the same output will also have one fully connected output layer and 2 more convolutional layers with 100 neurons each, too. in the core network, it turns out that the number of parameters in such a neural network has decreased by more than an order of magnitude.

It's great if the parameters are so much smaller - the network is easier to train. We see it, it’s really easier to train.

What does a convolutional neural network do?

In fact, it automatically teaches some hierarchical features for images: first, basic detectors, lines of different inclination, gradients, etc. Of these, she collects some more complex objects, then more complex ones.

If you perceive a neuron as a simple logistic regression, a simple classifier, then a neural network is just a hierarchical classifier. First, you single out simple signs, combine complex signs of them, even more complex ones, even more complex ones, and in the end you can combine some very complex sign - a specific person, a specific machine, an elephant, anything else.

Modern architectures of convolutional neural networks have become very complicated. Those neural networks that won at the last competitions of ImageNet are no longer just some convolutional layers, Pooling layers. These are directly finished blocks. The figure shows examples from the network Inception (Google) and ResNet (Microsoft).

In essence, the same basic components are inside: the same convolutions and pooling. Just now there are more, they are somehow cleverly combined. Plus, now there are direct links that allow you not to transform the image at all, but simply transfer it to the output. It, by the way, helps that gradients do not fade. This is an additional way to pass the gradient from the end of the neural network to the beginning. It also helps to train such networks.

It was quite classic convolutional neural networks. Yes, there are different types of layers that can be used for classification. But there are more interesting uses.

https://arxiv.org/abs/1411.4038

For example, there is such a kind of convolutional neural networks called Fully-convolutional networks (FCN). People rarely talk about them, but this is a cool thing. You can take and tear the last multilayer perceptron, it is not needed - and throw it out. And then the neural network can magically work on images of arbitrary size.

That is, she learned, let's say, to determine 1000 classes in the images of cats, dogs, something else, and then we took the last layer and did not tear it off, but transformed it into a convolutional layer. There are no problems - you can count the weights. Then it turns out that this neural network seems to work with the same window for which it was trained, 100 * 100 ps, but now it can run through this window across the entire image and build a heat map on output - where in this particular image is specific class.

You can build, for example, 1000 of these heatmap for all your classes and then use this to determine the location of the object in the picture.

This is the first example where a convolutional neural network is not used for classification, but in fact for generating an image.

http://cvlab.postech.ac.kr/research/deconvnet/

A more advanced example is Deconvolution networks. About them, too, rarely speak, but it is even more cool thing.

In fact, Deconvolution is the wrong term. In digital signal processing, this word is taken by a completely different thing - a similar, but not such.

What it is?In essence, this is a trained upsling. That is, at some point you have reduced your image to some small size, maybe even 1 ps. Rather, not to a pixel, but to some small vector. Then you can take this vector and open.

Or, if at some point you get an image of 10 * 10 ps, you can now do Upsampling of this image, but in some tricky way in which Upsampling weights are also trained.

This is not magic, it works, and in fact it allows us to train neural networks that receive some kind of output image from the input image. That is, you can submit entry / exit samples, and the one in the middle will learn by itself. It is interesting.

In fact, many tasks can be reduced to the generation of images. The classification is a cool task, but it is still not comprehensive. There are many tasks where pictures need to be generated. Segmentation is basically a classic task, where you need to have a picture at the output.

Moreover, if you have learned to do so, then you can do it in a different way, more interesting.

https://arxiv.org/abs/1411.5928

It’s possible to tear off the first part, for example, and fasten some kind of fully connected network that we will teach - what goes to the entrance, for example, the class number: generate a chair for me that angle and in such a form. These layers generate some internal representation of this chair, and then it unfolds into a picture.

This example is taken from the work where the neural network really taught to generate different chairs and other objects. It also works, and it's fun. This is collected, in principle, from the same basic blocks, but they are wrapped differently.

https://arxiv.org/abs/1508.06576

There are non-classical tasks, for example, transferring a style that we all hear about in the last year. There are a bunch of applications that can do this. They work on about the same technology.

https://arxiv.org/abs/1508.06576

There is a ready trained network for classification. It turned out that if we take a derived picture, load it into this neural network, then different convolutional layers will be responsible for different things. That is, on the first convolutional layers will be the stylistic features of the image, on the latter - the content attributes of the image, and this can be used.

You can take a picture as a model of style, drive it through a ready-made neural network, which was not taught at all, remove stylistic signs, and remember. You can take any other picture, get rid of, take the content signs, remember. And then the random image (noise) can be driven away again through this neural network, to get the signs on the same layers, to compare with those that should have been received. And you have a task for Backpropagation. In fact, further gradient descent can be transformed random image to one for which these weights on the desired layers will be as necessary. And you got a stylized picture.

The only problem with this method is that it is long. This iterative run of the picture back and forth is a long time. Who played with this code in the style generation, knows that the classic code is long, and you have to suffer. All services like Prism and so on, which generate more or less quickly, work differently.

https://arxiv.org/abs/1603.03417

Since then, we have learned how to generate networks that just generate a picture in 1 pass. This is the same task of transforming an image that you have already seen: there is something at the entrance, there is something at the exit, you can train everything in the middle.

In this case, the trick is that the loss function is the very function of the error you get on this neural network, and the error function is removed from the normal neural network that was trained for classification.

These are hacker methods of using neural networks, but it turned out that they work, and this leads to great results.

Next, we turn to recurrent neural networks.

Neural Network Architecture: Recurrent Neural Networks

Recurrent Neural Networks, RNN

Recurrent neural network is actually a very cool thing. At first glance, their main difference from conventional FNN networks is that some kind of cyclic connection just appears. That is, the hidden layer sends its own values to itself in the next step. It would seem like a minor thing, but there is a fundamental difference.

About the usual Feed-Forward neural network it is known that it is a universal approximator. They can approximate more or less any continuous function (there is such a Tsybenko theorem). It's great, but the recurrent neural networks are turing full. They can calculate any computable.

Essentially, recurrent neural networks are a regular computer. The task is to train him correctly. Potentially, it can read any algorithm. Another thing is that it is difficult to teach him.

In addition, conventional Feed-Forward neural networks have no opportunity to take into account the order in time - this is not in them, not presented. Recurrent networks do this explicitly, the concept of time is embedded in them.

Regular feed-forward networks do not have any memory, except for the one that was obtained at the training stage, and recurrent networks do. Due to the fact that the content of the layer is transferred back to the neural network, it is like its memory. It is stored while the neural network is running. This also adds a lot.

http://colah.imtqy.com/posts/2015-08-Understanding-LSTMs/

How are recurrent neural networks trained? In fact, almost the same. In addition to Backpropagation, of course, there are many other algorithms, but at the moment Backpropagation works best.

For recurrent neural networks, there is a variation of this algorithm - Backpropagation through time. The idea is very simple - you take a recurrent neural network and the cycle simply expands on a few steps, for example, 10, 20 or 100, and you get a regular deep neural network, which you then teach with ordinary Backpropagation.

But there is a problem.As soon as we start talking about deep neural networks - where 10, 20, 100 layers - from gradients, which from the end should go to the very beginning, there is nothing left behind 100 layers. With this we need to do something. In this place, a certain hack was invented, a beautiful engineering solution called LSTM or GRU is a memory cell.

https://deeplearning4j.org/lstm

Their idea is that the visualization of a normal neuron is replaced with some kind of clever thing that has memory and there is a gate, which control when this memory needs to be reset, rewritten or saved, etc. These gates are also trained in the same way as everything else. In fact, this cell, when it has learned, can tell the neural network that we are now keeping this internal state for a long time, for example, 100 steps. Then, when the neural network used this state for something, it can be reset. It became unnecessary, we went to a new count.

On all more or less serious tests, these neural networks strongly do the usual classical recurrent ones, which are simply on neurons. Almost all recurrent networks are currently built on either LSTM or GRU.

http://kvitajakub.imtqy.com/2016/04/14/rnn-diagrams

I will not go into what it is inside, but these are such tricky blocks, much more complicated than ordinary neurons, but, in fact, they are similar. There are some gateways that control this very “remember - not remember”, “pass on - no pass on”.

These were the classic recurrent neural networks. Then begins the topic, which is often silent, but it is also important.

When we work with a sequence in a recurrent network, we usually feed one element, then the next, and set the previous state of the network to the input, this natural direction arises - from left to right. But it is not the only one! If we have, for example, a proposal, we begin to submit his words in the usual way to the neural network - yes, this is the normal way, but why not submit it from the end?

That is, in many cases, the sequence has been given entirely from the very beginning. We have this proposal, and there is no point in highlighting one direction relative to another. We can run a neural network on the one hand, on the other hand, actually having 2 neural networks, and then combine their result.

This is called Bidirectional - bidirectional recurrent neural network. Their quality is even higher than conventional recurrent networks, because there is more context: for each point there are now 2 contexts - what was before, and what will be after. For many tasks this adds quality, especially for language related tasks.

For example, there is German, where in the end something will definitely be hung up and the sentence will change - such a network will help.

Moreover, we considered one-dimensional cases - for example, sentences. But there are multidimensional sequences - the same image can also be considered as a sequence. Then he generally has 4 directions that are reasonable in their own way. For an arbitrary point of the image there are, in fact, 4 contexts with such a detour.

There are interesting multidimensional recurrent neural networks: they are both multidimensional and multidirectional. Now they are a little forgotten. This, by the way, is an old development, which is already 10 years old, probably, but now it is beginning to emerge.

Here are the latest works (2015). This neural network is analogous to the classic LeNet neural network, which classified handwritten numbers. But now it is never convolutional, but recurrent and multidirectional. There are arrows that are in different directions in the image.

The second example is the tricky neural network, which was used for segmentation of brain sections. She, too, is never convolutional, but recurrent, and she won in some regular competition.

In fact, this is a cool technology. I think that in the near future recurrent networks will very strongly press convolutional networks, because even for images they add a lot of things. This is a potentially more powerful class.

https://arxiv.org/abs/1507.01526

And there is also a very recent development of the Grid LSTM, which is still not very meaningful and conscious. In fact, the idea is simple - they took a recurrent network, at some point they replaced the neurons with some tricky cells, so that the state could be stored for a long time. If our network is deep in this direction, then there is no gate, gradients are also lost. What if in this direction is also something to add? Yes, added, it turned out cool!

Just a problem - now there are almost no ready-made software libraries where this is implemented. There are 1-2 pieces of code that you can try to use. I hope that in the coming year these things will be publicly available, and it will be quite cool.

This is a wonderful thing, see what happens to her, she is good.

Now begin advanced topics.

Multimodal Learning

Mixing different modalities in one neural network, for example, image and text

Multimodal training is also ideologically a simple thing when we take and we mix 2 modalities in a neural network, for example, pictures and text. Before that, we considered cases of work on 1 modality - only in pictures, only on sound, only on text. But you can mix!

http://arxiv.org/abs/1411.4555

For example, there is a cool case - generating a description of pictures. You submit a picture to the neural network; it generates text at the output, say, in normal English, which describes what happens in this picture. A few years ago, this technology did not seem possible at all because it was not clear how to do this. But now it is implemented.

By the way, we posted in open access the video of the last five years of the conference of developers of high-loaded systems HighLoad ++ . Watch, learn, share and subscribe to the YouTube channel .

Inside, everything is simple. There is a convolutional neural network that processes the image, extracts some signs from it and writes it in some tricky state vector. There is a recurrent network that is taught from this state to generate and expand text.

This combination of 2 modalities is very productive. There are many such examples.

https://www.cs.utexas.edu/~vsub/

There is, for example, an interesting task of annotating a video. In fact, another dimension is simply added to the previous task - time.

For example:

There is a football player who runs around the field;
There is a convolutional neural network that generates features;
There is a recurrent neural network that describes what happened in each frame or in a sequence of frames.

It is interesting!

In a little more detail, how multimodal training looks inside.

http://arxiv.org/abs/1411.2539

There is some tricky space that we can’t see, but it exists inside the neural network in the form of these weights, which it considers for itself. It turns out that in the process of learning we learn two different neural networks: convolutional and recurrent for the text that describes the picture and for the picture itself to generate vectors in this tricky space in one place. That is, to reduce 2 modalities into one.

If we have learned to do this, then further there it does not matter to some extent: submit a picture - generate text, submit text - find a picture. You can play with different things and build interesting things.

By the way, there are already attempts to build networks that generate text in the text. This is interesting, it also works. Not very good yet, but the potential is huge.

Sequence Learning and the seq2seq paradigm

When it is necessary to work with sequences of arbitrary length at the input and / or output

The second interesting topic is Sequence Learning or the seq2seq paradigm. I will not even translate it. The idea is that a lot of your tasks come down to the fact that you have sequences. That is, not just a picture that needs to be classified, to give out one number, but there is one sequence, and the output needs another sequence.

For example, the translation is a classic task of Sequence 2 Sequence Learning: asked the text in English, want to receive in French.

There are a lot of such tasks. This is a picture description case.

http://karpathy.imtqy.com/2015/05/21/rnn-effectiveness/

Ordinary neural networks, which we considered - drove something, drove through the network, removed at the output - not interesting.

There is an option called One to many. They drove the picture into the network, and then she went to work, work and generated a description of this picture. Great.

You can in the opposite direction. For example, the classification of texts. This is the favorite task of all marketers - to classify tweets - they are positive or negative in terms of emotional coloring. You drove your proposal into a recurrent neural network, and then at the end it gave one number - yes, positively colored tweet, no, negatively colored tweet, or neutral, for example.

There is a story about the translation. You have long driven the sequence in the same language. Then the network worked and started generating a sequence in another language. This is generally the most common setting.

There is another interesting setting when the inputs and outputs are synchronized. For example, if you need to annotate each frame of the image, there is something on it or not.

The figure shows all the variants of Sequence 2 Sequence Learning, and this is a very powerful paradigm. It is powerful in that if everything inside the neural network is differentiable - and the neural networks that we discussed are all differentiable inside, it means that you can train the neural network, end-to-end, so to speak: some sequences have been fed to the input, others and what happens inside doesn’t matter to you. The neural network itself will cope - at the entrance a bunch of examples in English, on the way out - a bunch of examples in French - great, she will learn the translation. And really with good quality, if you have a large database and good computing power to drive it all away.

Another insanely important thing, about which they almost never speak, but without which neither Google’s speech recognition nor Baidu, nor Microsoft - CTC works.

https://github.com/baidu-research/warp-ctc

CTC is such a tricky output layer. What is he doing? There are many tasks in which the alignment within this sequence is not really important. There is a speech recognition task. You took the sound, cut it into short frames of 50 ms, for example, and then you need to generate at the output what word it was, a sequence of phonemes. By and large, you do not care where in the original signal was one or another phoneme. What is important is only the order among themselves to get a word at the exit.

The fact that you can throw out all the information about the exact position, in fact, a lot of what adds. For example, you do not need to have accurate markup of phonemes across all frames of sound, because getting such markup is insanely expensive. It is necessary to plant a person who will mark everything.

You can just take everything and throw it away - there is input data, there is a way out - what should happen in terms of the output sequence is a word, there is this tricky CTC-layer that will do some kind of alignment inside itself, and this will allow, again, end- to-end to train such a tricky network, for which you did not mark anything at all.

This is the most powerful thing, it is also not implemented in all modern packages. But, for example, a year ago Baidu laid out its implementation of the CTC layer - this is great.

Just a couple of words about different architectures.

https://github.com/farizrahman4u/seq2seq

There is a classic architecture decoder-decoder. The translation example, about which I spoke, is almost entirely reduced to this architecture.

There is one input neural network, words are fed into it. The output of this neural network is ignored, as it were, until the end of sentence character is given. After that, the second network turns on and reads the state of the first network and starts generating the output words from it. At the entrance are her results in the previous step.

It works. Many translation systems work this way.

But this architecture has one problem - also a bottleneck. The state vector (the size of the hidden layer) that is transmitted is limited and fixed. That is, it turns out that it is the same for both the short sentence and the insanely long one — this is not very good. It may be that a long sentence does not fit into this volume.

https://research.googleblog.com/2016/09/a-neural-network-for-machine.html

Appeared architecture, as they say, with attention.

Attention - this is such a tricky thing, which in fact is, in fact, very simple. The idea is that now the decoder output to a neural network does not look at the output value of the previous neural network, but at all its intermediate states, but with some weights. Weights are coefficients, how much you need to take each of those states into the final large amount that the decoder will work with.

That is, attention is actually a simple linear combination of all previous states of an encoder, which is also being trained.

Neural networks with attention in fact work very well. On translation tasks and other complex tasks, they are very much superior in quality to neural networks without attention.

http://kelvinxu.imtqy.com/projects/capgen.html

Extra bonus of such networks. The figure shows the combination of 2 different neural networks: the convolutional neural network, from which we received some signs, further the recurrent neural network generates the text. If we implemented the concept of attention, set off on some pictures, then we can look at generating a specific word, which weights were great. This actually indicates which pixels of the image at a particular moment played a role in generating this particular word. That is what the neural network seemed to pay attention to.

By the way, the concept of attention is far from being implemented in every library, that is, there are no ready-made boxed solutions. Sometimes you can find ready-made code that someone published as part of their work.

http://kelvinxu.imtqy.com/projects/capgen.html

CNN + RNN with attention = beautiful pictures.

When the neural network generates a text about the STOP sign, it really looks at this sign - its weight, its contribution to the generation of a specific word STOP is very high, and all the other pixels play little role.

This is an interesting concept, follow it too. It will also be used in many places.

Frameworks and libraries for working with neural networks

Very brief overview

In fact, you can talk about it for hours. I have no purpose to tell you - yes, use this library here or this one - because all this is not so. Libraries have a huge amount. I will give a more or less relevant list of different libraries.

Detailed list: http://deeplearning.net/software_links/

Universal libraries and services:

Torch7 ( http://torch.ch/ ) [Lua]
TensorFlow ( https://www.tensorflow.org/ ) [Python, C ++]
Theano ( http://deeplearning.net/software/theano/ ) [Python]
- Keras ( http://keras.io/ )
- Lasagne ( https://github.com/Lasagne/Lasagne )
- blocks ( https://github.com/mila-udem/blocks )
- pylearn2 ( https://github.com/lisa-lab/pylearn2 )

Microsoft Cognitive Toolkit (CNTK) ( http://www.cntk.ai/ ) [Python, C ++, C #, BrainScript]
Neon ( http://neon.nervanasys.com/ ) [Python]
Deeplearning4j ( http://deeplearning4j.org/ ) [Java]
MXNet ( http://mxnet.io/ ) [C ++, Python, R, Scala, Julia, Matlab, Javascript]
...

First, there are universal libraries, about many of which you have heard.

For example, TensorFlow (Google) is probably one of the most popular, although quite fresh. It can be used in Python and C ++.

There is a Torch library that actively supports Facebook at the moment. This is the language of Lua. But do not be afraid of him, in fact it is a cool language. There is a lot of this library, which is implemented, there is a lot of fresh research right in the form of Lua code. It's great.

There is the Theano library - TensorFlow has now pressed it a little, but many different cool high-level wrappers have been built around Theano - you can write the neural network in several lines. This is really very cool!

Some of these wrappers, for example, Keras, can work with TensorFlow, as a backend, as they say. That is, TensorFlow is a rather low-level code in terms of neural networks, Keras is a high-level code, or a single-line layer is convenient.

Microsoft has published something, there is Neon, Deeplearning4j - a rare case - the Java library for Deeplearning. They are few in Java. A lot of Python and C ++. In other languages, less.

In addition, there are special tools for video processing.

Image and video processing:

OpenCV ( http://opencv.org/ ) [C, C ++, Python]
Caffe ( http://caffe.berkeleyvision.org/ ) [C ++, Python, Matlab]
Torch7 ( http://torch.ch/ ) [Lua]
clarifai ( https://www.clarifai.com/ )
Google Vision API ( https://cloud.google.com/vision/ )
...

I included OpenCV here. This is of course never a Deeplearning library, but it integrates well with other libraries.

Caffe is an excellent library, we used it in Production. This is a plus library, it is very fast, there is little faster than it is. It is still cool, although those who are now developing neural networks for some reason think only of TensorFlow. But keep in mind that there are a bunch of other great solutions, including the Caffe - a very cool thing.

In addition, there are a number of different APIs that can be used in WEB.

Speech recognition. It's actually getting worse.

Speech recognition:

Microsoft Cognitive Toolkit (CNTK) ( http://www.cntk.ai/ ) [Python, C ++, C #, BrainScript]
KALDI ( http://kaldi-asr.org/ ) [C ++]
Google Speech API ( https://cloud.google.com/ )
Yandex SpeechKit ( https://tech.yandex.ru/speechkit/ )
Baidu Speech API ( http://www.baidu.com/ )
wit.ai ( https://wit.ai/ )
...

There is one cool KALDI library for speech recognition, it is positive. But in general, speech recognition is more or less closed within large corporations because no one has a large Data Set about speech and sound. But there is a large number of APIs from Yandex, Google, Baidu, Microsoft seems to have one too.

Text processing:

Torch7 ( http://torch.ch/ ) [Lua]
Theano / Keras / ... [Python]
TensorFlow ( https://www.tensorflow.org/ ) [C ++, Python]
Google Translate API ( https://cloud.google.com/translate/ )
...

For texts in general, there is nothing especially special, but all universal libraries are great. Take Keras (or any other, not fundamentally), write something in a few lines - you have a neural network for working with text. Or any other library is not important.

That's all, thank you. There is no universal answer to the question which universal library should be taken. Look at your task. There are many subtleties - and what kind of technology you have, what is embedded in it, and what code is already ready in nature - n http://github.com/ there are really a lot of codes that can be used. This is an engineering challenge that needs to be approached thoughtfully. One universal answer is not and can not be.

Questions and Answers

Question : Can you advise some literature for a beginner - what would be better to read, to look, to understand more deeply how to program neural networks?

Answer : It depends a lot on your current level, on how deeply you want to get an understanding. In fact, there are a huge number of blogs. First, forget about the Russian language - there is practically nothing on it. There are some transfers to Habré, but this is not serious against the background of the array that exists in nature.

In English there are a huge number of cool blogs, where different examples are dealt with. There are a lot of them, just google and find something on a specific topic. There are different tutorials, again in English, more or less small. There is a 800 page Deeplearning book, which is now on paper in AMT-Press, and it has been available in PDF for a long time.

In general, there is literature. There are some courses online, for example, on Coursera, there are attempts to run the course offline. In particular, I will soon participate in one of these courses.

In fact, there are quite a few different options. Look on the Internet - true, there are many possibilities. Most of them are still reading various foreign literature, but it is really good and comprehensive.

But the code on GitHub is also good. A lot of code is published, which you can at least see. Often it can be read, it is not very scary. And often with this code there are some intelligible comments about how it all works. Just go to the Internet - there is a lot of everything there.

Question : What, in general, are there approaches to learning neural networks? Is it possible to just google a bunch of pictures from the Internet, or is it possible to take some neural networks that train other neural networks?

Answer : Yes, this is a cool question. In the training of neural networks, I think in the coming years there will also be a lot of progress. There are different approaches. First, yes - when you have scored a Data Set and are teaching it - this is a classic approach. He is, from him there is no way to go, he is basic in many cases.

But now, by the way, the base approach often becomes another approach called Transfer learning. There are some already published neural networks that are trained on different tasks - on the same ImageNet. You can take yourself a ready ImageNet neural network for 1000 classes and train it for some special classes. This may be easier because you have, say, only 1000 pictures of your classes, and you do not train a good neural network from scratch on such a volume. In order to train a deep neural network, you really need a lot of data. We are talking about hundreds of thousands and millions of objects. But then, if you have a ready-made grid, you took it, a little bit of additional training, and you already have a sane result. Transfer learning is a good approach. He works.

The option when neural networks are taught by other neural networks is also such options. They are more research than production. This is a very cool topic, it's great to follow. I don’t know very good production-solutions, but if you are interested in the scientific side, then yes, read, there are some really cool articles where a neural network teacher, which contains a model of some kind of world, teaches another neural network, and it works .

Question : Are there any tools in which you can modify existing neural network architectures or create, for example, your own?

Answer : See, if you want a visual tool, then rather no, than yes. Although there are some plugins on TensorFlow that visualize something there. But in general, this is actually not a very big problem because the neural network is often specified in the form of some kind of structure, text file or code, it is not very difficult to change, there you can add layers, reprogram it. This is not even much programming, this is such a DSL special in fact. You took and added a couple of layers.

The most difficult thing in all these works is to observe the dimension between the layers. If you do not really understand how the tensors are arranged there, these multidimensional arrays, there is a chance to get confused with the dimensions. This is the hardest part of all this.

Question : You told about quite a lot of different recurrent architectures
neural networks. And what exactly did you use and for what tasks?

Answer : For most tasks, simple neural networks from the LSTM box work well, of sufficient depth and of sufficient size. There are many tasks of classifying text, classifying anything in sequences. If you start with one of the LSTM networks, this is, in principle, a normal start. If you understand that bidirectionality is helpful in this place, you are doing Bidirectional LSTM.

It would be great to start with all sorts of cool options with attention and so on, but it's just hard to start with them because they are difficult to program from scratch. There is not trivial after all. And there are not very many such good pieces of code that you took and use. For me, the Base line for now is LSTM networks - unidirectional or bidirectional. I used them to classify texts and images (recognition of numbers).

Question : I know that neural networks are used to crack cryptographic algorithms, for example, the plaintext is input and the cipher text is output. And then, in the opposite direction, the ciphertext is served, and the training simply gets open at the output. The question is: what progress is now in this area, does it really work, and what architectures can you use for this?

Answer : I cannot say much about this. I am not enough in this area, so let's say competent. I do not work at the interface with such security cryptography. I saw some fresh work of Google, where one neural network was taught to generate a cipher, and another to crack. But it seems to me these examples are far from good crypto-stable algorithms. It seems to me that this is research at the level of the “It is interesting to see what will come of it” series. I have not heard about the cool work about breaking serious ciphers.

This report is a transcript of one of the best speeches at the conference of developers of high-loaded HighLoad ++ systems. Less than a month is left before the HighLoad ++ 2017 conference.

We have already prepared the conference program , now the schedule is being actively formed.

This year we continue to explore the topic of neural networks:

Identification of attributes and visual search in UGC-photos of clothes / Dmitry Solovyov
Recognition of clouds and shadows on satellite images using deep learning / Anatoly Filin
Neural networks: fast inference on the GPU using TensorRT (demo) / Dmitry Korobchenko
Detection of anomalies in time series with the help of autoencoders / Pavel Filonov
Face Recognition: From Scratch To Hatch / Edward Tiantov

Also, some of these materials are used by us in an online training course on the development of high-load systems HighLoad.Guide is a chain of specially selected letters, articles, materials, videos. Already, in our textbook more than 30 unique materials. Get connected!

Source: https://habr.com/ru/post/340184/

All Articles