What exactly makes depth learning and neural networks work well?

Now there are a lot of articles reporting on the success of neural networks, in particular, in the field of natural language understanding that interests us. But for practical work, it is also important to understand under what conditions these algorithms do not work, or work poorly. Negative results for obvious reasons often remain outside the scope of publications. Often they write like this - we used Method A along with B and C, and got the result. But whether B and C was needed is questionable. For a developer introducing well-known methods into practice, these questions are very important, so today we will talk about negative results and their significance with examples. We will take examples, as well-known, and from the practice.

1. How important is the amount of data and power of computers?

It is often said that progress in the field of machine intelligence is largely due to the increase in computer performance. To a certain extent, this is true. Consider one illustrative example.
')
There is such a thing as a language model. A language model, conditionally speaking, is a function capable of predicting the probability that a certain sequence of words is valid for a given language. It can also be used to generate texts in this language. The language model is used, for example, to correct the errors that the speech recognition system makes, which greatly improves the quality of the result.

In 2010, Thomas Mikolov demonstrated a neural network that implements a language model and outperforms traditional n-gram-based models (the new model made 18% fewer errors than the best known means, which is a very significant improvement) (Mikolov et al , 2010). Three years later, in 2013, the same neural network showed the best result on the task of recognizing semantic roles (Mikolov et al, 2013), ahead of the widely used conditionally random field method (CRF). At about the same time, it was demonstrated that the same neural network can be used to obtain a vector representation of words (when similar values are matched by close vectors). A modification of this method formed the basis of the now widely known word2vec (vector representations of words were already known before, but did not receive such wide distribution, including due to rather complex and slow algorithms for calculating them).

What is interesting here? That the model used was first developed by Jeff Elman already in 1990 (Elman J., 1990). When asked what prevented them from getting these results even then, they usually answer - the speed of computers and the size of the training sample.

Indeed, when we ourselves tried to duplicate the result of these articles and get word vectors for the Russian language, for a long time nothing came of it. We checked everything that was possible, and we already started to think that the Russian language is very special and not suitable for this method, but the answer turned out to be simpler. “Beautiful” vectors for words that are close in meaning come from a certain amount of data, and until that moment absolutely nothing indicates such an opportunity:

Nearest neighbors for the word "phone":

50,000 words	50 0000 words	2 million words
second rivals download cube lot	used bought good cell phone microphone	cell phone device a tube screen

Interestingly, as the volume of the text increases, at first the words are not close in meaning, but often standing side by side - for example, when sampling 500,000 words, the words that often write together “used the phone” are close to the word telephone, "Bought a phone" and so on. And only 2 million words, we see a beautiful picture of synonyms, which is usually given in articles. At that time, this interesting fact was not published anywhere, or in any case we could not find it. Therefore, stubbornness helped to get the desired result. Hence the conclusion - not always neural network models can be checked on “simple” or “reduced” cases, despite the great desire to do so.

It would seem simple. But the question arises - were the 20 years of development of computing technology really needed to get a result? For the experience, we took an old computer Pentium-III sample somewhere 2000-2002 (it was then the last squeak, but it was) and downloaded it for this task. Of course, instead of an hour, he worked for several days, but got the same result. So the question here is not only in technology. We can say that the developer’s mind and his belief in the method used play a significant role.

Our excursion into history does not pretend to be complete (In 2001, the first neural models of the language were already, but they did not use the Elman network, but the sliding window, we simplified a number of other points), but on the whole the situation is revealing and interesting. Very many “modern” neural network models come from the 1990s (bidirectional recurrent networks — 1997, LSTM — 1998 ...) and probably even more interesting developments are waiting patiently in the wings. It is curious whether there is a similar situation in other areas of technology, and what would be our life now if these “useless” inventions would find the way to practice earlier? However, this is a question from the field of philosophy.

2. New architectural solutions against data volume

Gate neural modules, new activation functions (RelU - rectified linear unit), maxout, diagonal initialization, new neural network architectures, new regularization methods (dropout, dropconnect), and much more. How much does all this play a decisive role in creating text analysis systems? Here is another illustrative example.

In 2014, in work (Kalchbrenner et al, 2014), convolutional neural networks (not a new idea in itself) were applied to the problem of the classification of sentences. The architecture described in the article contains seven layers, implements a number of non-standard methods for convolution networks (dynamic k-max pooling). All this construction allowed “to set a new precision record” in predicting the tonality of sentences from movie reviews, and in classifying types of questions (an important task for systems capable of answering questions asked by the user in natural language). Indisputable achievement. A few months after the appearance of this article, we discovered a new article (Kim et al, 2014), which shows even more improved results. But the convolution network in this paper contains only three layers, does not use k-max pooling, and generally has the simplest structure.

We tried to reproduce the work (Kim et al, 2014), and for a very long time it didn’t work, and it was catastrophic (on one of the data sets it turned out about 65% of accuracy against the published 81%). The reason was again simple - use other word vectors. Initially, we took the vector, trained on the body of 400 million words. It would seem - a lot. But in the original, word2vec vectors were used that were trained on 100 billion words. Difference. Taking vectors trained on 5 billion words immediately got 78%. No other effort. Moreover, it turned out that the set of vectors is more important than the neural network architecture. If word vectors obtained for 5 billion words are given by a simple NBOW model (neural bag of words), then it turns out 69-72%. It turned out that again the volume of data plays a significant role.

From the above examples it may seem that you can arbitrarily improve the architecture of the system, but if there are no large amounts of data, then it is useless. However, this is not always the case. Take, for example, the task of recognizing named entities, such as the names of people or the names of organizations. Classical methods nowadays, such as CRF with a mass of hand-created signs on a huge number of examples (millions of annotated words), solve this problem easily, giving out F1 0.89–0.92. But if there are no millions of annotated words and manual signs, then what to do? Once we needed to quickly solve a similar problem. Manually this week we made a small (about 100K words) annotated sample. Come up with signs or mark up more time and opportunity data was not. Elman's standard bidirectional recurrent network with word vectors trained on Wikipedia data under testing yielded a not particularly outstanding 77%. Then we decided to tweak the architecture, made many layers, applied the RelU activation function, a special way of initializing the weights and recurrent connections from the upper to the lower layers. The result rose to a decent 86.7%. In this case, the new architectural solutions played a decisive role.

All these are of course particular examples that do not pretend to any global generalizations and conclusions, but seem interesting in terms of understanding what factors actually play a role in practical work with neural networks. At this point, we will complete the topic for now, but if there is interest in it, we will continue in the following articles.

Short list of references

1.Tomas Mikolov, Martin Karafiat´, Luka´s Burget, Jan Cernock, Sanjeev Khudanpur. Recurrent neural network-based language model // In: Proceedings of the 11th Annual Conference of the International Speech Communication Association (INTERSPEECH 2010), Makuhari, Chiba, JP
2. Yao, Kaisheng, Geoffrey Zweig, Mei-Yuh Hwang, Yangyang Shi, and Dong Yu. “Recurrent neural networks for language understanding.” In INTERSPEECH, pp. 2524-2528. 2013
3. Jeffrey L. Elman. Finding Structure in Time. Cognitive Science, 14, 179-211
4. Kalchbrenner, Nal, Edward Grefenstette, and Phil Blunsom. “A convolutional neural network for modeling sentences.” ArXiv preprint arXiv: 1404.2188 (2014).
5. Kim, Yoon. “Convolutional neural networks for listing classification.” ArXiv preprint arXiv: 1408.5882 (2014).

Source: https://habr.com/ru/post/266961/

All Articles

What exactly makes depth learning and neural networks work well?

More articles: