Deep Learning, NLP, and Representations

I offer the readers of “Habrakhabr” the translation of the “Deep Learning, NLP, and Representations” post by steep Christopher Olach. Illustrations from the same place.

In recent years, methods using deep neural network learning have taken a leading position in pattern recognition. Thanks to them, the bar for the quality of computer vision techniques has risen significantly. Speech recognition is moving in the same direction.

Results are results, but why do they solve problems so cool?
')

The post highlights several impressive results of using deep neural networks in natural language processing (Natural Language Processing; NLP). Thus, I hope to lucidly state one of the answers to the question of why deep neural networks work .

Neural networks with one hidden layer

A neural network with a hidden layer is universal: with a sufficiently large number of hidden nodes, it can construct an approximation of any function. There is a frequently quoted (and more often incorrectly understood and applied) theorem.

This is true because the hidden layer can simply be used as a “lookup table”.

For simplicity, consider the perceptron. This is a very simple neuron that works if its value exceeds a threshold value, and does not work if it does not. The perceptron has binary inputs and a binary output (i.e. 0 or 1). The number of options for input values is limited. Each of them can be compared to a neuron in a hidden layer that works only for a given input.

Analysis of the “conditions” for each individual entry will require hidden neurons (with n data). In fact, things are usually not so bad. There may be “conditions” under which several input values fit, and there may be “overlapping” “conditions” that reach the correct inputs at their intersection.

Then we can use the connections between this neuron and the neurons on the output to set the final value for this particular case.

Versatility is not only perceptrons. Networks with sigmoids in neurons (and other activation functions) are also universal: with a sufficient number of hidden neurons, they can construct an arbitrarily accurate approximation of any continuous function. Demonstrating this is much more difficult, since it is impossible to just and isolate the inputs from each other.

Therefore, it turns out that neural networks with one hidden layer are indeed universal. However, this is nothing impressive or surprising. The fact that the model can work as a reference table is not the strongest argument in favor of neural networks. This merely means that the model is in principle capable of coping with the task. Universality is understood only as the fact that the network can adapt to any samples, but this does not mean that it is able to adequately interpolate the solution for working with new data.

No, versatility still does not explain why neural networks work so well. The correct answer lies somewhat deeper. To understand, we first consider several specific results.

Word representations (word embeddings)

I will begin with a particularly interesting sub-area of deep learning - with word representations of words (word embeddings). In my opinion, vector representations are now one of the coolest topics for research in deep learning, although they were first proposed by Bengio, et al. more than 10 years ago.

Vector representations were first proposed in the works of Bengio et al, 2001 and Bengio et al, 2003 several years before the resurrection of deep learning in 2006, when neural networks were not yet in vogue. The idea of distributed representations as such is even older (see, for example, Hinton 1986 ).

In addition, I think that this is one of those tasks with the help of which an intuitive understanding of why deep learning is so effective is best formed.

Vector word representation

- a parameterized function that maps words from a certain natural language to large-dimension vectors (say, from 200 to 500 measurements). For example, it might look like this:

(Typically, this function is defined by the lookup table, which is determined by the matrix

in which each word corresponds to a string

).

W is initialized by random vectors for each word. She will be trained to give meaningful values to solve a problem.

For example, we can train the network to determine if a 5-gram is “correct” (a sequence of five words, for example, 'cat sat on the mat'). 5 grams can be easily obtained from Wikipedia, and then half of them can be “spoiled” by replacing some of the words with random words in each (for example, 'cat sat song the mat'), as this almost always makes 5 grams meaningless .

A modular network to determine if a 5 gram is “correct” ( Bottou (2011) ).

The model we are teaching will pass each word from a 5-gram through W , getting their vector representations as output, and feed them into another module, R , which will try to predict whether the 5-gram is “correct” or not. We want it to be like this:

In order to predict these values accurately, the network needs a good choice of parameters for W and R.

However, this task is boring. Probably, the solution found will help to find grammatical errors in the texts or something like that. But what is really valuable here is the resulting W.

(Actually, the whole point of the task in teaching W. We could consider solutions to other problems; so, one of the common ones is to predict the next word in a sentence. But this is not our goal now. In the rest of this section we will talk about many results vector representation of words and will not be distracted by highlighting the difference between approaches).

In order to “feel” how the space of vector representations is arranged, you can depict them using the clever method of visualization of high-dimensional data - tSNE.

Visualize vector representations of words with tSNE. On the left is the “area of numbers”, on the right is the “area of professions” (from Turian et al. (2010) ).

Such a “map of words” seems quite meaningful. “Similar” words are close, and if you look at which ideas are closer to this one, it turns out that at the same time the close ones are “similar”.

Whose vector representations are closer to the representation of this word? ( Collobert et al. (2011) .)

It seems natural that the network will match words with similar values to close vectors. If you replace the word with a synonym ("some sing well"

“The few sing well”), the “correctness” of the sentence does not change. It would seem that the sentences at the entrance differ significantly, but since W "shifts" the representations of synonyms ("some" and "few") to each other, for R there is little change.

This is a powerful tool. The number of possible 5-grams is huge, while the size of the training sample is relatively small. Rapprochement of representations of similar words allows us, taking one sentence, as if to work with a whole class of “similar” to it. The matter is not limited to the replacement of synonyms, for example, the possible substitution of a word from the same class (“wall of blue”

“Red wall”). Moreover, it makes sense to simultaneously replace several words (“wall of blue”

“Red ceiling”). The number of such “similar phrases” grows exponentially with the number of words.

Already in the foundational work of the A Neural Probabilistic Language Model (Bengio, et al. 2003), substantial explanations are given why vector representations are such a powerful tool.

Obviously, this property W would be very useful. But how is she taught? It is very likely that W encounters the “blue wall” sentence many times and recognizes it as correct before seeing the “red wall” sentence. Shift “red” closer to “blue” improves network performance.

We still have to deal with examples of uses of each word, but analogies allow us to generalize to new combinations of words. With all the words, the meaning of which we understand, we have come across before, but the meaning of the sentence can be understood without ever having heard it before. Neural networks are able to do the same.

Mikolov et al. (2013a)

Vector representations have another much more remarkable property: it seems that the analogy relations between words are determined by the value of the difference vector between their representations. For example, apparently, the vector of the difference between "male-female" words is constant:

Maybe it will not surprise anyone. In the end, the presence of genitive pronouns means that replacing the word “kills” the grammatical correctness of the sentence. We write: "she is aunt," but "he is uncle." Similarly, "he is the king" and "she is the queen." If we see in the text “she is uncle”, most likely, this is a grammatical mistake. If in half of the cases the words were replaced at random, then this must be our case.

“Of course!” - we say, looking back at past experience. - “Vector representations will be able to represent the floor. Surely there is a separate dimension for the floor. And also for the plural / singular. Yes, such a relationship and so easily recognized! "

It turns out, however, that much more complex relationships are “coded” in the same way. Just wonders in the sieve (well, almost)!

Pairs of relationships (from Mikolov et al. (2013b) .)

It is important that all these properties of W - side effects. We did not impose requirements that the representations of similar words should be close to each other. We did not try to customize the analogies using vector differences. We just tried to learn to check whether the proposal was “correct”, and the properties from somewhere came from themselves in the process of solving the optimization problem.

It seems that the great strength of neural networks is that they automatically learn how to build the “best” data views. In turn, data presentation is an essential part of solving many machine learning problems. And vector representations of words are one of the most amazing examples of learning representations.

Common views (shared representations)

The properties of vector representations are curious, of course, but can we do something useful with their help? In addition to silly little things like checking whether this or that 5-gram is “correct”.

W and F are trained by customizing task A. Then G can learn to solve problem B using W.

We have trained vector representations of words to cope well with simple puzzles, but, knowing their wonderful properties that we have already observed, we can assume that they will be useful for more general problems. In fact, vector representations like these are terribly important:

“The use of vector representations of words ... has recently become the main“ secret of the company ”in many natural language processing systems, including solving the task of identifying named entities (named entity recognition), part-of-speech-tagging, parsing and definition of semantic roles (semantic role labeling) ".

( Luong et al. (2013) .)

The overall strategy is to train a good presentation for task A and use it to solve task B — one of the main focuses in the magic hat of deep learning. In different cases, it is called differently: pretraining, transfer learning, and multi-task learning. One of the strengths of this approach is that it allows you to train views on several types of data.

You can crank this trick differently. Instead of setting up views for one type of data and using them to solve problems of different types, you can display different types of data into a single view!

One of the wonderful examples of using such a trick is vector representations of words for two languages proposed by Socher et al. (2013a) . We can learn to “embed” words from two languages into a single space. In this paper, the words “are embedded” from English and Mandarin (Mandarin adverb of Chinese).

We teach two vector views.

and

just like it did above. However, we know that some words in English and Chinese have similar meanings. So, we will optimize one more criterion: representations of translations known to us should be at a small distance from each other.

Of course, as a result, we observe that the “similar” words known to us fit together. It is not surprising, because we are so optimized. Much more interesting is this: translations that we did not know about are also nearby.

Perhaps this does not surprise anyone in the light of our past experience with vector representations of words. They “attract” similar words to each other, therefore, if we know that English and Chinese words mean about the same thing, then the representations of their synonyms should be located nearby. We also know that pairs of words in relationships, such as gender differences, differ by a constant vector. It seems that if you “have enough” to translate, you can adjust the differences so that they are the same in two languages. As a result, if the “male versions” words in both languages are translated into each other, we automatically get that the “female versions” are also correctly translated.

Intuition suggests that the languages must have a similar “structure” and that by forcibly linking them at the selected points, we pull up the rest of the representations in the right places.

Visualization of bilingual vector representations using t-SNE. Green is Chinese, Yellow is English ( Socher et al. (2013a) ).

When dealing with two languages, we train a single representation for two similar data types. But we can “enter” into a single space and very different types of data.

Recently, with the help of deep learning, they began to build models that “fit” images and words into a single representation space.
In a previous paper, the joint distribution of tags and images was modeled, but here everything is a little different.

The basic idea is that we classify images by expressing a vector from the word representation space. Pictures with dogs are displayed in vectors near the representation of the word "dog", with horses - near "horse", with cars - near "car". And so on.

The most interesting thing happens when you check the model on new image classes. So, what will happen if we propose to classify the image of the cat model, which was not specifically taught to recognize them, that is, to display in a vector close to the "cat" vector?

Socher et al. (2013b)

It turns out that the network copes well with new classes of images. Images of cats are not displayed at random points in space. On the contrary, they are stacked in the neighborhood of the “dog” vector and rather close to the “cat” vector. Similarly, truck images are displayed at points close to the truck vector, which is close to the associated car vector.

Socher et al. (2013b)

Members of the Stanford group did this with 8 famous classes and two unknowns. The results are already impressive. But with such a small number of classes, there are few points on which to interpolate the relationship between images and semantic space.

The Google research team has built a much larger version of the same; they took 1000 categories instead of 8 - and at about the same time ( Frome et al. (2013) ), and then offered another option ( Norouzi et al. (2014) ). The last two works are based on a strong image classification model ( Krizehvsky et al. (2012) ), but the images in them fit into the space of vector representations of words in different ways.

And the results are impressive. If it is not possible to accurately match the correct vector to the images of unknown classes, then at least it is possible to get to the right neighborhood. Therefore, if you try to classify images from unknown categories and significantly different from each other, the classes can at least be distinguished.

Even if I have never seen the Eskulap's snake, or the armadillo, when their pictures are shown to me, I can tell who is depicted where, because I have a general idea what kind of appearance this animal can have. And such networks are also capable of it.

(We often used the phrase “these words are similar.” But it seems that one can get much stronger results based on the relationship between words. In our space of words, there is a constant difference between “male” and “female versions.” But also in space image representations have reproducible properties that make it possible to see the difference between the sexes. Beard, mustache and bald head are well recognizable signs of a man. Chest and long hair (a less reliable feature), makeup and jewelery are obvious indicators. female I am well aware that the physical signs of the floor are deceptive, for example, I will not say that all bald -. men, or that all who have a bust - a woman, but that it is more likely true than not, help. we'd better set the initial values.
. Even if you have never seen the king, then, having seen the queen (which you identified by the crown) with a beard, you will surely decide that you need to use the "male version" of the word "queen").

General views (shared embeddings) - a breathtaking area of research; they are a very convincing argument in favor of advancing the teaching of ideas on the fronts of deep learning.

Recursive neural networks

We began the discussion of vector representations of words from this network:

The modular network (Modular Network), which teaches vector representations of words ( Bottou (2011) ).

The diagram shows a modular network

It is built from two modules, W and R. Such an approach to building neural networks - from smaller “neural network modules” - is not too widespread. However, he showed himself very well in natural language processing tasks.

The models about which it was told are strong, but they have one annoying limitation: they cannot change the number of inputs.

You can cope with this by adding the associating module A , which “merges” two vector representations.

From Bottou (2011)

"Merging" the sequence of words, A allows you to represent phrases and even whole sentences. And since we want to “merge” a different number of words, the number of entries should not be limited.

It is not a fact that it is correct to “merge” words in a sentence just in order. The sentence 'the cat sat on the mat' can be disassembled like this: '((the cat) (sat (on (the mat))'. We can apply A using this arrangement of brackets:

From Bottou (2011)

These models are often called recursive neural networks (recursive neural networks), since the output signal of one module is often fed to the input of another module of the same type. Sometimes they are also called neural networks of a tree structure (tree-structured neural networks).

Recursive neural networks have achieved considerable success in solving several natural language processing tasks. For example, in Socher et al. (2013c) they are used to predict the tonality of a sentence:

(From Socher et al. (2013c) .)

The main goal is to create a “reversible” representation of a sentence, that is, such that it is possible to restore a sentence with approximately the same meaning. For example, you can try to enter the dissociating module D , which will perform the action opposite to A :

From Bottou (2011)

If this succeeds, then we will have an incredibly powerful tool. For example, you can try to build the presentation of sentences for two languages and use it for automatic translation.

Unfortunately, it turns out, it is very difficult. Terribly difficult. But, having received grounds for hope, many are fighting over the solution of the problem.

Recently, Cho et al. (2014) progress was made in the presentation of phrases, with a model that “encodes” a phrase in English and “decodes” it as a phrase in French. Just see what the views are!

A small piece of space representations compressed with tSNE ( Cho et al. (2014) .)

Criticism

I heard that some of the above results were criticized by researchers from other fields, in particular, by linguists and specialists in natural language processing. It is not the results themselves that are criticized, but the consequences that derive from them, and the methods of comparison with other approaches.

I do not think that I am prepared so well to articulate exactly what the problem is. I would be glad if someone did this in the comments.

Conclusion

In-depth training in the service of learning representations is a powerful approach that seems to answer the question of why neural networks are so effective. In addition, there is an amazing beauty in it: why are neural networks effective? Yes, because the best ways to present data appear by themselves during the optimization of multilayer models.

Deep learning is a very young area where theories have not yet settled down and where views change quickly. With this reservation, I would say that, in my opinion, the training of representations using neural networks is now very popular.

This post covers many of the research findings that seem impressive to me, but my main goal is to prepare the ground for the next post, which will examine the links between deep learning, type theory and functional programming. If you're interested, in order not to miss it, you can subscribe to my RSS feed.

Further, the author asks to report any inaccuracies in the comments, see the original article .

Thanks

Eliana Lorch, Yoshua Bengio, Michael Nielsen, Laura Ball, Rob Gilson Jacob Steinhardt .

: .

Source: https://habr.com/ru/post/253227/

All Articles