Introduction to competitive networks

Hello. In this article, I begin a series of stories about adversarial networks. As in the previous article, I prepared an appropriate docker-image in which everything is already ready to reproduce what is written here below. I will not copy all the code from the example here, only its main parts, therefore, for convenience, I advise you to have it side by side for easier understanding. The docker container is available here , and the laptop, utils.py and dockerfile here .

Despite the fact that the competitive network framework was proposed by Ian Goodfellow in his already famous work Generative Adversarial Networks, the key idea came to him from his work on domain adaptation, so we will start the discussion of competitive networks with this topic.

Imagine that you have two sources of data about similar sets of objects. For example, it can be medical records of different socio-demographic groups (men / women, adults / children, Asians / Europeans ...). Typical blood tests of representatives of different groups will differ, so a model that predicts, say, the risk of cardiovascular diseases (CVD) trained on representatives of one sample cannot be applied to representatives of another sample.

A typical solution to this problem would be to add a characteristic identifying a sample to the model input, but, unfortunately, this approach has many drawbacks:

Unbalanced sample - more Asians than Europeans
Different statistics - children are less likely to suffer from CVD than adults
Insufficient marking of one of the samples - men born in the 1960s died in Afghanistan, therefore there is less data on CVD depending on the region of birth than for women.
The data have a different set of features - blood tests of humans and mice, etc.

All these reasons can greatly complicate the learning process of the model. But maybe you should not bother? We will train one model for each sample and calm down. It turns out - it is worth. If you can level the difference in statistics from different training samples, then, in fact, you can make one sample larger than each of the original ones. And if you have not two data sources, but much more?

I will not talk about how to solve the problem of adapting domains to the “pre-neural network” era, but will immediately show the basic architecture.

In 2014, our compatriot Yaroslav Ganin, in collaboration with Viktor Lempitsky, published a very important article “ Unsupervised Domain Adaptation by Backpropagation ” (domain adaptation without a teacher using back-propagation error). This article demonstrates how to transfer a classification model from one data source to another, without using labels for a second source. The presented model consisted of 3 subnets: feature extractor (E), label predictor (P) and domain classifier © interconnected as in the figure.

A pair of networks E + P is an ordinary classifier, cut somewhere in the middle. The layer where it is cut is called the feature layer. Network C receives input from this layer and tries to guess from which source the example came. The task of the E network is to extract such indications from the data so that, on the one hand, P can correctly guess the example label, and on the other hand, C cannot determine its source.

In order to better understand why this is necessary and why it should work, let's talk about information. We can say that each example contains information about its label and some other information. In the case of MNIST, all this information can be recorded, for example, in the form of a b / w image with a size of 28x28 pixels. If you can train the perfect auto-encoder on MNIST, then you can write the same information in a different form. It is clear that in some cases the label information in the example itself may be incomplete. For example, in the image it is not always possible to understand exactly which number was written, but a certain amount of information about the label is still contained in the image. But, in addition to the label, the image has a number of explicit and a huge number of implicit properties: handwriting properties (thickness, slope, “curlicues”), location (in the center or with a shift), noise, etc. When we train a classifier, we try to extract information about the tag as much as possible, but this can be done in a huge number of ways. On the same MNIST, we can train 100 equally effective classifiers, each of which will have its own hidden representation of images, let alone the case when the data sources are different.

Ganin's idea is that if we can maximize information with the help of a neural network, then nothing prevents us from minimizing it. If we consider data from two different sources (for example, MNIST and SVHN), then we can say that each of the examples contains information about the label and the source. If we are able to train a neural network E to extract features containing information about a tag, and do it in the same way, regardless of where the example came from, then the P network trained only with examples from one source should be able to predict the labels for the second source.

Indeed, a neural network trained on examples from SVHN using domain adaptation determines the class of images from MNIST more accurately than a network trained only on SVHN - 71% accuracy against 59%. At the same time, both models, of course, have never seen a single label from MNIST during training. In fact, this means that you can transfer a trained classifier from one sample to another, even if you do not know the labels for the second sample.

Despite the fact that the task of classifying numbers is quite simple, the training of the networks used in the article may require substantial resources, and besides, the code solving this problem can be easily found on the Internet. Therefore, I will analyze another example of the use of this technique, and I hope it will help to better demonstrate the idea of “sharing” information in the feature layer.

Very often, when it comes to extracting information or presenting it in a special way, auto-encoders appear on the scene. Let's learn how to present examples of MNIST in a compressed form, but we will do it in such a way that the compressed representation does not contain information about which figure was displayed. Then, if the decoder of our network, having obtained the extracted features and the label of the original digit is able to restore the original image, then we can assume that the encoder does not lose any information other than the label. At the same time, if the classifier according to the extracted features is not able to guess the label, then all the information about the label is forgotten.

To do this, we will have to create and train 3 networks - the encoder (E), the decoder (D) and the classifier (C).

This time we will make the encoder convolutional by adding a couple of convolutional layers, for this we will use the Sequential class.

conv1 = nn.Sequential( nn.Conv2d(in_channels=1, out_channels=16, kernel_size=3, padding=1), nn.BatchNorm2d(num_features=16), nn.ReLU(), nn.MaxPool2d(kernel_size=2, stride=2), nn.Dropout(0.2) ) conv2 = nn.Sequential( nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, padding=1), nn.BatchNorm2d(num_features=32), nn.ReLU(), nn.MaxPool2d(kernel_size=2, stride=2), nn.Dropout(0.2) ) self.conv = nn.Sequential( conv1, conv2 )

In fact, it allows us to specify subnets at once, in this case it is a sequence of layers of convolution, minibatch normalization, activation functions, subsampling and dropout. Information about these layers is available in large quantities on the Internet (or, for example, in our book ), so here I will not analyze them in detail.

Sequential layers (or subnets) in the forward function can be used just like any other layer.

 def forward(self, x): x = self.conv(x) x = x.view(-1, 7*7*32) x = self.fc(x) return x

In order to make a decoder similar to the encoder, its last layers will be set using the transposed convolution

 conv1 = nn.Sequential( nn.ConvTranspose2d(in_channels=32, out_channels=16, kernel_size=3, stride=2), nn.BatchNorm2d(num_features=16), nn.ReLU(), nn.Dropout(0.2) ) conv2 = nn.Sequential( nn.ConvTranspose2d(in_channels=16, out_channels=1, kernel_size=2, padding=1, stride=2), nn.Tanh() )

The essential difference between the decoder is that it receives not only the signs obtained from the encoder , but also the label:

 def forward(self, x, y): x = torch.cat([x, y], 1) x = self.fc(x) x = x.view(-1, 32, 7, 7) x = self.deconv(x) return x

torch.cat allows you to combine the signs and the label into one vector, and then we just restore the image from this vector.

And the third network will be an ordinary classifier , which predicts the original image mark by the signs extracted by the encoder .
The learning cycle of the entire model now looks like this:

 for x, y in mnist_train: y_onehot = utils.to_onehot(y, 10) # train classifier C C.zero_grad() z = E(x) C_loss = NLL_loss(C(z), y) C_loss.backward(retain_graph=True) C_optimizer.step() # train decoder D and encoder E E.zero_grad() D.zero_grad() AE_loss = MSE_loss(D(z, y_onehot), x) C_loss = NLL_loss(C(z), y) FADER_loss = AE_loss - beta*C_loss FADER_loss.backward() D_optimizer.step() E_optimizer.step()

First, we use the encoder to extract the image features and update the weights of the classifier only so that it better predicts the label:

 z = E(x) C_loss = NLL_loss(C(z), y) C_loss.backward(retain_graph=True) C_optimizer.step()

However, we ask PyTorch to save the computation graph in order to reuse it. The next step is to teach the auto-encoder and impose an additional requirement to extract such features, which make it harder for the classifier to restore the label:

 AE_loss = MSE_loss(D(z, y_onehot), x) C_loss = NLL_loss(C(z), y) FADER_loss = AE_loss - C_loss FADER_loss.backward() D_optimizer.step() E_optimizer.step()

Please note that the classifier weights are not updated, however, the encoder weights are updated in the direction of increasing the classifier error. Thus, we alternately teach the classifier , the auto-encoder pursuing opposite goals. This is the idea of competitive networks.

As a result of learning such a model, we want to obtain an encoder that extracts from the examples all the information necessary for restoring the example, with the exception of the label. At the same time, we train the decoder using this information in conjunction with the label to be able to restore the original example. But what if we give the decoder a different label? In the image below, each line is obtained by reconstructing an image from the signs of one of the digits in combination with 10 possible marks. The figures, taken as a basis, are located on the diagonal (more precisely, not the original examples themselves, but the reconstructed ones, but using the "correct" label).

Transferring the style between numbers

In my opinion, this example perfectly demonstrates the idea of extracting information other than a label, since it is clear that in the same line all the numbers are “written” in the same style. In addition, it is clear that the string obtained from the digit "1" is unstable. I explain this by the fact that the writing unit does not contain very much information about the style, perhaps, only the thickness of the line and the slope, but there is definitely no information about the curl. Therefore, the rest of the numbers written in the same style can be quite diverse, although in each case the style will be one for the whole line, but it will change at different stages of training.

It remains only to add that a similar approach was published on NIPS'17 in an article from the Facebook team. Similarly, the model extracts features from photographs of faces and “forgets” labels such as having a beard or glasses. Here is an example of what happened in the article:

An example from the FADER Networks article

')
Although in this post we drew “new” numbers, but for this we had to use already existing numbers to choose a style. In the next article I will talk about how to generate images from scratch and why this particular model does not know how to do that.

Source: https://habr.com/ru/post/358946/

All Articles

Introduction to competitive networks

More articles: