Understanding the neural network war (GAN)

Generative adversarial networks (GAN) are becoming increasingly popular. Many people talk about them, some even use them ... but, as it turns out, very few people (even those who use them) understand and can explain. ;-)
Let's take a look at the simplest example, how they work, what they learn and what they actually produce.

To begin with, I recommend everyone to read the excellent article " Counterfeiters vs. Bankers: Bleed adversarial networks in Theano ". Moreover, we will take an example from it.

To some, this article may seem verbose, and the explanations are redundant and even repetitive. However, the way it is. However, do not forget that repetition is the mother of learning. In addition, different people perceive the text very differently, so that sometimes even small changes in the wording play a big role. In general, everything that is already clear, just scroll through.
')

Rival networks

So, we have two neural networks. We denote the first network generating it as a function y _G = G (z) , it takes some value z at the input, and gives the value y _G at the output. The second network is distinguishing, and we denote it as a function y _D = D (x) , that is, at the input - x , at the output - y _D.

The generating network must learn to generate such samples y _G that the distinguishing network D cannot distinguish them from some real, reference samples. And the distinguishing network, in turn, must, on the contrary, learn to distinguish these generated patterns from the real ones.

Initially, the networks do not know anything, so they need to be trained. But how? After all, for the training of ordinary networks need a special set of training data - (X _i , Y _i ) . X _i are sequentially input to the network and y y = N (X _i ) are calculated. Then, according to the difference between Y _i (real value) and y _i (network performance), the network coefficients are recalculated and so on in a circle.

So, not so. In fact, at each learning step, the value of the loss function is calculated, the gradient of which then recalculates the network coefficients. But the loss function in one way or another takes into account the difference between Y _i and y _i . But strictly speaking, it is not necessary. First, the loss function can be arranged and in a completely different way, for example, to take into account the values of network coefficients or the value of some other function. Second, instead of minimizing the loss function during network training, you can solve another optimization problem, for example, maximizing the success function. And this is just used in the GAN.

Learning the first, generating network is to maximize the functional D (G (z)) . That is, this network seeks to maximize not its result, but the result of the work of the second, distinguishing, network. Simply put, the generating network must learn for any value supplied to its input, generate at the output such a value, which is supplied to the input of the discriminating network, we get the maximum value at its output (perhaps it is worth reading it twice) .
If the output of the discriminating network is sigmoid, then we can say that it returns the probability that the network has the “correct” value. Thus, the generating network seeks to maximize the probability that the distinguishing network does not distinguish the result of the generating network from the “reference” samples (which means that the generating network generates the correct samples).
Here it is important to remember that in order to make one step in training the generating network, it is necessary to calculate not only the result of the generating network, but also the result of the work of the distinguishing network. I will formulate, though clumsily, but intuitively clear: the generating network will learn from the gradient of the result of the work of the distinguishing network.

Learning the second, distinguishing, network is to maximize the functional D (x) (1 - D (G (z))) . Roughly, it (the network, that is, the D () function) should produce "ones" for the reference samples and "zeroes" for the samples generated by the generating network.
And the network does not know anything about whether it is submitted to the input: a standard or a fake. Only the functional "knows" this. However, the network studies in the direction of the gradient of this functional, which “as if” transparently hints to it how good it is.
Note that at each step of learning of the distinguishing network, the result of the generating network’s work is calculated once and twice the result of the discriminating network: for the first time, the reference sample is fed to the input, and the second is the result of the generating network.

In general, both networks are connected in an indissoluble circle of mutual learning. And in order for this whole construction to work, we need reference samples — a set of training data (X _i ) . Note that Y _i is not needed here. Although, it is clear that in fact the default implies that each X _i corresponds to Y _i = 1 .

In the process of learning, at each step, some values are given to the input of the generating network, for example, completely random numbers (or even non-random numbers — recipes for controlled generation). At each step, the next reference sample and the next result of operation of the generating network are fed to the input of the distinguishing network.
As a result, the originating network must learn to generate samples as close as possible to the reference ones. And the distinguishing network must learn to distinguish between reference and generated samples.

Although it is worth remembering that the generated samples can turn out to be unsuccessful, because the purpose of training the generating network is to maximize the similarity functional. And this maximum reached can be only a little more than zero, that is, the G network has not really learned anything. So be sure to check and make sure everything is fine.

Parse the code

Now let's take a look at the code, because there is also not so simple.

First, we import all the necessary modules:

import numpy as np import lasagne import theano import theano.tensor as T from lasagne.nonlinearities import rectify, sigmoid, linear, tanh

We define a function that returns a uniform noise on the interval [-5,5], which we will further apply to the input of the generating network.

 def sample_noise(M): return np.float32(np.linspace(-5.0, 5.0, M) + np.random.random(M) * 0.01).reshape(M,1)

Create a character variable that will be the input to the generating network:

 G_input = T.matrix('Gx')

And we describe the network itself:

 G_l1 = lasagne.layers.InputLayer((None, 1), G_input) G_l2 = lasagne.layers.DenseLayer(G_l1, 10, nonlinearity=rectify) G_l3 = lasagne.layers.DenseLayer(G_l2, 10, nonlinearity=rectify) G_l4 = lasagne.layers.DenseLayer(G_l3, 1, nonlinearity=linear) G = G_l4 G_out = lasagne.layers.get_output(G)

In terms of the lasagne library in this network, there are 4 layers. But, from an academic point of view, the input and output layers are not considered, so we get a two-layer network.
The result of the network operation will be recorded in the G_out variable after some value has been supplied to its input (in G_input ). Subsequently, G_out will be transmitted to the input of the distinguishing network, therefore, in their format, G_out and D_input must match.

Now we will create a character variable that will be the input of the distinguishing network and to which we will submit “reference” samples.

 D1_input = T.matrix('D1x')

And we describe the distinguishing network. In this case, it is almost no different from the generator, only at the output of her sigmoid.

 D1_target = T.matrix('D1y') D1_l1 = lasagne.layers.InputLayer((None, 1), D1_input) D1_l2 = lasagne.layers.DenseLayer(D1_l1, 10, nonlinearity=tanh) D1_l3 = lasagne.layers.DenseLayer(D1_l2, 10, nonlinearity=tanh) D1_l4 = lasagne.layers.DenseLayer(D1_l3, 1, nonlinearity=sigmoid) D1 = D1_l4

And now let's make a tricky trick. As you remember, the reference samples must be submitted to the input of the discriminating network, then the result of the generating network. But in a computational graph (in other words, in Theano, TensorFlow and similar libraries) this cannot be done. Therefore, we will create a third network that will become a complete copy of the previously described distinguishing network.

 D2_l1 = lasagne.layers.InputLayer((None, 1), G_out) D2_l2 = lasagne.layers.DenseLayer(D2_l1, 10, nonlinearity=tanh, W=D1_l2.W, b=D1_l2.b) D2_l3 = lasagne.layers.DenseLayer(D2_l2, 10, nonlinearity=tanh, W=D1_l3.W, b=D1_l3.b) D2_l4 = lasagne.layers.DenseLayer(D2_l3, 1, nonlinearity=sigmoid, W=D1_l4.W, b=D1_l4.b) D2 = D2_l4

And here, the input of the network is the value G_out , which is the result of the operation of the generating network. Moreover, the coefficients of all layers of the third network are equal to the coefficients of the second network. Therefore, the third and second network are complete copies of each other.

However, the results of the work of these two identical networks will be displayed in different variables.

 D1_out = lasagne.layers.get_output(D1) D2_out = lasagne.layers.get_output(D2)

So we got to the task of optimization functionals:

 G_obj = (T.log(D2_out)).mean() D_obj = (T.log(D1_out) + T.log(1 - D2_out)).mean()

Now you see why we needed two output variables of the distinguishing network.

Next, create a learning function for the generating network:

 G_params = lasagne.layers.get_all_params(G, trainable=True) G_lr = theano.shared(np.array(0.01, dtype=theano.config.floatX)) G_updates = lasagne.updates.nesterov_momentum(1 - G_obj, G_params, learning_rate=G_lr, momentum=0.6) G_train = theano.function([G_input], G_obj, updates=G_updates)

In G_params there will be a list of all coefficients of all layers of the generating network.
Learning speed will be stored in G_lr .
G_updates - actually, the function of updating coefficients by the method of gradient descent. Notice that the first parameter takes the loss function, that is, it does not maximize G_obj , but minimizes (1-G_obj) (but this is just a feature of Theano). By the second parameter, all network coefficients are transferred to it, and then the learning rate and the constant with the impulse value (which is needed only because the Nesterov impulse method is chosen as the gradient descent method).
As a result, in G_train we get the network learning function, the input of which is G_input , and the result of the calculation is G_obj , that is, the optimization functional for the generating network.

Now everything is the same for the distinguishing network:

 D_params = lasagne.layers.get_all_params(D1, trainable=True) D_lr = theano.shared(np.array(0.1, dtype=theano.config.floatX)) D_updates = lasagne.updates.nesterov_momentum(1 - D_obj, D_params, learning_rate=D_lr, momentum=0.6) D_train = theano.function([G_input, D1_input], D_obj, updates=D_updates)

Notice that D_train is already a function of the two variables G_input (input of the generating network) and D1_input (reference samples).

Finally we start the training. In a cycle by epoch:

 for i in range(epochs):

First, we teach the distinguishing network, and not just once, but K times.

  for j in range(K):

x - reference samples (in this case, the numbers from the normal distribution with the parameters mu and sigma ):
z - random noise

  x = np.float32(np.random.normal(mu, sigma, (M,1))) z = sample_noise(M)

Next comes the magic of Theano:
- reference samples are fed to the input of the distinguishing network
- random noise z is fed to the input of the generating network, and its result to the input of the distinguishing network
- after which the optimization functional for the distinguishing network is calculated.

  D_train(z, x)

The result of the D_train function, and this, as you remember, the optimization functionality of D_obj, is not needed by itself, however, it is explicitly used to train this network, albeit in a somewhat imperceptible way.
Then we train the generating network: again we form a vector of random values and generate a result, which, however, is also used at the training stage only in the process of calculating the optimization functional.

  z = sample_noise(M) G_train(z)

In theory, based on the original description of the task, both networks should take real value as input 1, but to speed up the learning we immediately feed a vector from M values, that is, as if, M iterations of the network are performed.

Every 10 epochs, we slightly decrease the learning speed of both networks.

  if i % 10 == 0: G_lr *= 0.999 D_lr *= 0.999

In the end, the training is completed, and you can use the G network to generate samples, giving random inputs to it, or even non-random data (recipes) that will generate samples with certain properties.

Source: https://habr.com/ru/post/278425/

All Articles

Understanding the neural network war (GAN)

Rival networks

Parse the code

More articles: