In the
last article, we looked at the simplest linear generative model PPCA. The second generative model we’ll look at is Generative Adversarial Networks, abbreviated GAN. In this article, we will look at the most basic version of this model, leaving the advanced versions and comparison with other approaches in generative modeling for the following chapters.
Story
Generative modeling assumes approximation of noncomputable a posteriori distributions. Because of this, most effective methods developed for teaching discriminative models do not work with generative models. Past methods for solving this problem are computationally difficult and mainly based on the use of
Markov Chain Monte Carlo , which is poorly scalable. Therefore, a method based on such scalable techniques as
Stochastic Gradient Descent (SGD) and
backpropagation was needed to train generative models. One such method is Generative Adversarial Networks (GAN). For the first time GANs were proposed
in this article in 2014. High-level, this model can be described as two sub-models that compete with each other, and one of these models (the generator) tries to learn in a certain sense to deceive the second (the discriminator). For this, the generator generates random objects, and the discriminator tries to distinguish these generated objects from real objects from the training sample. In the process of learning, the generator generates objects that are more and more similar to the sample, and it becomes increasingly difficult for the discriminator to distinguish them from the real ones. Thus, the generator turns into a generative model that generates objects from a certain complex distribution, for example, from the distribution of photographs of human faces.
Model
To begin with, we introduce the necessary terminology. Through

we will denote some space of objects. For example, pictures

pixel. On some probability space

set vector random variable

with a probability distribution having a density

such that a subset of space

, on which

accepts nonzero values — these are, for example, photographs of human faces. We are given a random iid sample of faces for size

. Additionally, we define the auxiliary space

and random variable

with a probability distribution having a density

.

- discriminator function. This function accepts an object as input.

(in our example, a picture of the appropriate size) and returns the probability that the input picture is a photograph of a human face.

- generator function. It takes value

and gives the space object

, that is, in our case, the picture.
')
Suppose we already have the perfect discriminator

. For any example

it gives the true probability that this example belongs to a given subset

the sample from

. Rephrasing the problem of deception of a discriminator in a probabilistic language, we obtain that it is necessary to maximize the probability given out by the ideal discriminator using the generated examples. Thus, the optimal generator is as

. Because

- a monotonically increasing function and does not change the position of the extrema of the argument; this formula is rewritten as

that will be convenient in the future.
In reality, there is usually no perfect discriminator and it needs to be found. Since the discriminator's task is to provide a signal for training the generator, instead of the ideal discriminator, it is enough to take a discriminator that ideally separates the real examples from those generated by the current generator, i.e. perfect only on subset

from which examples are generated by the current generator. This task can be reformulated as the search for such a function.

, which maximizes the likelihood of properly classifying examples as genuine or generated. This is called a binary classification task and in this case we have an infinite training set: a finite number of real examples and a potentially infinite number of examples generated. Each example has a label: it is real or generated. The
first article described the solution of the classification problem using the maximum likelihood method. Let's write it for our case.
So, our sample

. Determine the density of distribution

then

- this is a reformulation of the discriminator.

probability class

(real example) as a distribution on classes

. Because

This definition specifies the correct probability density. Then the optimal discriminator can be found as:
Group the multipliers for

and

:
And when the sample size tends to infinity, we get:
So, we get the following iterative process:
- Set arbitrary initial
. - Begins
iteration,
. - We are looking for an optimal discriminator for the current generator:
. - We improve the generator using the optimal discriminator:
. It is important to be in the vicinity of the current generator. If you move far away from the current generator, the discriminator will no longer be optimal and the algorithm will no longer be correct. - The task of learning a generator is considered solved when
for anyone
. If the process is not converged, then go to the next iteration in paragraph (2).
In the original article, this algorithm is summed up in one formula, which in a sense defines a minimax game between the discriminator and the generator:
Both functions

can be represented as neural networks:

after which the task of finding optimal functions is reduced to the problem of optimization by parameters and it can be solved using traditional methods: backpropagation and SGD. Additionally, since the neural network is a universal function approximator,

can approximate an arbitrary probability distribution, which removes the question of choosing a distribution

. It can be any continuous distribution in some reasonable limits. For example,

or

. The correctness of this algorithm and the convergence

to

under rather general assumptions, it is proved in the original article.
Finding Normal Distribution Parameters
We've dealt with math, now let's see how it works. Let's say

i.e. solve one-dimensional problem.

. Let's use a linear generator

where

. The discriminator will be a fully connected three-layer neural network with a binary classifier at the end. The solution to this problem is

, i.e,

. Let us now try to program a numerical solution of this problem using Tensorflow. The full code can be found
here , in the article only key points are covered.
The first thing to ask is the input sample:

. Since the training goes on minicabits, we will generate a vector of numbers at once. Additionally, the sample is parameterized by the mean and standard deviation.
def data_batch(hparams): """ Input data are just samples from N(mean, stddev). """ return tf.random_normal( [hparams.batch_size, 1], hparams.input_mean, hparams.input_stddev)
Now we will set random inputs for the generator

:
def generator_input(hparams): """ Generator input data are just samples from N(0, 1). """ return tf.random_normal([hparams.batch_size, 1], 0., 1.)
We define the generator. Take the absolute value of the second parameter to give it a meaning of the standard deviation:
def generator(input, hparams): mean = tf.Variable(tf.constant(0.)) stddev = tf.sqrt(tf.Variable(tf.constant(1.)) ** 2) return input * stddev + mean
Let's create a vector of real examples:
generator_input = generator_input(hparams) generated = generator(generator_input)
And the vector of generated examples:
generator_input = generator_input(hparams) generated = generator(generator_input)
Now let's run all the examples through the discriminator. Here it is important to remember that we do not want two different discriminators, but one, because Tensorflow should be asked to use the same parameters for both inputs:
with tf.variable_scope("discriminator"): real_ratings = discriminator(real_input, hparams) with tf.variable_scope("discriminator", reuse=True): generated_ratings = discriminator(generated, hparams)
The loss function on real examples is the cross-entropy between the unit (the expected response of the discriminator on real examples) and the discriminator estimates:
loss_real = tf.reduce_mean( tf.nn.sigmoid_cross_entropy_with_logits( labels=tf.ones_like(real_ratings), logits=real_ratings))
The loss function in fake examples is the cross-entropy between zero (the expected discriminator response in fake examples) and the discriminator scores:
loss_generated = tf.reduce_mean( tf.nn.sigmoid_cross_entropy_with_logits( labels=tf.zeros_like(generated_ratings), logits=generated_ratings))
The discriminator's loss function is the sum of losses on real examples and on fake examples:
discriminator_loss = loss_generated + loss_real
The generator loss function is the cross-entropy between the unit (the desired discriminator error response on fake examples) and the evaluations of these fake examples by the discriminator:
generator_loss = tf.reduce_mean( tf.nn.sigmoid_cross_entropy_with_logits( labels=tf.ones_like(generated_ratings), logits=generated_ratings))
Optionally, L2-regularization is added to the loss function of the discriminator.
Training of the model is reduced to alternate training of the discriminator and the generator in a cycle to convergence:
for step in range(args.max_steps): session.run(model.discriminator_train) session.run(model.generator_train)
Below are graphs for four discriminator models:
- three-layer neural network.
- three-layer neural network with L2 regularization ..
- three-layer neural network with dropout regularization.
- three-layer neural network with L2 and dropout regularization.
Fig. 1. Probability of classification by the discriminator of a real example as a real one.
Fig. 2. Probability of classification by the discriminator of the generated example as real.
All four models quite quickly converge to the fact that the discriminator produces

on all entrances. Because of the simplicity of the problem that the generator solves, there is almost no difference between the models. From the graphs it can be seen that the mean and standard deviation converge fairly quickly to the values from the data distribution:
Fig. 3. Mean generated distributions.
Fig. 4. The standard deviation of the generated distributions.
Below are the distribution of these and generated examples in the learning process. It can be seen that by the end of the training the generated examples are practically indistinguishable from the real ones (they are distinguishable in the graphs because Tensorboard chose different scales, but if you look at the values, they are the same).
Fig. 5. Distribution of real data. Does not change in time. The learning step is postponed on the vertical axis.



Fig. 6. Distribution of real data. Does not change in time. The learning step is postponed on the vertical axis.
Let's look at the learning process of the model:
Fig. 7. Visualization of the model learning process. The motionless Gaussian is the distribution density of real data, the moving Gaussian is the distribution density of the generated examples, the blue curve is the result of the work of the discriminator, i.e. the likelihood of an example being present.
It can be seen that at the beginning of training the discriminator very well shares the data, but the distribution of the generated examples very quickly “creeps” to the distribution of the real examples. In the end, the generator approximates the data so well that the discriminator becomes a constant

and the task converges.
Approximation of a mixture of normal distributions I
Let's try to replace

on

, thereby simulating a multimodal distribution of the source data. For this model, you only need to change the code for generating real examples. Instead of returning a normally distributed random variable, we return a mixture of several:
def data_batch(hparams): count = len(hparams.input_mean) componens = [] for i in range(count): componens.append( tf.contrib.distributions.Normal( loc=hparams.input_mean[i], scale=hparams.input_stddev[i])) return tf.contrib.distributions.Mixture( cat=tf.contrib.distributions.Categorical( probs=[1./count] * count), components=componens) .sample(sample_shape=[hparams.batch_size, 1])
Below are graphs for the same models as in the previous experiment, but for data with two modes:
Fig. 8. Probability of classification by the discriminator of a real example as a real one.
Fig. 9. Probability of classification by the discriminator of the generated example as real.
It is interesting to note that regularized models show themselves to be significantly better than non-regularized ones. However, regardless of the model, it can be seen that now the generator fails to deceive the discriminator so well. Let's understand why this happened.
Fig. 10. Mean generated distributions.
Fig. 11. The standard deviation of the generated distributions.
As in the first experiment, the generator approximates the data with a normal distribution. The reason for the decline in quality is that now the data cannot be accurately approximated by a normal distribution, because they are sampled from a mixture of two normal ones. The modes of the mixture are symmetric about zero, and it can be seen that all four models approximate the data with a normal distribution centered near zero and a fairly large dispersion. Let's look at the distribution of real and fake examples to see what is going on:
Figure 12. Distribution of real data. Does not change in time. The learning step is postponed on the vertical axis.



Figure 13. Distributions of the generated data from four models. The learning step is postponed on the vertical axis.
This is how the model is taught:
Fig. 14. Visualization of the model learning process. The motionless mixture of Gaussians is the distribution density of real data, the moving Gaussian is the distribution density of the generated examples, the blue curve is the result of the work of the discriminator, i.e. the likelihood of an example being present.
This animation shows in detail the case studied above. The generator, not possessing sufficient expressiveness and having the opportunity to approximate data only to a Gaussian, spreads into a wide Gaussian, trying to cover both modes of data distribution. As a result, the generator reliably deceives the discriminator only in places where the areas under the generator and source data curves are close, that is, in the area of intersection of these curves.
However, this is not the only possible case. Let's move the right-hand mode a little more to the right so that the initial approximation of the generator does not capture it.
Fig. 15. Visualization of the model learning process. The motionless mixture of Gaussians is the distribution density of real data, the moving Gaussian is the distribution density of the generated examples, the blue curve is the result of the work of the discriminator, i.e. the likelihood of an example being present.
It is seen that in this case it is most advantageous for the generator to try to bring the left distribution mode closer. After this happens, the generator tries to make an attempt to capture the left fashion too. It looks like oscillations of the generator standard deviation in the second half of the animation. But all these attempts fail, because the discriminator locks the generator somehow and in order to capture the left mode it needs to overcome the barrier of the high loss function, which it cannot do due to the insufficiently high learning speed. This effect is called fashion collapse.
In the two examples described above, we saw two types of problems that arise if the generator is not powerful enough to express the initial distribution of data: mode averaging, when the generator approximates the entire distribution, but everywhere is bad enough; and the collapse of the mode, when the generator learns a subset of modes, and those that he did not learn, do not affect him in any way.
In addition to the fact that both of these problems lead to the divergence of the discriminator to

they also lead to a decrease in the quality of the generative model. The first problem leads to the fact that the generator produces examples “between” modes, which should not be, the second problem leads to the fact that the generator gives examples only from some modes, thereby reducing the richness of the original data distribution.
Approximation of a mixture of normal distributions II
The reason that the previous section did not succeed until the end of the deception of the discriminator was the triviality of the generator, which simply did a linear transformation. Let us now try to use a fully connected three-layer neural network as a generator:
def generator(self, input, hparams):
Let's look at the training schedules.
Fig. 16. Probability of classification by the discriminator of a real example as real.
Fig. 17. Probability of classification by the discriminator of the generated example as real.
It can be seen that because of the large number of parameters, the training has become much more noisy. Discriminators of all models converge to about

but behave unstably around this point. Let's look at the form of the generator.
Figure 18. Distribution of real data. Does not change in time. The learning step is postponed on the vertical axis.



Figure 19. Distributions of the generated data from four models. The learning step is postponed on the vertical axis.
It can be seen that the distribution of the generator does not at least coincide with the distribution of data, but is quite similar to it. The most regularized model again showed itself better than anyone. It can be seen that she learned two modes that roughly coincide with the modes of data distribution. Peak sizes are also not very accurate, but approximate the distribution of data. Thus, the neural network generator is able to learn the multimodal distribution of data.
This is how the model is taught:
Fig. 20. Visualization of the learning process of a model with similar mods. The motionless mixture of Gaussians is the distribution density of real data, the moving Gaussian is the distribution density of the generated examples, the blue curve is the result of the work of the discriminator, i.e. the likelihood of an example being present.
Fig. 21. Visualization of the learning process of a model with distant modes. The motionless mixture of Gaussians is the distribution density of real data, the moving Gaussian is the distribution density of the generated examples, the blue curve is the result of the work of the discriminator, i.e. the likelihood of an example being present.
These two animations show learning on the data distributions from the previous section. From these animations it can be seen that when using a sufficiently large generator with a variety of parameters, it is, albeit rather crudely, but capable of approximating the multimodal distribution, thereby indirectly confirming that the problems from the previous section arise from the insufficiently complex generator. The discriminators in these animations are much more noisy than in the section on finding the parameters of the normal distribution, but by the end of the training they begin to resemble a noisy horizontal line.

.
Results
GAN is a model for approximating an arbitrary distribution only by sampling from this distribution. In this article, we looked at the details of how the model works on a trivial example of finding parameters of a normal distribution and on a more complex example of approximating a bimodal distribution by a neural network. Both tasks were solved with good accuracy, which required only the use of a rather complex generator model. In the next article, we will move from these model examples to real examples of generating samples from complex distributions using the example of image distribution.
Thanks
Thank you
Olga Talanova and
Ruslan Login for the review of the text. Thank you
Ruslan Login for help in preparing images and animations. Thanks to
Andrei Tarashkevich for help with the layout of this article.