In the previous article, using the example of toy models, I tried to analyze why, in fact, we can effectively train GANs. Now we will try to summarize some results and, most importantly, we will try to analyze how the architecture of neural networks affects the sustainability of the learning process.
Introduction
The GAN learning process can be viewed as the process of minimizing the lower bound of the divergence or as a dynamic system consisting of a discriminator and a generator. The first approach is very well developed, if anyone is interested can find a lot of interesting works on this topic, I will give only the most significant ones in my opinion: f-GAN , WGAN . On the other hand, The Numerics of GANs, Lars Mescheder et al, 2017 showed that it is impossible to look at GAN only from the point of view of minimizing divergence (although this is the main goal), but it is also necessary to take into account the properties of a dynamic system. Our goal is to combine these two approaches and see what theoretical properties trajectories of such a dynamic system will have.
Next, we will consider a fairly general form of the GAN functional (GAN objective).
Where theta - the parameters of the discriminator, phi - generator parameters. We also assume that the generator and the discriminator are twice differentiable functions.
The learning process can be mathematically formulated as:
defined for all values phi . Although it is possible and ambiguous - that is, there are many different optimal discriminators. We will call the whole set of such optimal discriminators "reparameterization", and the set \ {\ theta_ {opt} \}\ {\ theta_ {opt} \} multiple discriminator reparameterization. Let's make a small clarification - by the maximum we understand not only the global maximum, but any local one.
Mathematically thetaopt determined by the system of equations:
nablathetaI(thetaopt,phi)=0
nablatheta2I(theta,phi)preceq0
Although in case the generator and the discriminator are neural networks, and we train the image generator, this assumption is not fulfilled. In the following articles I will show how this can be achieved with a simple regularizer, but in the meantime I’ll have to believe it.
The second assumption ( Assumption 2 ) is that for any infinitely small change in the generator parameters, there is an optimal discriminator infinitely close to the current one. You can try to figure out when this is done for neural networks, but it seems to me that this is an intuitively understandable requirement and with high probability it is fulfilled for sufficiently flexible networks.
Mathematically, this requirement can be formulated as follows: system of equations
has solutions dtheta at any dphi . This condition is easily obtained from the decomposition nablathetaI(thetaopt,phi)=0 in taylor row.
We will now show that, under assumption 2, the generator gradient nablaphiI(thetaopt,phi) does not change if we move along the discriminator's reparameterization set. Moving along a multitude of reparameterization means that deltatheta belongs to the core nablatheta2I(theta,phi) those:
From where, in order for the gradient not to change, it is necessary that dthetaopt belonged to the core nablaphithetaI(theta,phi) . But if dthetaopt belongs to the core nablatheta2I(theta,phi) and executed (1), then dthetaopt belongs to the core nablaphithetaI(theta,phi) . It is easy to show by multiplying (1) by dthetaopt and considering that dphi can be any:
We have just proved the remarkable fact that the gradient \ nabla_ \ phi \ underset {\ theta} {max} \ {I (\ theta, \ phi) \} independent of which one is selected theta from the set of discriminator reparameterization, which means undersetthetamaxI(theta,phi) is a differentiable function of phi and we can minimize it using gradient optimization methods.
What can be said about the minimum of this function? Unfortunately, it is not yet possible to draw any very general conclusions. We have to make another assumption ( Assumption 3 ) about the properties of the discriminator, namely undersetthetamaxI(theta,phi)geq0 . From mat-analysis, we know that a continuous function bounded below either reaches its minimum value, or there are monotonously decreasing sequences of points (that is, the function reaches its minimum at infinity).
Assumption 3 states that no matter from what point we started training the discriminator (initializing the neural network), the optimal discriminator always assigns a higher (or equal) value to the data (on average) than the samples from the generator and any discriminator that does the opposite cannot be optimal. In my opinion, this assumption is very logical internally and sufficiently flexible neural networks should satisfy it. By the way, it is easy to show that the linear discriminator satisfies this assumption.
"The optimal trajectory"
We have just found out that if if the discriminator satisfies rather “soft” assumptions, then either the point of rest exists, or it is located at infinity and we can approach it asymptotically.
To do this, suppose theta and phi current parameters, and the discriminator is trained to convergence. Now we change the generator a little bit ( phik+1=phik+dphi ), so that \ underset {\ theta} {argmax} \ {I (\ theta, \ phi) \} decreases - for example, we make one step gradient descent. How should we update the discriminator? The answer is given by formula (1):
Where nablatheta2I(theta,phi)dagger - pseudoinverse matrix. If to be mathematically strict, then formula (2) defines a tangent hyperplane to the surface in the parameter space on which the optimal discriminators live. We will call this surface "optimal."
But what will happen if we started from a very close point, but still the discriminator was not optimal there. Obviously, our formula describes completely unpredictable trajectories, since it was obtained only for optimal discriminators. Let's try to fix this situation. In this case, it would be logical to move not only parallel to the “optimal” surface, but also towards it. Obviously, the step in the direction indicated by Newton's method is quite suitable for this. Our final formula will look like this:
What have we just done? First, note that on the optimal surface nablathetaI(theta,phi)=0 i.e. if if theta if discriminator was trained to convergence, then we did not change the trajectory. On the other hand, if we started somewhere near the “optimal” point, then the additional term will “pull” us onto the “optimal” trajectory. That is, we made the “optimal” trajectory resistant to noise.
Unfortunately, I cannot mathematically prove it. But we will not need it. Let's take a closer look at formula (3), it can be rewritten in the form:
Nothing like? It looks almost like Newton's method. This suggests that alternate (in English linguistic literature alternating is used) updates of generator and discriminator parameters, where the discriminator is updated by the Newton method (in fact, we don’t have to do the full Newtonian step, but we can take a small step in the direction indicated by the Newton method) and the generator, generally speaking, can take a step in any direction, the main thing is to decrease \ underset {\ theta} {argmax} \ {I (\ theta, \ phi) \} , is a good approximation of movement along the optimal trajectory.
By the way, perhaps this is precisely the reason for Adam’s success in GAN training. After all, it is well known that Adam is trying to approximate the Hessian and uses this information for the gradient step (approximates Newton's method).
Trajectory stability
To analyze the stability of the trajectories, we will use the same tools as the guys used in Which Training Methods for GANs do actually Converge? . Who has the time I recommend well understand this work, it is worth it.
I believe that the main drawback of this article is that the authors analyze the stability of the system at a point of rest, which simply does not exist for most real GANs (for toy 2D or 3D examples, of course this is possible). I mean not at all the point of rest does not exist, but the point at which the generator has the same distribution as data, and the discriminator is identical 0. Therefore, we will not analyze the stability of the system at the point of rest, we will go a little different way - we will try to analyze the stability of the system . Although this analysis can be easily transferred to the point of rest, because the point of rest is also a trajectory.
As the trajectories under study, we will choose the trajectories obtained when learning by gradient descent. Why is it important? First, we found out earlier what the “optimal” trajectory is and found that it can be well approximated by updates of the discriminator in the direction indicated by the Newton method. And Newton's method is the most gradient method, using information from the second derivatives to correct the direction. Well, the main reason why we do this is very simple - because we can. Any deviation from this leads to such crazy formulas that it is simply impossible to understand anything there.
The standard way of analyzing the stability of trajectories is the analysis of linearized models. Similar to the authors Which Training Methods for GANs do actually Converge? we will explore the jacobian trajectories:
A = \ begin {bmatrix} \ nabla_ \ theta ^ 2 I (\ theta, \ phi) & \ nabla _ {\ theta \ phi} ^ 2 I (\ theta, \ phi) \\ - \ nabla _ {\ theta \ phi} ^ 2 I (\ theta, \ phi) ^ T & - \ nabla_ \ phi ^ 2 I (\ theta, \ phi) \ end {bmatrix}
J=I+etaA
Where I - unit matrix.
We assume that the discriminator, although not optimal, is close enough to the optimal one nablatheta2I(theta,phi)preceq0 . And also that the regularizer (I already mentioned that in real GAN the discriminator does not have an optimal point and that we must create this point with the help of a regularizer) is not the envy of the generator. I'll run a little ahead - the best regularizer we found in practice depends, but for our theoretical studies this assumption does not limit us so much.
For further, we need one elementary fact about matrices: if nablatheta2I(theta,phi)preceq0 and nablaphi2I(theta,phi)succeq0 , then there is such eta>0 what is the spectral norm of the matrix J will be less than or equal to 1. The proof is elementary - we write Jtj :
JTJ=[I+etaA]T[I+etaA]=I+eta[A+AT+eta(ATA)]
but:
A + A ^ T = \ begin {bmatrix} 2 \ nabla_ \ theta ^ 2 I (\ theta, \ phi) & 0 \\ 0 & -2 \ nabla_ \ phi ^ 2 I (\ theta, \ phi) \ end {bmatrix}
Hence, due to the continuity of the eigenvalues, it is obvious that one can choose such eta>0 what's own numbers Jtj will be less than or equal to 1. This implies the necessary fact that left|Jright|leqslant1 .
We know that on our trajectory nablatheta2I(theta,phi)preceq0 is executed (since the trajectory is “optimal”) and if there were nablaphi2I(theta,phi)succeq0 then any trajectory (including the equilibrium point) was stable - that is, with a slight deviation from the trajectory, the system would tend to return to it. This is obvious because any discrete trajectory can be represented as a product of the Jacobians, and the product of matrices with a norm less than or equal to 1 cannot have a norm greater than 1.
But a great many facts say that this is not always the case for GAN. A striking example is the collapse mode, which, moreover, does not always happen and not for all models. I argue that the complexity of learning GAN is related to two facts: there is no optimal discriminator (treated by a regularizer) and the Jacobian generator is not a non-negatively defined matrix. Now we will try to understand how to, if not completely get rid of the second problem, then at least minimize its effect on sustainability.
To do this, write down the Jacobian generator. It will consist of three components:
Since the expressions for the Jacobian are very complex and it is impossible to see what kind of structure, all our actions will be directed to ensure that these matrices are as close as possible to 0.
J1 it can be seen that this term is very easy to zero - just select the function fG so that f″G(x)=0 or so that the “working point” is in the region where the second derivative is close to 0. By the way, the choice of the function f″G(x)=0 corresponds to the WGAN. Does this mean that the WGAN is known for its stability?
Two other components can be influenced only by the neural network architecture. For example, they are zeroed in the case of a linear generator and discriminator (the second derivatives are zero). By the way, ReLU activation comes to mind - unfortunately this analysis is not applicable for such networks, because the assumption of differentiability is being violated. But from practice it is known that really GAN with neural networks with piecewise linear activations learn much more stable than, for example, with hyperbolic tangents. Whether this is related to the Jacobian or is there another more important reason to guess, but in the future I plan to investigate this issue a little.
You can also conclude that GAN "prefers" wide networks. Indeed, we know that GAN with deep networks collapses more often.
Conclusion
In this article I tried to give answers to long-playing questions in the theory of GAN. We have seen that in order for a point of rest to exist, it is enough for the discriminator to have fairly simple properties. We also saw that the probabilistic (minimization of divergence) interpretation of the GAN has all the rights to life and that using a fairly simple algorithm we can very well approximate the “optimal” trajectories. And also that the Adam algorithm already does this and in theory should stabilize GAN learning, which, by the way, has been noted in many papers.
Still, which is quite important, we saw that the empirical recommendations for choosing the architecture of neural networks and the loss function are quite consistent with the theory. And the most important thing is all the problems in training due to nonlinearity - linear models are trained very stably.
What is left out of the article? Practice. In the next article, I will describe the regularizer, which I use to provide a point of rest to the discriminator and a bunch of tricks to improve the stability of the GAN learning process (by the way, some of them are motivated by this analysis).
And as an example, a deep network (50 layers generator and 50 layers discriminator) trained on LSUN bedroom dataset (only 100 thousand iterations and we did not use generator averaging)