Hessian-Free Optimization with TensorFlow

Good day! I want to tell you about the optimization method known as Hessian-Free or Truncated Newton (Truncated Newton Method) and about its implementation using the library of deep learning - TensorFlow. It takes advantage of second-order optimization methods, and there is no need to read the matrix of second derivatives. This article describes the HF algorithm itself, as well as presents its work for training the direct distribution network on MNIST and XOR datasets.

Little about optimization techniques

Training a neural network includes minimizing the error function (loss function) in relation to its parameters (weights), of which there can be a lot. Therefore, there are many optimization methods to solve this problem.

Gradient Descent (Gradient Descent)

Gradient descent is the simplest method for consistently finding the minimum of a differentiable function.

f

$f$ (in the case of neural networks, this is a cost function). Having several options

x

$x$ (weights of the network) and differentiating a function by them, we obtain a vector of partial derivatives or a gradient vector:

n a b l a f (x) = l a n g l e f r a c d e l t a f d e l t a x_{1}, f r a c d e l t a f d e l t a x_{2}, . . ., f r a c d e l t a f d e l t a x_{n} r a n g l e

$\ nabla f (x) = \ langle \ frac {\ delta f} {\ delta x_1}, \ frac {\ delta f} {\ delta x_2}, ..., \ frac {\ delta f} {\ delta x_n} \ rangle$

The gradient always points in the direction of maximum growth of the function. If we move in the opposite direction (i.e.

- n a b l a f (x)

$- \ nabla f (x)$ ) then in time we will come to the minimum, which is what we needed. The simplest gradient descent algorithm:

Initialization: randomly select parameters $x_0$
Calculate the gradient: $\ nabla f (x_i)$
Change the parameters in the direction of a negative gradient: $x_ {i + 1} = x_i - \ alpha \ nabla f (x_i)$ where $\ alpha$ - some parameter learning speed (learning rate)
Repeat the previous steps until the gradient is close enough to zero.

Gradient descent is fairly simple to implement and has a well-proven optimization method, but there is also a minus - it is of the first order, which means that the first derivative is taken with respect to the cost function. This in turn imposes some restrictions: we mean that our cost function locally looks like a plane and does not take into account its curvature.
')

Newton's Method

But what if we take and use the information that the second derivatives on the value function gives us? The most well-known optimization method using second derivatives is the Newton method. The main idea of this method is to minimize the quadratic approximation of the cost function. What does this mean? Let's see.

Take the one-dimensional case. Suppose we have a function:

f : m a t h b b R t o m a t h b b R

$f: \ mathbb {R} \ to \ mathbb {R}$ . To find the minimum point, it is necessary to find the zero of its derivative, because we know:

f^{'} (x) = 0

$f '(x) = 0$ is at the minimum

f (x)

$f (x)$ . We approximate the function

f

$f$ decomposition in a second order Taylor series:

f (x + d e l t a) a p p r o x f (x) + f^{'} (x) d e l t a + f r a c 12 d e l t a f^{″} (x) d e l t a

$f (x + \ delta) \ approx f (x) + f '(x) \ delta + \ frac {1} {2} \ delta f' '(x) \ delta$

We want to find such

d e l t a

$\ delta$ , what

f (x + d e l t a)

$f (x + \ delta)$ will be the minimum. For this we take the derivative with respect to

x

$x$ and equate to zero:

f r a c d d x b i g g (f (x) + f^{'} (x) d e l t a + f r a c 12 d e l t a f^{″} (x) d e l t a b i g g) = f^{'} (x) + f^{″} (x) d e l t a = 0 i m p l i e s d e l t a = - f r a c f^{'} (x) f^{″} (x)

$\ frac {d} {dx} \ bigg (f (x) + f '(x) \ delta + \ frac {1} {2} \ delta f' '(x) \ delta \ bigg) = f' (x ) + f '' (x) \ delta = 0 \ implies \ delta = - \ frac {f '(x)} {f' '(x)}$

If a

f

$f$ quadratic function this will be the absolute minimum. If we want to find a minimum iteratively, then we take the initial

x_{0}

$x_0$ and update it according to this rule:

x_{n + 1} = x_{n} - f r a c f^{'} (x_{n}) f^{″} (x_{n}) = x_{n} - (f^{″} (x_{n}))^{- 1} f^{'} (x_{n})

$x_ {n + 1} = x_n- \ frac {f '(x_n)} {f' '(x_n)} = x_n- (f' '(x_n)) ^ {- 1} f' (x_n)$

Over time, the decision will come down to a minimum.

Consider the multidimensional case. Suppose we have a multi-dimensional function

f : m a t h b b R^{n}

$f: \ mathbb {R ^ n}$ then:

f^{'} (x) t o n a b l a f (x)

$f '(x) \ to \ nabla f (x)$

f^{″} (x) t o H (f) (x)

$f '' (x) \ to H (f) (x)$

Where

H (f) (x)

$H (f) (x)$ - Hessian or matrix of second derivatives. Based on this, for updating the parameters we have the following formula:

x_{n + 1} = x_{n} - (H (f) (x))^{- 1} n a b l a f (x_{n})

$x_ {n + 1} = x_n- (H (f) (x)) ^ {- 1} \ nabla f (x_n)$

Problems with Newton's Method

As we can see, Newton's method is a second-order method and will work better than ordinary gradient descent, because instead of moving to a local minimum in each step, it moves to a global minimum if we assume that the function

f

$f$ quadratic and second order decomposition in a Taylor series is its good approximation.

But this method has one big minus . To optimize the cost function, you must find the Hessian matrix or Hessian.

H

$H$ . Set

m a t h b b x

$\ mathbb {x}$ - vector of parameters, then:

H (f) (\ mathbf {x}) = \ begin {pmatrix} \ frac {\ delta f} {\ delta x_1 \ delta x_1} & \ frac {\ delta f} {\ delta x_1 \ delta x_2} & \ cdots & \ frac {\ delta f} {\ delta x_1 \ delta x_n} \\ \ frac {\ delta f} {\ delta x_2 \ delta x_1} & \ frac {\ delta f} {\ delta x_2 \ delta x_2 } & \ cdots & \ frac {\ delta f} {\ delta x_2 \ delta x_n} \\ \ vdots & \ vdots & \ ddots & \ vdots \\ \ frac {\ delta f} {\ delta x_m \ delta x_1} & \ frac {\ delta f} {\ delta x_m \ delta x_2} & \ cdots & \ frac {\ delta f} {\ delta x_m \ delta x_n} \ end {pmatrix}

$H (f) (\ mathbf {x}) = \ begin {pmatrix} \ frac {\ delta f} {\ delta x_1 \ delta x_1} & \ frac {\ delta f} {\ delta x_1 \ delta x_2} & \ cdots & \ frac {\ delta f} {\ delta x_1 \ delta x_n} \\ \ frac {\ delta f} {\ delta x_2 \ delta x_1} & \ frac {\ delta f} {\ delta x_2 \ delta x_2 } & \ cdots & \ frac {\ delta f} {\ delta x_2 \ delta x_n} \\ \ vdots & \ vdots & \ ddots & \ vdots \\ \ frac {\ delta f} {\ delta x_m \ delta x_1} & \ frac {\ delta f} {\ delta x_m \ delta x_2} & \ cdots & \ frac {\ delta f} {\ delta x_m \ delta x_n} \ end {pmatrix}$

As you can see, the Hessian is a matrix of second derivatives of the size.

n t i m e s n

$n \ times n$ and that would need to be calculated

O (n^{2})

$O (n ^ 2)$ computational operations, which can be very critical for networks that have hundreds or thousands of parameters. In addition, to solve the optimization problem using the Newton method, it is necessary to find the inverse Hesse matrix

H^{- 1} (f) (x)

$H ^ {- 1} (f) (x)$ , for this, it must be positively defined for all

n

$n$ .

Positive matrix.

Matrix

m a t h b f A

$\ mathbf A$ dimensions

n t i m e s n

$n \ times n$ is called non-negative defined if it satisfies the condition:

m a t h b f v^{T} m a t h b f A m a t h b f v g e q 0 f o r a l l m a t h b f v i n m a t h b b R^{n}

$\ mathbf v ^ T \ mathbf A \ mathbf v \ geq 0 \; \; \; \ forall \; \; \; \ mathbf v \ in \ mathbb {R ^ n}$ . If the strict inequality holds, then the matrix is called positive definite. An important property of such matrices is their non-singularity, i.e. existence of inverse matrix

m a t h b f A^{- 1}

$\ mathbf A ^ {- 1}$ .

Hessian-free optimization

The main idea of HF optimization is that we use the Newton method as a basis, but we use a more suitable way to minimize the quadratic function. But first we introduce the basic concepts that will be needed in the future.
Let be

t h e t a = (m a t h b f W, m a t h b f b)

$\ theta = (\ mathbf W, \ mathbf b)$ - network settings, where

m a t h b f w

$\ mathbf w$ - weights matrix,

m a t h b f b

$\ mathbf b$ the displacement vector (biases), then the network output is called:

f (x, t h e t a)

$f (x, \ theta)$ where

x

$x$ - input vector.

L (t, f (x, t h e t a))

$L (t, f (x, \ theta))$ - loss function (loss function),

t

$t$ - target value. And the function which we will minimize is defined as the average of the losses for all training examples (training batch)

S

$S$ :

h (t h e t a) = f r a c 1 | S | s u m_{(x, y) i n S} L (t, f (x, t h e t a))

$h (\ theta) = \ frac {1} {| S |} \ sum _ {(x, y) \ in S} L (t, f (x, \ theta))$

Further, according to Newton's method, we define a quadratic function obtained by expanding into a second-order Taylor series:

h (t h e t a + d e l t a) e q u i v M (d e l t a) = h (t h e t a) + n a b l a h (t h e t a)^{T} d e l t a + f r a c 12 d e l t a^{T} H d e l t a q q u a d q q u a d (1)

$h (\ theta + \ delta) \ equiv M (\ delta) = h (\ theta) + \ nabla h (\ theta) ^ T \ delta + \ frac {1} {2} \ delta ^ TH \ delta \ qquad \ qquad (1)$

Next, take the derivative with respect to

d e l t a

$\ delta$ from the formula above and equating it to zero we get:

n a b l a M (d e l t a) = n a b l a h (t h e t a) + H d e l t a = 0 q q u a d q q u a d (2)

$\ nabla M (\ delta) = \ nabla h (\ theta) + H \ delta = 0 \ qquad \ qquad (2)$

To find

d e l t a

$\ delta$ we will use the conjugate gradient method.

Conjugate gradient method

The conjugate gradient method (CG) is an iterative method for solving systems of linear equations of the type:

m a t h b f A m a t h b f x = m a t h b f b

$\ mathbf A \ mathbf x = \ mathbf b$ .

Brief CG algorithm:
Input data:

m a t h b f b

$\ mathbf b$ ,

m a t h b f A

$\ mathbf A$ ,

m a t h b f x_{0}

$\ mathbf x_0$ ,

i = 0

$i = 0$ - CG algorithm step
Initialization:

$\ mathbf r_0 = \ mathbf b - \ mathbf A \ mathbf x_0$ - error vector (residual)
$\ mathbf d_0 = \ mathbf r_0$ - vector of search direction (search direction)

Repeat until the stop condition is met:

$\ alpha_i = \ frac {\ mathbf r_i ^ T \ mathbf r_i} {\ mathbf d_i ^ T \ mathbf A \ mathbf d}$
$\ mathbf x_ {i + 1} = \ mathbf x_i + \ alpha_i \ mathbf d_i$
$\ mathbf r_ {i + 1} = \ mathbf r_i - \ alpha_i \ mathbf A \ mathbf d_i$
$\ beta_i = \ frac {\ mathbf r_ {i + 1} ^ T \ mathbf r_ {i + 1}} {\ mathbf r_i ^ T \ mathbf r_i}$
$\ mathbf d_ {i + 1} = \ mathbf r_ {i + 1} + \ beta_i \ mathbf d_i$
$i = i + 1$

Now, using the conjugate gradient method, we can solve equation (2) and find

d e l t a

$\ delta$ which will minimize (1). In our case:

m a t h b f A e q u i v H, m a t h b f b e q u i v - n a b l a h (t h e t a), m a t h b f x e q u i v d e l t a

$\ mathbf A \ equiv H, \ mathbf b \ equiv - \ nabla h (\ theta), \ mathbf x \ equiv \ delta$ .
Stop the CG algorithm. You can stop the conjugate gradient method based on different criteria. We will do this based on the relative progress in optimizing the quadratic function

M

$M$ :

s_{i} = f r a c M (d e l t a_{i}) - M (d e l t a_{i - w}) M (d e l t a_{i}) - M (0) q q u a d q q u a d (3)

$s_i = \ frac {M (\ delta_i) -M (\ delta_ {i-w})} {M (\ delta_i) -M (0)} \ qquad \ qquad (3)$

Where

w

$w$ - the size of the "window" for which we will consider the value of progress,

w = m a x (10, i / 10)

$w = max (10, i / 10)$ . The condition for stopping is:

s_{i} < 0.0001

$s_i <0.0001$ .
And now you can see that the main feature of HF optimization is that we do not need to find the Hessian directly, but we only need to find the result of his work for the vector.

Hessian multiplication by vector

As mentioned earlier, the beauty of this method is that we do not have to count the Hessian directly. It is only necessary to calculate the result of the product of the matrix of second derivatives by the vector. For this you can imagine

H (t h e t a) v

$H (\ theta) v$ as derived from

H (t h e t a)

$H (\ theta)$ towards

v

$v$ :

H (t h e t a) v = l i m_{e p s i l o n t o 0} f r a c n a b l a h (t h e t a + e p s i l o n v) - n a b l a h (t h e t a) e p s i l o n

$H (\ theta) v = \ lim _ {\ epsilon \ to0} \ frac {\ nabla h (\ theta + \ epsilon v) - \ nabla h (\ theta)} {\ epsilon}$

But the use of this formula in practice can cause a number of problems associated with the calculation with a sufficiently small

e p s i l o n

$\ epsilon$ . Therefore, there is a method for accurately calculating the product of a matrix by a vector. We introduce a differential operator

R_{v} x

$R_vx$ . It denotes a derivative of some quantity.

x

$x$ depending on

t h e t a

$\ theta$ , towards

v

$v$ :

R_{v} x = l i m_{e p s i l o n t o 0} f r a c x (t h e t a + e p s i l o n v) - x (t h e t a) e p s i l o n = f r a c d e l t a x d e l t a t h e t a v

$R_vx = \ lim _ {\ epsilon \ to0} \ frac {x (\ theta + \ epsilon v) -x (\ theta)} {\ epsilon} = \ frac {\ delta x} {\ delta \ theta} v$

This shows that to calculate the product of the Hessian by the vector must be calculated:

H v = R (n a b l a h (t h e t a)) q q u a d q q u a d (4)

$Hv = R (\ nabla h (\ theta)) \ qquad \ qquad (4)$

Some improvements in HF optimization

1. The generalized Gauss-Newton matrix.
Uncertainty of the Hessian matrix is a problem for the optimization of non-convex functions, it can lead to the absence of a lower bound for a quadratic function

M

$M$ and as a consequence, the impossibility of finding its minimum. This problem can be solved in many ways. For example, the introduction of a confidence interval will limit optimization or damping based on a penalty that adds a positive semi-definite component to the curvature matrix.

H

$H$ and makes it positively defined.

Based on practical results, the best way to solve this problem is to use the Newton-Gauss matrix.

G

$G$ instead of the Hesse matrix:

G = f r a c 1 | S | s u m_{(x, y) i n S} J^{T} H_{L} J q q u a d q q u a d (5)

$G = \ frac {1} {| S |} \ sum _ {(x, y) \ in S} J ^ TH_LJ \ qquad \ qquad (5)$

Where

J

$J$ - Jacobian

H_{L}

$H_L$ - matrix of second derivatives of loss function

L (t, f (x, t h e t a))

$L (t, f (x, \ theta))$ .
To find the product of the matrix

G

$G$ by vector

v

$v$ :

G v = j^{T} H_{L} J v

$Gv = j ^ TH_LJv$ , first we find the product of the Jacobian on the vector:

J v = R_{v} (f (x, t h e t a))

$Jv = R_v (f (x, \ theta))$

then we calculate the product of a vector

J v

$Jv$ on the matrix

H_{L}

$H_L$ and finally multiply the matrix

J

$J$ on

H_{L} J v

$H_LJv$ .

2. damping.
Newton's standard method can poorly optimize highly non-linear objective functions. The reason for this may be that at the initial stages of optimization, it can be very large and aggressive, since the starting point is far from the minimum point. To solve this problem, use the dumping method of changing the quadratic function.

M

$M$ or limiting minimization in such a way that the new

d e l t a

$\ delta$ will lie within the limits in which

M

$M$ will remain a good approximation

h

$h$ .
Regularization of Tikhonov (Tikhonov regularization) or dumping of Tikhonov (Tikhonov damping). (Not to be confused with the term “regularization”, which is commonly used in the context of machine learning) This is the most well-known dumping method, which adds a quadratic penalty to the function

M

$M$ :

h a t M (d e l t a) e q u i v M (d e l t a) + f r a c l a m b d a 2 d e l t a^{T} d e l t a = f (t h e t a) + n a b l a h (t h e t a)^{T} d e l t a + f r a c 12 d e l t a^{T} h a t H d e l t a

$\ hat {M} (\ delta) \ equiv M (\ delta) + \ frac {\ lambda} {2} \ delta ^ T \ delta = f (\ theta) + \ nabla h (\ theta) ^ T \ delta + \ frac {1} {2} \ delta ^ T \ hat {H} \ delta$

Where

h a t H = H + l a m b d a I

$\ hat {H} = H + \ lambda I$ ,

l a m b d a g e q 0

$\ lambda \ geq0$ - dumping parameter. Calculation

H v

$Hv$ produced as follows:

h a t H v = (H + l a m b d a I) v = H v + l a m b d a v q q u a d q q u a d (6)

$\ hat {H} v = (H + \ lambda I) v = Hv + \ lambda v \ qquad \ qquad (6)$

3. Heuristics of Levenberg-Marquardt (Levenberg-Marquardt heuristic).
For dumping Tikhonov characteristic dynamic adjustment parameter

l a m b d a

$\ lambda$ . Change

l a m b d a

$\ lambda$ We will be according to the Levenberg-Marquardt rule, which is often used in the context of the LM method (the optimization method is an alternative to the Newton method). To use LM - heuristics, it is necessary to calculate the so-called reduction ratio:

r h o = f r a c h (t h e t a_{k} + d e l t a_{k}) - h (d e l t a_{k}) M_{k} (d e l t a_{k}) - M_{k} (0) = f r a c h (t h e t a_{k} + d e l t a_{k}) - h (d e l t a_{k}) n a b l a h (t h e t a_{k})^{T} d e l t a_{k} + f r a c 12 d e l t a_{k}^{T} H d e l t a_{k} q q u a d q q u a d (7)

$\ rho = \ frac {h (\ theta_k + \ delta_k) -h (\ delta_k)} {M_k (\ delta_k) -M_k (0)} = \ frac {h (\ theta_k + \ delta_k) -h (\ delta_k) } {\ nabla h (\ theta_k) ^ T \ delta_k + \ frac {1} {2} \ delta_k ^ TH \ delta_k} \ qquad \ qquad (7)$

Where

k

$k$ - step number HF algorithm,

d e l t a_{k}

$\ delta_k$ - the result of the CG minimization.
According to Levenberg-Marquardt heuristics, we obtain the update rule

l a m b d a

$\ lambda$ :

b e g i n c a s e s i f r h o > 3 / 4 t h e n l a m b d a g e t s (2 / 3) l a m b d a i f r h o < 1 / 4 t h e n l a m b d a g e t s (3 / 2) l a m b d a e n d c a s e s q q u a d q q u a d (8)

$\ begin {cases} if \ \ rho> 3/4 \ then \ \ lambda \ gets (2/3) \ lambda \\ if \ \ rho <1/4 \ then \ \ lambda \ gets (3/2) \ lambda \ end {cases} \ qquad \ qquad (8)$

4. The initial condition for the conjugate gradient algorithm (preconditioning).
In the context of HF optimization, we have some reversible transformation matrix

C

$C$ with which we change

M

$M$ so that

d e l t a = C^{- 1} g a m m a

$\ delta = C ^ {- 1} \ gamma$ and instead

d e l t a

$\ delta$ minimizing

g a m m a

$\ gamma$ . The use of this feature in the CG algorithm requires the calculation of the transformed error vector

y_{i} = P^{- 1} r_{i}

$y_i = P ^ {- 1} r_i$ where

P = C^{T} C

$P = C ^ TC$ .

Short PCG (Preconditioned conjugate gradient) algorithm:
Input data:

m a t h b f b

$\ mathbf b$ ,

m a t h b f A

$\ mathbf A$ ,

m a t h b f x_{0}

$\ mathbf x_0$ ,

P

$P$ ,

i = 0

$i = 0$ - CG algorithm step
Initialization:

$\ mathbf r_0 = \ mathbf b - \ mathbf A \ mathbf x_0$ - error vector (residual)
$\ mathbf y_0$ - solution of the equation $Py_0 = r_0$
$\ mathbf d_0 = \ mathbf y_0$ - vector of search direction (search direction)

Repeat until the stop condition is met:

$\ alpha_i = \ frac {\ mathbf r_i ^ T \ mathbf y_i} {\ mathbf d_i ^ T \ mathbf A \ mathbf d}$
$\ mathbf x_ {i + 1} = \ mathbf x_i + \ alpha_i \ mathbf d_i$
$\ mathbf r_ {i + 1} = \ mathbf r_i - \ alpha_i \ mathbf A \ mathbf d_i$
$\ mathbf y_ {i + 1}$ - solution of the equation $Py_ {i + 1} = r_i$
$\ beta_i = \ frac {\ mathbf r_ {i + 1} ^ T \ mathbf y_ {i + 1}} {\ mathbf r_i ^ T \ mathbf y_i}$
$\ mathbf d_ {i + 1} = \ mathbf y_ {i + 1} + \ beta_i \ mathbf d_i$
$i = i + 1$

Matrix selection

P

$P$ not quite a trivial task. Also, in practice, the use of a diagonal matrix (instead of a matrix with a full rank) shows quite good results. One of the choices matrix

P

$P$ - this is the use of a Fisher diagonal matrix (Empirical Fisher Diagonal):

P = d i a g (b a r F) = f r a c 1 | S | s u m_{(x, y) i n S} (n a b l a (L (t, f (x, t h e t a)))^{2} q q u a d q q u a d (9)

$P = diag (\ bar {F}) = \ frac {1} {| S |} \ sum _ {(x, y) \ in S} (\ nabla (L (t, f (x, \ theta)) ) ^ 2 \ qquad \ qquad (9)$

5. Initialization of the CG - algorithm.
A good practice is to initialize the initial

d e l t a_{0}

$\ delta_0$ , for conjugate gradient algorithm, value

d e l t a_{k}

$\ delta_k$ found in the previous step of the HF algorithm. You can use some decay constant:

d e l t a_{0} = e p s i l o n d e l t a_{k}, e p s i l o n i n (0, 1)

$\ delta_0 = \ epsilon \ delta_k, \ \ epsilon \ in (0, 1)$ . It is worth noting that the index

k

$k$ refers to the step number of the HF algorithm, in turn, the index 0 in

d e l t a_{0}

$\ delta_0$ refers to the initial step of the CG algorithm.

Full Hessian-Free optimization algorithm:
Input data:

t h e t a

$\ theta$ ,

l a m b d a

$\ lambda$ - dumping parameter

k

$k$ - algorithm iteration step
Initialization:

$\ delta_0 = 0$

The main HF optimization cycle:

Calculate the matrix $P$
Find $\ delta_k$ Solving an optimization problem with CG or PCG. $\ delta_k = CG (- \ nabla h (\ theta_k), H, \ epsilon \ delta_ {k-1}, P)$
Update dumping option $\ lambda$ using the Levenberg-Marquardt heuristics
$\ theta_ {k + 1} = \ theta_k + \ alpha \ theta_k$ , $\ alpha$ - parameter learning rate (learning rate)
$k = k + 1$

Thus, the Hessian-Free optimization method allows solving the problems of finding the minimum of a function of high dimensionality. It does not require finding the Hesse matrix directly.

Implementing HF optimization on TensorFlow

The theory is certainly good, but let's try to implement this optimization method in practice and see what happens. For writing the HF algorithm, I used Python and the TensorFlow library of deep learning. After that, as a performance check, I trained a direct distribution network with several layers on XOR and MNIST datasets, using the HF method for optimization.

Implementation (creating a calculation graph TensorFlow) for the conjugate gradient method.

def __conjugate_gradient(self, gradients): """ Performs conjugate gradient method to minimze quadratic equation and find best delta of network parameters. gradients: list of Tensorflow tensor objects Network gradients. return: Tensorflow tensor object Update operation for delta. return: Tensorflow tensor object Residual norm, used to prevent numerical errors. return: Tensorflow tensor object Delta loss. """ with tf.name_scope('conjugate_gradient'): cg_update_ops = [] prec = None #  P   (9) if self.use_prec: if self.prec_loss is None: graph = tf.get_default_graph() lop = self.loss.op.node_def self.prec_loss = graph.get_tensor_by_name(lop.input[0] + ':0') batch_size = None if self.batch_size is None: self.prec_loss = tf.unstack(self.prec_loss) batch_size = self.prec_loss.get_shape()[0] else: self.prec_loss = [tf.gather(self.prec_loss, i) for i in range(self.batch_size)] batch_size = len(self.prec_loss) prec = [[g**2 for g in tf.gradients(tf.gather(self.prec_loss, i), self.W)] for i in range(batch_size)] prec = [(sum(tensor) + self.damping)**(-0.75) for tensor in np.transpose(np.array(prec))] #    Ax = None if self.use_gnm: Ax = self.__Gv(self.cg_delta) else: Ax = self.__Hv(gradients, self.cg_delta) b = [-grad for grad in gradients] bAx = [b - Ax for b, Ax in zip(b, Ax)] condition = tf.equal(self.cg_step, 0) r = [tf.cond(condition, lambda: tf.assign(r, bax), lambda: r) for r, bax in zip(self.residuals, bAx)] d = None if self.use_prec: d = [tf.cond(condition, lambda: tf.assign(d, p * r), lambda: d) for p, d, r in zip(prec, self.directions, r)] else: d = [tf.cond(condition, lambda: tf.assign(d, r), lambda: d) for d, r in zip(self.directions, r)] Ad = None if self.use_gnm: Ad = self.__Gv(d) else: Ad = self.__Hv(gradients, d) residual_norm = tf.reduce_sum([tf.reduce_sum(r**2) for r in r]) alpha = tf.reduce_sum([tf.reduce_sum(d * ad) for d, ad in zip(d, Ad)]) alpha = residual_norm / alpha if self.use_prec: beta = tf.reduce_sum([tf.reduce_sum(p * (r - alpha * ad)**2) for r, ad, p in zip(r, Ad, prec)]) else: beta = tf.reduce_sum([tf.reduce_sum((r - alpha * ad)**2) for r, ad in zip(r, Ad)]) self.beta = beta beta = beta / residual_norm for i, delta in reversed(list(enumerate(self.cg_delta))): update_delta = tf.assign(delta, delta + alpha * d[i], name='update_delta') update_residual = tf.assign(self.residuals[i], r[i] - alpha * Ad[i], name='update_residual') p = 1.0 if self.use_prec: p = prec[i] update_direction = tf.assign(self.directions[i], p * (r[i] - alpha * Ad[i]) + beta * d[i], name='update_direction') cg_update_ops.append(update_delta) cg_update_ops.append(update_residual) cg_update_ops.append(update_direction) with tf.control_dependencies(cg_update_ops): cg_update_ops.append(tf.assign_add(self.cg_step, 1)) cg_op = tf.group(*cg_update_ops) dl = tf.reduce_sum([tf.reduce_sum(0.5*(delta*ax) + grad*delta) for delta, grad, ax in zip(self.cg_delta, gradients, Ax)]) return cg_op, residual_norm, dl

Code to calculate the matrix

P

$P$ to find the preconditioning condition below. At the same time, since Tensorflow summarizes the result of calculating gradients across the entire set of training examples, we had to distort a bit to get the gradient separately for each example, which affected the numerical stability of the solution. Therefore, the use of preconditioning is possible, as they say, at your own peril and risk.

  prec = [[g**2 for g in tf.gradients(tf.gather(self.prec_loss, i), self.W)] for i in range(batch_size)]

The calculation of the product of the Hessian on the vector (4). It uses dumping Tikhonov (6).

  def __Hv(self, grads, vec): """ Computes Hessian vector product. grads: list of Tensorflow tensor objects Network gradients. vec: list of Tensorflow tensor objects Vector that is multiplied by the Hessian. return: list of Tensorflow tensor objects Result of multiplying Hessian by vec. """ grad_v = [tf.reduce_sum(g * v) for g, v in zip(grads, vec)] Hv = tf.gradients(grad_v, self.W, stop_gradients=vec) Hv = [hv + self.damp_pl * v for hv, v in zip(Hv, vec)] return Hv

When I wanted to use the generalized Newton-Gauss matrix (5), I ran into a small problem. Namely, TensorFlow cannot count Jacobian’s vector work as another Theano deep learning framework does (there is a Rop function specifically designed for this in Theano). I had to do an analog operation in TensorFlow.

  def __Rop(self, f, x, vec): """ Computes Jacobian vector product. f: Tensorflow tensor object Objective function. x: list of Tensorflow tensor objects Parameters with respect to which computes Jacobian matrix. vec: list of Tensorflow tensor objects Vector that is multiplied by the Jacobian. return: list of Tensorflow tensor objects Result of multiplying Jacobian (df/dx) by vec. """ r = None if self.batch_size is None: try: r = [tf.reduce_sum([tf.reduce_sum(v * tf.gradients(f, x)[i]) for i, v in enumerate(vec)]) for f in tf.unstack(f)] except ValueError: assert False, clr.FAIL + clr.BOLD + 'Batch size is None, but used '\ 'dynamic shape for network input, set proper batch_size in '\ 'HFOptimizer initialization' + clr.ENDC else: r = [tf.reduce_sum([tf.reduce_sum(v * tf.gradients(tf.gather(f, i), x)[j]) for j, v in enumerate(vec)]) for i in range(self.batch_size)] assert r is not None, clr.FAIL + clr.BOLD +\ 'Something went wrong in Rop computation' + clr.ENDC return r

And then implement the product of the generalized Newton-Gauss matrix by a vector.

  def __Gv(self, vec): """ Computes the product G by vec = JHJv (G is the Gauss-Newton matrix). vec: list of Tensorflow tensor objects Vector that is multiplied by the Gauss-Newton matrix. return: list of Tensorflow tensor objects Result of multiplying Gauss-Newton matrix by vec. """ Jv = self.__Rop(self.output, self.W, vec) Jv = tf.reshape(tf.stack(Jv), [-1, 1]) HJv = tf.gradients(tf.matmul(tf.transpose(tf.gradients(self.loss, self.output)[0]), Jv), self.output, stop_gradients=Jv)[0] JHJv = tf.gradients(tf.matmul(tf.transpose(HJv), self.output), self.W, stop_gradients=HJv) JHJv = [gv + self.damp_pl * v for gv, v in zip(JHJv, vec)] return JHJv

The function of the main learning process is presented below. First, the quadratic function is minimized with the help of CG / PCG, then the main update of the weights of the network occurs. The dumping parameter is also adjusted based on the Levenberg-Marquardt heuristics.

  def minimize(self, feed_dict, debug_print=False): """ Performs main training operations. feed_dict: dictionary Input training batch. debug_print: bool If True prints CG iteration number. """ self.sess.run(tf.assign(self.cg_step, 0)) feed_dict.update({self.damp_pl:self.damping}) if self.adjust_damping: loss_before_cg = self.sess.run(self.loss, feed_dict) dl_track = [self.sess.run(self.ops['dl'], feed_dict)] self.sess.run(self.ops['set_delta_0']) for i in range(self.cg_max_iters): if debug_print: d_info = clr.OKGREEN + '\r[CG iteration: {}]'.format(i) + clr.ENDC sys.stdout.write(d_info) sys.stdout.flush() k = max(self.gap, i // self.gap) rn = self.sess.run(self.ops['res_norm'], feed_dict) #      if rn < self.cg_num_err: break self.sess.run(self.ops['cg_update'], feed_dict) dl_track.append(self.sess.run(self.ops['dl'], feed_dict)) #  ,    (3) if i > k: stop = (dl_track[i+1] - dl_track[i+1-k]) / dl_track[i+1] if not np.isnan(stop) and stop < 1e-4: break if debug_print: sys.stdout.write('\n') sys.stdout.flush() if self.adjust_damping: feed_dict.update({self.damp_pl:0}) dl = self.sess.run(self.ops['dl'], feed_dict) feed_dict.update({self.damp_pl:self.damping}) self.sess.run(self.ops['train'], feed_dict) if self.adjust_damping: loss_after_cg = self.sess.run(self.loss, feed_dict) #  (7) reduction_ratio = (loss_after_cg - loss_before_cg) / dl # - (8) if reduction_ratio < 0.25 and self.damping > self.damp_num_err: self.damping *= 1.5 elif reduction_ratio > 0.75 and self.damping > self.damp_num_err: self.damping /= 1.5

HF Optimization Testing

We are testing an HF optimizer written for this, we will use a simple example with XOR dataset and more complex with MNIST dataset. In order to see the learning outcomes and visualize some information, we use TesnorBoard. I would also like to note that we have a rather complicated graph of calculations TensorFlow.

TensorFlow calculation graph.

Architecture and network training on XOR.
Create a simple network of size: 2 input neurons, 2 hidden and 1 output. We will use sigmoid as an activation function. As a loss function, use log-loss.

  #   """ Log-loss cost function """ loss = tf.reduce_mean(( (y * tf.log(out)) + ((1 - y) * tf.log(1.0 - out)) ) * -1, name='log_loss') #XOR  XOR_X = [[0,0],[0,1],[1,0],[1,1]] XOR_Y = [[0],[1],[1],[0]] #  sess = tf.Session() hf_optimizer = HFOptimizer(sess, loss, y_out, dtype=tf.float64, use_gauss_newton_matrix=True) init = tf.initialize_all_variables() sess.run(init) #  max_epoches = 100 print('Begin Training') for i in range(max_epoches): feed_dict = {x: XOR_X, y: XOR_Y} hf_optimizer.minimize(feed_dict=feed_dict) if i % 10 == 0: print('Epoch:', i, 'cost:', sess.run(loss, feed_dict=feed_dict)) print('Hypothesis ', sess.run(out, feed_dict=feed_dict))

Now we compare the learning results using HF optimization (with the Hessian matrix), HF optimization (with the Newton-Gauss matrix) and the usual gradient descent with the learning speed parameter equal to 0.01. The number of iterations is 100.

Loss for gradient descent (red line). Loss for HF optimization with Hesse matrix (orange line). Loss for HF optimization with Newton-Gauss matrix (blue line).

In this case, it is clear that the HF optimization with the use of the Newton-Gauss matrix converges most quickly, but for the gradient descent of the 100 iteration it turned out to be very small. In order for the loss function during the gradient descent to be comparable to the HF optimization, it took about 100,000 iterations .

Loss for gradient descent, 100,000 iterations.

Network Architecture and Training on MNIST Dataset.
To solve the problem of recognizing handwritten numbers, create a network of size: 784 input neurons, 300 hidden and 10 output. As a loss function, we will use cross-entropy. The size of the mini batch filed during training is 50.

  with tf.name_scope('loss'): one_hot = tf.one_hot(t, n_outputs, dtype=tf.float64) xentropy = tf.nn.softmax_cross_entropy_with_logits(labels=one_hot, logits=y_out) loss = tf.reduce_mean(xentropy, name="loss") with tf.name_scope("eval"): correct = tf.nn.in_top_k(tf.cast(y_out, tf.float32), t, 1) accuracy = tf.reduce_mean(tf.cast(correct, tf.float64)) n_epochs = 10 batch_size = 50 with tf.Session() as sess: """ Initializing hessian free optimizer """ hf_optimizer = HFOptimizer(sess, loss, y_out, dtype=tf.float64, batch_size=batch_size, use_gauss_newton_matrix=True) init = tf.global_variables_initializer() init.run() #   for epoch in range(n_epochs): n_batches = mnist.train.num_examples // batch_size for iteration in range(n_batches): x_batch, t_batch = mnist.train.next_batch(batch_size) hf_optimizer.minimize({x: x_batch, t: t_batch}) if iteration%10==0: print('Batch:', iteration, '/', n_batches) acc_train = accuracy.eval(feed_dict={x: x_batch, t: t_batch}) acc_test = accuracy.eval(feed_dict={x: mnist.test.images, t: mnist.test.labels}) print('Loss:', sess.run(loss, {x: x_batch, t: t_batch})) print('Target', t_batch[0]) print('Out:', sess.run(y_out_sm, {x: x_batch, t: t_batch})[0]) print(epoch, "Train accuracy:", acc_train, "Test accuracy:", acc_test) acc_train = accuracy.eval(feed_dict={x: x_batch, t: t_batch}) acc_test = accuracy.eval(feed_dict={x: mnist.test.images, t: mnist.test.labels}) print(epoch, "Train accuracy:", acc_train, "Test accuracy:", acc_test)

Just as in the case of XOR, we compare the learning outcomes using HF optimization (with the Hessian matrix), HF optimization (with the Newton-Gauss matrix) and the usual gradient descent with the learning speed parameter equal to 0.01. The number of iterations is 200, i.e. if the size of the mini batch is 50, then 200 is not a complete epoch (not all examples from the training sample are used). I did this in order to test everything faster, but even this shows a general trend.

The figure on the left is the accuracy for the test sample. The figure on the right is the accuracy for the training sample. Accuracy for gradient descent (red line). Accuracy for HF optimization with Hesse matrix (orange line). Accuracy for HF optimization with Newton-Gauss matrix (blue line).

Loss for gradient descent (red line). Loss for HF optimization with Hesse matrix (orange line). Loss for HF optimization with Newton-Gauss matrix (blue line).

As you can see from the figures above, HF optimization with the Hesse matrix is not very stable, but in the end it will still converge when learning from several eras. The best result is shown by the HF optimization with the Newton-Gauss matrix.

One complete epoch of learning. The figure on the left is the accuracy for the test sample. The figure on the right is the accuracy for the training sample. Accuracy for gradient descent (turquoise line). Loss for HF optimization with Newton-Gauss matrix (pink line).

One complete epoch of learning. Loss for gradient descent (turquoise line). Loss for HF optimization with Newton-Gauss matrix (pink line).

When using the method of conjugate gradients with the initial condition for the algorithm of conjugate gradients (preconditioning), the calculations themselves slowed down and converged no faster than with the usual CG.

Loss for HF optimization using PCG algorithm.

From all these graphs, it can be noted that HF optimization using the Newton-Gauss matrix and the standard conjugate gradient method showed the best result.
The full code can be viewed on GitHub .

Results

As a result, an implementation of the HF algorithm in Python was created using the TensorFlow library. During the creation, I ran into some problems with the implementation of the main features of the algorithm, namely: support for the Newton-Gauss matrix and preconditioning. This was due to the fact that all the same TensorFlow is not such a flexible library as we would like and is not much intended for research. For experimental purposes, it is still better to use Theano, as it gives more freedom. But I initially decided for myself to do all this with the help of TensorFlow. The program was tested and it was possible to see that the best results are obtained by the HF algorithm using the Newton-Gauss matrix. Using the same initial condition for the algorithm of conjugate gradients (preconditioning) gave unstable numerical results and slowed down the calculations, but it seems to me that this was due to TensorFlow (for the implementation of preconditioning, it had to be greatly distorted).

Sources

In this article, the theoretical aspects of Hessian - Free optimization are described quite briefly, so that you can understand the basic essence of the algorithms. If a more detailed description of the material is needed, below are sources from which I took the basic theoretical information, on the basis of which I made the Python implementation of the HF method.

1) Training Deep and Recurrent Networks with Hessian-Free Optimization (James Martens and Ilya Sutskever, University of Toronto) - a full description of HF - optimization.
2) Deep learning via Hessian-free optimization (James Martens, University of Toronto) - article with the results of using HF-optimization.
3) Fast Exact Multiplication by Hessian (Barak A. Pearlmutter, Siemens Corporate Research) - a detailed description of the multiplication of the Hessian matrix by a vector.
4) A detailed description of the conjugate gradient method for the Conjugate Gradient Method without the Agonizing Pain (Jonathan Richard Shewchuk, Carnegie Mellon University) .

Source: https://habr.com/ru/post/350794/

All Articles

Hessian-Free Optimization with TensorFlow

Little about optimization techniques

Gradient Descent (Gradient Descent)

Newton's Method

Problems with Newton's Method

Hessian-free optimization

Conjugate gradient method

Hessian multiplication by vector

Some improvements in HF optimization

Implementing HF optimization on TensorFlow

HF Optimization Testing

Results

Sources

More articles: