📜 ⬆️ ⬇️

The Levenberg-Marquardt algorithm for the nonlinear least squares method and its implementation in Python



Finding an extremum (minimum or maximum) of an objective function is an important task in mathematics and its applications (in particular, in machine learning there is a problem of curve-fitting ). Surely everyone has heard about the method of the fastest descent (MNF) and the method of Newton (MN). Unfortunately, these methods have a number of significant drawbacks, in particular, the method of steepest descent can converge at the end of optimization for a very long time, and Newton's method requires the calculation of second derivatives, which requires a lot of calculations.



To eliminate the shortcomings, as is often the case, you need to dive deeper into the subject area and add restrictions on the input data. In particular: the MHC and MN deal with arbitrary functions. In statistics and machine learning it is often necessary to deal with the least squares method (OLS). This method minimizes the sum of the square of errors, i.e. the objective function is represented as



\ frac {1} {2} \ sum \ limits_ {i = 1} ^ {N} (y_i'-y_i) ^ 2 = \ frac {1} {2} \ sum \ limits_ {i = 1} ^ {N } r_i ^ 2 \ tag {1}


The Levenberg-Marquardt algorithm is a nonlinear least squares method. The article contains:




The code uses additional libraries, such as numpy, matplotlib . If you don’t have them, I highly recommend installing them from Anaconda for Python


')

Dependencies
The Levenberg-Marquardt algorithm relies on the methods outlined in the flowchart.




Therefore, first, it is necessary to study them. This will do

Definitions



Function selection



In mathematical optimization there are functions on which new methods are tested.
One such function is the Rosenbrock Function . In the case of a function of two variables, it is defined as



f (x, y) = (ax) ^ 2 + b (yx ^ 2) ^ 2


I accepted a = 0.5, \ b = 0.5 . Those. function has the form:



f (x, y) = \ frac {1} {2} (1-x) ^ 2 + \ frac {1} {2} (yx ^ 2) ^ 2


We will consider the behavior of the function on the interval -2 \ le x, y \ le 2




This function is defined non-negatively, has a minimum z = 0 $ at the point $ (x = 1, y = 1)
In code, it is easier to encapsulate all the data about the function in one class and take the class of the function that is required. The result depends on the starting point of the optimization. Choose it as (x = -2, y = -2) . As can be seen from the graph, at this point, the function takes the largest value on the interval.



functions.py


class Rosenbrock: initialPoint = (-2, -2) camera = (41, 75) interval = [(-2, 2), (-2, 2)] """   """ @staticmethod def function(x): return 0.5*(1-x[0])**2 + 0.5*(x[1]-x[0]**2)**2 """    -  - r """ @staticmethod def function_array(x): return np.array([1 - x[0] , x[1] - x[0] ** 2]).reshape((2,1)) @staticmethod def gradient(x): return np.array([-(1-x[0]) - (x[1]-x[0]**2)*2*x[0], (x[1] - x[0]**2)]) @staticmethod def hesse(x): return np.array(((1 -2*x[1] + 6*x[0]**2, -2*x[0]), (-2 * x[0], 1))) @staticmethod def jacobi(x): return np.array([ [-1, 0], [-2*x[0], 1]]) """     : http://www.mathworks.com/help/matlab/matlab_prog/vectorization.html """ @staticmethod def getZMeshGrid(X, Y): return 0.5*(1-X)**2 + 0.5*(Y - X**2)**2 

The fastest descent method (Steepest Descent)



The method itself is extremely simple. Accept F (x) = f (x) i.e. objective function coincides with the given one.
Need to find d_ {ns} (x) - direction of the fastest descent of the function f at the point x .
f (x) can be linearly approximated at x :



f (x + d) \ approx f (x) + \ nabla f (x) ^ Td, \ d \ in R ^ n, || d || \ to 0 \ tag {2}


\ lim_ {d \ to 0} f (x) -f (x + d) = - \ nabla f (x) ^ Td \ stackrel {(3.a)} = - || \ nabla f (x) ^ T || \ || d || cos \ theta, \ tag {3}


Where \ theta - angle between vector d \ and \ nabla f (x) ^ T . (3.a) follows from the dot product



So how do we minimize f (x) then the bigger the difference (3) , all the better. With \ theta = \ pi the expression will be maximally ( cos \ theta = -1 , the vector norm is always non-negative), and \ theta = \ pi will only be if vectors d \ and \ \ nabla f (x) ^ T will be opposite, therefore



d_ {ns} = - \ nabla f (x) ^ T


We have the right direction, but taking a step long || d_ {ns} || You can go the wrong way. Make a step smaller:



d_ {ns} = - \ alpha \ nabla f (x) ^ T, 0 & lt; \ alpha & lt; one


In theory, the smaller the pitch, the better. But then the speed of convergence will suffer. Recommended value \ alpha = 0.05



In code, it looks like this: first, the base optimizer class. We transfer everything that is needed in the future (Hesse and Jacobi matrices are not needed now, but will be needed for other methods)


 class Optimizer: def __init__(self, function, initialPoint, gradient=None, jacobi=None, hesse=None, interval=None, epsilon=1e-7, function_array=None, metaclass=ABCMeta): self.function_array = function_array self.epsilon = epsilon self.interval = interval self.function = function self.gradient = gradient self.hesse = hesse self.jacobi = jacobi self.name = self.__class__.__name__.replace('Optimizer', '') self.x = initialPoint self.y = self.function(initialPoint) "      " @abstractmethod def next_point(self): pass """     """ def move_next(self, nextX): nextY = self.function(nextX) self.y = nextY self.x = nextX return self.x, self.y 


Optimizer code itself:
 class SteepestDescentOptimizer(Optimizer): ... def next_point(self): nextX = self.x - self.learningRate * self.gradient(self.x) return self.move_next(nextX) 


Optimization result:




IterationXYZ
250.383-0.4090.334
750.6930.320.058
5320.9960.99010 ^ {- 6}

It is striking: how quickly the optimization went in 0-25 iterations, in 25-75 it was already slower, and at the end it took 457 iterations to get close to zero. This behavior is very characteristic of the MNS: a very good convergence rate at the beginning, a bad one at the end.



Newton's method



Newton himself searches for the root of the equation, i.e. such x , what f (x) = 0 . This is not exactly what we need, because the function may have an extremum not necessarily at zero.



And then there is Newton's Method for optimization . When they talk about MN in the context of optimization, they mean it. I myself, studying at the institute, confused these methods by foolishness and could not understand the phrase "Newton's method has a drawback - the need to count second derivatives."



Consider for f (x): R \ to R
Accept F (x) = f (x) i.e. objective function coincides with the given one.



Decompose f (x) in the Taylor series, but unlike the MNS, we need a quadratic approximation:



f (x + d) \ approx f (x) + f ^ {'} (x) d + \ frac {1} {2} f ^ {' '} (x) d ^ 2, \ d \ in R ^ n, || d || \ to 0 \ tag {4}


It is easy to show that if f {'} (x) \ ne 0 , the function cannot have an extremum in x . Point x ^ * $ in which $ f {'} (x) = 0 called stationary .



Differentiate both parts by d . Our goal is to f (x + d) ^ {'} = 0 Therefore, we solve the equation:



0 = f (x + d) ^ {'} = f ^ {'} (x) + f ^ {''} (x) d \\ d_ {n} = - \ frac {f ^ {'} (x )} {f ^ {''} (x)}


d_n - This is the direction of the extremum, but it can be both a maximum and a minimum. To find out - whether the point x + d_n minimum - you need to analyze the second derivative. If a f ^ {''} (x) & gt; 0 $, then $ f (x + d_n) is a local minimum if f ^ {''} (x) & lt; 0 - maximum.



In the multidimensional case, the first derivative is replaced by the gradient, the second - by the Hessian matrix. It is impossible to divide matrices, instead multiply by the opposite (observing the side, since commutativity is absent):



f (x): R ^ n \ to R \\ H (x) d_ {n} = - \ nabla f (x) \\ d_ {n} = - H ^ {- 1} (x) \ nabla f ( x)


Similar to the one-dimensional case - you need to check whether we go right? If the Hessian matrix is positively defined , then the direction is correct, otherwise we use the MNF.



In the code:


 def is_pos_def(x): return np.all(np.linalg.eigvals(x) > 0) class NewtonOptimizer(Optimizer): def next_point(self): hesse = self.hesse(self.x) # if Hessian matrix if positive - Ok, otherwise we are going in wrong direction, changing to gradient descent if is_pos_def(hesse): hesseInverse = np.linalg.inv(hesse) nextX = self.x - self.learningRate * np.dot(hesseInverse, self.gradient(self.x)) else: nextX = self.x - self.learningRate * self.gradient(self.x) return self.move_next(nextX) 


Result:




IterationXYZ
25-1.490.634.36
750.31-0.040.244
1790.995-0.99110 ^ {- 6}

Compare with the MHC. There was a very strong descent to iteration 25 (almost fell from the mountain), but then the convergence slowed down a lot. In MN, on the contrary, we first slowly descend from the mountain, but then we move faster. The MHC took from 25 to 532 iterations to reach zero from z = 0.334 . MN optimized 4.36 for the last 154 iterations.



This is a frequent behavior: MN has a quadratic rate of convergence, if you start from a point close to the local extremum. The MNF works well far from the extremum.



The MN uses the curvature information, as was seen in the figure above (smooth descent from the slide).
Another example demonstrating this idea: in the figure below, the red vector is the direction of the MNF, and the green vector is the MN.




[Nonlinear vs linear] least squares method



In MNC we have a model y = f (\ beta_1, .. \ beta_n; x) having n parameters that are configured to minimize



\ frac {1} {2} \ sum \ limits_ {i = 1} ^ {N} (y_i'-y_i) ^ 2 = \ frac {1} {2} \ sum \ limits_ {i = 1} ^ {N } r_i ^ 2


where y_i ' - i -th observation.



In a linear MNC , we have $ m $ equations, each of which we can represent as a linear equation



x_i \ beta_1 + x_i \ beta_2 + .. x_i \ beta_n = y_i


For a linear MNC, the solution is unique. There are powerful methods, such as QR decomposition , SVD decomposition , capable of finding a solution for linear least-squares method for 1 approximate solution of the matrix equation Ax = b .



In nonlinear OLS parameter \ beta_i may itself be represented by a function, for example \ beta_i ^ 2 . Also, there may be a product of parameters, for example



\ beta_1 \ beta_2


etc.
Here we have to find a solution iteratively, and the solution depends on the choice of the starting point.



The methods below deal with the nonlinear case. But, first, consider the non-isolated MNC in the context of our task - minimizing the function



f: R ^ 2 \ to R \\ F (x_1, x_2) = f (x_1, x_2) = \ frac {1} {2} (1-x_1) ^ 2 + \ frac {1} {2} (x_2 -x_1 ^ 2) ^ 2 = \ frac {1} {2} r_1 ^ 2 (x_1, x_2) + \ frac {1} {2} r_2 ^ 2 (x_1, x_2)


Nothing like? This is the form of the MNC! We introduce a vector function r



r: R ^ 2 \ to R ^ 2 \\ r = \ left [\ begin {matrix} 1-x_1 \\ x_2-x_1 ^ 2 \ end {matrix} \ right]


and we will pick up x_1, x_2 so as to solve the system of equations (at least approximately):



\ begin {cases} r_1 = 1-x_1 = 0 \\ r_2 = x_2 - x_1 ^ 2 = 0 \ end {cases} \\


Then we need a measure - how good our approximation is. Here she is:



F (x) = \ frac {1} {2} \ sum_i ^ m r_i ^ 2 (x) = \ frac {1} {2} r ^ T r = \ frac {1} {2} || r || ^ 2 \ tag {5}


I applied the inverse operation: adjusted the vector function r under the target F . But it is possible and vice versa: if a vector function is given r: R ^ n \ to R ^ m build F (x) from (5). For example:



r = \ left [\ begin {matrix} x_1 ^ 2 \\ x_2 ^ 2 \ end {matrix} \ right], F (x) = \ frac {1} {2} x_1 ^ 2 + \ frac {1} { 2} x_2 ^ 2


Finally, one very important moment. The condition must be met m \ ge n otherwise you cannot use the method. In our case, the condition is satisfied



Gauss-Newton Method



The method is based on the same linear approximation, only now we are dealing with two functions:



r (x + d) \ approx l (d) \ equiv r (x) + J (x) d \\ F (x + d) \ approx L (d) \ equiv \ frac {1} {2} l ^ T (d) l (d)


Then we do the same as in Newton's method - we solve the equation (only for L (d) ):



L ^ {''} d_ {n} = -L ^ {'}


It is easy to show that near \ text (\ d \ to 0) :



L ^ {''} (d) = J_r ^ T J_r, \ L ^ {'} (d) = J_r ^ T (d) r (d) \\ (J ^ TJ) d_ {n} = -J ^ Tr \\ d_ {n} = - (J ^ TJ) ^ {- 1} J ^ Tr


Optimizer code:


 class NewtonGaussOptimizer(Optimizer): def next_point(self): # Solve (J_t * J)d_ng = -J*f jacobi = self.jacobi(self.x) jacobisLeft = np.dot(jacobi.T, jacobi) jacobiLeftInverse = np.linalg.inv(jacobisLeft) jjj = np.dot(jacobiLeftInverse, jacobi.T) # (J_t * J)^-1 * J_t nextX = self.x - self.learningRate * np.dot(jjj, self.function_array(self.x)).reshape((-1)) return self.move_next(nextX) 


The result exceeded my expectations. In just 3 iterations, we came to a point (x = 1, y = 1) . To demonstrate the trajectory of the movement, I reduced the learningrate to 0.2




Algorithm of Levenberg - Marquardt



It is based on one of the versions of the Gauss-Newton Method (" damped version "):



(J ^ T J + \ mu I) d_ {lm} = -J ^ Tr, \ mu \ ge 0


\ mu called regulation parameter . Sometimes I replace with diag (J ^ T J) to improve convergence.
Diagonal elements J ^ T J will be positive, because element a_ {ii} matrices J ^ T J is the scalar product of a row vector i at J ^ t on myself.



For large \ mu the method of steepest descent is obtained, for small ones the Newton method.
The algorithm itself in the optimization process selects the desired \ mu based on gain ratio , defined as:



g = \ frac {F (x) - F (x_ {new)}} {L (0) - L (d_ {lm})}


If a g & gt; 0 then L (d) - good approximation for F (x + d) otherwise - you need to increase \ mu .
Initial value \ mu set as \ tau \ cdot max \ {{a_ {ij}} \} where a_ {ij} - matrix elements J ^ T J .
\ tau recommended to appoint for 10 ^ {- 3} . The stopping criterion is to reach the global minimum, i.e. F ^ {'} (x ^ *) = g (x ^ *) = 0




In optimizers, I did not implement the stopping criterion — the user is responsible for this. I only needed to move to the next point.


 class LevenbergMarquardtOptimizer(Optimizer): def __init__(self, function, initialPoint, gradient=None, jacobi=None, hessian=None, interval=None, function_array=None, learningRate=1): self.learningRate = learningRate functionNew = lambda x: np.array([function(x)]) super().__init__(functionNew, initialPoint, gradient, jacobi, hessian, interval, function_array=function_array) self.v = 2 self.alpha = 1e-3 self.m = self.alpha * np.max(self.getA(jacobi(initialPoint))) def getA(self, jacobi): return np.dot(jacobi.T, jacobi) def getF(self, d): function = self.function_array(d) return 0.5 * np.dot(function.T, function) def next_point(self): if self.y==0: # finished. Y can't be less than zero return self.x, self.y jacobi = self.jacobi(self.x) A = self.getA(jacobi) g = np.dot(jacobi.T, self.function_array(self.x)).reshape((-1, 1)) leftPartInverse = np.linalg.inv(A + self.m * np.eye(A.shape[0], A.shape[1])) d_lm = - np.dot(leftPartInverse, g) # moving direction x_new = self.x + self.learningRate * d_lm.reshape((-1)) # line search grain_numerator = (self.getF(self.x) - self.getF(x_new)) gain_divisor = 0.5* np.dot(d_lm.T, self.m*d_lm-g) + 1e-10 gain = grain_numerator / gain_divisor if gain > 0: # it's a good function approximation. self.move_next(x_new) # ok, step acceptable self.m = self.m * max(1 / 3, 1 - (2 * gain - 1) ** 3) self.v = 2 else: self.m *= self.v self.v *= 2 return self.x, self.y 


The result was good too:


IterationXYZ
0-2-222.5
four0.9990.99810 ^ {- 7}
elevenoneone0

When learningrate = 0.2:




Method comparison


Method nameObjective functionVirtuesdisadvantagesConvergence
The fastest descent methoddifferentiable- wide range of applications
-simple implementation

- low price per iteration
- the global minimum is searched worse than in other methods

-low convergence rate near extremum
local
Newton's methodtwice differentiable-high rate of convergence near the extremum

- uses curvature information
-function must be twice differentiable

will return an error if the Hesse matrix is ​​degenerate (does not have an inverse)

-There is a chance to go wrong if far away from the extremum.
local
Gauss-Newton Methodnonlinear OLS- very high convergence rate

-Works well with the task of curve-fitting
- columns of matrix J should be linearly independent

- imposes restrictions on the type of the objective function
local
Algorithm of Levenberg - Marquardtnonlinear OLS-labability among the considered methods

-The greatest chance of finding a global extremum

- very high convergence rate (adaptive)

-Works well with the task of curve-fitting
- columns of matrix J should be linearly independent

- imposes restrictions on the type of the objective function

-complicated implementation
local

Despite the good results in the concrete example, the considered methods do not guarantee global convergence (to find which is an extremely difficult task). An example of the few methods that can still achieve this is the basin-hopping algorithm.



Combined result (specially reduced speed of the last two methods):




Sources can be downloaded from github



Sources


  1. K. Madsen, HB Nielsen, O. Tingleff (2004): Methods for non-linear least square
  2. Florent Brunet (2011): Basics on Continuous Optimization
  3. Least Squares Problems

Source: https://habr.com/ru/post/308626/


All Articles