Good day, Habr! It's time to get back to the optimization tasks. This time we will deal with linear regression and find out who the cats are - only fluffy pet bastards or even a good tool for solving applied problems.
Since that time (and it has been quite a lot), I “brushed” the repository a bit, wrote more or less meaningful ReadMe, and also restructured the projects. What has changed since the last article , and what is the current state of the project?
the Algebras project contains the implementation of the trait of algebra, which lists all the basic operations that must be implemented for the object that inherits it; in addition, real and interval algebras are currently implemented in this project,
in the Transformations project there are main types of transformations with corresponding links; in this project, an addition was made to the properties of transformations (for example, numerical differentiability), which is useful later in the implementation of optimization algorithms using a gradient,
in the Algorithms project are the main classes of algorithms; this project will be filled as the general types of algorithms are highlighted (as long as there is only an algorithm for optimizing real-valued functions),
Metaheuristic Optimization and Machine Learning projects store implementations of optimization and machine learning algorithms, respectively,
Tools project contains various procedures necessary for the operation of algorithms (for example, random number generators in accordance with distributions).
As I promised in the first paper, at the beginning of the article I will outline the circle of tasks that will be solved, and then dwell on each in more detail. References to the repository will be given at the end. ')
So, in this work we will talk:
about linear regression models,
about how to reduce the problem of finding a linear regression to an optimization problem,
about the meta-heuristic global conditional optimization algorithm that simulates the behavior of cats.
Linear regression models and how to reduce them to an optimization problem
Let's start with the formulation of the regression problem.
Suppose there is a set of measurements, which is conveniently presented as a matrix:
X i n m a t h b b R m t i m e s n ,
those. every measurement x i = l e f t ( x i , 1 , x i , 2 , d o t s , x i , n r i g h t ) T represented by a vector of m a t h b b R n . There is also a set of values for the dependent variable.
YinmathbbRm.
We will work with a simple case when the dependent variable is one-dimensional. Solving the regression problem, you need to build a model that, taking into account a certain quality metric (performance criterion), will realize the connection between the set of measurements and the dependent variable, i.e. find model
fleft(cdot,thetaright):mathbbRnrightarrowmathbbR,
Where thetainmathbbRp - vector of model parameters, t.ch. fleft(xi,thetaright)=hatyiapproxyi,foralli=1,dots,n . Obviously, the problem of approximation and regression are closely related.
Thus, if you expertly fixed the shape of the desired model, then the whole task is reduced to determining the values of the parameters vector. theta . It should be noted that the following principle applies to machine learning: the more degrees of freedom (read parameters) the model has, the more likely it is that the model used is retrained. In the case of a few degrees of freedom, then there is every chance that the model will not be sufficiently accurate.
In this regard, linear models are probably some kind of transitional link. Despite the seeming simplicity, for many situations they solve the regression problem with high accuracy. However, when data is very noisy, linear models sometimes need artificial limitation (regularization).
a method that returns the weights of the regression model,
apply - a method that calculates the value of the dependent variable based on the received input,
conversion to inhomogeneous transformation.
In the same object there are
the constant responsible for the shift
RSS metrics.
Ordinary Least Squares
As is customary, let us begin with the simplest model and will quietly complicate it. So, in general, linear regression is given by the following formula:
which shows that the dimension of the vector theta equal to the dimension of the dimension vector plus one. The zero component is called bias term or intercept term. Similar to the simple function. fleft(xright)=a+bx where is a constant a responsible for the displacement of the graph relative to the x-axis.
Thus, the linear model can be expressed through the scalar product
It is convenient to pose the task of finding the optimal value of the parameter vector. theta using residual sum of squares (RSS, residual sum of squares)
where is the matrix tildeX obtained from the matrix X by adding to the left a column consisting of units.
As can be seen from the above description, the optimization problem has already been set. So in the future it remains only to apply the selected optimization algorithm to it.
If it is necessary for one reason or another to reduce the degree of variation of the model without its structural change, you can use regularization, which imposes restrictions on the model parameters.
Ridge regression (ridge regression) uses L2 regularization of model parameters:
Where left|left|left(theta1,dots,thetanright)Tright|right|2=sqrtsumnj=1theta2j - L2 rate. Parameter alpha responsible for compressing coefficients: with increasing alpha model parameters tend to zero. This is clearly demonstrated on the official website of the scikit package :
As in the case of simple linear regression for a ridge regression, it is possible to analytically express the solution:
Thus, from the point of view of optimization, Ridge regression and Lasso Regression differ only in the way of setting the minimization problem.
Cat Swarm Optimization
As it became clear from the name, the algorithm imitates the behavior of animals of the cat family (including domestic cats). What can you say about your pet? He can play the role of a sweet lazybird (although we actually know what kind of insidious thoughts swarm in his head), can imagine himself a great (but cautious) researcher, and can simply rush around the apartment for a non-existent (or rather invisible to you) opponent. We will leave the cats lying and immovable like the Great Wall of China alone, let them rest, but let's dwell on the last two actions. For any optimization algorithm, it is good to have several search stages: global , during which we should fall into the region of attraction of a local extremum (and ideally global), and specifying , during which we should move from a neighborhood of extremum closer to its true location. Nothing like? In fact, cats chasing an invisible enemy are obvious candidates for the global search procedure, but neat researchers will help you find the best place to rest. These two heuristics underlie the Cat Swarm Optimization algorithm. For the full picture, it remains to imagine that you have not one cat, but a whole flock. But it's even better, right?
The pseudocode of the algorithm is presented below:
1. N . 2. . "" ( , ). 3. ( ). 4. MR. 5. . 2.
If we try to formalize all ideas, then in mathematical expression we have the following theses:
each cat is associated with a point in the study space,
fitness cat - the value of the function to be optimized at the point corresponding to its current position,
Each cat can be in one of two modes: search or chase.
iteration of the algorithm involves the implementation of the process of moving in accordance with the mode in which the cat is located.
Parameters affecting the operation of the algorithm
number of cats ( numberOfCats ) - the more cats, the longer the algorithm runs (if it is limited by the number of iterations), but also the potentially greater accuracy of the answer found,
proportion of cats in the search mode and chase ( MR ) - this parameter allows you to direct the search according to the strategy that the user considers preferable; For example, if you know the neighborhood in which the global optimum lies, then it is logical to initialize the population in this neighborhood and support a larger number of male researchers in the population to clarify the initial solution,
the number of attempts for the search mode ( SMP ) - how many different offsets will be made by the cat-researcher; large values of this parameter increase the time of one iteration, but allow to increase the accuracy of determining the position of the extremum,
bias share for the search mode ( SRD ) - the fraction by which the research cat shifts from its current position, large values shift the refinement search to the global one,
the number of directions to be searched ( CDC ) - this parameter controls the number of measurements that will change at the current position of the cat in search mode; smaller values make the search coordinate-wise,
the desire to stay at the old place ( SPC ) is a boolean variable that allows you to choose whether the research cat can remain at the current place,
velocity constant ( velocityConstant ) - determines the degree of tilting of the cat during the chase; larger values change the current velocity vector of the cat faster,
maximum speed ( velocityRatio ) - after all, you are the master of the house, so if someone from the cats overclocked too much, then you may well shout at him, so he slowed down, so This parameter limits the maximum speed of movement of cats.
So, what are the modes in which cats can be? It's simple.
During the search mode from the current position l i n m a t h b b R n new possible positions are generated \ left \ {l ^ k \ right \} _ {k = 1} ^ N . The new position is selected by the roulette method , where the probability of sampling k th position is determined by the formula
For the maximization problem, the maximum in the numerator should be replaced by a minimum.
Now a little about how the chase is implemented. In order not to frighten the host, one should not chase after a fictional enemy, but rather a real one - a cat that has currently found a better place for itself (its position will be indicated by tildel ). The speed of the cat varies according to the following rule:
vk=vk+xicdotccdotleft(tildel−lkright),
Where xisimUleft(0,1right) - a uniformly distributed random variable, and c - speed constant (the one that is responsible for turning). The new position of the cat is calculated as follows: lk=lk+vk .
Now that all the highlights of the algorithm are listed, it's time to figure out whether the cats can build a regression?
So, can cats build regression or not?
Let's generate several test datasets (in the notebook there is also a calculation from the regression models using scikit):
one-dimensional linear dependence without noise: y=2+3cdotx ,
Ordinary Least Squares model parameters: theta=left(2.001,3.001right)T ,
Ridge Regression Model Parameters ( alpha=0.7 ): theta=left(2.0008,2.9993right)T ,
Lasso Regression Model Parameters ( alpha=0.7 ): theta=left(1.997,2.976right)T ,
multidimensional linear dependence without noise: y=x1−2cdotx2+3cdotx3−4cdotx4+5cdotx5 ,
Ordinary Least Squares model parameters: theta=left(0.018,0.999,−1.9999,3.005,−3.994,5.002right)T ,
Ridge Regression Model Parameters ( alpha=0.7 ): theta=left(0.025,1.003,−2.002,2.998,−3.997,4.999right)T ,
Lasso Regression Model Parameters ( alpha=0.7 ): theta=left(0.011,0.972,−1.989,2.995,−3.969,4.971right)T ,
one-dimensional linear dependence with noise: y=2+3cdotx+xi ,
Ordinary Least Squares model parameters: theta=left(2.124,3.06right)T ,
Ridge Regression Model Parameters ( alpha=0.7 ): theta=left(2.121,3.061right)T ,
Lasso Regression Model Parameters ( alpha=0.7 ): theta=left(2.093,3.039right)T ,
multidimensional linear dependence with noise: y=x1−2cdotx2+3cdotx3+xi ,
Ordinary Least Squares model parameters: theta=left(0.073,0.889,−2.005,2.949right)T ,
Ridge Regression Model Parameters ( alpha=0.7 ): theta=left(−0.068,0.884,−2.004,2.945right)T ,
Lasso Regression Model Parameters ( alpha=0.7 ): theta=left(−0.082,0.857,−1.982,2.921right)T ,
Where xisimUleft(−5,5right) - uniformly distributed random variable.
Results obtained using scikit for noise data
For one-dimensional linear dependence with noise: For multidimensional linear dependence with noise:
It can be seen that the values found are quite close to the results obtained using scikit.
Conclusion
Thus, the given formulation of the problem, despite its model and simplicity, made it possible to get acquainted with the meta-heuristic algorithm of Cat Swarm Optimization . When developing optimization procedures, it is often useful to work on observing the world around us, because, as you know, “nature knows better” .