📜 ⬆️ ⬇️

Do cats know how to build regression?

Good day, Habr! It's time to get back to the optimization tasks. This time we will deal with linear regression and find out who the cats are - only fluffy pet bastards or even a good tool for solving applied problems.


Since that time (and it has been quite a lot), I “brushed” the repository a bit, wrote more or less meaningful ReadMe, and also restructured the projects. What has changed since the last article , and what is the current state of the project?


As I promised in the first paper, at the beginning of the article I will outline the circle of tasks that will be solved, and then dwell on each in more detail. References to the repository will be given at the end.
')
So, in this work we will talk:

  1. about linear regression models,
  2. about how to reduce the problem of finding a linear regression to an optimization problem,
  3. about the meta-heuristic global conditional optimization algorithm that simulates the behavior of cats.

Linear regression models and how to reduce them to an optimization problem


Let's start with the formulation of the regression problem.

Suppose there is a set of measurements, which is conveniently presented as a matrix:

X i n m a t h b b R m t i m e s n ,   

those. every measurement x i = l e f t ( x i , 1 , x i , 2 , d o t s , x i , n r i g h t ) T    represented by a vector of  m a t h b b R n . There is also a set of values ​​for the dependent variable.

Y in mathbbRm.

We will work with a simple case when the dependent variable is one-dimensional. Solving the regression problem, you need to build a model that, taking into account a certain quality metric (performance criterion), will realize the connection between the set of measurements and the dependent variable, i.e. find model

f left( cdot, theta right): mathbbRn rightarrow mathbbR,

Where  theta in mathbbRp - vector of model parameters, t.ch. f left(xi, theta right)= hatyi approxyi, foralli=1, dots,n . Obviously, the problem of approximation and regression are closely related.

Thus, if you expertly fixed the shape of the desired model, then the whole task is reduced to determining the values ​​of the parameters vector.  theta . It should be noted that the following principle applies to machine learning: the more degrees of freedom (read parameters) the model has, the more likely it is that the model used is retrained. In the case of a few degrees of freedom, then there is every chance that the model will not be sufficiently accurate.

In this regard, linear models are probably some kind of transitional link. Despite the seeming simplicity, for many situations they solve the regression problem with high accuracy. However, when data is very noisy, linear models sometimes need artificial limitation (regularization).

General linear regression model code
trait GeneralizedLinearModel { def getWeights(): Vector[Real] def apply(v: Vector[Real]): Real = getWeights().dot(v + bias) def apply(vectors: Seq[Vector[Real]]): Seq[Real] = vectors.map(this.apply(_)) def convertToTransformation(): InhomogeneousTransformation[Vector[Real], Real] = new InhomogeneousTransformation[Vector[Real], Real](v => this.apply(v)) } object GeneralizedLinearModel { val bias = Vector("bias" -> Real(1.0)) object Metrics { def RSS(generalizedLinearModel: GeneralizedLinearModel, input: Seq[Vector[Real]], output: Seq[Real]): Real = { val predictions = generalizedLinearModel(input) predictions.zip(output) .map { case (pred, real) => (pred - real) ^ 2.0 } .reduce(_ + _) / input.length } } } 
In the trait itself are defined:
  • a method that returns the weights of the regression model,
  • apply - a method that calculates the value of the dependent variable based on the received input,
  • conversion to inhomogeneous transformation.

In the same object there are
  • the constant responsible for the shift
  • RSS metrics.


Ordinary Least Squares


As is customary, let us begin with the simplest model and will quietly complicate it. So, in general, linear regression is given by the following formula:

 hatyi=f left(xi, theta right)= theta0+ sumnj=1 thetajxi,j,

which shows that the dimension of the vector  theta equal to the dimension of the dimension vector plus one. The zero component is called bias term or intercept term. Similar to the simple function. f left(x right)=a+bx where is a constant a responsible for the displacement of the graph relative to the x-axis.

Thus, the linear model can be expressed through the scalar product

 hatyi=f left(xi, theta right)= theta cdot left(1,xi,1, dots,xi,n right)T.


It is convenient to pose the task of finding the optimal value of the parameter vector.  theta using residual sum of squares (RSS, residual sum of squares)

 theta=Arg min theta sumni=1 left(yif left(xi, theta right) right)2.

With this formulation of the problem, it becomes possible to find an analytical solution, which is expressed by the following formula:

 theta= left( tildeXT cdot tildeX right)1 cdot tildeXT cdotY,

where is the matrix  tildeX obtained from the matrix X by adding to the left a column consisting of units.

As can be seen from the above description, the optimization problem has already been set. So in the future it remains only to apply the selected optimization algorithm to it.

Ordinary Least Squares
 case class OrdinaryLeastSquaresRegression(w: Vector[Real]) extends GeneralizedLinearModel { override def getWeights(): Vector[Real] = w } object OrdinaryLeastSquaresRegression { class Task(input: Seq[Vector[Real]], output: Seq[Real]) extends General.Task { def toOptimizationTask(searchArea: Map[String, (Double, Double)]): (Optimization.Real.Task, InhomogeneousTransformation[Vector[Real], OrdinaryLeastSquaresRegression]) = { val vectorToRegressor = new InhomogeneousTransformation[Vector[Real], OrdinaryLeastSquaresRegression]((w: Vector[Real]) => OrdinaryLeastSquaresRegression(w)) val task = new Optimization.Real.Task( new Function[Real]((w: Vector[Real]) => GeneralizedLinearModel.Metrics.RSS(vectorToRegressor(w), input, output)), searchArea) (task, vectorToRegressor) } } } 


Ridge & Lasso Regression


If it is necessary for one reason or another to reduce the degree of variation of the model without its structural change, you can use regularization, which imposes restrictions on the model parameters.

Ridge regression (ridge regression) uses L2 regularization of model parameters:

 theta=Arg min theta sumni=1 left(yif left(xi, theta right) right)2+ alpha left| left| left( theta1, dots, thetan right)T right| right|22,

Where  left| left| left( theta1, dots, thetan right)T right| right|2= sqrt sumnj=1 theta2j - L2 rate. Parameter  alpha responsible for compressing coefficients: with increasing  alpha model parameters tend to zero. This is clearly demonstrated on the official website of the scikit package :


As in the case of simple linear regression for a ridge regression, it is possible to analytically express the solution:

 theta= left( tildeXT cdot tildeX+ alpha cdot tildeEn+1 right)1 cdot tildeXT cdotY,

Where  tildeEn+1 - order matrix n+1 , in the upper left corner of which is zero.

Ridge regression
 class RidgeRegression(w: Vector[Real], alpha: Double) extends OrdinaryLeastSquaresRegression(w) { } object RidgeRegression { class Task(input: Seq[Vector[Real]], output: Seq[Real]) extends General.Task { def toOptimizationTask(searchArea: Map[String, (Double, Double)], alpha: Double): (Optimization.Real.Task, InhomogeneousTransformation[Vector[Real], RidgeRegression]) = { val vectorToRegressor = new InhomogeneousTransformation[Vector[Real], RidgeRegression]((w: Vector[Real]) => new RidgeRegression(w, alpha)) val task = new Optimization.Real.Task( new Function[Real]((w: Vector[Real]) => GeneralizedLinearModel.Metrics.RSS(vectorToRegressor(w), input, output) + alpha * w.components.filterKeys(_ != "bias").values.map(_ ^ 2.0).reduce(_ + _)), searchArea) (task, vectorToRegressor) } } } 


For Lasso Regression, the problem is similar, the difference is that L1 now uses regularization of model parameters:

 theta=Arg min theta frac12m cdot sumni=1 left(yif left(xi, theta right) right)2+ alpha left| left| left( theta1, dots, thetan right)T right| right|1,

Where  left| left| left( theta1, dots, thetan right)T right| right|1= sumnj=1 l e f t | t h e t a j r i g h t |   - L1 rate

Lasso Regression
 class LassoRegression(w: Vector[Real], alpha: Double) extends OrdinaryLeastSquaresRegression(w) { } object LassoRegression { class Task(input: Seq[Vector[Real]], output: Seq[Real]) extends General.Task { def toOptimizationTask(searchArea: Map[String, (Double, Double)], alpha: Double): (Optimization.Real.Task, InhomogeneousTransformation[Vector[Real], LassoRegression]) = { val vectorToRegressor = new InhomogeneousTransformation[Vector[Real], LassoRegression]((w: Vector[Real]) => new LassoRegression(w, alpha)) val task = new Optimization.Real.Task( new Function[Real]((w: Vector[Real]) => GeneralizedLinearModel.Metrics.RSS(vectorToRegressor(w), input, output) / (2.0 * input.length) + alpha * w.components.filterKeys(_ != "bias").values.map(Algebra.abs(_)).reduce(_ + _)), searchArea) (task, vectorToRegressor) } } } 


Thus, from the point of view of optimization, Ridge regression and Lasso Regression differ only in the way of setting the minimization problem.

Cat Swarm Optimization


As it became clear from the name, the algorithm imitates the behavior of animals of the cat family (including domestic cats). What can you say about your pet? He can play the role of a sweet lazybird (although we actually know what kind of insidious thoughts swarm in his head), can imagine himself a great (but cautious) researcher, and can simply rush around the apartment for a non-existent (or rather invisible to you) opponent. We will leave the cats lying and immovable like the Great Wall of China alone, let them rest, but let's dwell on the last two actions. For any optimization algorithm, it is good to have several search stages: global , during which we should fall into the region of attraction of a local extremum (and ideally global), and specifying , during which we should move from a neighborhood of extremum closer to its true location. Nothing like? In fact, cats chasing an invisible enemy are obvious candidates for the global search procedure, but neat researchers will help you find the best place to rest. These two heuristics underlie the Cat Swarm Optimization algorithm. For the full picture, it remains to imagine that you have not one cat, but a whole flock. But it's even better, right?

The pseudocode of the algorithm is presented below:

1. N .
2. . "" ( , ).
3. ( ).
4. MR.
5. . 2.

If we try to formalize all ideas, then in mathematical expression we have the following theses:


Parameters affecting the operation of the algorithm
 class CatSwarmOptimization(numberOfCats: Int, MR: Double, SMP: Int, SRD: Double, CDC: Int, SPC: Boolean, velocityConstant: Double, velocityRatio: Double, generator: DiscreteUniform with ContinuousUniform = Kaimere.Tools.Random.GoRN) extends Algorithm 

  • number of cats ( numberOfCats ) - the more cats, the longer the algorithm runs (if it is limited by the number of iterations), but also the potentially greater accuracy of the answer found,
  • proportion of cats in the search mode and chase ( MR ) - this parameter allows you to direct the search according to the strategy that the user considers preferable; For example, if you know the neighborhood in which the global optimum lies, then it is logical to initialize the population in this neighborhood and support a larger number of male researchers in the population to clarify the initial solution,
  • the number of attempts for the search mode ( SMP ) - how many different offsets will be made by the cat-researcher; large values ​​of this parameter increase the time of one iteration, but allow to increase the accuracy of determining the position of the extremum,
  • bias share for the search mode ( SRD ) - the fraction by which the research cat shifts from its current position, large values ​​shift the refinement search to the global one,
  • the number of directions to be searched ( CDC ) - this parameter controls the number of measurements that will change at the current position of the cat in search mode; smaller values ​​make the search coordinate-wise,
  • the desire to stay at the old place ( SPC ) is a boolean variable that allows you to choose whether the research cat can remain at the current place,
  • velocity constant ( velocityConstant ) - determines the degree of tilting of the cat during the chase; larger values ​​change the current velocity vector of the cat faster,
  • maximum speed ( velocityRatio ) - after all, you are the master of the house, so if someone from the cats overclocked too much, then you may well shout at him,
    so he slowed down, so This parameter limits the maximum speed of movement of cats.


So, what are the modes in which cats can be? It's simple.

During the search mode from the current position l i n m a t h b b R n   new possible positions are generated \ left \ {l ^ k \ right \} _ {k = 1} ^ N . The new position is selected by the roulette method , where the probability of sampling k th position is determined by the formula

pk= frac displaystyle left| maxj=1, dots,Nf left(lj right)f left(lk right) right| displaystyle maxj=1, dots,Nf left(lj right) minj=1, dots,Nf left(lj right).

For the maximization problem, the maximum in the numerator should be replaced by a minimum.

Now a little about how the chase is implemented. In order not to frighten the host, one should not chase after a fictional enemy, but rather a real one - a cat that has currently found a better place for itself (its position will be indicated by  tildel ). The speed of the cat varies according to the following rule:

vk=vk+ xi cdotc cdot left( tildellk right),

Where  xi simU left(0,1 right) - a uniformly distributed random variable, and c - speed constant (the one that is responsible for turning). The new position of the cat is calculated as follows: lk=lk+vk .

Cat implementation
 case class Cat(location: Vector[Real], velocity: Vector[Real])(implicit generator: DiscreteUniform with ContinuousUniform) { def getFromSeries[T](data: Seq[T], n: Int, withReturn: Boolean): Seq[T] = withReturn match { case true => Seq.fill(n)(generator.getUniform(0, data.size - 1)).map(x => data(x)) case false => data.sortBy(_ => generator.getUniform(0.0, 1.0)).take(n) } def seek(task: Task, SPC: Boolean, SMP: Int, CDC: Int, SRD: Double): Cat = { val newLocations = (if (SPC) Seq(location) else Seq()) ++ Seq.fill(SMP - (if (SPC) 1 else 0))(location) .map { loc => val ratio = getFromSeries(task.searchArea.keys.toSeq, CDC, false) .map { key => (key, generator.getUniform(1.0 - SRD, 1.0 + SRD)) }.toMap (loc * ratio).constrain(task.searchArea) } val fitnessValues = newLocations.map(task(_)).map(_.value) val newLocation = if (fitnessValues.tail.forall(_ == fitnessValues.head)) newLocations(generator.getUniform(0, SMP - 1)) else { val maxFitness = fitnessValues.max val minFitness = fitnessValues.min val probabilities = fitnessValues.map(v => (maxFitness - v) / (maxFitness - minFitness)) val roulette = 0.0 +: probabilities.tail .foldLeft(Seq(probabilities.head)) { case (prob, curr) => (curr + prob.head) +: prob } .reverse val chosen = generator.getUniform(0.0, roulette.last) val idChosen = roulette.sliding(2).indexWhere{ case Seq(a, b) => a <= chosen && chosen <= b} newLocations(idChosen) } new Cat(newLocation, newLocation - this.location) } def updateVelocity(bestCat: Cat, velocityConstant: Double, maxVelocity: Map[String, Double]): Vector[Real] = { val newVelocity = this.velocity + (bestCat.location - this.location) * velocityConstant * generator.getUniform(0.0, 1.0) Vector(newVelocity.components .map { case (key, value) => if (value.value > maxVelocity(key)) (value, Real(maxVelocity(key))) if (value.value < -maxVelocity(key)) (value, Real(-maxVelocity(key))) (key, value) }) } def trace(task: Task, bestCat: Cat, velocityConstant: Double, maxVelocity: Map[String, Double]): Cat = { val newVelocity = this.updateVelocity(bestCat, velocityConstant, maxVelocity) val newLocation = (location + newVelocity).constrain(task.searchArea) new Cat(newLocation, newVelocity) } def move(mode: Int, task: Task, bestCat: Cat, SPC: Boolean, SMP: Int, CDC: Int, SRD: Double, velocityConstant: Double, maxVelocity: Map[String, Double]): Cat = mode match { case 0 => seek(task, SPC, SMP, CDC, SRD) case 1 => trace(task, bestCat, velocityConstant, maxVelocity) } } 


Now that all the highlights of the algorithm are listed, it's time to figure out whether the cats can build a regression?

So, can cats build regression or not?


Let's generate several test datasets (in the notebook there is also a calculation from the regression models using scikit):

  1. one-dimensional linear dependence without noise: y=2+3 cdotx ,
    • Ordinary Least Squares model parameters:  theta= left(2.001,3.001 right)T ,
    • Ridge Regression Model Parameters (  alpha=0.7 ):  theta= left(2.0008,2.9993 right)T ,
    • Lasso Regression Model Parameters (  alpha=0.7 ):  theta= left(1.997,2.976 right)T ,
  2. multidimensional linear dependence without noise: y=x12 cdotx2+3 cdotx34 cdotx4+5 cdotx5 ,
    • Ordinary Least Squares model parameters:  theta= left(0.018,0.999,1.9999,3.005,3.994,5.002 right)T ,
    • Ridge Regression Model Parameters (  alpha=0.7 ):  theta= left(0.025,1.003,2.002,2.998,3.997,4.999 right)T ,
    • Lasso Regression Model Parameters (  alpha=0.7 ):  theta= left(0.011,0.972,1.989,2.995,3.969,4.971 right)T ,
  3. one-dimensional linear dependence with noise: y=2+3 cdotx+ xi ,
    • Ordinary Least Squares model parameters:  theta= left(2.124,3.06 right)T ,
    • Ridge Regression Model Parameters (  alpha=0.7 ):  theta= left(2.121,3.061 right)T ,
    • Lasso Regression Model Parameters (  alpha=0.7 ):  theta= left(2.093,3.039 right)T ,
  4. multidimensional linear dependence with noise: y=x12 cdotx2+3 cdotx3+ xi ,
    • Ordinary Least Squares model parameters:  theta= left(0.073,0.889,2.005,2.949 right)T ,
    • Ridge Regression Model Parameters (  alpha=0.7 ):  theta= left(0.068,0.884,2.004,2.945 right)T ,
    • Lasso Regression Model Parameters (  alpha=0.7 ):  theta= left(0.082,0.857,1.982,2.921 right)T ,
Where  xi simU left(5,5 right) - uniformly distributed random variable.

Results obtained using scikit for noise data
For one-dimensional linear dependence with noise:

For multidimensional linear dependence with noise:


It can be seen that the values ​​found are quite close to the results obtained using scikit.

Conclusion


Thus, the given formulation of the problem, despite its model and simplicity, made it possible to get acquainted with the meta-heuristic algorithm of Cat Swarm Optimization . When developing optimization procedures, it is often useful to work on observing the world around us, because, as you know, “nature knows better” .

References and literature

Source: https://habr.com/ru/post/328760/


All Articles