Do cats know how to build regression?

Good day, Habr! It's time to get back to the optimization tasks. This time we will deal with linear regression and find out who the cats are - only fluffy ~~pet bastards~~ or even a good tool for solving applied problems.

Since that time (and it has been quite a lot), I “brushed” the repository a bit, wrote more or less meaningful ReadMe, and also restructured the projects. What has changed since the last article , and what is the current state of the project?

the Algebras project contains the implementation of the trait of algebra, which lists all the basic operations that must be implemented for the object that inherits it; in addition, real and interval algebras are currently implemented in this project,
in the Transformations project there are main types of transformations with corresponding links; in this project, an addition was made to the properties of transformations (for example, numerical differentiability), which is useful later in the implementation of optimization algorithms using a gradient,
in the Algorithms project are the main classes of algorithms; this project will be filled as the general types of algorithms are highlighted (as long as there is only an algorithm for optimizing real-valued functions),
Metaheuristic Optimization and Machine Learning projects store implementations of optimization and machine learning algorithms, respectively,
Tools project contains various procedures necessary for the operation of algorithms (for example, random number generators in accordance with distributions).

As I promised in the first paper, at the beginning of the article I will outline the circle of tasks that will be solved, and then dwell on each in more detail. References to the repository will be given at the end.
')
So, in this work we will talk:

about linear regression models,
about how to reduce the problem of finding a linear regression to an optimization problem,
about the meta-heuristic global conditional optimization algorithm that simulates the behavior of cats.

Linear regression models and how to reduce them to an optimization problem

Let's start with the formulation of the regression problem.

Suppose there is a set of measurements, which is conveniently presented as a matrix:

$X \ in \ mathbb {R} ^ {m \ times n},$

those. every measurement

$x_i = \ left (x_ {i, 1}, x_ {i, 2}, \ dots, x_ {i, n} \ right) ^ T$ represented by a vector of

$\ mathbb {R} ^ n$ . There is also a set of values for the dependent variable.

$Y \ in \ mathbb {R} ^ m.$

We will work with a simple case when the dependent variable is one-dimensional. Solving the regression problem, you need to build a model that, taking into account a certain quality metric (performance criterion), will realize the connection between the set of measurements and the dependent variable, i.e. find model

$f \ left (\ cdot, \ theta \ right): \ mathbb {R} ^ n \ rightarrow \ mathbb {R},$

Where

$\ theta \ in \ mathbb {R} ^ p$ - vector of model parameters, t.ch.

$f \ left (x_i, \ theta \ right) = \ hat {y} _i \ approx y_i, \ forall i = 1, \ dots, n$ . Obviously, the problem of approximation and regression are closely related.

Thus, if you expertly fixed the shape of the desired model, then the whole task is reduced to determining the values of the parameters vector.

$\ theta$ . It should be noted that the following principle applies to machine learning: the more degrees of freedom (read parameters) the model has, the more likely it is that the model used is retrained. In the case of a few degrees of freedom, then there is every chance that the model will not be sufficiently accurate.

In this regard, linear models are probably some kind of transitional link. Despite the seeming simplicity, for many situations they solve the regression problem with high accuracy. However, when data is very noisy, linear models sometimes need artificial limitation (regularization).

General linear regression model code

trait GeneralizedLinearModel { def getWeights(): Vector[Real] def apply(v: Vector[Real]): Real = getWeights().dot(v + bias) def apply(vectors: Seq[Vector[Real]]): Seq[Real] = vectors.map(this.apply(_)) def convertToTransformation(): InhomogeneousTransformation[Vector[Real], Real] = new InhomogeneousTransformation[Vector[Real], Real](v => this.apply(v)) } object GeneralizedLinearModel { val bias = Vector("bias" -> Real(1.0)) object Metrics { def RSS(generalizedLinearModel: GeneralizedLinearModel, input: Seq[Vector[Real]], output: Seq[Real]): Real = { val predictions = generalizedLinearModel(input) predictions.zip(output) .map { case (pred, real) => (pred - real) ^ 2.0 } .reduce(_ + _) / input.length } } }

In the trait itself are defined:

a method that returns the weights of the regression model,
apply - a method that calculates the value of the dependent variable based on the received input,
conversion to inhomogeneous transformation.

In the same object there are

the constant responsible for the shift
RSS metrics.

Ordinary Least Squares

As is customary, let us begin with the simplest model and will quietly complicate it. So, in general, linear regression is given by the following formula:

$\ hat {y} _i = f \ left (x_i, \ theta \ right) = \ theta_0 + \ sum_ {j = 1} ^ {n} \ theta_j x_ {i, j},$

which shows that the dimension of the vector

$\ theta$ equal to the dimension of the dimension vector plus one. The zero component is called bias term or intercept term. Similar to the simple function.

$f \ left (x \ right) = a + bx$ where is a constant

$a$ responsible for the displacement of the graph relative to the x-axis.

Thus, the linear model can be expressed through the scalar product

$\ hat {y} _i = f \ left (x_i, \ theta \ right) = \ theta \ cdot \ left (1, x_ {i, 1}, \ dots, x_ {i, n} \ right) ^ T .$

It is convenient to pose the task of finding the optimal value of the parameter vector.

$\ theta$ using residual sum of squares (RSS, residual sum of squares)

$\ theta ^ * = Arg \ min _ {\ theta} \ sum_ {i = 1} ^ {n} \ left (y_i-f \ left (x_i, \ theta \ right) \ right) ^ 2.$

With this formulation of the problem, it becomes possible to find an analytical solution, which is expressed by the following formula:

$\ theta ^ * = \ left (\ tilde {X} ^ T \ cdot \ tilde {X} \ right) ^ {- 1} \ cdot \ tilde {X} ^ T \ cdot Y,$

where is the matrix

$\ tilde {X}$ obtained from the matrix

$X$ by adding to the left a column consisting of units.

As can be seen from the above description, the optimization problem has already been set. So in the future it remains only to apply the selected optimization algorithm to it.

Ordinary Least Squares

 case class OrdinaryLeastSquaresRegression(w: Vector[Real]) extends GeneralizedLinearModel { override def getWeights(): Vector[Real] = w } object OrdinaryLeastSquaresRegression { class Task(input: Seq[Vector[Real]], output: Seq[Real]) extends General.Task { def toOptimizationTask(searchArea: Map[String, (Double, Double)]): (Optimization.Real.Task, InhomogeneousTransformation[Vector[Real], OrdinaryLeastSquaresRegression]) = { val vectorToRegressor = new InhomogeneousTransformation[Vector[Real], OrdinaryLeastSquaresRegression]((w: Vector[Real]) => OrdinaryLeastSquaresRegression(w)) val task = new Optimization.Real.Task( new Function[Real]((w: Vector[Real]) => GeneralizedLinearModel.Metrics.RSS(vectorToRegressor(w), input, output)), searchArea) (task, vectorToRegressor) } } }

Ridge & Lasso Regression

If it is necessary for one reason or another to reduce the degree of variation of the model without its structural change, you can use regularization, which imposes restrictions on the model parameters.

Ridge regression (ridge regression) uses L2 regularization of model parameters:

$\ theta ^ * = Arg \ min _ {\ theta} \ sum_ {i = 1} ^ {n} \ left (y_i-f \ left (x_i, \ theta \ right) \ right) ^ 2 + \ alpha \ left | \ left | \ left (\ theta_1, \ dots, \ theta_n \ right) ^ T \ right | \ right | _2 ^ 2,$

Where

$\ left | \ left | \ left (\ theta_1, \ dots, \ theta_n \ right) ^ T \ right | \ right | _2 = \ sqrt {\ sum_ {j = 1} ^ {n} \ theta_j ^ 2}$ - L2 rate. Parameter

$\ alpha$ responsible for compressing coefficients: with increasing

$\ alpha$ model parameters tend to zero. This is clearly demonstrated on the official website of the scikit package :

As in the case of simple linear regression for a ridge regression, it is possible to analytically express the solution:

$\ theta ^ * = \ left (\ tilde {X} ^ T \ cdot \ tilde {X} + \ alpha \ cdot \ tilde {E} _ {n + 1} \ right) ^ {- 1} \ cdot \ tilde {X} ^ T \ cdot Y,$

Where

$\ tilde {E} _ {n + 1}$ - order matrix

$n + 1$ , in the upper left corner of which is zero.

Ridge regression

 class RidgeRegression(w: Vector[Real], alpha: Double) extends OrdinaryLeastSquaresRegression(w) { } object RidgeRegression { class Task(input: Seq[Vector[Real]], output: Seq[Real]) extends General.Task { def toOptimizationTask(searchArea: Map[String, (Double, Double)], alpha: Double): (Optimization.Real.Task, InhomogeneousTransformation[Vector[Real], RidgeRegression]) = { val vectorToRegressor = new InhomogeneousTransformation[Vector[Real], RidgeRegression]((w: Vector[Real]) => new RidgeRegression(w, alpha)) val task = new Optimization.Real.Task( new Function[Real]((w: Vector[Real]) => GeneralizedLinearModel.Metrics.RSS(vectorToRegressor(w), input, output) + alpha * w.components.filterKeys(_ != "bias").values.map(_ ^ 2.0).reduce(_ + _)), searchArea) (task, vectorToRegressor) } } }

For Lasso Regression, the problem is similar, the difference is that L1 now uses regularization of model parameters:

$\ theta ^ * = Arg \ min _ {\ theta} \ frac {1} {2m} \ cdot \ sum_ {i = 1} ^ {n} \ left (y_i-f \ left (x_i, \ theta \ right) \ right) ^ 2 + \ alpha \ left | \ left | \ left (\ theta_1, \ dots, \ theta_n \ right) ^ T \ right | \ right | _1,$

Where

$\ left | \ left | \ left (\ theta_1, \ dots, \ theta_n \ right) ^ T \ right | \ right | _1 = \ sum_ {j = 1} ^ {n} \ left | \ theta_j \ right |$ - L1 rate

Lasso Regression

 class LassoRegression(w: Vector[Real], alpha: Double) extends OrdinaryLeastSquaresRegression(w) { } object LassoRegression { class Task(input: Seq[Vector[Real]], output: Seq[Real]) extends General.Task { def toOptimizationTask(searchArea: Map[String, (Double, Double)], alpha: Double): (Optimization.Real.Task, InhomogeneousTransformation[Vector[Real], LassoRegression]) = { val vectorToRegressor = new InhomogeneousTransformation[Vector[Real], LassoRegression]((w: Vector[Real]) => new LassoRegression(w, alpha)) val task = new Optimization.Real.Task( new Function[Real]((w: Vector[Real]) => GeneralizedLinearModel.Metrics.RSS(vectorToRegressor(w), input, output) / (2.0 * input.length) + alpha * w.components.filterKeys(_ != "bias").values.map(Algebra.abs(_)).reduce(_ + _)), searchArea) (task, vectorToRegressor) } } }

Thus, from the point of view of optimization, Ridge regression and Lasso Regression differ only in the way of setting the minimization problem.

Cat Swarm Optimization

As it became clear from the name, the algorithm imitates the behavior of animals of the cat family (including domestic cats). What can you say about your pet? He can play the role of a sweet lazybird (although we actually know what kind of insidious thoughts swarm in his head), can imagine himself a great (but cautious) researcher, and can simply rush around the apartment for a non-existent (or rather invisible to you) opponent. We will leave the cats lying and immovable like the Great Wall of China alone, let them rest, but let's dwell on the last two actions. For any optimization algorithm, it is good to have several search stages: global , during which we should fall into the region of attraction of a local extremum (and ideally global), and specifying , during which we should move from a neighborhood of extremum closer to its true location. Nothing like? In fact, cats chasing an invisible enemy are obvious candidates for the global search procedure, but neat researchers will help you find the best place to rest. These two heuristics underlie the Cat Swarm Optimization algorithm. For the full picture, it remains to imagine that you have not one cat, but a whole flock. But it's even better, right?

The pseudocode of the algorithm is presented below:

1. N .
2. . "" ( , ).
3. ( ).
4. MR.
5. . 2.

If we try to formalize all ideas, then in mathematical expression we have the following theses:

each cat is associated with a point in the study space,
fitness cat - the value of the function to be optimized at the point corresponding to its current position,
Each cat can be in one of two modes: search or chase.
iteration of the algorithm involves the implementation of the process of moving in accordance with the mode in which the cat is located.

Parameters affecting the operation of the algorithm

 class CatSwarmOptimization(numberOfCats: Int, MR: Double, SMP: Int, SRD: Double, CDC: Int, SPC: Boolean, velocityConstant: Double, velocityRatio: Double, generator: DiscreteUniform with ContinuousUniform = Kaimere.Tools.Random.GoRN) extends Algorithm

number of cats ( numberOfCats ) - the more cats, the longer the algorithm runs (if it is limited by the number of iterations), but also the potentially greater accuracy of the answer found,
proportion of cats in the search mode and chase ( MR ) - this parameter allows you to direct the search according to the strategy that the user considers preferable; For example, if you know the neighborhood in which the global optimum lies, then it is logical to initialize the population in this neighborhood and support a larger number of male researchers in the population to clarify the initial solution,
the number of attempts for the search mode ( SMP ) - how many different offsets will be made by the cat-researcher; large values of this parameter increase the time of one iteration, but allow to increase the accuracy of determining the position of the extremum,
bias share for the search mode ( SRD ) - the fraction by which the research cat shifts from its current position, large values shift the refinement search to the global one,
the number of directions to be searched ( CDC ) - this parameter controls the number of measurements that will change at the current position of the cat in search mode; smaller values make the search coordinate-wise,
the desire to stay at the old place ( SPC ) is a boolean variable that allows you to choose whether the research cat can remain at the current place,
velocity constant ( velocityConstant ) - determines the degree of tilting of the cat during the chase; larger values change the current velocity vector of the cat faster,
maximum speed ( velocityRatio ) - after all, you are the master of the house, so if someone from the cats overclocked too much, then you may well shout at him,
so he slowed down, so This parameter limits the maximum speed of movement of cats.

So, what are the modes in which cats can be? It's simple.

During the search mode from the current position

$l \ in \ mathbb {R} ^ n$ new possible positions are generated

$\ left \ {l ^ k \ right \} _ {k = 1} ^ N$ . The new position is selected by the roulette method , where the probability of sampling

$k$ th position is determined by the formula

$p ^ k = \ frac {\ displaystyle \ left | \ max_ {j = 1, \ dots, N} f \ left (l ^ j \ right) -f \ left (l ^ k \ right) \ right |} {\ displaystyle \ max_ {j = 1, \ dots, N} f \ left (l ^ j \ right) - \ min_ {j = 1, \ dots, N} f \ left (l ^ j \ right)}.$

For the maximization problem, the maximum in the numerator should be replaced by a minimum.

Now a little about how the chase is implemented. In order not to frighten the host, one should not chase after a fictional enemy, but rather a real one - a cat that has currently found a better place for itself (its position will be indicated by

$\ tilde {l}$ ). The speed of the cat varies according to the following rule:

$v ^ k = v ^ k + \ xi \ cdot c \ cdot \ left (\ tilde {l} -l ^ k \ right),$

Where

$\ xi \ sim U \ left (0,1 \ right)$ - a uniformly distributed random variable, and

$c$ - speed constant (the one that is responsible for turning). The new position of the cat is calculated as follows:

$l ^ k = l ^ k + v ^ k$ .

Cat implementation

 case class Cat(location: Vector[Real], velocity: Vector[Real])(implicit generator: DiscreteUniform with ContinuousUniform) { def getFromSeries[T](data: Seq[T], n: Int, withReturn: Boolean): Seq[T] = withReturn match { case true => Seq.fill(n)(generator.getUniform(0, data.size - 1)).map(x => data(x)) case false => data.sortBy(_ => generator.getUniform(0.0, 1.0)).take(n) } def seek(task: Task, SPC: Boolean, SMP: Int, CDC: Int, SRD: Double): Cat = { val newLocations = (if (SPC) Seq(location) else Seq()) ++ Seq.fill(SMP - (if (SPC) 1 else 0))(location) .map { loc => val ratio = getFromSeries(task.searchArea.keys.toSeq, CDC, false) .map { key => (key, generator.getUniform(1.0 - SRD, 1.0 + SRD)) }.toMap (loc * ratio).constrain(task.searchArea) } val fitnessValues = newLocations.map(task(_)).map(_.value) val newLocation = if (fitnessValues.tail.forall(_ == fitnessValues.head)) newLocations(generator.getUniform(0, SMP - 1)) else { val maxFitness = fitnessValues.max val minFitness = fitnessValues.min val probabilities = fitnessValues.map(v => (maxFitness - v) / (maxFitness - minFitness)) val roulette = 0.0 +: probabilities.tail .foldLeft(Seq(probabilities.head)) { case (prob, curr) => (curr + prob.head) +: prob } .reverse val chosen = generator.getUniform(0.0, roulette.last) val idChosen = roulette.sliding(2).indexWhere{ case Seq(a, b) => a <= chosen && chosen <= b} newLocations(idChosen) } new Cat(newLocation, newLocation - this.location) } def updateVelocity(bestCat: Cat, velocityConstant: Double, maxVelocity: Map[String, Double]): Vector[Real] = { val newVelocity = this.velocity + (bestCat.location - this.location) * velocityConstant * generator.getUniform(0.0, 1.0) Vector(newVelocity.components .map { case (key, value) => if (value.value > maxVelocity(key)) (value, Real(maxVelocity(key))) if (value.value < -maxVelocity(key)) (value, Real(-maxVelocity(key))) (key, value) }) } def trace(task: Task, bestCat: Cat, velocityConstant: Double, maxVelocity: Map[String, Double]): Cat = { val newVelocity = this.updateVelocity(bestCat, velocityConstant, maxVelocity) val newLocation = (location + newVelocity).constrain(task.searchArea) new Cat(newLocation, newVelocity) } def move(mode: Int, task: Task, bestCat: Cat, SPC: Boolean, SMP: Int, CDC: Int, SRD: Double, velocityConstant: Double, maxVelocity: Map[String, Double]): Cat = mode match { case 0 => seek(task, SPC, SMP, CDC, SRD) case 1 => trace(task, bestCat, velocityConstant, maxVelocity) } }

Now that all the highlights of the algorithm are listed, it's time to figure out whether the cats can build a regression?

So, can cats build regression or not?

Let's generate several test datasets (in the notebook there is also a calculation from the regression models using scikit):

one-dimensional linear dependence without noise: y=2+3 cdotx ,
- Ordinary Least Squares model parameters: $\ theta = \ left (2.001, 3.001 \ right) ^ T$ ,
- Ridge Regression Model Parameters ( $\ alpha = 0.7$ ): $\ theta = \ left (2.0008, 2.9993 \ right) ^ T$ ,
- Lasso Regression Model Parameters ( $\ alpha = 0.7$ ): $\ theta = \ left (1.997, 2.976 \ right) ^ T$ ,
multidimensional linear dependence without noise: y=x1−2 cdotx2+3 cdotx3−4 cdotx4+5 cdotx5 ,
- Ordinary Least Squares model parameters: $\ theta = \ left (0.018, 0.999, -1.9999, 3.005, -3.994, 5.002 \ right) ^ T$ ,
- Ridge Regression Model Parameters ( $\ alpha = 0.7$ ): $\ theta = \ left (0.025, 1.003, -2.002, 2.998, -3.997, 4.999 \ right) ^ T$ ,
- Lasso Regression Model Parameters ( $\ alpha = 0.7$ ): $\ theta = \ left (0.011, 0.972, -1.989, 2.995, -3.969, 4.971 \ right) ^ T$ ,
one-dimensional linear dependence with noise: y=2+3 cdotx+ xi ,
- Ordinary Least Squares model parameters: $\ theta = \ left (2.124, 3.06 \ right) ^ T$ ,
- Ridge Regression Model Parameters ( $\ alpha = 0.7$ ): $\ theta = \ left (2.121, 3.061 \ right) ^ T$ ,
- Lasso Regression Model Parameters ( $\ alpha = 0.7$ ): $\ theta = \ left (2.093, 3.039 \ right) ^ T$ ,
multidimensional linear dependence with noise: y=x1−2 cdotx2+3 cdotx3+ xi ,
- Ordinary Least Squares model parameters: $\ theta = \ left (0.073, 0.889, -2.005, 2.949 \ right) ^ T$ ,
- Ridge Regression Model Parameters ( $\ alpha = 0.7$ ): $\ theta = \ left (-0.068, 0.884, -2.004, 2.945 \ right) ^ T$ ,
- Lasso Regression Model Parameters ( $\ alpha = 0.7$ ): $\ theta = \ left (-0.082, 0.857, -1.982, 2.921 \ right) ^ T$ ,

Where

$\ xi \ sim U \ left (-5,5 \ right)$ - uniformly distributed random variable.

Results obtained using scikit for noise data

For one-dimensional linear dependence with noise:

For multidimensional linear dependence with noise:

It can be seen that the values found are quite close to the results obtained using scikit.

Conclusion

Thus, the given formulation of the problem, despite its model and simplicity, made it possible to get acquainted with the meta-heuristic algorithm of Cat Swarm Optimization . When developing optimization procedures, it is often useful to work on observing the world around us, because, as you know, “nature knows better” .

References and literature

Project code
Description of linear regression models: [ 1 ], [ 2 ]
Machine Learning Book
Cat Swarm Optimization Article

Source: https://habr.com/ru/post/328760/

All Articles

Do cats know how to build regression?

Linear regression models and how to reduce them to an optimization problem

Ordinary Least Squares

Ridge & Lasso Regression

Cat Swarm Optimization

So, can cats build regression or not?

Conclusion

More articles: