📜 ⬆️ ⬇️

Welford method and one-dimensional linear regression

One-dimensional linear regression is one of the simplest regression methods (and generally one of the simplest methods of machine learning), which allows one to describe the linear dependence of the observed value on one of the signs. In the general case, in problems of machine learning one has to deal with a large number of different signs; one-dimensional linear regression in this case chooses the one that allows to achieve the best correlation with the objective function.


In the previous post of this series, we discussed the accuracy of calculations of averages and covariances, and also got acquainted with the Welford method, which in many cases allows us to avoid computational errors in these problems. Today we will look at the practical application of the Welford method in a one-dimensional linear regression problem.



Content



1. One-dimensional linear regression


In the one-dimensional linear regression problem, we assume that there are two sequences of real numbers: signs xand answers y. In addition, there is a vector of corresponding weights. w. As always, we will assume that these sequences contain a potentially infinite number of elements, but at the moment only nthe first elements of each of the sequences.


Our task will be to restore the linear relationship between signs and answers, that is, to build a linear decision function f: mathbbR rightarrow mathbbR:


f(xi)= alpha cdotxi+ beta


At the same time, the rms loss functional is minimized:


Q(f,x,y,w)= sqrt frac1 sumi=1nwi sumi=1nwi cdot(f(xi)yi)2


For analysis, it is easier to work with a formula free of radicals and normalization:


Q1(f,x,y,w)= sumi=1nwi cdot(f(xi)yi)2= sumi=1nwi cdot( alphaxi+ betayi)2


Since the minimum points for functionals Qand Q1match, such a replacement is correct.


2. Solution for centered samples


For loss functional Q1easy to write derivatives  alphaand  beta:


 frac partialQ1(f,x,y,w) partial alpha=2 cdot sumi=1nwixi( alphaxi+ betayi)


 frac partialQ1(f,x,y,w) partial beta=2 cdot sumi=1nwi( alphaxi+ betayi)


If we equate them to zero, we get:


 alpha cdot sumi=1nwixi2+ beta cdot sumi=1nwixi sumi=1nwixiyi=0


 alpha= frac sumi=1nwixiyi beta cdot sumi=1nwixi sumi=1nwixi2


 beta= sumi=1nwiyi alpha cdot sumi=1nwixi


An important digression. Equating the derivative to zero in this case is correct, because:
1. The loss functional is convex in terms of optimized parameters, therefore any point of the local optimum will also be a point of global optimum .
2. The loss functional with respect to the parameters being optimized is a parabola, so the extremal points found will be minimum points.


If the optimal parameter  betaequal to zero, finding a solution would not be difficult. You may notice that centering, a standard for machine learning method of pre-processing the sample, leads to this effect. Indeed, consider the problem for centered variables:


xi=xi frac sumi=1nwixi sumi=1nwi


yi=yi frac sumi=1nwiyi sumi=1nwi


The sum of weighted features now equals zero, as well as the sum of weighted responses:


 sumk=1nwkxk= sumk=1nwk cdot Big(xk frac sumi=1nwixi sumi=1nwi Big)= sumk=1nwkxk sumi=1nwixi=0


 sumk=1nwkyk= sumk=1nwk cdot Big(yk frac sumi=1nwiyi sumi=1nwi Big)= sumk=1nwkyk sumi=1nwiyi=0


Then the optimal value of the free parameter is zero:


 beta= sumi=1nwiyi alpha cdot sumi=1nwixi=0


This means that the optimal value of the parameter  alphaeasy to find:


 alpha= frac sumi=1nwixiyi sumi=1nwixi2


3. Decision in general


Now let's try to return to the general case of non-centered data. If a f- the decisive function for the centered case, whose values ​​are determined by the formula


f(xk)= alpha cdotxk


and approximate value yk=yk frac sumi=1nyixi sumi=1nwi, then the next decision function approximates the value yk:


f(xk)= alpha cdot Big(xk frac sumi=1nwixi sumi=1nwi Big)+ frac sumi=1nyixi sumi=1nwi= alpha cdotxk+ Big( frac sumi=1nyixi sumi=1nwi alpha cdot frac sumi=1nwixi sumi=1nwi Big)


Therefore, the solution of the original one-dimensional linear regression problem can be written as follows:


 alpha= frac sumi=1nwi(ximnwx)(yimnwy) sumi=1nwi(ximnwx)(ximnwx)


 beta=mnwy alpha cdotmnwx


Here we use the notation introduced in the last article for the weighted average :


mnwx= frac sumi=1nwixi sumi=1nwi


mnwy= frac sumi=1nwiyi sumi=1nwi


It can be understood that such a transition is correct, in another way. If the solution for centered data is optimal, then the parameters  alphaand  betadeliver minimal loss functionality Q1:


Q1(f,x,y,w)= sumi=1nwi cdot( alpha cdotxi+ betayi)2


Now let's change the variables, returning to the non-centered data:


Q1(f,x,y,w)= sumi=1nwi cdot Big( alpha cdot(ximnwx)yi+mnwy Big)2=


= sumi=1nwi cdot Big( alpha cdotxi+(mnwy alpha cdotmnwx)yi Big)$


The resulting expression describes the value of the loss functional Q1for unbiased data according to the formulas for  alphaand  betawhich we got higher. The value of the functional at the same time reaches a minimum, therefore, the problem is solved correctly!


4. Application of the method of Welford


Now note that when calculating a parameter  alphathe same covariances are used, the calculation of which we did in the previous article . In fact, using the notation from it, you can write:


 alpha= fracDnwxyDnwxx= fracCnwxyCnwxx


This means that to calculate the regression coefficient, it is necessary to calculate the covariance two times using the Welford method. In the course of these calculations, we simultaneously find the average values ​​needed to calculate the free regression coefficient.


The code for adding the next element to the sample consists of updating the averages and variances for the characteristics and answers, as well as the covariance between the characteristics and the answers:


void TWelfordSLRSolver::Add(const double feature, const double goal, const double weight) { SumWeights += weight; if (!SumWeights) { return; } const double weightedFeatureDiff = weight * (feature - FeaturesMean); const double weightedGoalDiff = weight * (goal - GoalsMean); FeaturesMean += weightedFeatureDiff / SumWeights; FeaturesDeviation += weightedFeatureDiff * (feature - FeaturesMean); GoalsMean += weightedGoalDiff / SumWeights; GoalsDeviation += weightedGoalDiff * (goal - GoalsMean); Covariation += weightedFeatureDiff * (goal - GoalsMean); } 

And this is the method that solves the problem of finding the optimal parameters of a one-dimensional linear regression after adding all the elements:


 template <typename TFloatType> void TWelfordSLRSolver::Solve(TFloatType& factor, TFloatType& intercept, const double regularizationParameter = 0.1) const { if (!FeaturesDeviation) { factor = 0.; intercept = GoalsMean; return; } factor = Covariation / (FeaturesDeviation + regularizationParameter); intercept = GoalsMean - factor * FeaturesMean; } 

The GoalsDeviation value is not used here, but we will need it in future articles.


Combining all computations within a single class avoids some overhead. Say, if the implementation used two objects for storing means and three objects for storing covariances (signs with signs, answers with answers, signs with answers), the sum of weights would be updated five times for each sample from the sample.


5. Experimental comparison of methods


For practical comparison, I wrote a program that implements various ways of solving one-dimensional and multidimensional linear regression problems . Multidimensional regression will be discussed in the following articles, and now we will focus on the one-dimensional case.


We will compare, as usual, "naive" methods, methods based on summation by the Kahan method, and methods based on the Welford method.


"Naive" methods directly apply the formula to calculate the covariances :


 void Add(const double feature, const double goal, const double weight = 1.) { SumFeatures += feature * weight; SumSquaredFeatures += feature * feature * weight; SumGoals += goal * weight; SumSquaredGoals += goal * goal * weight; SumProducts += goal * feature * weight; SumWeights += weight; } 

The class is a template and has specializations with double and TKahanAccumulator counters .


In addition, the TTypedBestSLRSolver class is TTypedBestSLRSolver , which selects the best feature to build a one-dimensional regression model. This is done very simply : the one-dimensional linear regression problem is solved for each of the signs, and then the best of the resulting models is selected.


To test the developed methods, we use model data from the LIAC collection . For convenience, some of the data sets are placed in the data directory in a format that the written program understands.


Data in problems is “corrupted” in a simple way : the values ​​of attributes and answers are multiplied by a number, after which some other number is added to them. Thus, we can get a problem case from the point of view of computation: large average values ​​compared with the values ​​of the scatter.


In research-bslr sample changes several times in a row, and each time it runs a sliding control procedure. The result of the test is the average value of the coefficient of determination for test samples.


For example, for the kin8nm sample, the results of the work are obtained as follows:


 injure factor: 1 injure offset: 1 fast_bslr time: 0.001322 R^2: 0.27359 kahan_bslr time: 0.002999 R^2: 0.27359 welford_bslr time: 0.00432 R^2: 0.27359 normalized_welford_bslr time: 0.004288 R^2: 0.27359 injure factor: 0.1 injure offset: 10 fast_bslr time: 0.001256 R^2: 0.27359 kahan_bslr time: 0.002948 R^2: 0.27359 welford_bslr time: 0.004303 R^2: 0.27359 normalized_welford_bslr time: 0.004275 R^2: 0.27359 injure factor: 0.01 injure offset: 100 fast_bslr time: 0.001283 R^2: 0.27359 kahan_bslr time: 0.003015 R^2: 0.27359 welford_bslr time: 0.004304 R^2: 0.27359 normalized_welford_bslr time: 0.004285 R^2: 0.27359 injure factor: 0.001 injure offset: 1000 fast_bslr time: 0.001262 R^2: 0.27324 kahan_bslr time: 0.002977 R^2: 0.27359 welford_bslr time: 0.004329 R^2: 0.27359 normalized_welford_bslr time: 0.00428 R^2: 0.27359 injure factor: 0.0001 injure offset: 10000 fast_bslr time: 0.00128 R^2: -59.271 kahan_bslr time: 0.003009 R^2: -0.0005269 welford_bslr time: 0.004304 R^2: 0.27359 normalized_welford_bslr time: 0.00428 R^2: 0.27359 full learning time: fast_bslr 0.006403s kahan_bslr 0.014948s welford_bslr 0.02156s normalized_welford_bslr 0.021408s 

In this case, the reduction of all values ​​in the sample by 10 thousand times together with the simultaneous addition of the value of 10,000 to them leads to the inoperability of the standard algorithm, even with the use of the Kahang method. Similar results are obtained in other samples, including those from them that occur in real life in production.


Conclusion


So, today we talked about the one-dimensional linear regression problem, figured out how analytical solutions are obtained in this problem and how to use the Welford method to find solutions.


The Welford method makes solving a problem much more resistant to possible problems in the data. However, this method turns out to be 2-4 times slower than the standard algorithm, so in practice you need to decide for yourself what is more important at the moment - not to depend on possible problems in the data or work as quickly as possible.


If there is a need to build models on different data many times and there is no possibility to control the quality of each model obtained, I would recommend using the Welford method.


In the next article we will talk about the use of the Welford method for solving the problem of multidimensional linear regression.


Literature


  1. habrahabr.ru: Exact calculation of averages and covariances by the Welford method
  2. github.com: Linear regression problem solvers with different computation methods
  3. github.com: The Collection of ARFF datasets of the Connection Artificial Intelligence Laboratory (LIAC)
  4. machinelearning.ru: One-dimensional linear regression

')

Source: https://habr.com/ru/post/335522/


All Articles