Welford method and one-dimensional linear regression

One-dimensional linear regression is one of the simplest regression methods (and generally one of the simplest methods of machine learning), which allows one to describe the linear dependence of the observed value on one of the signs. In the general case, in problems of machine learning one has to deal with a large number of different signs; one-dimensional linear regression in this case chooses the one that allows to achieve the best correlation with the objective function.

In the previous post of this series, we discussed the accuracy of calculations of averages and covariances, and also got acquainted with the Welford method, which in many cases allows us to avoid computational errors in these problems. Today we will look at the practical application of the Welford method in a one-dimensional linear regression problem.

Content

1. One-dimensional linear regression

In the one-dimensional linear regression problem, we assume that there are two sequences of real numbers: signs $x$ and answers $y$ . In addition, there is a vector of corresponding weights. $w$ . As always, we will assume that these sequences contain a potentially infinite number of elements, but at the moment only $n$ the first elements of each of the sequences.

Our task will be to restore the linear relationship between signs and answers, that is, to build a linear decision function $f: \ mathbb {R} \ rightarrow \ mathbb {R}$ :

f (x_{i}) = a l p h a c d o t x_{i} + b e t a

$f (x_i) = \ alpha \ cdot x_i + \ beta$

At the same time, the rms loss functional is minimized:

Q (f, x, y, w) = s q r t f r a c 1 s u m_{i = 1}^{n} w_{i} s u m_{i = 1}^{n} w_{i} c d o t (f (x_{i}) - y_{i})^{2}

$Q (f, x, y, w) = \ sqrt {\ frac {1} {\ sum_ {i = 1} ^ {n} w_i} \ sum_ {i = 1} ^ {n} w_i \ cdot (f (x_i) - y_i) ^ 2}$

For analysis, it is easier to work with a formula free of radicals and normalization:

Q_{1} (f, x, y, w) = s u m_{i = 1}^{n} w_{i} c d o t (f (x_{i}) - y_{i})^{2} = s u m_{i = 1}^{n} w_{i} c d o t (a l p h a x_{i} + b e t a - y_{i})^{2}

$Q_1 (f, x, y, w) = \ sum_ {i = 1} ^ {n} w_i \ cdot (f (x_i) -y_i) ^ 2 = \ sum_ {i = 1} ^ {n} w_i \ cdot (\ alpha x_i + \ beta-y_i) ^ 2$

Since the minimum points for functionals $Q$ and $Q_1$ match, such a replacement is correct.

2. Solution for centered samples

For loss functional $Q_1$ easy to write derivatives $\ alpha$ and $\ beta$ :

f r a c p a r t i a l Q_{1} (f, x, y, w) p a r t i a l a l p h a = 2 c d o t s u m_{i = 1}^{n} w_{i} x_{i} (a l p h a x_{i} + b e t a - y_{i})

$\ frac {\ partial Q_1 (f, x, y, w)} {\ partial \ alpha} = 2 \ cdot \ sum_ {i = 1} ^ {n} w_i x_i (\ alpha x_i + \ beta - y_i)$

f r a c p a r t i a l Q_{1} (f, x, y, w) p a r t i a l b e t a = 2 c d o t s u m_{i = 1}^{n} w_{i} (a l p h a x_{i} + b e t a - y_{i})

$\ frac {\ partial Q_1 (f, x, y, w)} {\ partial \ beta} = 2 \ cdot \ sum_ {i = 1} ^ {n} w_i (\ alpha x_i + \ beta - y_i)$

If we equate them to zero, we get:

a l p h a c d o t s u m_{i = 1}^{n} w_{i} x_{i}^{2} + b e t a c d o t s u m_{i = 1}^{n} w_{i} x_{i} - s u m_{i = 1}^{n} w_{i} x_{i} y_{i} = 0

$\ alpha \ cdot \ sum_ {i = 1} ^ {n} w_i x ^ 2_i + \ beta \ cdot \ sum_ {i = 1} ^ {n} w_i x_i - \ sum_ {i = 1} ^ {n} w_i x_i y_i = 0$

a l p h a = f r a c s u m_{i = 1}^{n} w_{i} x_{i} y_{i} - b e t a c d o t s u m_{i = 1}^{n} w_{i} x_{i} s u m_{i = 1}^{n} w_{i} x_{i}^{2}

$\ alpha = \ frac {\ sum_ {i = 1} ^ {n} w_i x_i y_i - \ beta \ cdot \ sum_ {i = 1} ^ {n} w_i x_i} {\ sum_ {i = 1} ^ { n} w_i x ^ 2_i}$

b e t a = s u m_{i = 1}^{n} w_{i} y_{i} - a l p h a c d o t s u m_{i = 1}^{n} w_{i} x_{i}

$\ beta = \ sum_ {i = 1} ^ {n} w_i y_i - \ alpha \ cdot \ sum_ {i = 1} ^ {n} w_i x_i$

An important digression. Equating the derivative to zero in this case is correct, because:
1. The loss functional is convex in terms of optimized parameters, therefore any point of the local optimum will also be a point of global optimum .
2. The loss functional with respect to the parameters being optimized is a parabola, so the extremal points found will be minimum points.

If the optimal parameter $\ beta$ equal to zero, finding a solution would not be difficult. You may notice that centering, a standard for machine learning method of pre-processing the sample, leads to this effect. Indeed, consider the problem for centered variables:

x_{i}^{'} = x_{i} - f r a c s u m_{i = 1}^{n} w_{i} x_{i} s u m_{i = 1}^{n} w_{i}

$x'_i = x_i - \ frac {\ sum_ {i = 1} ^ n w_i x_i} {\ sum_ {i = 1} ^ n w_i}$

y_{i}^{'} = y_{i} - f r a c s u m_{i = 1}^{n} w_{i} y_{i} s u m_{i = 1}^{n} w_{i}

$y'_i = y_i - \ frac {\ sum_ {i = 1} ^ n w_i y_i} {\ sum_ {i = 1} ^ n w_i}$

The sum of weighted features now equals zero, as well as the sum of weighted responses:

s u m_{k = 1}^{n} w_{k} x_{k}^{'} = s u m_{k = 1}^{n} w_{k} c d o t B i g (x_{k} - f r a c s u m_{i = 1}^{n} w_{i} x_{i} s u m_{i = 1}^{n} w_{i} B i g) = s u m_{k = 1}^{n} w_{k} x_{k} - s u m_{i = 1}^{n} w_{i} x_{i} = 0

$\ sum_ {k = 1} ^ {n} w_k x_k '= \ sum_ {k = 1} ^ {n} w_k \ cdot \ Big (x_k - \ frac {\ sum_ {i = 1} ^ n w_i x_i} {\ sum_ {i = 1} ^ n w_i} \ Big) = \ sum_ {k = 1} ^ {n} w_k x_k - \ sum_ {i = 1} ^ n w_i x_i = 0$

s u m_{k = 1}^{n} w_{k} y_{k}^{'} = s u m_{k = 1}^{n} w_{k} c d o t B i g (y_{k} - f r a c s u m_{i = 1}^{n} w_{i} y_{i} s u m_{i = 1}^{n} w_{i} B i g) = s u m_{k = 1}^{n} w_{k} y_{k} - s u m_{i = 1}^{n} w_{i} y_{i} = 0

$\ sum_ {k = 1} ^ {n} w_k y_k '= \ sum_ {k = 1} ^ {n} w_k \ cdot \ Big (y_k - \ frac {\ sum_ {i = 1} ^ n w_i y_i} {\ sum_ {i = 1} ^ n w_i} \ Big) = \ sum_ {k = 1} ^ {n} w_k y_k - \ sum_ {i = 1} ^ n w_i y_i = 0$

Then the optimal value of the free parameter is zero:

b e t a^{'} = s u m_{i = 1}^{n} w_{i} y_{i}^{'} - a l p h a^{'} c d o t s u m_{i = 1}^{n} w_{i} x_{i}^{'} = 0

$\ beta '= \ sum_ {i = 1} ^ {n} w_i y_i' - \ alpha '\ cdot \ sum_ {i = 1} ^ {n} w_i x'_i = 0$

This means that the optimal value of the parameter $\ alpha '$ easy to find:

a l p h a^{'} = f r a c s u m_{i = 1}^{n} w_{i} x_{i}^{'} y_{i}^{'} s u m_{i = 1}^{n} w_{i} x_{i}^{' 2}

$\ alpha '= \ frac {\ sum_ {i = 1} ^ {n} w_i x'_i y'_i} {\ sum_ {i = 1} ^ {n} w_i x' ^ 2_i}$

3. Decision in general

Now let's try to return to the general case of non-centered data. If a $f '$ - the decisive function for the centered case, whose values are determined by the formula

f^{'} (x_{k}^{'}) = a l p h a^{'} c d o t x_{k}^{'}

$f '(x'_k) = \ alpha' \ cdot x'_k$

and approximate value $y'_k = y_k - \ frac {\ sum_ {i = 1} ^ n y_i x_i} {\ sum_ {i = 1} ^ n w_i}$ , then the next decision function approximates the value $y_k$ :

f (x_{k}) = a l p h a^{'} c d o t B i g (x_{k} - f r a c s u m_{i = 1}^{n} w_{i} x_{i} s u m_{i = 1}^{n} w_{i} B i g) + f r a c s u m_{i = 1}^{n} y_{i} x_{i} s u m_{i = 1}^{n} w_{i} = a l p h a^{'} c d o t x_{k} + B i g (f r a c s u m_{i = 1}^{n} y_{i} x_{i} s u m_{i = 1}^{n} w_{i} - a l p h a^{'} c d o t f r a c s u m_{i = 1}^{n} w_{i} x_{i} s u m_{i = 1}^{n} w_{i} B i g)

$f (x_k) = \ alpha '\ cdot \ Big (x_k - \ frac {\ sum_ {i = 1} ^ n w_i x_i} {\ sum_ {i = 1} ^ n w_i} \ Big) + \ frac { \ sum_ {i = 1} ^ n y_i x_i} {\ sum_ {i = 1} ^ n w_i} = \ alpha '\ cdot x_k + \ Big (\ frac {\ sum_ {i = 1} ^ n y_i x_i} {\ sum_ {i = 1} ^ n w_i} - \ alpha '\ cdot \ frac {\ sum_ {i = 1} ^ n w_i x_i} {\ sum_ {i = 1} ^ n w_i} \ Big)$

Therefore, the solution of the original one-dimensional linear regression problem can be written as follows:

a l p h a = f r a c s u m_{i = 1}^{n} w_{i} (x_{i} - m_{n}^{w x}) (y_{i} - m_{n}^{w y}) s u m_{i = 1}^{n} w_{i} (x_{i} - m_{n}^{w x}) (x_{i} - m_{n}^{w x})

$\ alpha = \ frac {\ sum_ {i = 1} ^ {n} w_i (x_i - m_n ^ {wx}) (y_i - m_n ^ {wy})} {\ sum_ {i = 1} ^ {n} w_i (x_i - m_n ^ {wx}) (x_i - m_n ^ {wx})}$

b e t a = m_{n}^{w y} - a l p h a c d o t m_{n}^{w x}

$\ beta = m_n ^ {wy} - \ alpha \ cdot m_n ^ {wx}$

Here we use the notation introduced in the last article for the weighted average :

m_{n}^{w x} = f r a c s u m_{i = 1}^{n} w_{i} x_{i} s u m_{i = 1}^{n} w_{i}

$m_n ^ {wx} = \ frac {\ sum_ {i = 1} ^ n w_i x_i} {\ sum_ {i = 1} ^ n w_i}$

m_{n}^{w y} = f r a c s u m_{i = 1}^{n} w_{i} y_{i} s u m_{i = 1}^{n} w_{i}

$m_n ^ {wy} = \ frac {\ sum_ {i = 1} ^ n w_i y_i} {\ sum_ {i = 1} ^ n w_i}$

It can be understood that such a transition is correct, in another way. If the solution for centered data is optimal, then the parameters $\ alpha '$ and $\ beta '$ deliver minimal loss functionality $Q_1$ :

Q_{1} (f^{'}, x^{'}, y^{'}, w) = s u m_{i = 1}^{n} w_{i} c d o t (a l p h a^{'} c d o t x_{i}^{'} + b e t a^{'} - y_{i}^{'})^{2}

$Q_1 (f ', x', y ', w) = \ sum_ {i = 1} ^ {n} w_i \ cdot (\ alpha' \ cdot x'_i + \ beta '- y'_i) ^ 2$

Now let's change the variables, returning to the non-centered data:

Q_{1} (f^{'}, x^{'}, y^{'}, w) = s u m_{i = 1}^{n} w_{i} c d o t B i g (a l p h a^{'} c d o t (x_{i} - m_{n}^{w x}) - y_{i} + m_{n}^{w y} B i g)^{2} =

$Q_1 (f ', x', y ', w) = \ sum_ {i = 1} ^ {n} w_i \ cdot \ Big (\ alpha' \ cdot (x_i - m_n ^ {wx}) - y_i + m_n ^ {wy} \ Big) ^ 2 =$

= s u m_{i = 1}^{n} w_{i} c d o t B i g (a l p h a^{'} c d o t x_{i} + (m_{n}^{w y} - a l p h a^{'} c d o t m_{n}^{w x}) - y_{i} B i g)^{$}

$= \ sum_ {i = 1} ^ {n} w_i \ cdot \ Big (\ alpha '\ cdot x_i + (m_n ^ {wy} - \ alpha' \ cdot m_n ^ {wx}) - y_i \ Big) ^ $$

The resulting expression describes the value of the loss functional $Q_1$ for unbiased data according to the formulas for $\ alpha$ and $\ beta$ which we got higher. The value of the functional at the same time reaches a minimum, therefore, the problem is solved correctly!

4. Application of the method of Welford

Now note that when calculating a parameter $\ alpha$ the same covariances are used, the calculation of which we did in the previous article . In fact, using the notation from it, you can write:

a l p h a = f r a c D_{n}^{w x y} D_{n}^{w x x} = f r a c C_{n}^{w x y} C_{n}^{w x x}

$\ alpha = \ frac {D_ {n} ^ {wxy}} {D_ {n} ^ {wxx}} = \ frac {C_ {n} ^ {wxy}} {C_ {n} ^ {wxx}}$

This means that to calculate the regression coefficient, it is necessary to calculate the covariance two times using the Welford method. In the course of these calculations, we simultaneously find the average values needed to calculate the free regression coefficient.

The code for adding the next element to the sample consists of updating the averages and variances for the characteristics and answers, as well as the covariance between the characteristics and the answers:

void TWelfordSLRSolver::Add(const double feature, const double goal, const double weight) { SumWeights += weight; if (!SumWeights) { return; } const double weightedFeatureDiff = weight * (feature - FeaturesMean); const double weightedGoalDiff = weight * (goal - GoalsMean); FeaturesMean += weightedFeatureDiff / SumWeights; FeaturesDeviation += weightedFeatureDiff * (feature - FeaturesMean); GoalsMean += weightedGoalDiff / SumWeights; GoalsDeviation += weightedGoalDiff * (goal - GoalsMean); Covariation += weightedFeatureDiff * (goal - GoalsMean); }

And this is the method that solves the problem of finding the optimal parameters of a one-dimensional linear regression after adding all the elements:

 template <typename TFloatType> void TWelfordSLRSolver::Solve(TFloatType& factor, TFloatType& intercept, const double regularizationParameter = 0.1) const { if (!FeaturesDeviation) { factor = 0.; intercept = GoalsMean; return; } factor = Covariation / (FeaturesDeviation + regularizationParameter); intercept = GoalsMean - factor * FeaturesMean; }

The GoalsDeviation value is not used here, but we will need it in future articles.

Combining all computations within a single class avoids some overhead. Say, if the implementation used two objects for storing means and three objects for storing covariances (signs with signs, answers with answers, signs with answers), the sum of weights would be updated five times for each sample from the sample.

5. Experimental comparison of methods

For practical comparison, I wrote a program that implements various ways of solving one-dimensional and multidimensional linear regression problems . Multidimensional regression will be discussed in the following articles, and now we will focus on the one-dimensional case.

We will compare, as usual, "naive" methods, methods based on summation by the Kahan method, and methods based on the Welford method.

"Naive" methods directly apply the formula to calculate the covariances :

 void Add(const double feature, const double goal, const double weight = 1.) { SumFeatures += feature * weight; SumSquaredFeatures += feature * feature * weight; SumGoals += goal * weight; SumSquaredGoals += goal * goal * weight; SumProducts += goal * feature * weight; SumWeights += weight; }

The class is a template and has specializations with double and TKahanAccumulator counters .

In addition, the TTypedBestSLRSolver class is TTypedBestSLRSolver , which selects the best feature to build a one-dimensional regression model. This is done very simply : the one-dimensional linear regression problem is solved for each of the signs, and then the best of the resulting models is selected.

To test the developed methods, we use model data from the LIAC collection . For convenience, some of the data sets are placed in the data directory in a format that the written program understands.

Data in problems is “corrupted” in a simple way : the values of attributes and answers are multiplied by a number, after which some other number is added to them. Thus, we can get a problem case from the point of view of computation: large average values compared with the values of the scatter.

In research-bslr sample changes several times in a row, and each time it runs a sliding control procedure. The result of the test is the average value of the coefficient of determination for test samples.

For example, for the kin8nm sample, the results of the work are obtained as follows:

 injure factor: 1 injure offset: 1 fast_bslr time: 0.001322 R^2: 0.27359 kahan_bslr time: 0.002999 R^2: 0.27359 welford_bslr time: 0.00432 R^2: 0.27359 normalized_welford_bslr time: 0.004288 R^2: 0.27359 injure factor: 0.1 injure offset: 10 fast_bslr time: 0.001256 R^2: 0.27359 kahan_bslr time: 0.002948 R^2: 0.27359 welford_bslr time: 0.004303 R^2: 0.27359 normalized_welford_bslr time: 0.004275 R^2: 0.27359 injure factor: 0.01 injure offset: 100 fast_bslr time: 0.001283 R^2: 0.27359 kahan_bslr time: 0.003015 R^2: 0.27359 welford_bslr time: 0.004304 R^2: 0.27359 normalized_welford_bslr time: 0.004285 R^2: 0.27359 injure factor: 0.001 injure offset: 1000 fast_bslr time: 0.001262 R^2: 0.27324 kahan_bslr time: 0.002977 R^2: 0.27359 welford_bslr time: 0.004329 R^2: 0.27359 normalized_welford_bslr time: 0.00428 R^2: 0.27359 injure factor: 0.0001 injure offset: 10000 fast_bslr time: 0.00128 R^2: -59.271 kahan_bslr time: 0.003009 R^2: -0.0005269 welford_bslr time: 0.004304 R^2: 0.27359 normalized_welford_bslr time: 0.00428 R^2: 0.27359 full learning time: fast_bslr 0.006403s kahan_bslr 0.014948s welford_bslr 0.02156s normalized_welford_bslr 0.021408s

In this case, the reduction of all values in the sample by 10 thousand times together with the simultaneous addition of the value of 10,000 to them leads to the inoperability of the standard algorithm, even with the use of the Kahang method. Similar results are obtained in other samples, including those from them that occur in real life in production.

Conclusion

So, today we talked about the one-dimensional linear regression problem, figured out how analytical solutions are obtained in this problem and how to use the Welford method to find solutions.

The Welford method makes solving a problem much more resistant to possible problems in the data. However, this method turns out to be 2-4 times slower than the standard algorithm, so in practice you need to decide for yourself what is more important at the moment - not to depend on possible problems in the data or work as quickly as possible.

If there is a need to build models on different data many times and there is no possibility to control the quality of each model obtained, I would recommend using the Welford method.

In the next article we will talk about the use of the Welford method for solving the problem of multidimensional linear regression.

Literature

habrahabr.ru: Exact calculation of averages and covariances by the Welford method
github.com: Linear regression problem solvers with different computation methods
github.com: The Collection of ARFF datasets of the Connection Artificial Intelligence Laboratory (LIAC)
machinelearning.ru: One-dimensional linear regression

Source: https://habr.com/ru/post/335522/

All Articles