One-dimensional linear regression is one of the simplest regression methods (and generally one of the simplest methods of machine learning), which allows one to describe the linear dependence of the observed value on one of the signs. In the general case, in problems of machine learning one has to deal with a large number of different signs; one-dimensional linear regression in this case chooses the one that allows to achieve the best correlation with the objective function.
In the previous post of this series, we discussed the accuracy of calculations of averages and covariances, and also got acquainted with the Welford method, which in many cases allows us to avoid computational errors in these problems. Today we will look at the practical application of the Welford method in a one-dimensional linear regression problem.
In the one-dimensional linear regression problem, we assume that there are two sequences of real numbers: signs and answers . In addition, there is a vector of corresponding weights. . As always, we will assume that these sequences contain a potentially infinite number of elements, but at the moment only the first elements of each of the sequences.
Our task will be to restore the linear relationship between signs and answers, that is, to build a linear decision function :
At the same time, the rms loss functional is minimized:
For analysis, it is easier to work with a formula free of radicals and normalization:
Since the minimum points for functionals and match, such a replacement is correct.
For loss functional easy to write derivatives and :
If we equate them to zero, we get:
An important digression. Equating the derivative to zero in this case is correct, because:
1. The loss functional is convex in terms of optimized parameters, therefore any point of the local optimum will also be a point of global optimum .
2. The loss functional with respect to the parameters being optimized is a parabola, so the extremal points found will be minimum points.
If the optimal parameter equal to zero, finding a solution would not be difficult. You may notice that centering, a standard for machine learning method of pre-processing the sample, leads to this effect. Indeed, consider the problem for centered variables:
The sum of weighted features now equals zero, as well as the sum of weighted responses:
Then the optimal value of the free parameter is zero:
This means that the optimal value of the parameter easy to find:
Now let's try to return to the general case of non-centered data. If a - the decisive function for the centered case, whose values are determined by the formula
and approximate value , then the next decision function approximates the value :
Therefore, the solution of the original one-dimensional linear regression problem can be written as follows:
Here we use the notation introduced in the last article for the weighted average :
It can be understood that such a transition is correct, in another way. If the solution for centered data is optimal, then the parameters and deliver minimal loss functionality :
Now let's change the variables, returning to the non-centered data:
The resulting expression describes the value of the loss functional for unbiased data according to the formulas for and which we got higher. The value of the functional at the same time reaches a minimum, therefore, the problem is solved correctly!
Now note that when calculating a parameter the same covariances are used, the calculation of which we did in the previous article . In fact, using the notation from it, you can write:
This means that to calculate the regression coefficient, it is necessary to calculate the covariance two times using the Welford method. In the course of these calculations, we simultaneously find the average values needed to calculate the free regression coefficient.
The code for adding the next element to the sample consists of updating the averages and variances for the characteristics and answers, as well as the covariance between the characteristics and the answers:
void TWelfordSLRSolver::Add(const double feature, const double goal, const double weight) { SumWeights += weight; if (!SumWeights) { return; } const double weightedFeatureDiff = weight * (feature - FeaturesMean); const double weightedGoalDiff = weight * (goal - GoalsMean); FeaturesMean += weightedFeatureDiff / SumWeights; FeaturesDeviation += weightedFeatureDiff * (feature - FeaturesMean); GoalsMean += weightedGoalDiff / SumWeights; GoalsDeviation += weightedGoalDiff * (goal - GoalsMean); Covariation += weightedFeatureDiff * (goal - GoalsMean); }
And this is the method that solves the problem of finding the optimal parameters of a one-dimensional linear regression after adding all the elements:
template <typename TFloatType> void TWelfordSLRSolver::Solve(TFloatType& factor, TFloatType& intercept, const double regularizationParameter = 0.1) const { if (!FeaturesDeviation) { factor = 0.; intercept = GoalsMean; return; } factor = Covariation / (FeaturesDeviation + regularizationParameter); intercept = GoalsMean - factor * FeaturesMean; }
The GoalsDeviation
value is not used here, but we will need it in future articles.
Combining all computations within a single class avoids some overhead. Say, if the implementation used two objects for storing means and three objects for storing covariances (signs with signs, answers with answers, signs with answers), the sum of weights would be updated five times for each sample from the sample.
For practical comparison, I wrote a program that implements various ways of solving one-dimensional and multidimensional linear regression problems . Multidimensional regression will be discussed in the following articles, and now we will focus on the one-dimensional case.
We will compare, as usual, "naive" methods, methods based on summation by the Kahan method, and methods based on the Welford method.
"Naive" methods directly apply the formula to calculate the covariances :
void Add(const double feature, const double goal, const double weight = 1.) { SumFeatures += feature * weight; SumSquaredFeatures += feature * feature * weight; SumGoals += goal * weight; SumSquaredGoals += goal * goal * weight; SumProducts += goal * feature * weight; SumWeights += weight; }
The class is a template and has specializations with double and TKahanAccumulator counters .
In addition, the TTypedBestSLRSolver
class is TTypedBestSLRSolver
, which selects the best feature to build a one-dimensional regression model. This is done very simply : the one-dimensional linear regression problem is solved for each of the signs, and then the best of the resulting models is selected.
To test the developed methods, we use model data from the LIAC collection . For convenience, some of the data sets are placed in the data directory in a format that the written program understands.
Data in problems is “corrupted” in a simple way : the values of attributes and answers are multiplied by a number, after which some other number is added to them. Thus, we can get a problem case from the point of view of computation: large average values compared with the values of the scatter.
In research-bslr
sample changes several times in a row, and each time it runs a sliding control procedure. The result of the test is the average value of the coefficient of determination for test samples.
For example, for the kin8nm sample, the results of the work are obtained as follows:
injure factor: 1 injure offset: 1 fast_bslr time: 0.001322 R^2: 0.27359 kahan_bslr time: 0.002999 R^2: 0.27359 welford_bslr time: 0.00432 R^2: 0.27359 normalized_welford_bslr time: 0.004288 R^2: 0.27359 injure factor: 0.1 injure offset: 10 fast_bslr time: 0.001256 R^2: 0.27359 kahan_bslr time: 0.002948 R^2: 0.27359 welford_bslr time: 0.004303 R^2: 0.27359 normalized_welford_bslr time: 0.004275 R^2: 0.27359 injure factor: 0.01 injure offset: 100 fast_bslr time: 0.001283 R^2: 0.27359 kahan_bslr time: 0.003015 R^2: 0.27359 welford_bslr time: 0.004304 R^2: 0.27359 normalized_welford_bslr time: 0.004285 R^2: 0.27359 injure factor: 0.001 injure offset: 1000 fast_bslr time: 0.001262 R^2: 0.27324 kahan_bslr time: 0.002977 R^2: 0.27359 welford_bslr time: 0.004329 R^2: 0.27359 normalized_welford_bslr time: 0.00428 R^2: 0.27359 injure factor: 0.0001 injure offset: 10000 fast_bslr time: 0.00128 R^2: -59.271 kahan_bslr time: 0.003009 R^2: -0.0005269 welford_bslr time: 0.004304 R^2: 0.27359 normalized_welford_bslr time: 0.00428 R^2: 0.27359 full learning time: fast_bslr 0.006403s kahan_bslr 0.014948s welford_bslr 0.02156s normalized_welford_bslr 0.021408s
In this case, the reduction of all values in the sample by 10 thousand times together with the simultaneous addition of the value of 10,000 to them leads to the inoperability of the standard algorithm, even with the use of the Kahang method. Similar results are obtained in other samples, including those from them that occur in real life in production.
So, today we talked about the one-dimensional linear regression problem, figured out how analytical solutions are obtained in this problem and how to use the Welford method to find solutions.
The Welford method makes solving a problem much more resistant to possible problems in the data. However, this method turns out to be 2-4 times slower than the standard algorithm, so in practice you need to decide for yourself what is more important at the moment - not to depend on possible problems in the data or work as quickly as possible.
If there is a need to build models on different data many times and there is no possibility to control the quality of each model obtained, I would recommend using the Welford method.
In the next article we will talk about the use of the Welford method for solving the problem of multidimensional linear regression.
Source: https://habr.com/ru/post/335522/
All Articles