When it comes to machine learning, usually involve large amounts of data - millions or even billions of transactions, from which you need to make a complex conclusion about the behavior, interests or current state of the user, buyer or some device (robot, car, drone or machine).
However, in the life of the ordinary analyst of the most ordinary company a lot of data is rare. Rather, on the contrary - you will have little or very little data - literally dozens or hundreds of records. But the analysis still needs to be done. And not some kind of analysis, but a qualitative and reliable one.
Often, the situation is aggravated by the fact that you can easily generate many features for each record (most often polynomials are added, the difference with the previous value and the value for the last year, one-hot-encoding for categorical features, etc.). That's just not at all easy to figure out which of them are really useful, and which only complicate the model and increase the errors of your prognosis.
To do this, you can use the methods of Bayesian statistics, for example, Automatic Relevance Determination . This is a relatively new method proposed in 1992 by David Mackay (it all started with his doctoral dissertation (PDF) ). A very brief but incomprehensible presentation of the method can be found in this PDF presentation . A clear, but overly verbose explanation can be found here :
If it is quite simple, then in the ARD for each coefficient a posteriori estimate of the variance is displayed, and then the coefficients with a small variance are zeroed.
Let's see how it works in practice. So, we have the initial data - only 30 points (for example, data on 30 stores). And each store has 30 signs. And your task is to create a regression model (for example, to predict sales volume according to location, format, sales area, configuration, number of personnel and other store parameters).
Building ordinary linear regression under such conditions will be pure insanity. Let us further exacerbate the problem by the fact that only 5 signs really matter, and the rest are completely irrelevant data.
Thus, let the real dependence be represented by the formula Y = w * X + e , where e is a random normal error, and the coefficients w are equal [1, 2, 3, 4, 5, 0, 0, ...., 0], that is, only the first five coefficients are nonzero, and the signs from the 6th to the 30th do not affect the real value of Y at all . However, we do not know. We only have data - X and Y - and we need to calculate the coefficients w .
Now run ARD:
import numpy as np from sklearn.linear_model import ARDRegression N = 30 # ( ) X = np.random.random(size=(N,N)) * 10 + 1 # [1 2 3 4 5 0 0 ... 0] w = np.zeros(N) w[:5] = np.arange(5) + 1 # e = np.random.normal(0, 1, size=N) # Y = np.dot(X, w) + e ard = ARDRegression() ard.fit(X, Y) print ard.coef_
And we get just an impressive result:
array([ 1.01, 2.14, 2.95, 3.89, 4.79, 0., 0., 0., 0., 0.01, 0., 0., 0.31, 0.04, -0.05, 0. , 0., 0.01, 0., 0., 0., 0., 0.01, 0., 0., 0., 0., 0.17, 0., 0. ])
Let me remind you that the real coefficients are equal:
array([ 1., 2., 3., 4., 5., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0. ])
Thus, having only 30 points in a 30-dimensional space, we were able to build a model that almost exactly repeats the real dependence.
For comparison, I will give the coefficients calculated using ordinary linear regression:
array([ 0.39 2.07 3.16 2.86 4.8 -0.21 -0.13 0.42 0.6 -0.21 -0.96 0.03 -0.46 0.57 0.89 0.15 0.24 0.11 -0.38 -0.36 -0.28 -0.01 0.43 -1.22 0.23 0.15 0.12 0.43 -1.11 -0.3 ])
linear regression with L2 regularization:
array([-0.36 1.48 2.67 3.44 3.99 -0.4 1.01 0.58 -0.81 0.78 -0.13 -0.23 -0.26 -0.24 -0.38 -0.24 -0.38 -0.25 0.54 -0.31 -0.21 -0.42 0.14 0.88 1.09 0.66 0.12 -0.07 0.08 -0.58])
And they both do not hold water.
But linear regression with L1 regularization gives a similar result:
array([ 0.68 1.9 2.88 3.86 4.88 -0.05 0.09 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.01 0. 0. 0. 0. 0. 0. 0. ])
As you can see, L1-regularization even better nullifies insignificant coefficients, but significant coefficients can be calculated with a slightly larger error.
In general, ARD is a wonderful method, but there is a nuance. Like many (one might even say almost all) Bayesian methods, ARD is extremely complex from a computational point of view (although it is well parallelized). Therefore, it works quickly on data of several tens or hundreds of points (fractions of a second), on several thousand - slowly (tens to hundreds of seconds), and on tens and hundreds of thousands - ooh-oh-so-so (minutes and hours). In addition, he needs a huge amount of RAM.
However, this is not so scary. If you have a lot of data, then you can safely use classical statistical methods, and they will give a fairly good result. Serious problems begin when data is scarce and conventional methods no longer work. And then Bayes comes to the rescue.
ARD is actively used in a variety of kernel-methods, for example, Relevance Vector Machine (RVM) - this is the Support Vector Machine (SVM) along with ARD. It is also convenient in classifiers, when you need to evaluate the significance of the available features. In general, try it and you will like it.
Source: https://habr.com/ru/post/313566/
All Articles