Meet linear models

Machine learning is sweeping across the planet. Artificial intelligence, creaking with neural networks, is gradually ahead of people in those tasks that they managed to reach with their neurons. However, do not forget about the simple linear regression model. First, because it built many complex methods of machine learning, including neural networks. And, secondly, because often applied business problems are easily, quickly and efficiently solved by linear models.
And for a start, a little test. Is it possible to describe using a linear model:
- dependence of a person's weight on his height?
- the duration of waiting in the queue at the store at different times of day?
- site traffic in the phase of exponential growth?
- time dynamics of the number of people waiting for the train at the metro station?
- the probability that the client will not place an order on the site depending on its performance?
As you guess, the answer to all questions will be “Yes, you can.” So linear models are not as simple as they might seem at first glance. So let's get to know their rich variety.

Formulation of the problem

Let a set of M pairs of input data (X _i , Y _i ) be given , where
ℝ - the set of real numbers
ℕ - the set of natural numbers
M Є ℕ, i Є ℕ, 1 ≤ i ≤ M
X _i Є ℝ ^d , d Є ℕ, i.e. each X _i is a sequence of real numbers of length d
Y _i ℝ ℝ ^k , k Є ℕ, i.e. and each Y _i is also a sequence of real numbers, but of length k .
X is called independent variables (and also factors or regressors). And Y is called dependent or explicable variables.

In the simplest case, d = k = 1 , that is, we are given a pair of numbers, for example, (person's height, person's weight) or (part volume, part weight) or (waiting time in a queue, a sign of leaving the client).
Most often, d> 1 , and k = 1 , that is, for each Y, several X are given, for example, ((height, age), weight) or ((sales yesterday, sales the day before yesterday, sales before yesterday), sales today) or (( the number of tasks assigned, the total complexity of the tasks), the number of days the employee walked under the specious pretext).
It happens that Y also consists of several values and then couples (X _i , Y _i ) can look, for example, like this - ((duration of exercise, level of load, body mass index, level of fitness), (blood lactate level , the duration of the recovery period)).
')
And, as you may have already guessed, we suddenly make an unexpected assumption: our Y is not just dependent, but linearly dependent on X , that is, Y = b ^T X , where b is the model parameters in the form of a real matrix of dimension d ^ₓ k .

However, immediately recall about two important aspects. Firstly, there is no reason for b ^T X to be strictly equal to Y like this . After all, we could measure the initial data (and most likely measured it) with an error. In addition, it is possible (and probably even almost certainly) that we have overlooked some factors X , which also affect Y. Therefore, the model must take into account the random error.
Secondly, there is no guarantee that Y linearly depends directly from X directly. After all, maybe they depend on something else, which in turn depends on X. For example, Y can be equal to b ∙ X ² or b log X. And then why, actually, we want exactly the dependencies of Y itself, and not some value depending on Y, say, log Y = b ^T X.

Thus, we arrive at a generalized formulation of the linear problem:
Y ^* = b ^T X ^* + ℇ
where X ^* = f (X), Y ^* = g (Y),
ℇ is a random variable.

Despite the fact that f and g can be non-linear functions, and Y as a result can depend very nonlinearly on X , the model still remains linear with respect to b parameters. That is why it is called the linear model.

But this is not the whole formulation of the problem.

So far we only know X and Y. True, our merit in this is not, because they were given to us. How do we know b and ℇ ?

Recall that the essence of the model is just the coefficients b . After all, it was only for them that we started all this fuss. And if there were no random errors ℇ , then b could be easily calculated by solving an elementary system of d equations, where d is the number of factors, that is, the length of the vector X _i . In other words, it is enough to get exactly d pairs (X _i , Y _i ) as the source data, where each X _i consists of exactly d values - and the exact model is ready.
For example, if the source data has the form ((part size, price of the material), the cost price of the part), then there are enough data of only two details to build an exact cost formula.

But in real life it does not happen. We are hardly lucky enough, and we will have an exhaustive list of influencing factors, and they will be absolutely measured. Therefore, in the formula of our model there will always be a fatal random error ℇ . And most often it will not be a microscopic error, which can be ignored and ignored without prejudice. No, it will have to be considered. And instead of two parts, you’ll have to measure and cheat hundreds and thousands.

And yet they exist

Due to accidents, we cannot easily calculate the desired coefficients b of our model. Therefore, it is necessary to solve an optimization problem of the form b ^* = argmin F (b | X, Y) , where F is a certain functional (for example, ∑ (Y _i - b ^T X _i ) ² , which leads us to the least squares method, but you can set and other functionals).

Notice that the correct and exact b coefficients still exist, but we do not know them. And the best we can count on is b ^* , which may be close to b . In addition, different functionals can lead to different b ^* , which will differ in different ways from the ideal b (however, we still do not know it).

We will deal with errors

With the random errors themselves ℇ is still more complicated. It seems that we don’t know them, and we don’t have them to calculate them, but nevertheless they (more precisely, information about them) are needed, at least, for the exact formulation of the aforementioned optimization problem (and how else to solve it?). So you have to make some parametric assumptions.
Suppose, based on the nature of the process that generates the source data, we a priori declare that ℇ ∽ ℥ (θ) , that is, our random variable is described by a known distribution with a set of parameters θ , for example, a normal distribution.
Or, without knowing the distribution, we can make assumptions about its important properties. For example, that E [ℇ] = 0 , that is, the mathematical expectation of errors is zero. Or that D [ℇ _i ] = σ ² = const , that is, the variances of all errors are the same and finite.

But why all this? After all, you can simply take the argmin from the selected functionality and get the value of b .
Really possible. The only question is the quality of this value. If it is not clear what to take and calculate, then it will be unclear what, and not a model.
For example, if ℇ has a Cauchy distribution, then an OLS solution will give a chaotic and meaningless result.
And here, for example, if conditions are fulfilled simultaneously
- E [ℇ] = 0
- E [ℇ | X] = 0 , that is, errors do not depend on X
- D [ℇ _i ] = σ ² = const
- cov (ℇ _i , ℇ _j ) = 0 , where i ≠ j
then, as a result of the OLS calculation, we will receive not just an estimate of the required parameters b , but the most effective, unbiased and consistent estimate. And all this is very clearly substantiated, proved and tied with theorems from all sides.

Note an important fact. Since the product b ^T X ^{* is} absolutely deterministic, we can say that Y ^{* is} also a random variable that has the same distribution form as ℇ . And then you can make a convenient conclusion by rewriting the model as E [Y ^* ] = b ^T X ^* , which means that our linear model predicts not the value of Y ^{* itself} , but its mathematical expectation. And then we connect to the solution of our problem the richest statistical apparatus.
In particular, having the original data and setting a parametric assumption about the distribution form ℇ (and hence Y ^* ), you can construct the likelihood function (b) and then maximize it, which will be equivalent to building our linear model. In other words, by selecting the distribution parameters (and these are all the same b ), we will learn to generate random variables that are as close as possible to Y. What we, in general, and sought.

Variety of linear models

We repeat the description of our model:
Y ^* = b ^T X ^* + ℇ
b ^* = argmin F (b | X ^* , Y ^* ) .
Changing the format and structure of the components (Y ^* , X ^* , b ^* , ℇ, F), we will get different models that have different properties and are applicable in different tasks.

By the dimension of the independent variable ( X ), one can single out one-factor (univariable) and multi-factor (multivariable) models.
If the dependent variable ( Y ) is a scalar value, then we have a single (univariate) model. When the dependent variable is multidimensional, that is, represented by a vector, we obtain a multiple or general (multivariate, general) model.

Also, let's not forget that independent variables can contain both source data and their conversions, including repeated ones. For example, suppose we have a single variable X at the input, then we can make several factors out of it - [X, X ² , X ³ ] - and we obtain, however strange it may sound together, a polynomial linear model.
Another great example is the categorical variable conversion. For example, one of the variables in the source data takes the values "metro", "car", "bicycle". Three factors are immediately created from it for the model: X ^* _metro , X ^* _car , X ^* _bike , such that X ^* _metro = 1 , if X _{type of transport} = metro and so on.
Due to this, instead of a very ambiguous model of the type Y = bX _{type of transport,} we turn to a convenient and flexible model of the type Y = b ₁ X ^* _metro + b ₂ X ^* _car + b ₃ X ^* _bicycle .

A dependent variable can also contain both source data — and this will be a simple model — as well as their conversion. Moreover, when the transformed dependent variable belongs to an exponential family of distributions , then we are already talking about the so-called generalized linear model (GLM, generalized linear model), which in particular includes the normal, logistic, Poisson, exponential, binomial, and many other models. Generalized models are very important and easy to use, because they have proved both the parameters of convergence, the quality of the estimates obtained, and the influence of functionals of different types. Ideally, try to reduce your task to some GLM model.

And here it is time to remember that a dependent variable can have a very different nature. In particular, it can be continuous (a real number, for example, weight or probability) or discrete. The latter, in turn, can be an integer (for example, the number of clients or days) or a categorical parameter, which can be binary (yes / no) or multinomial, and both disordered (bicycle, bus, metro) or ordered (“good "," Normal "," bad ").
Naturally, different models require different types of variables. It is impossible to predict the likelihood of customers leaving and the volume of their purchases by the same model, even if the influencing factors are the same.

Let's not forget about random errors that may have different distributions, thereby having the strongest influence on the model and its method of construction. For example, the logit and probit model are externally arranged in exactly the same way, they take the same data and predict the probability of some event Y at given X. Here, only in probit models, the errors are distributed normally, and in logit models have a logistic distribution. Naturally, the results of these models give different, so do not confuse them.

Loss function

With the formula of the model sort of figured out, go to the optimized functional. And it can be simple when it takes into account only the loss function, that is, the difference between the values predicted by the model and the actual values. Typical examples of simple functionals are:
- smallest squares: min ∑ (Y _i ^* - b ^T X _i ^* ) ²
- weighted smallest squares: min _i W _i (Y _i ^* - b ^T X _i ) ² , so we can, for example, give more weight to recent data, and thereby reduce the significance of data from previous years.
- generalized least squares by Mahalanobis distance : min ∑ (Y _i ^* - b ^T X _i ^* ) ^T Ω ^-1 (Y _i ^* - b ^T X _i ^* )
- Huber's function , which is interesting because near the minimum it behaves in a quadratic way, and in other places linearly.
Is the inverse Huber function, which, on the contrary, is everywhere quadratic, and linear in the neighborhood of the minimum.
Of course, the variants of possible functionals are not limited to this at all. Perhaps the widest class is the function of maximum likelihood . By entering a parametric assumption about the distribution of random errors ℇ (and hence Y ^* ), you can write the likelihood function in an explicit form, and then, having built a maximizing functional, calculate the desired parameters.
By the way, a curious fact: if Y ^* is normally distributed, then the maximum likelihood functional is actually equivalent to the least squares functional.

Regularization

A complex functional contains a regularization, which is usually represented as an additional regularization term: min ℒ (b) + λ ℱ (b) , where ℒ (b) is a loss function, ℱ (b) is a regularization function, λ is a parameter specifying the degree of influence regularization.
Regularization is intended to regulate the complexity of the model and its purpose is to simplify the model. This, in particular, helps to fight retraining and allows you to increase the generalizing ability of the model.

Typical examples of regularization functions:
1. L ¹ = ∑ | b |
Known as LASSO-regularization (Least Absolute Shrinkage and Selection Operator), and as the name suggests, it allows you to reduce the dimension of the coefficients, turning some of them into zeros. And this is very convenient when the source data is highly correlated.

2. L ² = ∑ | b | ²
Sometimes it is called ridge-regularization, and it allows you to minimize the values of the coefficients of the model, and at the same time make it robust to minor changes in the source data. And it is well differentiated, and therefore the model can be calculated analytically.

3. L _EN = α L ¹ + (1 - α) L ²
Combining LASSO and ridge, we get ElasticNet, which combines two worlds with all their pros and cons.

4. L _N = ∑ E [A (b,)] - A (b, X) , where A is the log partition function
Introducing a new variable of the form Ž = X + ℥ , where ℥ is a random variable, we actually add random noise to the original data, which obviously helps to fight retraining.
For the simplest linear regression, the introduction of additive noise is identical to L ² -regularization, but for other models the additive noise can produce a very interesting result. For example, in logistic regression, he essentially penalizes predictions close to 1/2 (in other words, encourages categorical predictions and punishes uncertainty).

5. Dropout
Another tricky approach that is actively used in neural networks. We introduce a new variable of the form Ž = X * , where ℥ is a vector of Bernoulli random variables of length d . Simply put, we randomly select a subset of factors X and build a model on them, and then choose a model that depends least of all on this randomness.
For the simplest linear regression, dropout is again analogous to L ² -regulation. But, for example, in logistic regression, it allows for the influence of rare, but very characteristic factors (in other words, for some very small X _ij, it will select large coefficients b _j , thereby increasing their influence on the result.

Of course, the available types of regularization are not limited to this. Although linear models rarely require anything more.

And now the solution

Once we have built the functionality, you can begin to solve it. And there are two main ways:
- decision in analytical form
- numerical solution.

Simple functionals such as ONKs and even ONKs with ridge regularization can be solved analytically, that is, to derive a formula for calculating the desired coefficients. As a rule, such decisions are immediately made in a matrix form, for example, using the SVD or Cholesky expansions.
However, if there is a lot of data or it is very sparse, then even in these cases it is better to use iterative numerical methods.

Usually, an analytical solution is generally not available, and we have to resort to numerical methods, and here, of course, we are confronted with a huge variety of algorithms:
- stochastic gradient descent
- stochastic average gradient
- conjugate gradient method
- The Broyden-Fletcher-Goldfarb-Shanno algorithm , as well as its modification with limited L-BFGS memory.

Summarizing

As you can see, simple linear models are not at all simple. A huge number of common business problems are well solved by linear models. In particular, banks and insurance organizations have been actively using them for many, many years. In any case, before you engage in neural networks and other methods of machine learning, you should carefully examine the linear regression, because it is very much a building block for creating more complex analytical models.

Source: https://habr.com/ru/post/278513/

All Articles