One-dimensional linear regression, SQL and machine learning

Hello. This is my first article on Habré, I will be glad to criticism and comments.

The article is devoted to a simple but convenient way of constructing predictors of a special kind in SQL-like languages. These predictors describe a linear trend in the data that can be used to solve machine learning problems. The idea is to calculate linear trends quickly and efficiently using transactional data.

We will consider the problem of learning with the teacher - the problem of regression or classification. The solution of such problems can be divided into many stages, and we will be interested in the stage of creating new predictors. We will assume that in the initial raw data we have, among other things, data on transactions. Details are not important, so we will assume that we have users who perform some actions in time that characterize their activity (Amount field), as well as actions that we want to predict (Target field):

Userid	Timeid	Amount	Target
101	one	20.0	0
101	2	30.0	one
101	3	50.0	0
102	2	80.0	0
102	four	50.0	0
102	five	50.0	0
102	7	30.0	one

In this case, time is indexed by integers. When creating such an index, it is important that the indexes $n$ and $n + 1$ corresponded to the time units that really follow each other (from the following it will be clear why this is important). For example, if you need to index days, then each day can be assigned an index equal to the number of days from the very first day among all those considered. If the very first day on the tablet is January 1, 2017, then January 2 will correspond to index 1, January 28 - index 27, and on February 4 - index 34. Sometimes it is appropriate for each user to add entries for all missing values of the time index, filling the field amount zeroes For example, we may be interested in data on the consumption of a certain product during each of the previous three months - for each user there will always be three entries in the table.

The Target field in this example stores information about a binary event that either occurred or did not occur during the corresponding point in time.

It is important to note that in practice, if there is such an opportunity, I always separate the creation of predictors from modeling. Usually, most of the transformations and frills that can be done in R th, python or elsewhere, can also be done in SQL. Therefore, whenever possible, I always create datasets and all predictors in SQL. The advantages are that, firstly, the stages of machine learning are separated from each other and, secondly, all data manipulations are done immediately and only in the database. Moreover, predictive transformations are sometimes even easier to do using SQL.

So, we have a table with transactional data: users, integer points in time and a quantitative description of the actions of these users. Our goal is to find a linear trend in this data for each user. Since the predictor is one, we will use simple linear regression . Since we are interested in the trend, we will calculate the slope coefficient:

a l p h a = f r a c o v e r l i n e x y - b a r x b a r y o v e r l i n e x^{2} - b a r x^{2}

$\ alpha = \ frac {\ overline {xy} - \ bar {x} \ bar {y}} {\ overline {x ^ 2} - \ bar {x} ^ 2}$

From the formula it is clear that the slope depends only on the averages and their products. Consequently, it should be possible to calculate it "on the fly" in SQL, as it is quite self-sharpened for such operations. It is also clear that it is possible to calculate changes in this coefficient: for example, we fix the fact that the trend is calculated by five time values, then for each set of five consecutive records we calculate the trend. Those who are familiar with SQL already understand how this can be done - window functions should be used. Window functions allow you to calculate the moving average, and that is what we need. The goal is to calculate moving averages for both the Amount field and the TimeID field. The first will play the role of a dependent variable, the second - independent:

SELECT UserID , TimeID, Amount, , AVG(Amount) OVER(PARTITION BY UserID ORDER BY TimeID ROWS BETWEEN 5 PRECEDING AND 1 PRECEDING) AS [y_bar] , AVG(1.0*TimeID) OVER(PARTITION BY UserID ORDER BY TimeID ROWS BETWEEN 5 PRECEDING AND 1 PRECEDING) AS [x_bar] , AVG(Amount*TimeID) OVER(PARTITION BY UserID ORDER BY TimeID ROWS BETWEEN 5 PRECEDING AND 1 PRECEDING) AS [xy_bar] , AVG(1.0*TimeID*TimeID) OVER(PARTITION BY UserID ORDER BY TimeID ROWS BETWEEN 5 PRECEDING AND 1 PRECEDING) AS [x2_bar] INTO #tmp_LinearTrendAmount FROM MyTransTable

After doing this select, we will have all the necessary components at our disposal to calculate the slope coefficient. Another select-m can be collected in the desired indicator, doing the necessary checks (for example, division by zero, which can occur if the user has only one entry). In this example, the slope is calculated from 5 records preceding this one (in the code we use ... AND 1 PRECEDING). The fact is that the Target field refers to a point in time from its own line, so taking records up to it would be "looking into the future."

So, using window functions, you can quickly get the slope coefficients. Importantly, we will get these coefficients for all the records in the table, because SQL will go through the entire table, calculate the moving averages and write them in new columns (taking into account that the slope values for the first four records for each user will be inaccurate). As a result, you can train the model on the entire table, since each record can be considered as a separate observation. Window functions work fairly quickly, especially when compared with some other implementation of linear regression.

Remark 1. On the question of why all this is necessary. For learning tasks with a teacher, in the case when there is data on transactions, I usually use predictors of various kinds — averages, medians, and modes that describe the magnitude of the predictor values; standard deviations and variation coefficients to describe variability, as well as trends in predictor values. In general, this is a good approach, because at least one predictor of each type usually falls into the model.

Note 2. Now it should be clear why it was important to correctly calculate TimeID indexation. Because it is our independent variable and it must be correctly calculated.

Note 3. The devil is in the details. For example, if you take a fixed number of time values, then trends can be calculated over different time intervals — for one user, 5 times will cover 1 week, for another - 1 month. Of course, logic must allow such use. If, for example, for each user we store consumption for each of the previous 3 months, then this problem disappears.

Instead of a conclusion. Applications are always individual: a lot depends on what data, what transactions, at what intervals, how the target variable is calculated. To a greater extent, my idea was to talk about an approach that can be adapted to the solution of many problems - the one-dimensional linear trend depends only on averages, and they can be easily calculated, including in SQL-e, using window functions.

Source: https://habr.com/ru/post/326124/

All Articles

One-dimensional linear regression, SQL and machine learning

More articles: