Quite often they ask whether there are analogues of analytical (window) functions in MySQL.
Note. At the time of writing, there were no such analogues, however, the article still represents academic interest in terms of analyzing the original MySQL approach to using variables.To replace analytic functions, often use queries with self-connections, complex subqueries, and so on. Most of these solutions are inefficient in terms of performance.
Also in MySQL there is no recursion. However, with some of the problems that are usually solved by analytical functions or recursion, you can cope with the tools of MySQL.
')
One of these tools is a unique, uncharacteristic for other DBMS, mechanism for working with variables within an SQL query. We can declare a variable inside a query, change its value and substitute it in a SELECT for output. Moreover, the order of processing the rows in the query and, as a result, the order in which values ​​are assigned to variables can be specified in custom sorting!
A warning. The article assumes that the processing of expressions in the SELECT clause is carried out from left to right, but there is no official confirmation of such processing order in the MySQL documentation. This must be borne in mind when changing the server version. To guarantee the calculation sequence, you can use a dummy CASE or IF statement.
Analogue of recursion
Consider a simple example that generates the Fibonacci sequence (in the Fibonacci sequence, each term is equal to the sum of the two previous ones, and the first 2 are equal to one):
SELECT IF(X=1, Fn_1, Fn_2) F FROM( SELECT @I := @I + @J Fn_1, @J := @I + @J Fn_2 FROM (SELECT 0 dummy UNION ALL SELECT 0 UNION ALL SELECT 0)a, (SELECT 0 dummy UNION ALL SELECT 0 UNION ALL SELECT 0)b, (SELECT @I := 1, @J := 1)IJ )T, (SELECT 1 X UNION ALL SELECT 2)X;
This query generates 18 Fibonacci numbers, not counting the first two:
2,3,5,8,13,21,34,55,89,144,233,377,610,987,1597,2584,4181,6765
We now analyze how it works.
In lines 5) 6) 9 entries are generated. There is nothing unusual.
In line 7) we declare two variables @I, @J and assign them 1.
In line 3) the following happens: first, the sum of two variables is assigned to the variable @I. Then we assign the same to the @J variable, taking into account the fact that the value of @I has already changed.
In other words, calculations in SELECT are performed from left to right - see also the remark at the beginning of the article.
And change of variables is carried out in each of our 9 records, i.e. during the processing of each new line, the variables @I and @J will contain the values ​​calculated during the processing of the previous line.
To solve this problem using other DBMS, we would have to write a
recursive query!Note:Variables need to be declared in a separate subquery (line 7), if we declared a variable in the SELECT clause, it would most likely be calculated only once (although the specific behavior will depend on the version of the server). The type of a variable is determined by the value by which it is initialized. This type may change dynamically. If a variable is assigned NULL, its type will be BLOB.The order in which rows are processed in SELECT, as mentioned above, depends on custom sorting. A simple example of numbering lines in a given order:
SELECT val, @I:=@I+1 Num FROM (SELECT 30 val UNION ALL SELECT 20 UNION ALL SELECT 10 UNION ALL SELECT 50)a, (SELECT @I := 0)I ORDER BY val;
Val Num 10 1 20 2 30 3 50 4
Analogs of analytical functions
Variables can also be used to replace analytic functions. Further some examples. For simplicity, we assume that all fields are NOT NULL, and sorting and partitioning (PARTITION BY) occur in a single field. Using NULL values ​​and more complex sorts will make the examples more cumbersome, but the essence does not change.
For examples, create a TestTable table:
CREATE TABLE TestTable( group_id INT NOT NULL, order_id INT UNIQUE NOT NULL, value INT NOT NULL );
Where
group_id - group identifier (analogue of the analytic function window);
order_id is a unique field that will be sorted;
value is a numeric value.
Fill out our table with test data:
INSERT TestTable(order_id, group_id, value) SELECT * FROM( SELECT 1 order_id, 1 group_id, 1 value UNION ALL SELECT 2, 1, 2 UNION ALL SELECT 3, 1, 2 UNION ALL SELECT 4, 2, 1 UNION ALL SELECT 5, 2, 2 UNION ALL SELECT 6, 2, 3 UNION ALL SELECT 7, 3, 1 UNION ALL SELECT 8, 3, 2 UNION ALL SELECT 9, 4, 1 UNION ALL SELECT 11, 3, 2 )T;
Examples of the replacement of some analytical functions.
1) ROW_NUMBER () OVER (ORDER BY order_id)
SELECT T.*, @I:=@I+1 RowNum FROM TestTable T,(SELECT @I:=0)I ORDER BY order_id;
group_id order_id value RowNum
1 1 1 1
1 2 2 2
1 3 2 3
2 4 1 4
2 5 2 5
2 6 3 6
3 7 1 7
3 8 2 8
4 9 1 9
3 11 2 10
2) ROW_NUMBER () OVER (PARTITION BY group_id ORDER BY order_id)
SELECT group_id, order_id, value, RowNum FROM( SELECT T.*, IF(@last_group_id = group_id, @I:=@I+1, @I:=1) RowNum, @last_group_id := group_id FROM TestTable T,(SELECT @last_group_id:=NULL, @I:=0)I ORDER BY group_id, order_id )T;
group_id order_id value RowNum
1 1 1 1
1 2 2 2
1 3 2 3
2 4 1 1
2 5 2 2
2 6 3 3
3 7 1 1
3 8 2 2
3 11 2 3
4 9 1 1
3) SUM (value) OVER (PARTITION BY group_id ORDER BY order_id)
SELECT group_id, order_id, value, RunningTotal FROM( SELECT T.*, IF(@last_group_id = group_id, @I:=@I+value, @I:=value) RunningTotal, @last_group_id := group_id FROM TestTable T, (SELECT @last_group_id:=NULL, @I:=0)I ORDER BY group_id, order_id )T;
group_id order_id value RunningTotal
1 1 1 1
1 2 2 3
1 3 2 5
2 4 1 1
2 5 2 3
2 6 3 6
3 7 1 1
3 8 2 3
3 11 2 5
4 9 1 1
4) LAG (value) OVER (PARTITION BY group_id ORDER BY order_id)
SELECT group_id, order_id, value, LAG FROM( SELECT T.*, IF(@last_group_id = group_id, @last_value, NULL) LAG, @last_group_id := group_id, @last_value := value FROM TestTable T,(SELECT @last_value:=NULL, @last_group_id:=NULL)I ORDER BY group_id, order_id )T;
group_id order_id value LAG
1 1 1 NULL
1 2 2 1
1 3 2 2
2 4 1 NULL
2 5 2 1
2 6 3 2
3 7 1 NULL
3 8 2 1
3 11 2 2
4 9 1 NULL
For LEAD, everything is the same, only you need to change the sorting to ORDER BY group_id, order_id DESC
For the COUNT, MIN, MAX functions, everything is somewhat more complicated, since until we analyze all the lines in the group (window), we will not be able to find out the meaning of the function. MS SQL, for example, “spools” a window for this purpose (temporarily places the rows of a window in a hidden buffer table for repeated access to them), in MySQL there is no such possibility. But for each window we can calculate the value of the function in the last line for a given sort (that is, after analyzing the entire window), and then, sorting the lines in the window in reverse order, put the calculated value over the entire window.
So we need two sorts. To ensure that the final sorting remains the same as in the examples above, first sort by ASC group_id, order_id DESC fields, then ASC by group_id field, ASC order_id.
5) COUNT (*) OVER (PARTITION BY group_id)
In the first sort, we simply number the entries. In the second window we assign the maximum number to all the lines, which will correspond to the number of lines in the window.
SELECT group_id, order_id, value, Cnt FROM( SELECT group_id, order_id, value, IF(@last_group_id = group_id, @MaxRowNum, @MaxRowNum := RowNumDesc) Cnt, @last_group_id := group_id FROM( SELECT T.*, IF(@last_group_id = group_id, @I:=@I+1, @I:=1) RowNumDesc, @last_group_id := group_id FROM TestTable T,(SELECT @last_group_id:=NULL, @I:=0)I ORDER BY group_id, order_id DESC )T,(SELECT @last_group_id:=NULL, @MaxRowNum:=NULL)I ORDER BY group_id, order_id )T;
group_id order_id value Cnt
1 1 1 3
1 2 2 3
1 3 2 3
2 4 1 3
2 5 2 3
2 6 3 3
3 7 1 3
3 8 2 3
3 11 2 3
4 9 1 1
The functions MAX and MIN are calculated by analogy. I will give only an example for MAX:
6) MAX (value) OVER (PARTITION BY group_id)
SELECT group_id, order_id, value, MaxVal FROM( SELECT group_id, order_id, value, IF(@last_group_id = group_id, @MaxVal, @MaxVal := MaxVal) MaxVal, @last_group_id := group_id FROM( SELECT T.*, IF(@last_group_id = group_id, GREATEST(@MaxVal, value), @MaxVal:=value) MaxVal, @last_group_id := group_id FROM TestTable T,(SELECT @last_group_id:=NULL, @MaxVal:=NULL)I ORDER BY group_id, order_id DESC )T,(SELECT @last_group_id:=NULL, @MaxVal:=NULL)I ORDER BY group_id, order_id )T;
group_id order_id value MaxVal
1 1 1 2
1 2 2 2
1 3 2 2
2 4 1 3
2 5 2 3
2 6 3 3
3 7 1 2
3 8 2 2
3 11 2 2
4 9 1 1
7) COUNT (DISTINCT value) OVER (PARTITION BY group_id)
An interesting thing that is missing in MS SQL Server, but it can be calculated with a subquery, taking MAX from RANK. So do the same here. In the first sorting, we calculate RANK () OVER (PARTITION BY group_id ORDER BY value DESC), then in the second sort we set the maximum value to all the rows in each window:
SELECT group_id, order_id, value, Cnt FROM( SELECT group_id, order_id, value, IF(@last_group_id = group_id, @Rank, @Rank := Rank) Cnt, @last_group_id := group_id FROM( SELECT T.*, IF(@last_group_id = group_id, IF(@last_value = value, @Rank, @Rank:=@Rank+1) , @Rank:=1) Rank, @last_group_id := group_id, @last_value := value FROM TestTable T,(SELECT @last_value:=NULL, @last_group_id:=NULL, @Rank:=0)I ORDER BY group_id, value DESC, order_id DESC )T,(SELECT @last_group_id:=NULL, @Rank:=NULL)I ORDER BY group_id, value, order_id )T;
group_id order_id value Cnt
1 1 1 2
1 2 2 2
1 3 2 2
2 4 1 3
2 5 2 3
2 6 3 3
3 7 1 2
3 8 2 2
3 11 2 2
4 9 1 1
Performance
To begin with, the numbering of lines in a query is comparable in performance using a self-join and using variables.
1) The classic self-joining method
SELECT COUNT(*)N, T1.* FROM TestTable T1 JOIN TestTable T2 ON T1.order_id >= T2.order_id GROUP BY T1.order_id;
With 10,000 entries in the table, TestTable gives:
Duration / Fetch
16.084 sec / 0.016 sec
2) Using variables:
SELECT @N:=@N+1 N, T1.* FROM TestTable T1, (SELECT @N := 0)M ORDER BY T1.order_id;
Issues:
Duration / Fetch
0.016 sec / 0.015 sec
The result speaks for itself. However, it should be understood that the values ​​calculated using variables are not optimal for use in filtering conditions. Sorting and computing will occur for ALL rows, despite the fact that in the end we need only a small part of them.
Let us consider in more detail on the example of such a task:
Print the first 2 lines from the TestTable table for each group_id, sorted by order_id.Here is how this task would be solved in a DBMS with the support of analytical functions:
SELECT group_id, order_id, value FROM( SELECT *, ROW_NUMBER()OVER(PARTITION BY group_id ORDER BY order_id) RowNum FROM TestTable )T WHERE RowNum <= 2;
However, the MySQL optimizer knows nothing about the rules for calculating the RowNum field. He will have to number all the lines, and only then select the necessary ones.
Now imagine that we have 1 million records and 20 unique group_id values. Those. To select 40 rows, MySQL will calculate the RowNum value for a million rows! There is no beautiful solution to this problem in one query in MySQL. But you can first get a list of unique group_id values, for example, like this:
SELECT DISTINCT group_id FROM TestTable;
Then by means of any other programming language to generate a query of the form:
SELECT * FROM TestTable WHERE group_id=1 ORDER BY order_id LIMIT 2 UNION ALL SELECT * FROM TestTable WHERE group_id=2 ORDER BY order_id LIMIT 2 UNION ALL … SELECT * FROM TestTable WHERE group_id=20 ORDER BY order_id LIMIT 2;
20 light queries will work much faster than calculating the RowNum for a million rows.