What will be the article, and so it is clear from the title.
In addition, the author will explain why, from his point of view, it is necessary, and also tell you that SUBJ is not just a fashionable technology, but also “a doubly necessary thing - both pleasant and useful.”
It is always interesting to see how several talented people do something (a programming language, why not), knowing exactly what problem they are solving and what tasks they set for themselves. They also test their creation on themselves. Do not compare with the monumental creations of giant committees, which are at the forefront of maintaining the harmony of the universe, as he understands it.
Compare, for example, the fate of
FORTRAN and
PL / 1 . Who now generally remembers this PL / 1.
')
From this point of view, the
AWK language, for example, is very successful. It is worth saying that in its title A is
Alfred Aho , one of the authors of the
Dragon Book , W is
Peter Weinberger , who also had a hand on Fortran-77, K is
Brian Kernighan , where without him. A language is designed to process textual flow on the fly in the pipes between processes.
The language is typeless (
this is not quite so ), the syntax is very similar to C, it has filtering capabilities, associative arrays, start / end events of a stream, a newline event ...
The author has always been impressed with this language by the fact that his interpreter does not need to be installed, it is always available under UNIX-like systems, and under Windows it is enough just to copy the executable file and everything works. However, this is irrelevant.
In the process, the author often has to use a bunch of SQL + AWK and here's why. SQL is still an inherently declarative language for managing data flows. It provides very limited possibilities for working with the context of executing a query in the form of aggregate functions.
How, for example, to build a two-dimensional histogram using SQL?
But let me use GROUP BY to use sorting, which is not cheap, if you have hundreds of millions (or even more) rows.
UPD: in the comments I was corrected that this is not quite so (or not at all)SQL-processor has the ability to perform aggregate functions in the process of constructing a hash according to the criterion of grouping. For this, it is necessary that it has enough free memory to accommodate the hash map in memory.
Then the contexts of the groups will be updated as the table is read, and at the end of this reading we will already have a calculated result.
The same technique can be extended to window functions (below), just the context will be "thicker".
In the case when the number of groups is not known in advance or is very large, the SQL processor is forced to build a temporary index and run through it with the second pass.
In simple cases, for example, like here - a simple COUNT, a universal option is possible - a temporary index (cx, cy, count), then with a small number of groups it will be completely in memory on the cached pages. In complex cases, window functions, the state of the group becomes non-trivial and constantly (de) serialize it is not at all what the doctor prescribed.
In short: the SQL processor resorts to sorting when it cannot estimate the number of groups after a GROUP BY. However, grouping by calculated values ​​is (often) just the case.
So you have to do something like:
psql -t -q -c 'select x, y from samples' | gawk -f mk_hist2d.awk
where mk_hist2d.awk accumulates statistics in an associative array and displays it upon completion
# mk_hist2d.awk { bucket[int($2*0.01), int($3*0.01)]+=$1; } END { for (i=0; i < 500; i++) for (j=0; j < 500; j++) { if ((i, j) in bucket) print i*100." "j*100." "bucket[i, j]; else print i*100." "j*100." 0"; } }
There is one BUT - the full data stream should be sent from the server to the working machine, and this is not so cheap.
Is it possible to somehow combine business with pleasure - to accumulate statistics during the execution of a SQL query, but without resorting to sorting? Yes, for example, using custom aggregate functions.
Custom aggregate functions
Subj is present in different systems, everywhere done a little differently.
- PostgreSQL . Documentation is here . Read more here .
This is where the maximum account balance is calculated.
And this is an example that calculates what is more in a boolean column - true or false.
It looks like this -
CREATE AGGREGATE mode(boolean) ( SFUNC = mode_bool_state, STYPE = INT[], FINALFUNC = mode_bool_final, INITCOND = '{0,0}' );
Here, SFUNC is a function that is called for every line in the stream,
The first argument in it is of type STYPE .
FINALFUNC finishes the calculation and returns the value of the aggregate.
INITCOND - initialization of the initial value of the internal state ( STYPE ), transmitted by the first argument.
Taking into account that functions can be written in C (which means that you can use automatically released memory when closing a request for the internal state), this is a very powerful tool. Beyond its use one must still be able to go. - MS SQL.
Before (2000), before the request, it was necessary to create an ActiveX object, do aggregation using this object.
Now (2016+) this is done in the CLR environment. You have to create a custom function , create and register an assembly . Then you can create an aggregate .
An example of calculating a geometric average, as well as merging lines: with additional parameters and a user-defined type for storing an intermediate state. - Oracle .
In Oracle, this is done using the ODCIAggregate Data Cartridge (interface).
To create your own aggregate, you must write a custom type that implements 4 methods.
- initialization (ODCIAggregateInitialize), static, must create an instance of the desired type and return via the parameter
- iteration (ODCIAggregateIterate), called on each data line
- merge (ODCIAggregateMerge), used to merge concurrently executed units
- finish (ODCIAggregateTerminate) - issue result
Examples: 1 , 2 , 3 , 4 . - DB2.
There is no explicit way to use custom aggregates in DB2.
But you can slip the standard function (even, MAX) user defined type (in Java) and force the system to perform queries of the form
CREATE TYPE Complex AS ( real DOUBLE, i DOUBLE ) … CREATE TABLE complexNumbers ( id INTEGER NOT NULL PRIMARY KEY, number Complex ) … SELECT sum..real, sum..i FROM ( SELECT GetAggrResult(MAX(BuildComplexSum(number))) FROM complexNumbers ) AS t(sum)
What attracts attention in all these systems?
- Anyway, you will need to create some objects in the database. Whether AGGREGATE or TYPE. At least the corresponding rights are required. And all I want to add a few numbers on the knee.
- You may have to write something in another language, be it C, C # or Java.
To integrate the written into the system, again, you need rights. And all you want ...
- Difficulties with initialization. Suppose you want to count histograms with different sizes of baskets. It would seem, what is simpler - we indicate the desired INITCOND when declaring an aggregate (PostgreSQL) and the whole business. But then for each size of the basket you need your own unit, and for this, rights are needed again.
Here you can resort to a dirty trick and slip the processor union out of the initialization string (forward) and data, construct the context not in the constructor, but when you get the first string.
- Nevertheless, even with the described limitations, custom aggregates allow us to calculate anything.
- It is important that aggregates can be parallelized , at least PostgreSQL, and Oracle (Enterprise Edition) can do this. To do this, the truth will have to learn how to serialize / deserialize intermediate states as well as to migrate them from different streams.
Window functions
Window functions appeared in the
SQL: 2003 standard. At the moment, they are supported by all the above systems. In essence, window functions are an extension of working with aggregates. And, of course, custom aggregate functions work in a windowed context.
The extension is this. And before SQL: 2003, the aggregate functions worked in a certain window, which either the entire resultset or its part corresponding to the combination of field values ​​from the GROUP BY clause performed. Now the user has some freedom in manipulating this window.
The difference is that the values ​​calculated using windows are added to the output in a separate column, and do not require the entire flow to be closed using aggregate functions. So in one request you can use several window units each in its own context (window). There could be several aggregate functions before, but they all worked in one window.
Large strokes,
- OVER ()
the window is the entire resultset. Suppose the query ' select count (1) from Samples ' returns 169. In this case, by running ' select count (1) over () from Samples ', we get a column that is written 169 times 169 times. - OVER (PARTITION BY)
this is analogous to GROUP BY, for each combination of values ​​a window is created in which the aggregate functions are performed. Suppose in the Samples table one integer column is val, data is numbers from 1 to 169.
Then the query ' select count (1) over (partition by (12 + val) / 13) from Samples ' will return a column in which value 139 is written 16 times.
- OVER (ORDER BY)
can be combined with PARTITION BY, allows you to dynamically change the size of the window in the process of moving the cursor, in this case, the window extends from the beginning of the group to the current position of the cursor. As a result, the group does not produce the same value in the aggregate column, but its own. Convenient for calculating the cumulative sum. Query result
'select sum (val) over (order by val) from Samples ' will be a column in which the nth element will contain the sum of positive integers from 1 to n. - OVER (ROWS)
allows you to define the window frame, starting from either the cursor position or the start / end of the ORDER BY range.
For example, ' ... ROWS 1 PRECEDING ... ' means that the window consists of the current line and 1 before it. A ' ... ROWS BETWEEN 1 FOLLOWING AND 2 FOLLOWING ... ' - the window consists of two lines immediately after the cursor.
CURRENT ROW in this mode indicates the current position of the cursor. For example, ' ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING ' means starting at the current line and ending at the end of the range. - OVER (RANGE)
differs from ROWS in that CURRENT ROW here means as the start of the window - the beginning of the range from ORDER BY, and as the end of the window - the last line of the ORDER BY range.
The syntax for using window functions varies slightly across systems.
To summarize the above, there remains a bit of a painful feeling that the developers, having analyzed the construction of various reports in SQL, identified the most frequent cases and tightly concreted them in the syntax.
Functions returning recordset
In the output of aggregate / window functions, each resultant line corresponds to a certain range of lines from the incoming data stream. In life, such a correspondence does not always exist.
For example, you need to build a 10X10 covariance matrix (
this would require 672X672). This
can be done in a single pass; for this, we execute the aggregate function written by us with 10 numeric parameters. The result of its work is a recordset of 10 rows of 10 values, each element of the matrix refers to all rows of the input stream (no matter how many).
You can say - so what, in PostgreSQl, for example,
you can return a two-dimensional array from a function (Ex: 'ARRAY [[1,2], [3,4]'). Or just serialize the matrix into a row.
It is good, but it is not always possible to keep the size of the result within the limits acceptable to this approach.
Lyrical digressionFor example, our task is to generalize geometry.
The size of the geometries is unknown to us, it may be the coastline of Eurasia from tens of millions of points. Or vice versa, there is a very rough geometry, it is required to smooth it with splines. I would like to pass the parameters to the aggregate and get a data stream instead of a vector or string.
You can, of course, say that the task is far-fetched, that no one does this, the geometries in the DBMS are stored in a special way, there are special programs for processing geometries, ...
In fact, it is quite convenient to store geometry in point-by-point tables, if only because by moving one point, there is no need to rewrite the entire blob. Before spatial data was universally leaked into the DBMS, it was, for example, in
ArcSDE .
As soon as the average size of a blob of a geometry exceeds the size of a page, it becomes more profitable to work directly with points. If there was a physical opportunity to operate with streams of points, perhaps the history wheel would turn again.
The covariance matrix is ​​still not a very good example of dissynchronization between input and output streams, since the entire result is obtained simultaneously at the end. Suppose you want to process / compress the source data stream. Wherein
- there is a lot of data, they are in a “heap” without indexes, in fact they are simply 'quickly' written to disk
- required to put them in different categories, which are relatively few
- within categories, average over time intervals, store only average, number of measurements and variance
- all this needs to be done quickly
What are the options?
- In the framework of SQL, a sorting by time interval / category is required, which contradicts the last item.
- If the data is already sorted by time (which, in general, is not guaranteed), and you will be able to convey this fact to the SQL processor, you can get by with window functions and a single pass.
- Write a separate application that will do all this. On PL / SQL or, more likely, considering that there is a lot of data, on C / C ++.
- Functions returning a record set. Perhaps they can help us.
More detail about P.4. There are two mechanisms for this - temporary tables and pipeline functions.
- Conveyor functions.
This mechanism appeared in Oracle (starting from 9i, 2001) and allows the function that returns the recordset not to accumulate data, but to calculate it as necessary (by analogy with the synchronization of stdout and stdin of two pipe-related processes).
Those. The results of pipelined functions can be processed before exiting this function. For this, it suffices to say in the definition of the function
FUNCTION f_trans(p refcur_t) RETURN outrecset PIPELINED IS …
and in the body register result lines
LOOP … out_rec.var_char1 := in_rec.email; out_rec.var_char2 := in_rec.phone_number; PIPE ROW(out_rec); … END LOOP;
As a result, we have
SELECT * FROM TABLE( refcur_pkg.f_trans( CURSOR(SELECT * FROM employees WHERE department_id = 60)));
Custom units are simply not needed when there are pipeline functions.
Bravo, Oracle!
Not so long ago (2014) pipeline functions appeared in DB2 (IBM i 7.1 TR9, i 7.2 TR1). - Temporary tables.
To begin with, neither in MS SQL nor in PostgreSQL, it seems impossible to return the cursor from the aggregate function.
Well, let's, by analogy with the conveyor functions, get the cursor by parameter, process it, add it to the temporary table and return the cursor to it.
However, in MS SQL, it is impossible to transfer the cursor by parameter to the stored procedure; you can only create a cursor in the procedure and return it via the output parameter. The same can be said about PostgreSQL.
Well, okay, just open the cursor, subtract it, process the values, calculate the result, add to the temporary table and give the cursor.
Or even easier, we add the results of the query into one temporary table, process it and return the results through the cursor to another temporary table.
What can I say. First, and most importantly, reading the data through the cursor is slower than processing in the stream. Secondly, why do we need a SQL processor at all, let's read the tables with cursors, create temporary tables with our hands, write join-in logic in cycles ... This is like assembler inserts in C / C ++, sometimes you can be pampered, but it's better not to abuse it.
So, having considered the issue with functions that return a recordset, we arrive at the conclusions:
- Custom units here will not help us much.
- In any case, you will need to create some objects in the database. Whether functions or temporary tables. At least the corresponding rights are required. And all I want to handle a few numbers.
- However, albeit with the described limitations, sometimes not too elegantly, but this method can solve the problem.
What else
In fact, if we already have the opportunity to solve problems, what else does the author need?
In fact, the Turing machine can also calculate anything, it simply does not do it very quickly and is not very convenient.
We formulate the requirements as follows:
- it should be a relational operator that can be used along with the rest (sample, projection, ...)
- it must be a statement that turns one data stream into another
- there is no synchronization between input and output streams
- when declaring an operator, the output stream structure is determined
- the operator has the ability to dynamically initialize (as a function, more precisely, of its body, specified directly in the definition of the operator)
- as well as a destructor as a function (...)
- as well as a function (...) that is called every time a new line is received from the input stream.
- the operator has an execution context — a user-defined set of variables and / or collections that are needed for operation
- To run this operator, you do not need to create database objects, you do not need additional rights
- all that is required for work is determined in one place, in one language
Once, a
long time ago, the author made such an operator, extending a home-made processor of a implemented subset of
TTM / Tutorial D. Now the same idea is proposed for SQL.
It is worth warning, here the SQL ends and begins improvisation. The syntax is left as it was in the original, in the end, syntactic sugar can be anything, it does not change the essence.
So, the
chew operator consists of
- The header, which contains a list of output fields and their types.
Each output (and input) field is a local variable.
Ex: “chew {“ var1 "float," var2 "integer}" means that there will be two columns in the output stream - a floating point and an integer - Body - a list of callbacks for events, at the moment - the start of the stream, the end of the stream, a string. Syntax functions are similar to PL / SQL. The predefined function __interrupt () is analogous to PIPE; it takes values ​​from the variables corresponding to the output columns and places them in the output stream. If the output buffer overflows, the handler stops working and the receiving side of the stream begins.
Ex: “hook“ init ”{var1: = 0; var2: = -1; } "
The easiest way to show examples.
- Analogue of the SUM aggregate function.
It looks cumbersome, but this is just an example,
It is not necessary to write a C program to add a pair of numbers. - SUM + AVG
Here we note that the summation occurs only once. - SUM + GROUP BY
- ROW_NUMBER () OVER ()
Is it possible to offer an example in which this approach gives results that are fundamentally unattainable in the usual way? We have them.
Sometimes it happens that the data is almost sorted. They may even be completely sorted, but it is not known for sure.
( ) . Those. T1 T2 T1 < T2.
, T1 T2 () , ( ) .
, , , , .
.
, .
.
, .
, .
SQL- .
lambda- SQL- , , .
Conclusion
.
PL/SQL.
.
, , GROUP BY.
, , SQL- .
, , .
PS: .