📜 ⬆️ ⬇️

Lambda functions in SQL ... let me think

image

What will be the article, and so it is clear from the title.

In addition, the author will explain why, from his point of view, it is necessary, and also tell you that SUBJ is not just a fashionable technology, but also “a doubly necessary thing - both pleasant and useful.”

It is always interesting to see how several talented people do something (a programming language, why not), knowing exactly what problem they are solving and what tasks they set for themselves. They also test their creation on themselves. Do not compare with the monumental creations of giant committees, which are at the forefront of maintaining the harmony of the universe, as he understands it.

Compare, for example, the fate of FORTRAN and PL / 1 . Who now generally remembers this PL / 1.
')
From this point of view, the AWK language, for example, is very successful. It is worth saying that in its title A is Alfred Aho , one of the authors of the Dragon Book , W is Peter Weinberger , who also had a hand on Fortran-77, K is Brian Kernighan , where without him. A language is designed to process textual flow on the fly in the pipes between processes.

The language is typeless ( this is not quite so ), the syntax is very similar to C, it has filtering capabilities, associative arrays, start / end events of a stream, a newline event ...

The author has always been impressed with this language by the fact that his interpreter does not need to be installed, it is always available under UNIX-like systems, and under Windows it is enough just to copy the executable file and everything works. However, this is irrelevant.

In the process, the author often has to use a bunch of SQL + AWK and here's why. SQL is still an inherently declarative language for managing data flows. It provides very limited possibilities for working with the context of executing a query in the form of aggregate functions.

How, for example, to build a two-dimensional histogram using SQL?

--   100 x 100 SELECT count(), round(x, -2) AS cx, round(y, -2) AS cy FROM samples GROUP BY cx, xy 

But let me use GROUP BY to use sorting, which is not cheap, if you have hundreds of millions (or even more) rows.
UPD: in the comments I was corrected that this is not quite so (or not at all)
SQL-processor has the ability to perform aggregate functions in the process of constructing a hash according to the criterion of grouping. For this, it is necessary that it has enough free memory to accommodate the hash map in memory.

Then the contexts of the groups will be updated as the table is read, and at the end of this reading we will already have a calculated result.
The same technique can be extended to window functions (below), just the context will be "thicker".

In the case when the number of groups is not known in advance or is very large, the SQL processor is forced to build a temporary index and run through it with the second pass.

In simple cases, for example, like here - a simple COUNT, a universal option is possible - a temporary index (cx, cy, count), then with a small number of groups it will be completely in memory on the cached pages. In complex cases, window functions, the state of the group becomes non-trivial and constantly (de) serialize it is not at all what the doctor prescribed.
In short: the SQL processor resorts to sorting when it cannot estimate the number of groups after a GROUP BY. However, grouping by calculated values ​​is (often) just the case.

So you have to do something like:

 psql -t -q -c 'select x, y from samples' | gawk -f mk_hist2d.awk 

where mk_hist2d.awk accumulates statistics in an associative array and displays it upon completion

 # mk_hist2d.awk { bucket[int($2*0.01), int($3*0.01)]+=$1; } END { for (i=0; i < 500; i++) for (j=0; j < 500; j++) { if ((i, j) in bucket) print i*100." "j*100." "bucket[i, j]; else print i*100." "j*100." 0"; } } 

There is one BUT - the full data stream should be sent from the server to the working machine, and this is not so cheap.

Is it possible to somehow combine business with pleasure - to accumulate statistics during the execution of a SQL query, but without resorting to sorting? Yes, for example, using custom aggregate functions.

Custom aggregate functions


Subj is present in different systems, everywhere done a little differently.

  1. PostgreSQL . Documentation is here . Read more here .
    This is where the maximum account balance is calculated.
    And this is an example that calculates what is more in a boolean column - true or false.

    It looks like this -

     CREATE AGGREGATE mode(boolean) ( SFUNC = mode_bool_state, STYPE = INT[], FINALFUNC = mode_bool_final, INITCOND = '{0,0}' ); 

    Here, SFUNC is a function that is called for every line in the stream,
    The first argument in it is of type STYPE .

    FINALFUNC finishes the calculation and returns the value of the aggregate.
    INITCOND - initialization of the initial value of the internal state ( STYPE ), transmitted by the first argument.
    Taking into account that functions can be written in C (which means that you can use automatically released memory when closing a request for the internal state), this is a very powerful tool. Beyond its use one must still be able to go.
  2. MS SQL.
    Before (2000), before the request, it was necessary to create an ActiveX object, do aggregation using this object.
    Now (2016+) this is done in the CLR environment. You have to create a custom function , create and register an assembly . Then you can create an aggregate .
    An example of calculating a geometric average, as well as merging lines: with additional parameters and a user-defined type for storing an intermediate state.
  3. Oracle .
    In Oracle, this is done using the ODCIAggregate Data Cartridge (interface).
    To create your own aggregate, you must write a custom type that implements 4 methods.
    - initialization (ODCIAggregateInitialize), static, must create an instance of the desired type and return via the parameter
    - iteration (ODCIAggregateIterate), called on each data line
    - merge (ODCIAggregateMerge), used to merge concurrently executed units
    - finish (ODCIAggregateTerminate) - issue result
    Examples: 1 , 2 , 3 , 4 .
  4. DB2.
    There is no explicit way to use custom aggregates in DB2.
    But you can slip the standard function (even, MAX) user defined type (in Java) and force the system to perform queries of the form

     CREATE TYPE Complex AS ( real DOUBLE, i DOUBLE ) … CREATE TABLE complexNumbers ( id INTEGER NOT NULL PRIMARY KEY, number Complex ) … SELECT sum..real, sum..i FROM ( SELECT GetAggrResult(MAX(BuildComplexSum(number))) FROM complexNumbers ) AS t(sum) 

What attracts attention in all these systems?


Window functions


Window functions appeared in the SQL: 2003 standard. At the moment, they are supported by all the above systems. In essence, window functions are an extension of working with aggregates. And, of course, custom aggregate functions work in a windowed context.

The extension is this. And before SQL: 2003, the aggregate functions worked in a certain window, which either the entire resultset or its part corresponding to the combination of field values ​​from the GROUP BY clause performed. Now the user has some freedom in manipulating this window.

The difference is that the values ​​calculated using windows are added to the output in a separate column, and do not require the entire flow to be closed using aggregate functions. So in one request you can use several window units each in its own context (window). There could be several aggregate functions before, but they all worked in one window.

Large strokes,


The syntax for using window functions varies slightly across systems.

To summarize the above, there remains a bit of a painful feeling that the developers, having analyzed the construction of various reports in SQL, identified the most frequent cases and tightly concreted them in the syntax.

Functions returning recordset


In the output of aggregate / window functions, each resultant line corresponds to a certain range of lines from the incoming data stream. In life, such a correspondence does not always exist.

For example, you need to build a 10X10 covariance matrix ( this would require 672X672). This can be done in a single pass; for this, we execute the aggregate function written by us with 10 numeric parameters. The result of its work is a recordset of 10 rows of 10 values, each element of the matrix refers to all rows of the input stream (no matter how many).

You can say - so what, in PostgreSQl, for example, you can return a two-dimensional array from a function (Ex: 'ARRAY [[1,2], [3,4]'). Or just serialize the matrix into a row.

It is good, but it is not always possible to keep the size of the result within the limits acceptable to this approach.

Lyrical digression
For example, our task is to generalize geometry.

The size of the geometries is unknown to us, it may be the coastline of Eurasia from tens of millions of points. Or vice versa, there is a very rough geometry, it is required to smooth it with splines. I would like to pass the parameters to the aggregate and get a data stream instead of a vector or string.

You can, of course, say that the task is far-fetched, that no one does this, the geometries in the DBMS are stored in a special way, there are special programs for processing geometries, ...

In fact, it is quite convenient to store geometry in point-by-point tables, if only because by moving one point, there is no need to rewrite the entire blob. Before spatial data was universally leaked into the DBMS, it was, for example, in ArcSDE .

As soon as the average size of a blob of a geometry exceeds the size of a page, it becomes more profitable to work directly with points. If there was a physical opportunity to operate with streams of points, perhaps the history wheel would turn again.

The covariance matrix is ​​still not a very good example of dissynchronization between input and output streams, since the entire result is obtained simultaneously at the end. Suppose you want to process / compress the source data stream. Wherein


What are the options?

  1. In the framework of SQL, a sorting by time interval / category is required, which contradicts the last item.
  2. If the data is already sorted by time (which, in general, is not guaranteed), and you will be able to convey this fact to the SQL processor, you can get by with window functions and a single pass.
  3. Write a separate application that will do all this. On PL / SQL or, more likely, considering that there is a lot of data, on C / C ++.
  4. Functions returning a record set. Perhaps they can help us.

More detail about P.4. There are two mechanisms for this - temporary tables and pipeline functions.

  1. Conveyor functions.
    This mechanism appeared in Oracle (starting from 9i, 2001) and allows the function that returns the recordset not to accumulate data, but to calculate it as necessary (by analogy with the synchronization of stdout and stdin of two pipe-related processes).
    Those. The results of pipelined functions can be processed before exiting this function. For this, it suffices to say in the definition of the function

      FUNCTION f_trans(p refcur_t) RETURN outrecset PIPELINED IS … 

    and in the body register result lines

     LOOP … out_rec.var_char1 := in_rec.email; out_rec.var_char2 := in_rec.phone_number; PIPE ROW(out_rec); … END LOOP; 

    As a result, we have

     SELECT * FROM TABLE( refcur_pkg.f_trans( CURSOR(SELECT * FROM employees WHERE department_id = 60))); 

    Custom units are simply not needed when there are pipeline functions.

    Bravo, Oracle!

    Not so long ago (2014) pipeline functions appeared in DB2 (IBM i 7.1 TR9, i 7.2 TR1).
  2. Temporary tables.
    To begin with, neither in MS SQL nor in PostgreSQL, it seems impossible to return the cursor from the aggregate function.

    Well, let's, by analogy with the conveyor functions, get the cursor by parameter, process it, add it to the temporary table and return the cursor to it.

    However, in MS SQL, it is impossible to transfer the cursor by parameter to the stored procedure; you can only create a cursor in the procedure and return it via the output parameter. The same can be said about PostgreSQL.

    Well, okay, just open the cursor, subtract it, process the values, calculate the result, add to the temporary table and give the cursor.

    Or even easier, we add the results of the query into one temporary table, process it and return the results through the cursor to another temporary table.

    What can I say. First, and most importantly, reading the data through the cursor is slower than processing in the stream. Secondly, why do we need a SQL processor at all, let's read the tables with cursors, create temporary tables with our hands, write join-in logic in cycles ... This is like assembler inserts in C / C ++, sometimes you can be pampered, but it's better not to abuse it.

So, having considered the issue with functions that return a recordset, we arrive at the conclusions:


What else


In fact, if we already have the opportunity to solve problems, what else does the author need?
In fact, the Turing machine can also calculate anything, it simply does not do it very quickly and is not very convenient.

We formulate the requirements as follows:

  1. it should be a relational operator that can be used along with the rest (sample, projection, ...)
  2. it must be a statement that turns one data stream into another
  3. there is no synchronization between input and output streams
  4. when declaring an operator, the output stream structure is determined
  5. the operator has the ability to dynamically initialize (as a function, more precisely, of its body, specified directly in the definition of the operator)
  6. as well as a destructor as a function (...)
  7. as well as a function (...) that is called every time a new line is received from the input stream.
  8. the operator has an execution context — a user-defined set of variables and / or collections that are needed for operation
  9. To run this operator, you do not need to create database objects, you do not need additional rights
  10. all that is required for work is determined in one place, in one language

Once, a long time ago, the author made such an operator, extending a home-made processor of a implemented subset of TTM / Tutorial D. Now the same idea is proposed for SQL.

It is worth warning, here the SQL ends and begins improvisation. The syntax is left as it was in the original, in the end, syntactic sugar can be anything, it does not change the essence.

So, the chew operator consists of

  1. The header, which contains a list of output fields and their types.
    Each output (and input) field is a local variable.
    Ex: “chew {“ var1 "float," var2 "integer}" means that there will be two columns in the output stream - a floating point and an integer
  2. Body - a list of callbacks for events, at the moment - the start of the stream, the end of the stream, a string. Syntax functions are similar to PL / SQL. The predefined function __interrupt () is analogous to PIPE; it takes values ​​from the variables corresponding to the output columns and places them in the output stream. If the output buffer overflows, the handler stops working and the receiving side of the stream begins.
    Ex: “hook“ init ”{var1: = 0; var2: = -1; } "

The easiest way to show examples.


Is it possible to offer an example in which this approach gives results that are fundamentally unattainable in the usual way? We have them.

Sometimes it happens that the data is almost sorted. They may even be completely sorted, but it is not known for sure.

( ) . Those. T1 T2 T1 < T2.

, T1 T2 () , ( ) .

, , , , .

.

, .

.

, .
, .

SQL- .

lambda- SQL- , , .

Conclusion


.

PL/SQL.

.

, , GROUP BY.

, , SQL- .

, , .

PS: .

Source: https://habr.com/ru/post/435120/


All Articles