How to calculate everything in the world by a single SQL query. PostgreSQL window functions

I was surprised to find that many developers, even for a long time using postgresql, do not understand window functions, considering them to be some kind of special magic for the elect. Well, or at best, copy-paste with StackOverflow expressions of the type “row_number () OVER ()”, without going into details. But window functions are a useful PostgreSQL functional .
I will try to explain in a simple way how to use them.

First I want to immediately explain that the window functions do not change the selection, but only add some additional information about it. Those. for ease of understanding, we can assume that postgres first performs the entire request (except sorting and limit), and then only calculates window expressions.
The syntax is something like this:

 OVER

A window is some expression that describes the set of strings that the function will handle and the order of this processing.
Moreover, the window can be simply given with empty brackets (), i.e. the window is all the rows of the query result.

For example, in this select, line numbers are simply added to the usual id, header, and score fields.
')

 SELECT id, section, header, score, row_number() OVER () AS num FROM news; id | section | header | score | num ----+---------+-----------+-------+----- 1 | 2 |  | 23 | 1 2 | 1 |  | 6 | 2 3 | 4 |  | 79 | 3 4 | 3 |  | 36 | 4 5 | 2 |  | 34 | 5 6 | 2 |  | 95 | 6 7 | 4 |  | 26 | 7 8 | 3 |  | 36 | 8

ORDER BY can be added to the window expression, then the processing order can be changed.

 SELECT id, section, header, score, row_number() OVER (ORDER BY score DESC) AS rating FROM news ORDER BY id; id | section | header | score | rating ----+---------+-----------+-------+-------- 1 | 2 |  | 23 | 7 2 | 1 |  | 6 | 8 3 | 4 |  | 79 | 2 4 | 3 |  | 36 | 4 5 | 2 |  | 34 | 5 6 | 2 |  | 95 | 1 7 | 4 |  | 26 | 6 8 | 3 |  | 36 | 3

Please note that I also added the ORDER BY id at the end of the entire order, and the rating was calculated correctly anyway. Those. Posgres simply sorted the result along with the result of the window function, one order does not interfere with another.

Further more. In the window expression you can add the word PARTITION BY [expression] ,
for example row_number () OVER (PARTITION BY section) , then the counting will go in each group separately:

 SELECT id, section, header, score, row_number() OVER (PARTITION BY section ORDER BY score DESC) AS rating_in_section FROM news ORDER BY section, rating_in_section; id | section | header | score | rating_in_section ----+---------+-----------+-------+------------------- 2 | 1 |  | 6 | 1 6 | 2 |  | 95 | 1 5 | 2 |  | 34 | 2 1 | 2 |  | 23 | 3 4 | 3 |  | 36 | 1 8 | 3 |  | 36 | 2 3 | 4 |  | 79 | 1 7 | 4 |  | 26 | 2

If you do not specify a partition, then the partition is the entire request.

Here it is necessary to say a little about the functions that can be used, since there is a very important nuance.
As a function, you can use, so to speak, the true window functions from the manual - this is row_number (), rank (), lead (), etc., or you can use functions-aggregates, such as: sum (), count () etc. So, this is important, the aggregate functions work slightly differently: if ORDER BY is not specified in the window, the whole partition is counted once, and the result is written in all rows (the same for all rows of the partition). If ORDER BY is specified, then the counting on each line goes from the beginning of the partition to this line.

Let's look at this with an example. For example, we have a certain (spherical in vacuum) table of balance replenishment.

 SELECT transaction_id, change FROM balance_change ORDER BY transaction_id; transaction_id | change ----------------+-------- 1 | 1.00 2 | -2.00 3 | 10.00 4 | -4.00 5 | 5.50

and we want to know at the same time how the balance on the balance changed:

 SELECT transaction_id, change, sum(change) OVER (ORDER BY transaction_id) as balance FROM balance_change ORDER BY transaction_id; transaction_id | change | balance ----------------+--------+--------- 1 | 1.00 | 1.00 2 | -2.00 | -1.00 3 | 10.00 | 9.00 4 | -4.00 | 5.00 5 | 5.50 | 10.50

Those. for each row is counted in a separate frame . In this case, a frame is a set of lines from the beginning to the current line (if it were PARTITION BY, then from the beginning of the partition).

If we do not use ORDER BY in the window for the aggregate function sum, then we simply calculate the total amount and show it in all the rows. Those. the frame for each of the lines will be the entire set of lines
from start to finish partitions.

 SELECT transaction_id, change, sum(change) OVER () as result_balance FROM balance_change ORDER BY transaction_id; transaction_id | change | result_balance ----------------+--------+---------------- 1 | 1.00 | 10.50 2 | -2.00 | 10.50 3 | 10.00 | 10.50 4 | -4.00 | 10.50 5 | 5.50 | 10.50

This is the feature of aggregate functions, if they are used as window functions. In my opinion, this is a rather strange, intuitively unobvious moment of the SQL standard.

Window functions can be used at once in several pieces, they do not interfere with each other at all, so that you will not write in them there.

 SELECT transaction_id, change, sum(change) OVER (ORDER BY transaction_id) as balance, sum(change) OVER () as result_balance, round( 100.0 * sum(change) OVER (ORDER BY transaction_id) / sum(change) OVER (), 2 ) AS percent_of_result, count(*) OVER () as transactions_count FROM balance_change ORDER BY transaction_id; transaction_id | change | balance | result_balance | percent_of_result | transactions_count ----------------+--------+---------+----------------+-------------------+-------------------- 1 | 1.00 | 1.00 | 10.50 | 9.52 | 5 2 | -2.00 | -1.00 | 10.50 | -9.52 | 5 3 | 10.00 | 9.00 | 10.50 | 85.71 | 5 4 | -4.00 | 5.00 | 10.50 | 47.62 | 5 5 | 5.50 | 10.50 | 10.50 | 100.00 | 5

If you have a lot of identical expressions after OVER, then you can give them a name and take out separately with the keyword WINDOW to avoid duplication of code. Here is an example from the manual:

 SELECT sum(salary) OVER w, avg(salary) OVER w FROM empsalary WINDOW w AS (PARTITION BY depname ORDER BY salary DESC);

Here w after the word OVER comes without parentheses.

The result of the window function cannot be filtered in the query using WHERE, because window functions are performed after all the filtering and grouping, i.e. with what happened. Therefore, to select, for example, the top 5 news items in each group, you need to use a subquery:

 SELECT * FROM ( SELECT id, section, header, score, row_number() OVER (PARTITION BY section ORDER BY score DESC) AS rating_in_section FROM news ORDER BY section, rating_in_section ) counted_news WHERE rating_in_section <= 5;

Another example to fix. In addition to row_number () there are several other functions. For example, lag, which searches for a line before the last row of a frame. For example, we can find how many points the news lags behind the previous one in the rating:

 SELECT id, section, header, score, row_number() OVER w AS rating, lag(score) OVER w - score AS score_lag FROM news WINDOW w AS (ORDER BY score DESC) ORDER BY score desc; id | section | header | score | rating | score_lag ----+---------+-----------+-------+--------+----------- 6 | 2 |  | 95 | 1 | 3 | 4 |  | 79 | 2 | 16 8 | 3 |  | 36 | 3 | 43 4 | 3 |  | 36 | 4 | 0 5 | 2 |  | 34 | 5 | 2 7 | 4 |  | 26 | 6 | 8 1 | 2 |  | 23 | 7 | 3 2 | 1 |  | 6 | 8 | 17

I ask in comments to distribute examples where it is especially convenient to use window functions. And also, what problems may arise with them, if any.

Source: https://habr.com/ru/post/268983/

All Articles

How to calculate everything in the world by a single SQL query. PostgreSQL window functions

More articles: