📜 ⬆️ ⬇️

Sum optimization in PostgreSQL

Consider the situation: there is a statistical table with columns-identifiers and columns-counters. It is required to sum up the counters for a certain subset. At the same time, we are not interested in how we choose the set of interest to us - there are a lot of books and articles written about indices and partitioning. We will assume that all data have already been selected in the most optimal way and will learn how to summarize faster .

This is not the first place that needs to be optimized, if the request slows down, rather the last. The following ideas are meaningfully applied when the execution plan (explain) is seemingly perfect and the mosquito in it does not weaken the nose, but I want to “squeeze” a little more.

Let's make a test table and write 10 million records into it:
create table s ( d date, browser_id int not null, banner_id int not null, views bigint, clicks bigint, primary key(d, browser_id, banner_id) ); insert into s select d, browser_id, banner_id, succ + insucc, succ from ( select d, browser_id, banner_id, (array[0,0,50,500])[ceil(random()*4)] succ, (array[0,0,400,400000])[ceil(random()*4)] insucc from generate_series(now() - interval '99 day', now(), '1 day') d cross join generate_series(0, 999) banner_id cross join generate_series(0, 99) browser_id )_; 

Date, id-shnik and primary key are given only for decency - the task we will have is simple, to sum up the entire table. Strange generation of values ​​for views and clicks is needed to simulate a real situation in which values ​​are often both zero and rather large.

So let's get started. All time measurements will be done when the query is re-executed to eliminate the influence of the cold cache.
I did not conduct honest benchmarks, I ran it several times, took the average and rounded it off. The machine is weak, you will be faster!
')

Method 1: "In the forehead"


 select sum(clicks) from s; 

9 seconds.

Method 2: Change Type


We will recreate our table, while doing the counters we will do the type numeric:
 create table s ( d date, browser_id int not null, banner_id int not null, views numeric, clicks numeric, primary key(d, browser_id, banner_id) ); 

8 seconds. It would seem that the numeric type should be more cumbersome, since it allows storage of numbers of very large sizes and therefore is far from machine representation. In fact, it turns out to be faster.

The solution is as follows: sum, taking as input bigint or numeric, returns numeric in both cases. This is done to avoid overflow. When we immediately give a numeric entry, we avoid implicit conversion.

First note: if we have enough for an int counter (and it can take values ​​up to ~ 2 billion), then work with it will be even faster. Sum in this case returns bigint.

Second note: ordinary arithmetic operations (+, -, *, /) with the numeric type work more slowly than with bigint. And sum is faster with numeric.

Method 3: do not count zeros


 select sum(clicks) from s where clicks <> 0; 

This method gives acceleration up to 7 seconds. But it has a drawback: when summing up values ​​from several columns, it is not clear how to apply it, especially if the columns are reset independently.

Method 4: replace zeros with nulls


 select sum(nullif(licks, 0)) from s; 

Same 7 seconds, but the method works better than the previous one when summing up several columns.

The reason is that sum, as a strict aggregate function, ignores incoming nulls.

Methods 3 and 4 are appropriate when there is a significant number of zeros in the column (at least 10%, and preferably half).

Method 5: replace zeros with nulls directly in the table


This method of optimization is somewhat "unethical": we suggest brazenly trampling the semantics of null. But it works.
 create table s2 (like s including all); insert into s2 select d, browser_id, banner_id, nullif(views, 0), nullif(clicks, 0) from s; 


 select sum(clicks) from s2; 

6 seconds. Apparently, the reason lies in the fact that postgres stores nulls in the bitmap and therefore the size of the tupl'a decreases.

Perhaps the reader will be surprised to ask: why did we not immediately throw away the columns with zeros when sampling, for example, using a partial index?
And we will explain: there are many columns in real tables, and the zeros in them are located independently.

Links to documentation:
  1. Numeric types
  2. Sum aggregate function
  3. User-defined aggregate functions, strictness

Source: https://habr.com/ru/post/186788/


All Articles