📜 ⬆️ ⬇️

Thinning timeframes (cryptocurrency, forex, stock exchange)

Some time ago, I was tasked to write a procedure that performs thinning out of the Forex market quotes (more precisely, timeframe data).

Task statement: data is sent to the input at 1 second intervals in this format:


It is necessary to provide recalculation and synchronization of data in the tables: 5 s, 15 s, 1 min, 5 min, 15 min, etc.
')
The described data storage format is called OHLC, or OHLCV (Open, High, Low, Close, Volume). It is often used, and it is immediately possible to construct the “Japanese Candles” graph.

image

Under the cut, I described all the options that I could think of, how to thin out (enlarge) the data, for analyzing, for example, the winter jump in Bitcoin prices, and based on the data, you immediately build the “Japanese Candles” chart (in MS Excel, too) ). In the picture above, this graph is built for the 1 month timeframe, for the “bitstampUSD” tool. The white body of the candle means an increase in price in the interval, black - a decrease in price, the upper and lower wicks mean the maximum and minimum prices that were achieved in the interval. Background - the volume of transactions. It is clearly seen that in December 2017, the price came close to the 20K mark.

The solution will be given for the two database engines, for Oracle and MS SQL, which, in some way, will make it possible to compare them on this particular task (we will not generalize the comparison to other tasks).

Then I solved the problem in a trivial way: calculating the correct thinning into a temporary table and synchronizing with the target table - deleting rows that exist in the target table but do not exist in the temporary table and adding rows that exist in the temporary table but do not exist in the target table. At that time, the Customer was satisfied with the decision, and I closed the task.

But now I decided to consider all the options, because the above solution contains one feature - it is difficult to optimize it for two cases at once:


This is due to the fact that the procedure will have to connect the target table and the temporary table, and you need to join to the larger one, and not vice versa. In the above two cases, the larger / smaller are swapped. The optimizer will decide on the join order based on statistics, and the statistics may be outdated, and the decision may be made wrong, leading to significant performance degradation.

In this article I will describe the methods of one-time thinning, which can be useful for readers to analyze, for example, the winter jump in the price of Bitcoin.

Procedures for online thinning can be downloaded from github at the link at the bottom of the article.

To the point ... My task concerned thinning from the “1 sec” timeframe to the following, but here I consider thinning from the transaction level (in the source table, the fields STOCK_NAME, UT, ID, APRICE, AVOLUME). Because such data gives the site bitcoincharts.com.
Actually, thinning from the transaction level to the “1 second” level is performed by such a command (the operator is easily translated into thinning from the “1 second” level to the upper levels):

On Oracle:

select 1 as STRIPE_ID , STOCK_NAME , TRUNC_UT (UT, 1) as UT , avg (APRICE) keep (dense_rank first order by UT, ID) as AOPEN , max (APRICE) as AHIGH , min (APRICE) as ALOW , avg (APRICE) keep (dense_rank last order by UT, ID) as ACLOSE , sum (AVOLUME) as AVOLUME , sum (APRICE * AVOLUME) as AAMOUNT , count (*) as ACOUNT from TRANSACTIONS_RAW group by STOCK_NAME, TRUNC_UT (UT, 1); 

The avg () keep (dense_rank first order by UT, ID) function works like this: since the query is grouped by GROUP BY, each group is calculated independently of the others. Within each group, the rows are sorted by UT and ID, numbered by the function dense_rank . Since next is the function first, then the line is selected where dense_rank returned 1 (in other words, the minimum is chosen) - the first transaction is selected within the interval. For this minimum UT, ID, if there were several lines, it would be considered an average. But in our case there will be guaranteed one line (due to the uniqueness of the ID), so the resulting value is immediately returned as AOPEN. It is easy to see that the first function replaces the two aggregate.

On MS SQL

There are no first / last functions (there is first_value / last_value , but this is not the case). Therefore, you have to connect the table with itself.

I will not give the request separately, but it can be viewed below in the dbo.THINNING_HABR_CALC procedure. Of course, without first / last, it is not so elegant, but it will work.

How can this problem be solved by one operator? (Here, the term "one operator" means not that the operator will be one, but that there will be no cycles that "pull" the data one line at a time.)

I will list all the options for solving this problem known to me:

  1. SIMP (simple, simple, Cartesian product),
  2. CALC (calculate, iterative thinning of the upper levels),
  3. CHIN (china way, cumbersome request for all levels at once),
  4. UDAF (user-defined aggregate function),
  5. PPTF (pipelined and parallel table function, procedural solution, but with only two cursors, in fact, two SQL statements),
  6. MODE (model, MODEL phrase),
  7. and IDEA (ideal, the ideal solution that cannot work now).

Looking ahead to say that this is the rare case when the procedural solution PPTF is the most effective on Oracle.

Download transaction files from http://api.bitcoincharts.com/v1/csv
I recommend choosing kraken files *. The localbtc * files are very noisy - they contain distracting lines with unrealistic prices. All kraken * contain about 31M transactions, I recommend to exclude krakenEUR from there, then the transaction becomes 11M. This is the most convenient volume for testing.

Run the script in Powershell to generate control files for SQLLDR for Oracle and to generate an import request for MSSQL.

  # MODIFY PARAMETERS THERE $OracleConnectString = "THINNING/aaa@P-ORA11/ORCL" # For Oracle $PathToCSV = "Z:\10" # without trailing slash $filenames = Get-ChildItem -name *.csv Remove-Item *.ctl -ErrorAction SilentlyContinue Remove-Item *.log -ErrorAction SilentlyContinue Remove-Item *.bad -ErrorAction SilentlyContinue Remove-Item *.dsc -ErrorAction SilentlyContinue Remove-Item LoadData-Oracle.bat -ErrorAction SilentlyContinue Remove-Item LoadData-MSSQL.sql -ErrorAction SilentlyContinue ForEach ($FilenameExt in $Filenames) { Write-Host "Processing file: "$FilenameExt $StockName = $FilenameExt.substring(1, $FilenameExt.Length-5) $FilenameCtl = '.'+$Stockname+'.ctl' Add-Content -Path $FilenameCtl -Value "OPTIONS (DIRECT=TRUE, PARALLEL=FALSE, ROWS=1000000, SKIP_INDEX_MAINTENANCE=Y)" Add-Content -Path $FilenameCtl -Value "UNRECOVERABLE" Add-Content -Path $FilenameCtl -Value "LOAD DATA" Add-Content -Path $FilenameCtl -Value "INFILE '.$StockName.csv'" Add-Content -Path $FilenameCtl -Value "BADFILE '.$StockName.bad'" Add-Content -Path $FilenameCtl -Value "DISCARDFILE '.$StockName.dsc'" Add-Content -Path $FilenameCtl -Value "INTO TABLE TRANSACTIONS_RAW" Add-Content -Path $FilenameCtl -Value "APPEND" Add-Content -Path $FilenameCtl -Value "FIELDS TERMINATED BY ','" Add-Content -Path $FilenameCtl -Value "(ID SEQUENCE (0), STOCK_NAME constant '$StockName', UT, APRICE, AVOLUME)" Add-Content -Path LoadData-Oracle.bat -Value "sqlldr $OracleConnectString control=$FilenameCtl" Add-Content -Path LoadData-MSSQL.sql -Value "insert into TRANSACTIONS_RAW (STOCK_NAME, UT, APRICE, AVOLUME)" Add-Content -Path LoadData-MSSQL.sql -Value "select '$StockName' as STOCK_NAME, UT, APRICE, AVOLUME" Add-Content -Path LoadData-MSSQL.sql -Value "from openrowset (bulk '$PathToCSV\$FilenameExt', formatfile = '$PathToCSV\format_mssql.bcp') as T1;" Add-Content -Path LoadData-MSSQL.sql -Value "" } 

Create a transaction table on Oracle.

 create table TRANSACTIONS_RAW ( ID number not null , STOCK_NAME varchar2 (32) , UT number not null , APRICE number not null , AVOLUME number not null) pctfree 0 parallel 4 nologging; 

On Oracle, launch the LoadData-Oracle.bat file , having previously corrected the connection parameters at the beginning of the Powershell script.

I work in a virtual machine. Downloading all 11M transaction files in 8 kraken * files (I missed the EUR file) took about 1 minute.

And create functions that will truncate dates to the boundaries of intervals:

 create or replace function TRUNC_UT (p_UT number, p_StripeTypeId number) return number deterministic is begin return case p_StripeTypeId when 1 then trunc (p_UT / 1) * 1 when 2 then trunc (p_UT / 10) * 10 when 3 then trunc (p_UT / 60) * 60 when 4 then trunc (p_UT / 600) * 600 when 5 then trunc (p_UT / 3600) * 3600 when 6 then trunc (p_UT / ( 4 * 3600)) * ( 4 * 3600) when 7 then trunc (p_UT / (24 * 3600)) * (24 * 3600) when 8 then trunc ((trunc (date '1970-01-01' + p_UT / 86400, 'Month') - date '1970-01-01') * 86400) when 9 then trunc ((trunc (date '1970-01-01' + p_UT / 86400, 'year') - date '1970-01-01') * 86400) when 10 then 0 when 11 then 0 end; end; create or replace function UT2DATESTR (p_UT number) return varchar2 deterministic is begin return to_char (date '1970-01-01' + p_UT / 86400, 'YYYY.MM.DD HH24:MI:SS'); end; 

Consider the options. First, the code for all the options, then the scripts to run and test. First, the task is described for Oracle, then for MS SQL

Option 1 - SIMP (Trivial)


The whole set of transactions is multiplied by a Cartesian product into a set of 10 lines with numbers from 1 to 10. This is needed to get 10 lines with dates truncated to the boundaries of 10 intervals from a single transaction line.

After that, the lines are grouped by interval number and truncated date, and the above query is executed.

Create a VIEW:

 create or replace view THINNING_HABR_SIMP_V as select STRIPE_ID , STOCK_NAME , TRUNC_UT (UT, STRIPE_ID) as UT , avg (APRICE) keep (dense_rank first order by UT, ID) as AOPEN , max (APRICE) as AHIGH , min (APRICE) as ALOW , avg (APRICE) keep (dense_rank last order by UT, ID) as ACLOSE , sum (AVOLUME) as AVOLUME , sum (APRICE * AVOLUME) as AAMOUNT , count (*) as ACOUNT from TRANSACTIONS_RAW , (select rownum as STRIPE_ID from dual connect by level <= 10) group by STRIPE_ID, STOCK_NAME, TRUNC_UT (UT, STRIPE_ID); 

Option 2 - CALC (calculated iteratively)


In this case, we iteratively thin out from transactions to level 1, from level 1 to level 2, and so on.

Create a table:

 create table QUOTES_CALC ( STRIPE_ID number not null , STOCK_NAME varchar2 (128) not null , UT number not null , AOPEN number not null , AHIGH number not null , ALOW number not null , ACLOSE number not null , AVOLUME number not null , AAMOUNT number not null , ACOUNT number not null ) /*partition by list (STRIPE_ID) ( partition P01 values (1) , partition P02 values (2) , partition P03 values (3) , partition P04 values (4) , partition P05 values (5) , partition P06 values (6) , partition P07 values (7) , partition P08 values (8) , partition P09 values (9) , partition P10 values (10) )*/ parallel 4 pctfree 0 nologging; 

You can create an index on the STRIPE_ID field, but it has been experimentally established that in the 11M transaction volume without an index it turns out to be more profitable. For larger quantities, the situation may change. And you can partition a table by uncommenting a block in a query.

Create a procedure:

 create or replace procedure THINNING_HABR_CALC_T is begin rollback; execute immediate 'truncate table QUOTES_CALC'; insert --+ append into QUOTES_CALC select 1 as STRIPE_ID , STOCK_NAME , UT , avg (APRICE) keep (dense_rank first order by ID) , max (APRICE) , min (APRICE) , avg (APRICE) keep (dense_rank last order by ID) , sum (AVOLUME) , sum (APRICE * AVOLUME) , count (*) from TRANSACTIONS_RAW a group by STOCK_NAME, UT; commit; for i in 1..9 loop insert --+ append into QUOTES_CALC select --+ parallel(4) STRIPE_ID + 1 , STOCK_NAME , TRUNC_UT (UT, i + 1) , avg (AOPEN) keep (dense_rank first order by UT) , max (AHIGH) , min (ALOW) , avg (ACLOSE) keep (dense_rank last order by UT) , sum (AVOLUME) , sum (AAMOUNT) , sum (ACOUNT) from QUOTES_CALC a where STRIPE_ID = i group by STRIPE_ID, STOCK_NAME, TRUNC_UT (UT, i + 1); commit; end loop; end; / 

For symmetry, create a simple VIEW:

 create view THINNING_HABR_CALC_V as select * from QUOTES_CALC; 

Option 3 - CHIN (Chinese code)


The method is brutal straightforward approach, and is to abandon the principle of "Do not repeat yourself." In this case, the rejection of cycles.

A variant is given here only for completeness.

Looking ahead I will say that in terms of performance on this particular task, it ranks second.

Big request
 create or replace view THINNING_HABR_CHIN_V as with T01 (STRIPE_ID, STOCK_NAME, UT, AOPEN, AHIGH, ALOW, ACLOSE, AVOLUME, AAMOUNT, ACOUNT) as (select 1 , STOCK_NAME , UT , avg (APRICE) keep (dense_rank first order by ID) , max (APRICE) , min (APRICE) , avg (APRICE) keep (dense_rank last order by ID) , sum (AVOLUME) , sum (APRICE * AVOLUME) , count (*) from TRANSACTIONS_RAW group by STOCK_NAME, UT) , T02 (STRIPE_ID, STOCK_NAME, UT, AOPEN, AHIGH, ALOW, ACLOSE, AVOLUME, AAMOUNT, ACOUNT) as (select STRIPE_ID + 1 , STOCK_NAME , TRUNC_UT (UT, STRIPE_ID + 1) , avg (AOPEN) keep (dense_rank first order by UT) , max (AHIGH) , min (ALOW) , avg (ACLOSE) keep (dense_rank last order by UT) , sum (AVOLUME) , sum (AAMOUNT) , sum (ACOUNT) from T01 group by STRIPE_ID, STOCK_NAME, TRUNC_UT (UT, STRIPE_ID + 1)) , T03 (STRIPE_ID, STOCK_NAME, UT, AOPEN, AHIGH, ALOW, ACLOSE, AVOLUME, AAMOUNT, ACOUNT) as (select STRIPE_ID + 1 , STOCK_NAME , TRUNC_UT (UT, STRIPE_ID + 1) , avg (AOPEN) keep (dense_rank first order by UT) , max (AHIGH) , min (ALOW) , avg (ACLOSE) keep (dense_rank last order by UT) , sum (AVOLUME) , sum (AAMOUNT) , sum (ACOUNT) from T02 group by STRIPE_ID, STOCK_NAME, TRUNC_UT (UT, STRIPE_ID + 1)) , T04 (STRIPE_ID, STOCK_NAME, UT, AOPEN, AHIGH, ALOW, ACLOSE, AVOLUME, AAMOUNT, ACOUNT) as (select STRIPE_ID + 1 , STOCK_NAME , TRUNC_UT (UT, STRIPE_ID + 1) , avg (AOPEN) keep (dense_rank first order by UT) , max (AHIGH) , min (ALOW) , avg (ACLOSE) keep (dense_rank last order by UT) , sum (AVOLUME) , sum (AAMOUNT) , sum (ACOUNT) from T03 group by STRIPE_ID, STOCK_NAME, TRUNC_UT (UT, STRIPE_ID + 1)) , T05 (STRIPE_ID, STOCK_NAME, UT, AOPEN, AHIGH, ALOW, ACLOSE, AVOLUME, AAMOUNT, ACOUNT) as (select STRIPE_ID + 1 , STOCK_NAME , TRUNC_UT (UT, STRIPE_ID + 1) , avg (AOPEN) keep (dense_rank first order by UT) , max (AHIGH) , min (ALOW) , avg (ACLOSE) keep (dense_rank last order by UT) , sum (AVOLUME) , sum (AAMOUNT) , sum (ACOUNT) from T04 group by STRIPE_ID, STOCK_NAME, TRUNC_UT (UT, STRIPE_ID + 1)) , T06 (STRIPE_ID, STOCK_NAME, UT, AOPEN, AHIGH, ALOW, ACLOSE, AVOLUME, AAMOUNT, ACOUNT) as (select STRIPE_ID + 1 , STOCK_NAME , TRUNC_UT (UT, STRIPE_ID + 1) , avg (AOPEN) keep (dense_rank first order by UT) , max (AHIGH) , min (ALOW) , avg (ACLOSE) keep (dense_rank last order by UT) , sum (AVOLUME) , sum (AAMOUNT) , sum (ACOUNT) from T05 group by STRIPE_ID, STOCK_NAME, TRUNC_UT (UT, STRIPE_ID + 1)) , T07 (STRIPE_ID, STOCK_NAME, UT, AOPEN, AHIGH, ALOW, ACLOSE, AVOLUME, AAMOUNT, ACOUNT) as (select STRIPE_ID + 1 , STOCK_NAME , TRUNC_UT (UT, STRIPE_ID + 1) , avg (AOPEN) keep (dense_rank first order by UT) , max (AHIGH) , min (ALOW) , avg (ACLOSE) keep (dense_rank last order by UT) , sum (AVOLUME) , sum (AAMOUNT) , sum (ACOUNT) from T06 group by STRIPE_ID, STOCK_NAME, TRUNC_UT (UT, STRIPE_ID + 1)) , T08 (STRIPE_ID, STOCK_NAME, UT, AOPEN, AHIGH, ALOW, ACLOSE, AVOLUME, AAMOUNT, ACOUNT) as (select STRIPE_ID + 1 , STOCK_NAME , TRUNC_UT (UT, STRIPE_ID + 1) , avg (AOPEN) keep (dense_rank first order by UT) , max (AHIGH) , min (ALOW) , avg (ACLOSE) keep (dense_rank last order by UT) , sum (AVOLUME) , sum (AAMOUNT) , sum (ACOUNT) from T07 group by STRIPE_ID, STOCK_NAME, TRUNC_UT (UT, STRIPE_ID + 1)) , T09 (STRIPE_ID, STOCK_NAME, UT, AOPEN, AHIGH, ALOW, ACLOSE, AVOLUME, AAMOUNT, ACOUNT) as (select STRIPE_ID + 1 , STOCK_NAME , TRUNC_UT (UT, STRIPE_ID + 1) , avg (AOPEN) keep (dense_rank first order by UT) , max (AHIGH) , min (ALOW) , avg (ACLOSE) keep (dense_rank last order by UT) , sum (AVOLUME) , sum (AAMOUNT) , sum (ACOUNT) from T08 group by STRIPE_ID, STOCK_NAME, TRUNC_UT (UT, STRIPE_ID + 1)) , T10 (STRIPE_ID, STOCK_NAME, UT, AOPEN, AHIGH, ALOW, ACLOSE, AVOLUME, AAMOUNT, ACOUNT) as (select STRIPE_ID + 1 , STOCK_NAME , TRUNC_UT (UT, STRIPE_ID + 1) , avg (AOPEN) keep (dense_rank first order by UT) , max (AHIGH) , min (ALOW) , avg (ACLOSE) keep (dense_rank last order by UT) , sum (AVOLUME) , sum (AAMOUNT) , sum (ACOUNT) from T09 group by STRIPE_ID, STOCK_NAME, TRUNC_UT (UT, STRIPE_ID + 1)) select * from T01 union all select * from T02 union all select * from T03 union all select * from T04 union all select * from T05 union all select * from T06 union all select * from T07 union all select * from T08 union all select * from T09 union all select * from T10; 


Option 4 - UDAF


The variant with the User Defined Aggregated Function will not be given here, but it can be viewed on github.

Option 5 - PPTF (Pipelined and Parallel table function)


Create a function (in the package):

 create or replace package THINNING_PPTF_P is type TRANSACTION_RECORD_T is record (STOCK_NAME varchar2(128), UT number, SEQ_NUM number, APRICE number, AVOLUME number); type CUR_RECORD_T is ref cursor return TRANSACTION_RECORD_T; type QUOTE_T is record (STRIPE_ID number, STOCK_NAME varchar2(128), UT number , AOPEN number, AHIGH number, ALOW number, ACLOSE number, AVOLUME number , AAMOUNT number, ACOUNT number); type QUOTE_LIST_T is table of QUOTE_T; function F (p_cursor CUR_RECORD_T) return QUOTE_LIST_T pipelined order p_cursor by (STOCK_NAME, UT, SEQ_NUM) parallel_enable (partition p_cursor by hash (STOCK_NAME)); end; / create or replace package body THINNING_PPTF_P is function F (p_cursor CUR_RECORD_T) return QUOTE_LIST_T pipelined order p_cursor by (STOCK_NAME, UT, SEQ_NUM) parallel_enable (partition p_cursor by hash (STOCK_NAME)) is QuoteTail QUOTE_LIST_T := QUOTE_LIST_T() ; rec TRANSACTION_RECORD_T; rec_prev TRANSACTION_RECORD_T; type ut_T is table of number index by pls_integer; ut number; begin QuoteTail.extend(10); loop fetch p_cursor into rec; exit when p_cursor%notfound; if rec_prev.STOCK_NAME = rec.STOCK_NAME then if (rec.STOCK_NAME = rec_prev.STOCK_NAME and rec.UT < rec_prev.UT) or (rec.STOCK_NAME = rec_prev.STOCK_NAME and rec.UT = rec_prev.UT and rec.SEQ_NUM < rec_prev.SEQ_NUM) then raise_application_error (-20010, 'Rowset must be ordered, ('||rec_prev.STOCK_NAME||','||rec_prev.UT||','||rec_prev.SEQ_NUM||') > ('||rec.STOCK_NAME||','||rec.UT||','||rec.SEQ_NUM||')'); end if; end if; if rec.STOCK_NAME <> rec_prev.STOCK_NAME or rec_prev.STOCK_NAME is null then for j in 1 .. 10 loop if QuoteTail(j).UT is not null then pipe row (QuoteTail(j)); QuoteTail(j) := null; end if; end loop; end if; for i in reverse 1..10 loop ut := TRUNC_UT (rec.UT, i); if QuoteTail(i).UT <> ut then for j in 1..i loop pipe row (QuoteTail(j)); QuoteTail(j) := null; end loop; end if; if QuoteTail(i).UT is null then QuoteTail(i).STRIPE_ID := i; QuoteTail(i).STOCK_NAME := rec.STOCK_NAME; QuoteTail(i).UT := ut; QuoteTail(i).AOPEN := rec.APRICE; end if; if rec.APRICE < QuoteTail(i).ALOW or QuoteTail(i).ALOW is null then QuoteTail(i).ALOW := rec.APRICE; end if; if rec.APRICE > QuoteTail(i).AHIGH or QuoteTail(i).AHIGH is null then QuoteTail(i).AHIGH := rec.APRICE; end if; QuoteTail(i).AVOLUME := nvl (QuoteTail(i).AVOLUME, 0) + rec.AVOLUME; QuoteTail(i).AAMOUNT := nvl (QuoteTail(i).AAMOUNT, 0) + rec.AVOLUME * rec.APRICE; QuoteTail(i).ACOUNT := nvl (QuoteTail(i).ACOUNT, 0) + 1; QuoteTail(i).ACLOSE := rec.APRICE; end loop; rec_prev := rec; end loop; for j in 1 .. 10 loop if QuoteTail(j).UT is not null then pipe row (QuoteTail(j)); end if; end loop; exception when no_data_needed then null; end; end; / 

Create a VIEW:

 create or replace view THINNING_HABR_PPTF_V as select * from table (THINNING_PPTF_P.F (cursor (select STOCK_NAME, UT, ID, APRICE, AVOLUME from TRANSACTIONS_RAW))); 

Option 6 - MODE (model clause)


The variant iteratively calculates thinning for all 10 levels by using the MODEL clause clause with the phrase ITERATE .

The option is also impractical because it turns out to be slow. In my environment, 1000 transactions on 8 instruments are calculated for 1 minute. Most of the time is spent on calculating the phrase MODEL .

Here I give this option only for completeness and as confirmation of the fact that almost all complex calculations on Oracle can be performed in one query, without using PL / SQL.

One of the reasons for the poor performance of the MODEL phrase in this query is that the search by the criteria on the right is made for each rule, of which we have 6. The first two rules are calculated quite quickly, because there is direct explicit addressing, without jokers. In the other four rules there is the word any - there calculations are made more slowly.

The second difficulty is that the reference model has to be calculated. It is necessary because the dimension list must be known before calculating the MODEL phrase; we cannot calculate new dimensions within this phrase. Perhaps this can be circumvented with the help of two MODEL phrases, but I did not do this because of the poor performance of a large number of rules.

I ’ll add that you could not count UT_OPEN and UT_CLOSE in the reference model, but use the same avg () keep (dense_rank first / last order by) functions directly in the MODEL phrase. But it would have been even slower.
Due to performance limitations, I will not include this option in the testing procedure.

 with --       SOURCETRANS (STRIPE_ID, STOCK_NAME, PARENT_UT, UT, AOPEN, AHIGH, ALOW, ACLOSE, AVOLUME, AAMOUNT, ACOUNT) as (select 1, STOCK_NAME, TRUNC_UT (UT, 2), UT , avg (APRICE) keep (dense_rank first order by ID) , max (APRICE) , min (APRICE) , avg (APRICE) keep (dense_rank last order by ID) , sum (AVOLUME) , sum (AVOLUME * APRICE) , count (*) from TRANSACTIONS_RAW where ID <= 1000 --       group by STOCK_NAME, UT) --   PARENT_UT, UT  2...10    UT_OPEN, UT_CLOSE --    , REFMOD (STRIPE_ID, STOCK_NAME, PARENT_UT, UT, UT_OPEN, UT_CLOSE) as (select b.STRIPE_ID , a.STOCK_NAME , TRUNC_UT (UT, b.STRIPE_ID + 1) , TRUNC_UT (UT, b.STRIPE_ID) , min (TRUNC_UT (UT, b.STRIPE_ID - 1)) , max (TRUNC_UT (UT, b.STRIPE_ID - 1)) from SOURCETRANS a , (select rownum + 1 as STRIPE_ID from dual connect by level <= 9) b group by b.STRIPE_ID , a.STOCK_NAME , TRUNC_UT (UT, b.STRIPE_ID + 1) , TRUNC_UT (UT, b.STRIPE_ID)) --        , MAINTAB as ( select STRIPE_ID, STOCK_NAME, PARENT_UT, UT, AOPEN , AHIGH, ALOW, ACLOSE, AVOLUME, AAMOUNT, ACOUNT, null, null from SOURCETRANS union all select STRIPE_ID, STOCK_NAME, PARENT_UT, UT, null , null, null, null, null, null, null, UT_OPEN, UT_CLOSE from REFMOD) select STRIPE_ID, STOCK_NAME, UT, AOPEN, AHIGH, ALOW, ACLOSE, AVOLUME, AAMOUNT, ACOUNT from MAINTAB model return all rows --      2...10 reference RM on (select * from REFMOD) dimension by (STRIPE_ID, STOCK_NAME, UT) measures (UT_OPEN, UT_CLOSE) main MM partition by (STOCK_NAME) dimension by (STRIPE_ID, PARENT_UT, UT) measures (AOPEN, AHIGH, ALOW, ACLOSE, AVOLUME, AAMOUNT, ACOUNT) rules iterate (9) ( AOPEN [iteration_number + 2, any, any] = AOPEN [cv (STRIPE_ID) - 1, cv (UT) , rm.UT_OPEN [cv (STRIPE_ID), cv (STOCK_NAME), cv (UT)]] , ACLOSE [iteration_number + 2, any, any] = ACLOSE [cv (STRIPE_ID) - 1, cv (UT) , rm.UT_CLOSE[cv (STRIPE_ID), cv (STOCK_NAME), cv (UT)]] , AHIGH [iteration_number + 2, any, any] = max (AHIGH)[cv (STRIPE_ID) - 1, cv (UT), any] , ALOW [iteration_number + 2, any, any] = min (ALOW)[cv (STRIPE_ID) - 1, cv (UT), any] , AVOLUME [iteration_number + 2, any, any] = sum (AVOLUME)[cv (STRIPE_ID) - 1, cv (UT), any] , AAMOUNT [iteration_number + 2, any, any] = sum (AAMOUNT)[cv (STRIPE_ID) - 1, cv (UT), any] , ACOUNT [iteration_number + 2, any, any] = sum (ACOUNT)[cv (STRIPE_ID) - 1, cv (UT), any] ) order by 1, 2, 3, 4; 

Option 6 - IDEA (ideal, ideal, but inoperative)


The query described below would potentially be the most efficient and consume a quantity of resources equal to the theoretical minimum.

But neither Oracle nor MS SQL allow you to write a query in this form. I guess this is dictated by the standard.

 with QUOTES_S1 as (select 1 as STRIPE_ID , STOCK_NAME , TRUNC_UT (UT, 1) as UT , avg (APRICE) keep (dense_rank first order by ID) as AOPEN , max (APRICE) as AHIGH , min (APRICE) as ALOW , avg (APRICE) keep (dense_rank last order by ID) as ACLOSE , sum (AVOLUME) as AVOLUME , sum (APRICE * AVOLUME) as AAMOUNT , count (*) as ACOUNT from TRANSACTIONS_RAW -- where rownum <= 100 group by STOCK_NAME, TRUNC_UT (UT, 1)) , T1 (STRIPE_ID, STOCK_NAME, UT, AOPEN, AHIGH, ALOW, ACLOSE, AVOLUME, AAMOUNT, ACOUNT) as (select 1, STOCK_NAME, UT, AOPEN, AHIGH, ALOW, ACLOSE, AVOLUME, AAMOUNT, ACOUNT from QUOTES_S1 union all select STRIPE_ID + 1 , STOCK_NAME , TRUNC_UT (UT, STRIPE_ID + 1) , avg (AOPEN) keep (dense_rank first order by UT) , max (AHIGH) , min (ALOW) , avg (ACLOSE) keep (dense_rank last order by UT) , sum (AVOLUME) , sum (AAMOUNT) , sum (ACOUNT) from T1 where STRIPE_ID < 10 group by STRIPE_ID + 1, STOCK_NAME, TRUNC_UT (UT, STRIPE_ID + 1) ) select * from T1 

This query corresponds to the following part of the Oracle documentation:

If a subquery_factoring_clause refers to the query_name in the subquery that is, then it is said to be recursive. A recursive subquery_factoring_clause must contain two query blocks. The anchor member must appear before the reference. UNION ALL, UNION, INTERSECT or MINUS. You must follow the reference link query_name exactly once. You must combine the recursive member with the anchor member using the UNION ALL set operator.

But contrary to the following paragraph of the documentation:

Cant keep the following elements:
The DISTINCT keyword or a GROUP BY clause
An aggregate function. However, analytic functions are permitted in the select list.

Thus, in a recursive member, aggregates and grouping are not allowed.

Testing



Let's spend at first for Oracle .

Perform the calculation procedure for the CALC method and write down the execution time:

 exec THINNING_HABR_CALC_T 

The calculation results for the four methods are in four views:


The execution time for all methods has already been measured by me and is shown in the table at the end of the article.

For the rest of VIEW, we will execute queries and record the execution time:

 select count (*) as CNT , sum (STRIPE_ID) as S_STRIPE_ID, sum (UT) as S_UT , sum (AOPEN) as S_AOPEN, sum (AHIGH) as S_AHIGH, sum (ALOW) as S_ALOW , sum (ACLOSE) as S_ACLOSE, sum (AVOLUME) as S_AVOLUME , sum (AAMOUNT) as S_AAMOUNT, sum (ACOUNT) as S_ACOUNT from THINNING_HABR_XXXX_V 

where XXXX is SIMP, CHIN, PPTF.

These VIEW count set digest. To calculate the digest, you need to fetch all the strings, and using the digest you can compare the sets with each other.

You can also compare sets using the dbms_sqlhash package, but this is much slower, because the initial set is required to be sorted, and the calculation of the hash is not fast.
Also in 12c there is a DBMS_COMPARISON package.

You can simultaneously check the correctness of all algorithms. We calculate the digests with such a request (with 11M records on a virtual machine, it will be relatively long, about 15 minutes):

 with T1 as (select 'SIMP' as ALG_NAME, a.* from THINNING_HABR_SIMP_V a union all select 'CALC', a.* from THINNING_HABR_CALC_V a union all select 'CHIN', a.* from THINNING_HABR_CHIN_V a union all select 'PPTF', a.* from THINNING_HABR_PPTF_V a) select ALG_NAME , count (*) as CNT , sum (STRIPE_ID) as S_STRIPE_ID, sum (UT) as S_UT , sum (AOPEN) as S_AOPEN, sum (AHIGH) as S_AHIGH, sum (ALOW) as S_ALOW , sum (ACLOSE) as S_ACLOSE, sum (AVOLUME) as S_AVOLUME , sum (AAMOUNT) as S_AAMOUNT, sum (ACOUNT) as S_ACOUNT from T1 group by ALG_NAME; 

We see that the digests are the same, so all the algorithms gave the same results.

Now let's reproduce all the same in MS SQL . I tested on version 2016.

Pre-create the DBTEST database, then create a transaction table in it:

 use DBTEST go create table TRANSACTIONS_RAW ( STOCK_NAME varchar (32) not null , UT int not null , APRICE numeric (22, 12) not null , AVOLUME numeric (22, 12) not null , ID bigint identity not null ); 

Download the downloaded data.

In MSSQL, create a file format_mssql.bcp:

 12.0 3 1 SQLCHAR 0 0 "," 3 UT "" 2 SQLCHAR 0 0 "," 4 APRICE "" 3 SQLCHAR 0 0 "\n" 5 AVOLUME "" 

And run the LoadData-MSSQL.sql script in SSMS (this script was generated by a single powershell script given in the section of this article for Oracle).

Create two functions:

 use DBTEST go create or alter function TRUNC_UT (@p_UT bigint, @p_StripeTypeId int) returns bigint as begin return case @p_StripeTypeId when 1 then @p_UT when 2 then @p_UT / 10 * 10 when 3 then @p_UT / 60 * 60 when 4 then @p_UT / 600 * 600 when 5 then @p_UT / 3600 * 3600 when 6 then @p_UT / 14400 * 14400 when 7 then @p_UT / 86400 * 86400 when 8 then datediff (second, cast ('1970-01-01 00:00:00' as datetime), dateadd(m, datediff (m, 0, dateadd (second, @p_UT, cast ('1970-01-01 00:00:00' as datetime))), 0)) when 9 then datediff (second, cast ('1970-01-01 00:00:00' as datetime), dateadd(yy, datediff (yy, 0, dateadd (second, @p_UT, cast ('1970-01-01 00:00:00' as datetime))), 0)) when 10 then 0 when 11 then 0 end; end; go create or alter function UT2DATESTR (@p_UT bigint) returns datetime as begin return dateadd(s, @p_UT, cast ('1970-01-01 00:00:00' as datetime)); end; go 

:

1 — SIMP


:

 use DBTEST go create or alter view dbo.THINNING_HABR_SIMP_V as with T1 (STRIPE_ID) as (select 1 union all select STRIPE_ID + 1 from T1 where STRIPE_ID < 10) , T2 as (select STRIPE_ID , STOCK_NAME , dbo.TRUNC_UT (UT, STRIPE_ID) as UT , min (1000000 * cast (UT as bigint) + ID) as AOPEN_UT , max (APRICE) as AHIGH , min (APRICE) as ALOW , max (1000000 * cast (UT as bigint) + ID) as ACLOSE_UT , sum (AVOLUME) as AVOLUME , sum (APRICE * AVOLUME) as AAMOUNT , count (*) as ACOUNT from TRANSACTIONS_RAW, T1 group by STRIPE_ID, STOCK_NAME, dbo.TRUNC_UT (UT, STRIPE_ID)) select t.STRIPE_ID, t.STOCK_NAME, t.UT, t_op.APRICE as AOPEN, t.AHIGH , t.ALOW, t_cl.APRICE as ACLOSE, t.AVOLUME, t.AAMOUNT, t.ACOUNT from T2 t join TRANSACTIONS_RAW t_op on (t.STOCK_NAME = t_op.STOCK_NAME and t.AOPEN_UT / 1000000 = t_op.UT and t.AOPEN_UT % 1000000 = t_op.ID) join TRANSACTIONS_RAW t_cl on (t.STOCK_NAME = t_cl.STOCK_NAME and t.ACLOSE_UT / 1000000 = t_cl.UT and t.ACLOSE_UT % 1000000 = t_cl.ID); 

first/last .

2 — CALC


, view:

 use DBTEST go create table dbo.QUOTES_CALC ( STRIPE_ID int not null , STOCK_NAME varchar(32) not null , UT bigint not null , AOPEN numeric (22, 12) not null , AHIGH numeric (22, 12) not null , ALOW numeric (22, 12) not null , ACLOSE numeric (22, 12) not null , AVOLUME numeric (38, 12) not null , AAMOUNT numeric (38, 12) not null , ACOUNT int not null ); go create or alter procedure dbo.THINNING_HABR_CALC as begin set nocount on; truncate table QUOTES_CALC; declare @StripeId int; with T1 as (select STOCK_NAME , UT , min (ID) as AOPEN_ID , max (APRICE) as AHIGH , min (APRICE) as ALOW , max (ID) as ACLOSE_ID , sum (AVOLUME) as AVOLUME , sum (APRICE * AVOLUME) as AAMOUNT , count (*) as ACOUNT from TRANSACTIONS_RAW group by STOCK_NAME, UT) insert into QUOTES_CALC (STRIPE_ID, STOCK_NAME, UT, AOPEN, AHIGH, ALOW, ACLOSE, AVOLUME, AAMOUNT, ACOUNT) select 1, t.STOCK_NAME, t.UT, t_op.APRICE, t.AHIGH, t.ALOW, t_cl.APRICE, t.AVOLUME, t.AAMOUNT, t.ACOUNT from T1 t join TRANSACTIONS_RAW t_op on (t.STOCK_NAME = t_op.STOCK_NAME and t.UT = t_op.UT and t.AOPEN_ID = t_op.ID) join TRANSACTIONS_RAW t_cl on (t.STOCK_NAME = t_cl.STOCK_NAME and t.UT = t_cl.UT and t.ACLOSE_ID = t_cl.ID); set @StripeId = 1; while (@StripeId <= 9) begin with T1 as (select STOCK_NAME , dbo.TRUNC_UT (UT, @StripeId + 1) as UT , min (UT) as AOPEN_UT , max (AHIGH) as AHIGH , min (ALOW) as ALOW , max (UT) as ACLOSE_UT , sum (AVOLUME) as AVOLUME , sum (AAMOUNT) as AAMOUNT , sum (ACOUNT) as ACOUNT from QUOTES_CALC where STRIPE_ID = @StripeId group by STOCK_NAME, dbo.TRUNC_UT (UT, @StripeId + 1)) insert into QUOTES_CALC (STRIPE_ID, STOCK_NAME, UT, AOPEN, AHIGH, ALOW, ACLOSE, AVOLUME, AAMOUNT, ACOUNT) select @StripeId + 1, t.STOCK_NAME, t.UT, t_op.AOPEN, t.AHIGH, t.ALOW, t_cl.ACLOSE, t.AVOLUME, t.AAMOUNT, t.ACOUNT from T1 t join QUOTES_CALC t_op on (t.STOCK_NAME = t_op.STOCK_NAME and t.AOPEN_UT = t_op.UT) join QUOTES_CALC t_cl on (t.STOCK_NAME = t_cl.STOCK_NAME and t.ACLOSE_UT = t_cl.UT) where t_op.STRIPE_ID = @StripeId and t_cl.STRIPE_ID = @StripeId; set @StripeId = @StripeId + 1; end; end; go create or alter view dbo.THINNING_HABR_CALC_V as select * from dbo.QUOTES_CALC; go 

3 (CHIN) 4 (UDAF) MS SQL.

5 — PPTF


view. , parallel pipelined table function, Oracle:

 use DBTEST go create or alter function dbo.THINNING_HABR_PPTF () returns @rettab table ( STRIPE_ID bigint not null , STOCK_NAME varchar(32) not null , UT bigint not null , AOPEN numeric (22, 12) not null , AHIGH numeric (22, 12) not null , ALOW numeric (22, 12) not null , ACLOSE numeric (22, 12) not null , AVOLUME numeric (38, 12) not null , AAMOUNT numeric (38, 12) not null , ACOUNT bigint not null) as begin declare @i tinyint; declare @tut int; declare @trans_STOCK_NAME varchar(32); declare @trans_UT int; declare @trans_ID int; declare @trans_APRICE numeric (22,12); declare @trans_AVOLUME numeric (22,12); declare @trans_prev_STOCK_NAME varchar(32); declare @trans_prev_UT int; declare @trans_prev_ID int; declare @trans_prev_APRICE numeric (22,12); declare @trans_prev_AVOLUME numeric (22,12); declare @QuoteTail table ( STRIPE_ID bigint not null primary key clustered , STOCK_NAME varchar(32) not null , UT bigint not null , AOPEN numeric (22, 12) not null , AHIGH numeric (22, 12) , ALOW numeric (22, 12) , ACLOSE numeric (22, 12) , AVOLUME numeric (38, 12) not null , AAMOUNT numeric (38, 12) not null , ACOUNT bigint not null); declare c cursor fast_forward for select STOCK_NAME, UT, ID, APRICE, AVOLUME from TRANSACTIONS_RAW order by STOCK_NAME, UT, ID; -- THIS ORDERING (STOCK_NAME, UT, ID) IS MANDATORY open c; fetch next from c into @trans_STOCK_NAME, @trans_UT, @trans_ID, @trans_APRICE, @trans_AVOLUME; while @@fetch_status = 0 begin if @trans_STOCK_NAME <> @trans_prev_STOCK_NAME or @trans_prev_STOCK_NAME is null begin insert into @rettab select * from @QuoteTail; delete @QuoteTail; end; set @i = 10; while @i >= 1 begin set @tut = dbo.TRUNC_UT (@trans_UT, @i); if @tut <> (select UT from @QuoteTail where STRIPE_ID = @i) begin insert into @rettab select * from @QuoteTail where STRIPE_ID <= @i; delete @QuoteTail where STRIPE_ID <= @i; end; if (select count (*) from @QuoteTail where STRIPE_ID = @i) = 0 begin insert into @QuoteTail (STRIPE_ID, STOCK_NAME, UT, AOPEN, AVOLUME, AAMOUNT, ACOUNT) values (@i, @trans_STOCK_NAME, @tut, @trans_APRICE, 0, 0, 0); end; update @QuoteTail set AHIGH = case when AHIGH < @trans_APRICE or AHIGH is null then @trans_APRICE else AHIGH end , ALOW = case when ALOW > @trans_APRICE or ALOW is null then @trans_APRICE else ALOW end , ACLOSE = @trans_APRICE, AVOLUME = AVOLUME + @trans_AVOLUME , AAMOUNT = AAMOUNT + @trans_APRICE * @trans_AVOLUME , ACOUNT = ACOUNT + 1 where STRIPE_ID = @i; set @i = @i - 1; end; set @trans_prev_STOCK_NAME = @trans_STOCK_NAME; set @trans_prev_UT = @trans_UT; set @trans_prev_ID = @trans_ID; set @trans_prev_APRICE = @trans_APRICE; set @trans_prev_AVOLUME = @trans_AVOLUME; fetch next from c into @trans_STOCK_NAME, @trans_UT, @trans_ID, @trans_APRICE, @trans_AVOLUME; end; close c; deallocate c; insert into @rettab select * from @QuoteTail; return; end go create or alter view dbo.THINNING_HABR_PPTF_V as select * from dbo.THINNING_HABR_PPTF (); 

QUOTES_CALC CALC :
 use DBTEST go exec dbo.THINNING_HABR_CALC 

view:


VIEW :

 select count (*) as CNT , sum (STRIPE_ID) as S_STRIPE_ID, sum (UT) as S_UT , sum (AOPEN) as S_AOPEN, sum (AHIGH) as S_AHIGH, sum (ALOW) as S_ALOW , sum (ACLOSE) as S_ACLOSE, sum (AVOLUME) as S_AVOLUME , sum (AAMOUNT) as S_AAMOUNT, sum (ACOUNT) as S_ACOUNT from THINNING_HABR_XXXX_V 

XXXX — SIMP, PPTF.

MS SQL. . :

 use DBTEST go with T1 as (select 'SIMP' as ALG_NAME, a.* from THINNING_HABR_SIMP_V a union all select 'CALC', a.* from THINNING_HABR_CALC_V a union all select 'PPTF', a.* from THINNING_HABR_PPTF_V a) select ALG_NAME , count (*) as CNT, sum (cast (STRIPE_ID as bigint)) as STRIPE_ID , sum (cast (UT as bigint)) as UT, sum (AOPEN) as AOPEN , sum (AHIGH) as AHIGH, sum (ALOW) as ALOW, sum (ACLOSE) as ACLOSE, sum (AVOLUME) as AVOLUME , sum (AAMOUNT) as AAMOUNT, sum (cast (ACOUNT as bigint)) as ACOUNT from T1 group by ALG_NAME; 

— .

, , MS SQL .

MS SQL, , : :

 create unique clustered index TRANSACTIONS_RAW_I1 on TRANSACTIONS_RAW (STOCK_NAME, UT, ID); create unique clustered index QUOTES_CALC_I1 on QUOTES_CALC (STRIPE_ID, STOCK_NAME, UT); 

, :

image

github : Oracle, THINNING — , THINNING_LIVE — - bitcoincharts.com - ( - 5 ), MS SQL .

:

Oracle, MS SQL. .

Oracle PPTF. , . — 367 ( PPTF ).

MS SQL (CALC).

PPTF Oracle ? - - — , parallel pipelined table function, :

image

Source: https://habr.com/ru/post/418757/


All Articles