Operational vs analytical databases: column vs row-by-line data storage

Databases can be implemented using Excel, GSheet or using large ORM systems. In my practice of business analyst, I came across different solutions. And since I came to business analysis from finance and auditing, every time I met a new system I asked myself questions - how do they all differ from each other and what tasks do they solve? Found some answers. This article will discuss the two main purposes of databases:

1 - accounting operations
2 - data analysis

The first type of problem is solved by the OLTP systems: from O n L ine T ransaction P rocessing. The second type is solved by OLAP systems: from O n L ine nalytical P rocessing

OLTP

The data storage model in OLTP can be compared with the entries in the phone book. The row in the table is presented as an index and data corresponding to this index: (indexN, data). Therefore, such a table cannot be called a table. It is rather a regular book, with numbered lines. If you need to write a new operation in the book, add a line, assign an index and close the book. From the book, stick labels on which you can quickly O (log n), find the desired line and make CRUD.

For accounting purposes, this is a friendly mapping. But it is unfriendly for data analysis, in which we are not concerned with the lines themselves, but with calculations based on the contents of these lines. And if you make an analytical query based on the content of the lines, i.e. for non-indexed fields, such queries will work slower.

Indexing all records is not an option. Although the book becomes similar to the table, since the attributes become available for a quick search, but at the same time, the creation of new lines and the update of existing lines slow down significantly. Since these operations will require re-sorting of the entire array.

Compromise between OLAP and OLTP

In 1C solutions, the compromise is implemented as follows. Events when writing to the database are written in several places at once. In one place, records have few indexes and are optimized for OLTP loads, in another place records are indexed across all fields and adapted for OLAP loads. Such tables are called accumulation registers and information registers. Since an entry in several places multiplies the occupied space, for saving, not all transaction attributes are included in the registers, but only those that are considered important for this section of analytical accounting. Such a compromise is called the ROLAP model, i.e. relational-analytical mapping.

OLAP

In SAP, the German 1C analog went further. The relational OLTP model in this software can be replicated to the OLAP model. SAP HANA has a storage column structure. This means that the "tables" are stored there not as a set of rows, but as a set of columns.

A similar storage scheme is implemented in solutions such as Google Bigquery, Microsoft SSAS Tabular, Amazon Redshift, Yandex ClickHouse.

The difference between column and line storage

If in a line-by-line structure, data is stored as "horizontal" tuples, each of which is a transaction:

period, product, department (Q1, SKU1, 1) (Q1, SKU2, 1) (Q1, SKU1, 1) ... (Q2, SKU1, 1) (Q2, SKU1, 1) (Q3, SKU1, 1) (Q3, SKU1, 1) ...

Then in the column such data is stored "vertically":

 (Q1, Q1, Q1, ... Q2, Q2, Q3, Q3, ...) (SKU1, SKU2, SKU1, ... SKU1, SKU1, SKU1, SKU1, ...) (1,1,1, ... 1,1,1,1, ...)

Replays can be optimized, conditionally, as follows:

 period = (Q1, {start: 0, count: n}, Q2, {start: n+1; count: m}, ...) product = (SKU1, {start: 0, count: 1}, SKU2, {start: 1; count: 1}, SKU1, {start: 2; count: m}, ...) department = (1,{start:0, count:m}...)

If there is a column for which such optimization will not reduce the initial volume, then the data is stored in its original form.

The column table engine itself selects the column sorting sequence, but if you know your data and sort it manually, this often increases compression and eases the analytical workload. My compression of individual tables exceeded 300 times. In practice, this storage structure is:

allows you to compress data to the level when they are placed in RAM, i.e. make available in-memory calculations that are not comparable in speed with queries to relational databases
sets its own rules for building a data model, without requiring such normalization as in OLTP
sets its semantics for the construction of analytical expressions.

The specificity of expressions is described in detail:
here for Google BigQuery.
here for the Microsoft DAX.

BI as infrastructure of column bases

BI is a solution serving analytical loads. And they make life much easier if they are built on top of column databases. This could be a self-made ClickHouse-Grafana-Python bundle or a Google stack bundle: Bigquery-Data Studio-Dataprep-Dataflow or monolithic Power BI.

Multidimensional cubes are another OLAP alternative storage column scheme. But for me, MDX expressions, when compared with SQL in BQ or DAX, are redundant and complex.

Source: https://habr.com/ru/post/442754/

All Articles