KDB

Hi, Habr!

In the article I will describe the idea of storage in a fairly well-known KDB column database, as well as examples of how to access this data. The base has existed since 2001, and at the moment it occupies high places on websites with a comparison of similar systems (see, for example, here )

What for?

Time Series Storage

If you have currency fluctuations every second for the last 20 years, the relational database will not be the fastest and most effective solution for storing and processing the accumulated (that is, slightly more than 120 * 10 ^ 9 lines for 200 currencies). In this case, it is most logical to use a fast column database, which means KDB will help us.

Similarly, if you do not store numbers, as in the example above, but serialized objects. In this case, the task of storing a large number of rows adds the complication of the large size of each row.

Calculations

After you have a large amount of data, tasks on analyzing this data often begin to arise - finding correlations between them, creating aggregations, etc. That is, this requires the ability to write a function (with cycles, conditions, everything is supposed) would be executed as close as possible to the data (ideally in the database itself), so as not to drive the data over the network.

Production ready

Having solved all the direct technical problems, you will have the following:

Backup
Replication (+ active-active services work)
Sharding
Support for major technologies and programming languages (Java, .Net, R, etc.)

How does it work on one server?

Physical data in KDB is stored at minimal cost. So, a column with integers is just a sequence of integers that is stored in a single file on disk.

Physical storage

As mentioned above, KDB is a column database, i.e. each column is stored separately. In reality, a column is just a separate file, nothing more than. That is, a table t with columns a, b, c and d will simply represent on the disk a folder "t", in which there are four files - a, b, c and d. And plus a small metadata file. If you need to copy the table - you can simply copy the files (and make the metadata generate). If you need to transfer some data to a new server - just copy the files.

As any reader understands, storing millions of objects in one file is extremely inefficient. In this case, even the task of re-sorting will be solved already difficult and expensive (after all, you can’t take everything into memory - it’s not so much). From here in KDB (as in every decent column database) the entire table is initially divided into sections (see the sectionsitions), see the documentation . Sections are assigned to the entire database and most often are just a date.

The latter already slightly complicates the file structure. If you have two tables (t1 and t2), and they have a date column (we divide data by folders according to it), then the following structure will be on the disk:

\ 2017.01.01 \ t1 \ t2 \ 2017.01.02 \ t1 \ t2

That is, in the folder with the date there are folders with tables, in which there are files with columns.

Data is always stored on the disk without the possibility of changing or deleting it. You can only add more data. Those. if you need to update or discard data for date d, take them all into memory ( select from t where date = d ), do all the necessary operations, save for date d1, and then change the folder names on the disk.

After we have learned to share files on a disk, you can still optimize their storage using compression (for example, gzip or google snappy ). An effective column base must be able to do it on its own, for otherwise it will be necessary to compress either the file system (ie, store uncompressed data in the RAM cache), or not compress the data at all (and increase IO) or compress the data already in the application layer (and lose possibility of compression of adjacent lines).

In addition to efficient data storage, KDB provides the ability to quickly read data into memory. For this, the table must be ordered, that is, one to choose from:

The data in each section is kept intact. That is, if we have a table t with date (partition) columns, as well as b, c and d, then to execute the query select v from t where date=2017.01.01 and k=12 will have to load all data from columns k and v for a specific date. Or, in the language of relational databases, you have to do an index scan.
One of the columns will be sorted. If we continue the example above and sort the data by column k, then the query select v from t where date=2017.01.01 and k=12 will work much faster - KDB will load only part of the data into memory, it will find them for logarithm. What is important - from this attribute the table does not grow on the disk, i.e. no additional data is required.
One of the columns will be unique. In this case, KDB will additionally create a hash table for the values, which will make an index seek in the example select v from t where date=2017.01.01 and k=12 . Obviously, in this case, the hash table is stored nearby and takes away precious space.
Multiple columns are grouped. In essence, this is about the same as the primary key index in relational databases. In such a table, a tuple of identical column values is stored together; moreover, a hash table is separately stored, by which you can immediately refer to the desired value. That is, for queries of the form select v from t where date=2017.01.01 and k=12 , index seek will occur, and KDB will instantly jump to the desired value on the disk. However, queries of the form select v from t where date=2017.01.01 and k<12 and k > 10 will do an index scan, since the hash table will not sort the data. However, the task is easily solved with the help of an additional table and a sorted column.

RDB and HDB

The attentive reader will note that the two statements above are somewhat difficult to combine: the data in KDB can be stored sorted and cannot be inserted into the table on the disk in the middle, you can only add to the end . To combine this two statements (and not to lose in performance), KDB uses the following approach:

All historical data is stored in HDB (historical DB) . They are stored compressed and orderly on disk, they can be quickly read into memory and analyzed.
All data for the last day is stored in RDB (realtime DB) , whose task is to collect data from the application as soon as possible. In this case, the numbers can be stored in RAM (the last day of 20 years is unlikely to take up much space), which will allow them to be quickly accessed even if they are not sorted. If the data stream is large enough, of course, you can remove numbers from RAM at the time they are flushed to disk.

If it is completely superficial, then the algorithm of RDB operation is as follows:

We take data from the application
Once in N seconds / minutes, we reset the data to disk and call the user function to which we transfer the reset. She is:
2.1. Or appends newly arrived data to an object in memory (filtering, aggregating, whatever)
2.2. Does nothing (after all, we do not always need the current day to analyze history)
At the end of the day - we take away all the data accumulated in the RDB, sort / group them and drop them in HDB

Q

Speaking about KDB, one cannot but mention the Q language, on which all queries (and all functions) in KDB are built. If the functions of the sample more or less everything is clear (see the example above - select v from t where date=2017.01.01 and k=12 ), then the rest of the things look a bit more unusual.

The idea of Q can be associated with the proverb brevity - the sister of talent .

So, create a new variable:

 tv: select v from t where date=2017.01.01 and k=12;

Simplify the request - we do not need and to list the conditions:

 tv: select v from t where date=2017.01.01,k=12;

Add grouping and aggregation:

 tv: select count by v from t where date=2017.01.01,k=12;

rename column:

 tv: select c: count by v from t where date=2017.01.01,k=12;

Back to first request

 tv: select v from t where date=2017.01.01,k=12;

And rename the column

 tv: select v from t where date=2017.01.01,k=12; tv: `v1 xcol tv;

Sort the column:

 tv: select v from t where date=2017.01.01,k=12; tv: `v1 xcol tv; tv: `v1 xasc tv;

Or, more conveniently, we will combine the query into a more familiar one line:

 tv: `v1 xasc `v1 xcol select v from t where date=2017.01.01,k=12;

Wrap our query into a function (the ':' symbol at the beginning of the expression means return, not assignment, as was the case in the examples above):

 f: {[] tv: `v1 xasc `v1 xcol select v from t where date=2017.01.01,k=12; :tv; }

Add parameters:

 f: {[i_d; i_k] tv: `v1 xasc `v1 xcol select v from t where date=i_d,k=i_k; :tv; }

And we will call the function (in the end we will not write ";" - this will give us output to the console, as a useful side effect):

 f: {[i_d; i_k] tv: `v1 xasc `v1 xcol select v from t where date=i_d,k=i_k; :tv; }; f[2017.01.01; 12]

Let's pass the arguments in the dictionary, so that later it would be more convenient to forward them from other functions (without an explicit return, that is, without ":", the result of the last expression is considered the result of the lambda operation):

 f: {[d] i_d: d[`date]; i_k: d[`key]; `v1 xasc `v1 xcol select v from t where date=i_d,k=i_k; }; f[(`date`key)!(2017.01.01;12)]

In the last example, we did several things at once:

Declared a dictionary using the expression (`date`key)!(2017.01.01;12)
Transferred the dictionary to the function
Read the variables from the dictionary i_d: d[`date] ;

Next, we add a throwing error for the case when there is no data:

 f: {[d] i_d: d[`date]; i_k: d[`key]; r: `v1 xasc `v1 xcol select v from t where date=i_d,k=i_k; $[0 = count r;'`no_data;:r]; }; f[(`date`key)!(2017.01.01;12)]

So, now our function will throw an exception with the words "no_data" for the case when there is no data in the table for our query.
The construction of $[1=0;`true;`false] is a conditional transition in which the condition is first, then the expression that should be executed if the condition is true. At the end is an else block. However, in reality it is rather pattern matching than if, for the following construction is also admissible: $[a=0;`0; a=1;`2; `unknown] $[a=0;`0; a=1;`2; `unknown] $[a=0;`0; a=1;`2; `unknown] . That is, on all odd positions (except the last one) there are conditions, on all even ones - what needs to be done. And at the end - the else block.

As seen in the examples, the language is logical (albeit concise). In Q there is:

Lambda
Conditional transitions
Cycles
Special instructions for joining tables (including complex joines, pivot tables)
The ability to add modules (for example, to calculate analytics on the GPU at the same time)

And in conclusion

If you are working with a large amount of data - KDB will help you
If you have time series analysis tasks - KDB will help you
If you have a task of fast recording (and subsequent analysis) of a large data stream - KDB will help you

Source: https://habr.com/ru/post/346576/

All Articles

KDB