Hi, Habr!
In the article I will describe the idea of storage in a fairly well-known KDB column database, as well as examples of how to access this data. The base has existed since 2001, and at the moment it occupies high places on websites with a comparison of similar systems (see, for example, here )
If you have currency fluctuations every second for the last 20 years, the relational database will not be the fastest and most effective solution for storing and processing the accumulated (that is, slightly more than 120 * 10 ^ 9 lines for 200 currencies). In this case, it is most logical to use a fast column database, which means KDB will help us.
Similarly, if you do not store numbers, as in the example above, but serialized objects. In this case, the task of storing a large number of rows adds the complication of the large size of each row.
After you have a large amount of data, tasks on analyzing this data often begin to arise - finding correlations between them, creating aggregations, etc. That is, this requires the ability to write a function (with cycles, conditions, everything is supposed) would be executed as close as possible to the data (ideally in the database itself), so as not to drive the data over the network.
Having solved all the direct technical problems, you will have the following:
Physical data in KDB is stored at minimal cost. So, a column with integers is just a sequence of integers that is stored in a single file on disk.
As mentioned above, KDB is a column database, i.e. each column is stored separately. In reality, a column is just a separate file, nothing more than. That is, a table t with columns a, b, c and d will simply represent on the disk a folder "t", in which there are four files - a, b, c and d. And plus a small metadata file. If you need to copy the table - you can simply copy the files (and make the metadata generate). If you need to transfer some data to a new server - just copy the files.
As any reader understands, storing millions of objects in one file is extremely inefficient. In this case, even the task of re-sorting will be solved already difficult and expensive (after all, you can’t take everything into memory - it’s not so much). From here in KDB (as in every decent column database) the entire table is initially divided into sections (see the sectionsitions), see the documentation . Sections are assigned to the entire database and most often are just a date.
The latter already slightly complicates the file structure. If you have two tables (t1 and t2), and they have a date column (we divide data by folders according to it), then the following structure will be on the disk:
\ 2017.01.01 \ t1 \ t2 \ 2017.01.02 \ t1 \ t2
That is, in the folder with the date there are folders with tables, in which there are files with columns.
Data is always stored on the disk without the possibility of changing or deleting it. You can only add more data. Those. if you need to update or discard data for date d, take them all into memory ( select from t where date = d
), do all the necessary operations, save for date d1, and then change the folder names on the disk.
After we have learned to share files on a disk, you can still optimize their storage using compression (for example, gzip or google snappy ). An effective column base must be able to do it on its own, for otherwise it will be necessary to compress either the file system (ie, store uncompressed data in the RAM cache), or not compress the data at all (and increase IO) or compress the data already in the application layer (and lose possibility of compression of adjacent lines).
In addition to efficient data storage, KDB provides the ability to quickly read data into memory. For this, the table must be ordered, that is, one to choose from:
select v from t where date=2017.01.01 and k=12
will have to load all data from columns k and v for a specific date. Or, in the language of relational databases, you have to do an index scan.select v from t where date=2017.01.01 and k=12
will work much faster - KDB will load only part of the data into memory, it will find them for logarithm. What is important - from this attribute the table does not grow on the disk, i.e. no additional data is required.select v from t where date=2017.01.01 and k=12
. Obviously, in this case, the hash table is stored nearby and takes away precious space.select v from t where date=2017.01.01 and k=12
, index seek will occur, and KDB will instantly jump to the desired value on the disk. However, queries of the form select v from t where date=2017.01.01 and k<12 and k > 10
will do an index scan, since the hash table will not sort the data. However, the task is easily solved with the help of an additional table and a sorted column.The attentive reader will note that the two statements above are somewhat difficult to combine: the data in KDB can be stored sorted and cannot be inserted into the table on the disk in the middle, you can only add to the end . To combine this two statements (and not to lose in performance), KDB uses the following approach:
If it is completely superficial, then the algorithm of RDB operation is as follows:
Speaking about KDB, one cannot but mention the Q language, on which all queries (and all functions) in KDB are built. If the functions of the sample more or less everything is clear (see the example above - select v from t where date=2017.01.01 and k=12
), then the rest of the things look a bit more unusual.
The idea of Q can be associated with the proverb brevity - the sister of talent .
So, create a new variable:
tv: select v from t where date=2017.01.01 and k=12;
Simplify the request - we do not need and to list the conditions:
tv: select v from t where date=2017.01.01,k=12;
Add grouping and aggregation:
tv: select count by v from t where date=2017.01.01,k=12;
rename column:
tv: select c: count by v from t where date=2017.01.01,k=12;
Back to first request
tv: select v from t where date=2017.01.01,k=12;
And rename the column
tv: select v from t where date=2017.01.01,k=12; tv: `v1 xcol tv;
Sort the column:
tv: select v from t where date=2017.01.01,k=12; tv: `v1 xcol tv; tv: `v1 xasc tv;
Or, more conveniently, we will combine the query into a more familiar one line:
tv: `v1 xasc `v1 xcol select v from t where date=2017.01.01,k=12;
Wrap our query into a function (the ':' symbol at the beginning of the expression means return, not assignment, as was the case in the examples above):
f: {[] tv: `v1 xasc `v1 xcol select v from t where date=2017.01.01,k=12; :tv; }
Add parameters:
f: {[i_d; i_k] tv: `v1 xasc `v1 xcol select v from t where date=i_d,k=i_k; :tv; }
And we will call the function (in the end we will not write ";" - this will give us output to the console, as a useful side effect):
f: {[i_d; i_k] tv: `v1 xasc `v1 xcol select v from t where date=i_d,k=i_k; :tv; }; f[2017.01.01; 12]
Let's pass the arguments in the dictionary, so that later it would be more convenient to forward them from other functions (without an explicit return, that is, without ":", the result of the last expression is considered the result of the lambda operation):
f: {[d] i_d: d[`date]; i_k: d[`key]; `v1 xasc `v1 xcol select v from t where date=i_d,k=i_k; }; f[(`date`key)!(2017.01.01;12)]
In the last example, we did several things at once:
(`date`key)!(2017.01.01;12)
i_d: d[`date]
;Next, we add a throwing error for the case when there is no data:
f: {[d] i_d: d[`date]; i_k: d[`key]; r: `v1 xasc `v1 xcol select v from t where date=i_d,k=i_k; $[0 = count r;'`no_data;:r]; }; f[(`date`key)!(2017.01.01;12)]
So, now our function will throw an exception with the words "no_data" for the case when there is no data in the table for our query.
The construction of $[1=0;`true;`false]
is a conditional transition in which the condition is first, then the expression that should be executed if the condition is true. At the end is an else block. However, in reality it is rather pattern matching than if, for the following construction is also admissible: $[a=0;`0; a=1;`2; `unknown]
$[a=0;`0; a=1;`2; `unknown]
$[a=0;`0; a=1;`2; `unknown]
. That is, on all odd positions (except the last one) there are conditions, on all even ones - what needs to be done. And at the end - the else block.
As seen in the examples, the language is logical (albeit concise). In Q there is:
Source: https://habr.com/ru/post/346576/
All Articles