KDB + database: from finance to Formula 1

KDB +, a KX product, is a well-known in narrow circles, an exceptionally fast, column-based database for storing time series and analytical calculations based on them. Initially, she enjoyed (and enjoys) great popularity in the finance industry — all top 10 investment banks and many well-known hedge funds, stock exchanges, and other organizations use it. Recently, KX decided to expand its client base and now offer solutions in other areas where there is a large amount of data ordered by time or otherwise - telecom, bioinformatics, production, etc. In particular, they became a partner of the Aston Martin Red Bull Racing team in Formula 1, where they help collect and process data from car sensors and analyze tests in a wind tunnel. In this article I want to tell you what features of KDB + make it overproductive, why companies are willing to spend a lot of money on it, and finally, why this is not really a database.

In this article I will try to tell in general what KDB + is, what opportunities and limitations it has, what is its use for companies wishing to process large amounts of data. I will not go into the details of KDB + implementation and the details of its Q programming language. Both of these topics are very extensive and deserve separate articles. A lot of information on these topics can be found on code.kx.com, including a book on Q - Q For Mortals (see the link below).

Some terms

In-memory database. A database that stores data in RAM for faster access. The advantages of such a database are clear, and the disadvantages are the possibility of data loss, the need to have a lot of memory on the server.
Column database. A database where data is stored by colic, and not record by record. The main advantage of such a database is that data from one column is stored together on disk and in memory, which greatly speeds up access to it. There is no need to load columns that are not used in the query. The main disadvantage is that it is difficult to modify and delete records.
Time series Data with a date or time type column. As a rule, the ordering of time is important for such data so that it is easy to determine which record precedes or follows the current one, or to use functions whose result depends on the order of the records. Classic databases are built on a completely different principle - the representation of a set of records as a set, where the order of records is not defined in principle.
Vector. In the context of KDB +, this is a list of elements of one atomic type, for example, numbers. In other words, an array of elements. Arrays, unlike lists, can be stored compactly and processed using vector instructions of the processor.

History reference

KX was founded in 1993 by Arthur Whitney, who had previously worked at Morgan Stanley Bank on the A + language, the heir to APL, a very original and once popular language in the financial world. Of course, at KX, Arthur continued in the same vein and created the vector-functional language K, guided by the ideas of radical minimalism. Programs in K look like a jumbled set of punctuation marks and special characters, the meaning of signs and functions depends on the context, and each operation carries a lot more sense than it does in familiar programming languages. Due to this, the K program takes up minimal space — a few lines can replace text pages of a verbose language like Java — and is an over-concentrated implementation of the algorithm.
')
A function on K that implements a large part of the LL1 parser generator according to a given grammar:

1. pp:{q:{(x;p3(),y)};r:$[-11=@x;$x;11=@x;q[`N;$*x];10=abs@@x;q[`N;x]  2.   ($)~*x;(`P;p3 x 1);(1=#x)&11=@*x;pp[{(1#x;$[2=#x;;,:]1_x)}@*x]  3.      (?)~*x;(`Q;pp[x 1]);(*)~*x;(`M;pp[x 1]);(+)~*x;(`MP;pp[x 1]);(!)~*x;(`Y;p3 x 1)  4.      (2=#x)&(@x 1)in 100 101 107 7 -7h;($[(@x 1)in 100 101 107h;`Ff;`Fi];p3 x 1;pp[*x])  5.      (|)~*x;`S,(pp'1_x);2=#x;`C,{@[@[x;-1+#x;{x,")"}];0;"(",]}({$[".sC"~4#x;6_-2_x;x]}'pp'x);'`pp];  6.   $[@r;r;($[1<#r;".s.";""],$*r),$[1<#r;"[",(";"/:1_r),"]";""]]}

Arthur embodied this philosophy of extreme efficiency with a minimum of body movements in KDB +, which appeared in 2003 (I think it’s clear where the letter K in the title is from) and is nothing more than an interpreter of the fourth version of the K language. K is called Q. Q also adds support for a specific SQL dialect, QSQL, and the interpreter adds support for tables as a system data type, tools for working with tables in memory and on disk, etc.

Thus, from the user's point of view, KDB + is simply an interpreter of the Q language with support for tables and SQL-like expressions in LINQ style from C #. This is the most important difference between KDB + from other databases and its main competitive advantage, which is often overlooked. This is not a database + auxiliary disabled language, but a full-fledged powerful programming language + built-in support for database functions. This distinction will play a decisive role in listing all the benefits of KDB +. For example…

The size

By modern standards, KDB + has just a microscopic size. It is literally one executable file smaller than a megabyte and one small text file with some system functions. In reality, it is less than one megabyte and companies pay tens of thousands of dollars a year for a single processor on a server for this program.

This size allows KDB + to feel great on any hardware - from the Pi microcomputer to servers with terabytes of memory. It does not affect the functionality in any way; moreover, Q starts instantly, which allows using it as a scripting language as well.
With this size, the Q interpreter is completely placed in the processor's cache, which speeds up the execution of programs.
With this size of the executable file, the Q process takes up very little memory space, you can run them in the hundreds. At the same time, if necessary, Q can operate with tens or hundreds of gigabytes of memory in a single process.

Versatility

Q is perfect for a variety of tasks. Q process can serve as a historical database and provide quick access to terabytes of information. For example, we have dozens of historical databases, in some of which one uncompressed data day takes up more than 100 gigabytes. However, with reasonable restrictions, the request to the database will be executed in tens to hundreds of milliseconds. In general, we have a universal timeout - 30 seconds - for user requests, and it rarely works.

With the same ease, Q can be an in-memory database. Adding new data to tables in memory is so fast that user requests are a limiting factor. The data in the tables is stored in columns, which means that any operation on the column will use the processor's cache at full capacity. In addition to this, KX tried to implement all the basic operations like arithmetic through the vector instructions of the processor, maximizing their speed. Q can also perform tasks not inherent in databases - for example, process streaming data and calculate in "real time" (with a delay of tens of milliseconds to several seconds depending on the task) different aggregation functions for financial instruments for different time intervals or build a model of perfect influence. transactions on the market and carry out its profiling almost immediately after its commission. In such tasks, most often the main time delay is made not by Q, but by the need to synchronize data from different sources. High speed is achieved due to the fact that the data and functions that process them are in the same process, and processing is reduced to the execution of several QSQL expressions and joins that are not interpreted, but are executed by binary code.

Finally, you can write any service processes on Q. For example, Gateway processes that automatically distribute user requests to the right databases and servers. The programmer has complete freedom to implement any algorithm for balancing, prioritization, fault tolerance, access rights, quotas, and in general what your heart desires. The main problem here is that you have to implement all this yourself.

For example, I will list what types of processes we have. All of them are actively used and work together, combining dozens of different databases into one, processing data from multiple sources and serving hundreds of users and applications.

Connectors (feedhandler) to data sources. As a rule, these processes use external libraries that are loaded into Q. The C-interface in Q is extremely simple and allows you to easily create proxy functions for any C / C ++ library. Q is fast enough to handle, for example, processing the flow of FIX messages from all European stock exchanges simultaneously.
Data distributors (tickerplant), which serve as an intermediate link between connectors and consumers. At the same time, they write incoming data to a special binary log, providing consumers with resistance to connection losses or restarts.
In-memory database (rdb). These databases provide the fastest possible access to raw, fresh data by storing them in memory. As a rule, they accumulate data in tables during the day and reset them at night.
Persist Database (pdb). These databases provide the preservation of data for today in the historical database. As a rule, unlike rdb, they do not store data in memory, but use a special cache on the disk during the day and copy the data at midnight into the historical database.
Historical bases (hdb). These databases provide access to data for previous days, months, and years. Their size (in days) is limited only by the size of the hard drives. Data can be located anywhere, in particular on different disks to speed up access. It is possible to compress data using several algorithms to choose from. The structure of the database is well documented and simple, the data is stored in standard files, so that they can be processed, including by means of the operating system.
Base with aggregated information. They store various aggregations, usually with, grouped by instrument name and time interval. In-memory databases update their state with each incoming message, and historical data stores pre-calculated data to speed up access to historical data.
Finally, gateway processes serving applications and users. Q allows you to implement a fully asynchronous processing of incoming messages, their distribution across databases, checking access rights, etc. I note that messages are not limited and most often are not SQL expressions, as is the case with other databases. Most often, the SQL expression is hidden in a special function and is constructed based on the parameters requested by the user — time conversion is performed, filtering is performed, data is normalized (for example, the stock price is leveled if dividends are paid), etc.

Typical architecture for one data type:

Speed

Although Q is an interpreted language, it is also a vector language. This means that many built-in functions, in particular, arithmetic, take arguments of any form — numbers, vectors, matrices, lists, and the programmer is expected to implement the program as operations on arrays. In such a language, if you add two vectors of a million elements, it no longer plays the role that the language is interpretable, addition will be performed by a super-optimized binary function. Since the lion's share of time in Q programs is spent on operations with tables using these basic vectorized functions, we have a very decent speed at the output, which allows us to process a huge amount of data even in one process. This is similar to the mathematical libraries in Python - although Python itself is a very slow language, it has many excellent numpy libraries that allow you to process numeric data at the speed of the compiled language (by the way, numpy is ideologically close to Q).

In addition, the KX very carefully approached the design of tables and optimize the work with them. First, several types of indexes are supported, which are supported by built-in functions and can be applied not only to columns of tables, but also to any vectors - grouping, sorting, unique attribute and special grouping for historical bases. The index is superimposed elementary and automatically adjusted when adding elements to a column / vector. Indexes equally well can be superimposed on columns of tables both in memory and on disk. When executing a QSQL query, indexes are used automatically, if possible. Secondly, work with historical data is done through the OS file mechanism (memory map). Large tables are never loaded into memory, instead, the necessary columns are displayed directly into memory and only that part of them is actually loaded (here they also help with indexes), which is necessary. There is no difference for a programmer whether data is in memory or not, the mechanism for working with mmap is completely hidden in the depths of Q.

KDB + is not a relational database, tables can contain arbitrary data, while the order of rows in the table does not change when new items are added and can and should be used when writing queries. This feature is urgently needed to work with time series (data from stock exchanges, telemetry, event logs), because if the data is sorted by time, the user does not need to use any SQL tricks to find the first or last time line or N rows in the table , determine which line follows the Nth line, etc. The joins of tables are simplified even more, for example, finding the last quote in a table of 500 million items for 16000 VOD.L transactions takes about a second on the disk and ten milliseconds in memory.

An example of a time join - a quote table is mapped into memory, so there is no need to specify VOD.L in where, an index on the sym column is implicitly used, and the fact that the data is sorted by time. Almost all joins in Q are normal functions, not the select part of an expression:

 1. aj[`sym`time;select from trade where date=2019.03.26, sym=`VOD.L;select from quote where date=2019.03.26]

Finally, it is worth noting that the engineers at KX, starting with Arthur Whitney himself, are really obsessed with efficiency and are making every effort to get the most out of the standard Q functions and optimize the most frequent patterns of use.

Total

KDB + is popular with businesses primarily because of its exceptional versatility - it serves equally well as an in-memory database, as a base for storing terabytes of historical data, and as a platform for analyzing data. Due to the fact that data processing takes place directly in the database, a high speed of work and saving of resources are achieved. A full-fledged programming language integrated with database functions allows you to implement the entire stack of necessary processes on one platform, from receiving data to processing user requests.

additional information

disadvantages

A significant disadvantage of KDB + / Q is the high threshold of entry. The language has a strange syntax, some functions are heavily overloaded (value, for example, has about 11 use cases). Most importantly, it requires a radically different approach to writing programs. In a vector language, you need to think all the time in terms of array transformations, implement all cycles through several variants of map / reduce functions (which are called adverbs in Q), never try to save money by replacing vector operations with atomic ones. For example, to find the index of the Nth occurrence of an element in an array, you should write:

 1. (where element=vector)[N]

although it looks terribly inefficient by C / Java standards (= creates a Boolean vector, where it returns indices of true elements in it). But such a record makes the meaning of the expression more understandable and you use fast vector operations instead of slow atomic ones. The conceptual difference between the vector language and the rest is comparable to the difference between the imperative and functional approaches to programming, and we must be ready for this.

Some users are also unhappy with QSQL. The fact is that it only looks like real SQL. In fact, this is just an interpreter of SQL-like expressions that does not support query optimization. The user must write the optimal queries himself, and in Q, to which many are not ready. On the other hand, of course, you can always write the optimal query yourself, and not rely on the black box optimizer.

As a plus, the Q - Q For Mortals book is available for free on the company's website , and many other useful materials are also collected there.

Another big minus is the cost of the license. These are tens of thousands of dollars a year for one CPU. Only large firms can afford such expenses. Recently, KX has made the licensing policy more flexible and provides an opportunity to pay only during use or to rent KDB + in Google and Amazon clouds. KX also offers a free version for non-commercial purposes (32-bit or 64-bit on request).

Competitors

There are quite a few specialized databases built on similar principles - columnar, in-memory, focused on very large amounts of data. The problem is that these are specialized databases. A bright example is Clickhouse. This database has a very similar to KDB + principle of storing data on disk and the index structure, it performs some queries faster than KDB +, although it is not essential. But even as the Clickhouse database is more specialized than KDB + - web analytics vs arbitrary time series (this distinction is very important - because of it, for example, Clickhouse does not have the ability to use orderliness of records). But, most importantly, Clickhouse does not have the universality of KDB +, a language that would allow to process data directly in the database, and not load it in advance into a separate application, build arbitrary SQL expressions, use arbitrary functions in the query, create processes not related to the performance of the historical database . Therefore, it is difficult to compare KDB + with other databases, they may be better in certain use cases or simply better if we are talking about classical database tasks, but I don’t know another equally effective and versatile tool for processing temporary data.

Python integration

To simplify working with KDB + for people who are not familiar with the technology, KX created libraries for tight integration with Python in a single process. It is possible to call any python function from Q, and vice versa, to call any Q function from Python (in particular, QSQL expressions). Libraries convert, if necessary (for the sake of efficiency, not always) data from the format of one language into the format of another. As a result, Q and Python live in such a close symbiosis that the boundaries between them are erased. As a result, the programmer, on the one hand, has full access to numerous useful Python libraries, on the other hand, he gets a fast database integrated with Python, which is especially useful to those involved in machine learning or modeling.

Working with Q in Python:

 1. >>> q() 2.q)trade:([]date:();sym:();qty:()) 3. q)\ 4. >>> q.insert('trade', (date(2006,10,6), 'IBM', 200)) 5. k(',0') 6. >>> q.insert('trade', (date(2006,10,6), 'MSFT', 100)) 7. k(',1')

Links

Company website - https://kx.com/
Website for developers - https://code.kx.com/v2/
Book Q For Mortals (in English) - https://code.kx.com/q4m3/
Articles on the topic of KDB + / Q applications from kx employees - https://code.kx.com/v2/wp/

Source: https://habr.com/ru/post/446412/

All Articles