Virtual cube - instead of OLAP

When you do the opposite and get the same ...

Having the task of analytical (computational / aggregative) data processing, we have to find a compromise between responsiveness, speed and convenience.

Some systems are well indexed and found, others are able to quickly calculate and aggregate data, others are simple. Somewhere it is necessary to organize the preloading and indexing of data with all the attendant difficulties, and somewhere the user is provided with an abstraction of his model of source and aggregated data on top of the built-in or external physical storages and databases used directly during the calculations. In any case, the user, from programmer to analyst, has to do relatively great work, starting with the preparation of raw data and querying, the computation model, and the visual design of the result on widgets, of course “Sexy” - beautiful, responsive and understandable - otherwise All the work done will go down the drain. And often, unfortunately, after going through the agony of decision making, we notice how simple and understandable at first glance the task grows into an eerie monster, with which it is useless to fight with available means, and we need to urgently invent something - a blackjack and hooky bike © Our bike rode, even goes around bumps well and copes with obstacles, which previously could only be guessed.

Below will be described one side of the original internal device of the fictional "Rubik's Cube" - computational processing for interactive data visualization.

A simple task should be solved simply, and a complex one should also be simple, but longer ...

Starting to create a system of small forces, we went from simple to complex. Creating a constructor, we were internally convinced that we understand the purpose of the system well, while fighting the desire not to do too much and the opposite desire to automate everything and everyone, creating a framework for everything. Moreover, one of our wonderful framework was already ready and even run in production - jsBeans. So, we started to develop a new data processing system, which has grown and is now also a self-sufficient product — a designer and a platform for developing a whole class of data processing systems. Conditionally in the article we will call it "Rubik's Cube" in order to do without advertising, but to describe interesting, in our opinion, solutions.

Cube, cut, dimension

The main task is to have a set of unrelated data, including heterogeneous external databases and files, to form a multidimensional model of interconnected elements of the source data and the results of their analytical processing for visualization on dynamic dashboards and interrelated widgets.

Simply put, for example, in the form of a clickable dashboard:

Schools rated dashboard example

Such a multidimensional data model in our system is called "Cube" and literally represents an abstract collection of variable data sets called "Slice", interconnected by common output (displayed) fields / columns or internal fields called "Dimensions" and used for filtering and linking cuts to each other.

A slice can be represented as a virtual table or a view ( CTE ) with parameters and a variable query body, depending on the filtering conditions. The main thing is that the output data changes depending on the conditions of the contextual search (inside the widget) and the global filter, which is built by selecting values on the widgets and applying basic logic functions (AND / OR / NOT) and combinations.

The global filter allows you to "rotate the Rubik's Cube", as in the video :

If the output slice field is simultaneously a measurement in another slice, it has the same name, then the system perceives the values of this field as “facts” (if we talked about OLAP ) specified in the form of a global filter that changes the initial data sets during calculations and aggregation . As a result, there is a dynamic interaction of widgets, in which the values of the displayed indicators depend on the selected elements and filters.

A slice is a changeable “by measurements” data set — initial or analytical results; characterized by output fields / columns, a list of supported measurements and a set of parameters with default values; described by a relatively elegant query in a visual editor that supports filtering, sorting, grouping / aggregation, intersection (JOIN), union (UNION), recursion, and other manipulations.

Sections that use each other as sources describe the internal structure of the cube, for example:

Cube structure example

An example of a slice in the editor:

Example slice query editor

The slice supports both measurements explicitly specified in the output fields and also inherits the measurements from the sources of the request — this means that the output of the slice can be changed even as a result of changes in other source slices. In other words, the results of the slice can be filtered not only by the output fields, but also by the internal fields-measurements of sources, somewhere in the depth of the query, up to the primary database tables.

The query structure is expanded and modified by the system automatically at the time of execution, depending on the current global filter and input parameters, dragging them deeper into the query according to the cube model, the declared dimensions and slices.

An example of a simple global filter, literally, when a user has committed or selected values on several widgets:

An example of a global filter on a dashboard

Global filter is saved in JSON request:

Global filter JSON example in request body

The request comes to the primary source (in the database) in a prepared form, passing through several basic steps:

Assembling a query, including selecting and embedding optimal slices, taking into account the current global filter (when there is no filter or simple, you can select simple / fast slices; when the filter is complex - slices with a complex structure and additional dimensions);
Embedding a global filter and adding filters to the bodies of queries and subqueries;
Embedding macros and template query expressions;
Query optimization, including the removal of unused fields and expressions;
Additional operations with a query for the specifics of primary databases (for example, if there is no SQL speech and there is no WITH in the database, then named queries are embedded).

And the final stage is the translation of the request to the format of the primary source, for example in SQL:

Example of a final query with an inline filter

When sources are different

As a rule, everything is simple and clear when you have to work with a single data store. But when there are several of them and they are fundamentally different - you have to apply different tricks for each specific task. And you always want to have a universal solution that would always be suitable, preferably "out of the box", as a maximum with a few modifications. For this, another abstraction suggests itself - over data warehouses, firstly, realizing the coordination of formats and query languages, secondly, ensuring data interdependence, at least at the level of additional filtering conditions in queries to one source by values from another.

To do this, we have developed a universal query language that is suitable both for the presentation of a virtual cube data model and for working with arbitrarily arbitrary repositories by translating the query to the desired format and language. By luck, the query language, originally designed for simple mapping and filtering data from different sources, easily expanded into a full-fledged search and data processing language, allowing you to build computational structures from very simple to very complex in several pages and with many subqueries.

Sources can be divided into three types:

data files that require loading into the system;
databases that support full data processing and other operations;
repositories that support only data retrieval with or without filtering, including various external services.

With the first type, everything is unambiguous - an import module is integrated in the system, which parses various input formats and immerses the results in the storage. A special constructor has also been developed for import, which is worth telling separately.

The second type is self-sufficient databases, to work with which you only need to translate the original request to the desired format and query language, dialect.

The third type requires at least post-processing of data. And all types, when used simultaneously, may also require post-processing — intersections, joins, aggregation, and final calculations. This occurs when data processing in one database must be performed taking into account the results of filtering in another external one.

The simplest example is when a fuzzy search is performed in one database, and at the output you need to get an aggregation of indicators stored in another database on another server, taking into account the search results.

To implement such a scheme, a simple algorithm is implemented in our system: the initial request is simultaneously prepared by several interpreters, each of which can either refuse to execute the request if it is incompatible, or return an iterator with data, or convert the request and initiate the work of the next request preparation chain by another interpreter . In the end, for one query, we get from one to several lazy iterators that form the same result, but in different ways, from which the best one is selected (according to various criteria defined by the developer in the configuration).

The iterator selection strategy is specified in the configuration or query parameters. Currently, several basic strategies are supported:

first, any, last;
priority type of database;
on the priority of the chain that formed the iterators;
according to the weighting function "query weighting";
by the first result - a parallel launch of all iterators is performed and the first result is expected, as a result, the fastest iterator is used, the rest are closed.

As a result of such a combination for a single input request, we get several variants of its execution, both using different sources and with different execution strategies - choosing the main / target database, in which the main part of the query and the final assembly of results will be performed.

If the target DBMS supports the connection of external sources, it becomes possible to create a return circuit, in which the DBMS connects to the system API to receive small amounts of data from the system, for example, to filter large amounts "on site". Such integration is transparent for the end user and analytics - the cube model does not change, and all operations are performed by the system automatically.

Simplified sequence diagram when querying multiple integrated databases in a single query

For more complex cases, the system implements an internal in-memory query interpreter on the remarkable H2 Embedded database engine, which allows integrating any supported databases out of the box. Literally, it works this way - the request is broken into pieces by groups of sources, sent for execution, after which the assembly and final processing of the results in memory in H2 is performed.

At first glance, such a data integration scheme at the level of the internal interpreter seems to be “heavy”, and this is true if you have to work with large amounts of input data and the need to make calculations after intersecting and combining sets from external sources. In fact, this circumstance is partially leveled - at the same time the request is executed by several processors in different versions, therefore, the interpreter is used only in the most extreme cases, as a universal solution out of the box. In the end, any integration is limited to typical transportation costs for preparation, transmission over the network and data reception, and this is a completely different task.

Technical side

On the technical side, without which, probably, it is impossible to do without affecting this topic, the system is also arranged according to the principle - to trick more, but to simplify everything as much as possible.

The data processing system is implemented on top of the jsBeans client-server framework as a set of additional modules and specific build projects. jsBeans, in turn, is implemented in Java, works as an application server, by and large is a bunch of Rhino, Jetty and Akka, and also includes the client-server bean technology developed by our team, and a rich library of components built over several years of successful use.

Rubik's cube is fully implemented on JavaScript in the form of a set of js-bins (* .jsb files), some of which operate only on the server. The other part is on the client, and the rest are an integral component, functioning as a distributed whole, the parts of which interact with each other, is transparent to the developer, but is under his control. Js-bins can have different life strategies, for example, with or without reference to a user session and much more. The bean is isomorphic, it allows both on the client and on the server to work with it as with a virtual instance of the usual class. The bean is described in one file and includes three sections - for fields and methods running on the client, for server ones, as well as a section for common synchronized fields.

Since the article has already turned out to be verbose, in order not to bore readers, it’s time to move on to completion, with the intention to soon describe the details and the most interesting architectural solutions in the implementation of JsBeans and our projects based on it - the designed visualization subsystem, analytical processes, ontological designer of subject areas, language requests, data import and something else ...

Why is that?

Never was this, and here again ...

At first, there were few primary data sets. Subject areas and tasks were quite specific. It would seem, why such torment? The task looked simple, everyone wanted to get a result right away - especially when the quick solution lay on the surface, and the right one required assiduity and balanced decisions, adherence to the initial installation. We went in the opposite direction, from difficult and long decisions to simple and fast, along the path of generalizing particular problems.

The main condition is that new dashboards should be built quickly, even if a new subject area and analytical needs are very different from previous ones. Obviously you can not guess half of the future requirements, the system should be primarily malleable. Refining the library of components, analytical algorithms, connecting new types of sources is an integral part of the system adaptation. In other words, the link has earned - analysts build queries and deshboards, and programmers quickly implement new needs for them. And we, as programmers, initially sought to simplify the work themselves in the future, trying not to harm usability.

And the system was immediately created universal and adaptive - we built a "constructor constructor", developing a framework on top of the framework previously created with a similar, but even more general goal.

The rating of Moscow schools based on the results of the Unified State Exam and Olympiads is an example of a dashboard constructed in the manner described above from downloading from the open data portal of the Government of Moscow.

"Rubik-Rubik" is a basic platform for the development of information and analytical systems. Developed as a branch and a logical continuation of jsBeans. It includes all the necessary modules for solving problems of collection, processing, analysis (computational and process-oriented) and visualization.

jsBeans - an isomorphic "full-stack" web framework that implements client-server JavaScript-beans technology, is developed with an open license as a universal tool. During use, he proved himself well, in most cases ideally fitting into the tasks before us.

Source: https://habr.com/ru/post/452154/

All Articles

Virtual cube - instead of OLAP

Cube, cut, dimension

When sources are different

Technical side

Why is that?

More articles: