Aggregates of multidimensional OLAP cubes in RAM

Staging

The problem of Business Intelligence solutions (Business Analytics) is to provide interested parties with statistical, analytical information on the results of the activity of any automated process or complex of processes.
For example, there is a business process of fixing purchases made by people in an electronic store. In a relational business process model, there will naturally be sellers, buyers, products, and other entities. However, if the business process is successful, i.e. a rather intensive data flow takes place, there is a need to analyze this data for solving various tasks, including economic ones. For financiers, this will be a collection of data reflecting:

Quantity of goods sold in the first quarter of the current year
The amount of goods sold in the context of sellers for the past year
Dynamics of sales of a certain type of product in the monthly slice
And many others

At the same time, if we are talking about a holding that includes shops, restaurants, and other activities, the amount of data increases, which sometimes leads to an increase in the types of analytical data views.
Thus, the developer is faced with the problem of providing the widest, most efficient and convenient tool for data analysis. OLAP solutions offered by various brands, such as Oracle, SAP, Microsoft, MicroStrategy, Pentaho and many others, come to the rescue.

I want to make a few reservations in advance:

The article proposes a method for storing multiparameter data in RAM without using specialized repositories (DBMS) that serve as the basis for Business Intelligence solutions.
The article will not discuss the OLAP cubes, the principles of their construction in the classical form MOLAP, ROLAP, HOLAP. For an introductory introduction to them, I recommend the following articles:
habrahabr.ru/post/66356 - Multidimensional Cubes, OLAP and MDX
habrahabr.ru/post/187782 - Run the Pentaho-based OLAP server step by step
habrahabr.ru/post/67272 - We create OLAP cube. Part 1
habrahabr.ru/company/eastbanctech/blog/173711 - 6 practical tips for beginners when building a simple BI solution.

Suppose that the solutions offered by brands are not entirely suitable for a specific task (sometimes they are not at all suitable), or the enterprise’s budget does not allow the use of one or another brand.

Model design

')
In view of my activities over the past 3 years, I have been creating business dashboards to provide multiparameter data to end users using HTML technologies and such libraries as DevExpress, HighCharts, ExtJS. A feature of aggregate storages is intensive data reading and block updating. As part of the work, various types of relational data models for storing aggregates were tried, but none of these types gave the desired effect — high read speed. This led to the solution described in this article - storing all aggregated data in RAM, which entails the following tasks:

Economical use of RAM
Efficient data extraction algorithms by specified criteria
Storing and reading data from file storage

First of all, it is necessary to determine the entities with which we will operate:

An indicator is a named entity defined by quantitative and qualitative characteristics. For example - "income" or "expense"
A unit of measurement is a named entity that defines the measure in which the indicator is calculated. For example - "dollar" or "ruble"
The directory and directory elements are a named entity that defines the qualitative characteristics associated with an indicator. For example - "Directory of the sexes", which contains the elements of "male" and "female"
Period - a named entity that defines the time period for which measurements were made. For example - “year”, “quarter”, “month”, “day”
Value is a named entity that defines an aggregated value in numeric form.

The following are examples of data that should be stored in the repository:

The indicator "Income" of the enterprise, in the period of "2011" in the context of "female" had a value - 300 "dollars"
Indicator "Consumption" of the enterprise, in the period "1 quarter of 2013" had a value - 100 "rubles"

Having thus defined entities, it is necessary to determine the object model of classes, which determines their interrelation.

As can be seen from the figure, in the class diagram there are utility classes, such as ObjectName Repository, DataWarehouse. These classes are easy to use when working with data collections and you need to define the main interfaces that provide objects according to a given rule.
Index class - describes the “Indicator” entity, which includes such characteristics as the name of the indicator, attributes of the indicator visibility and other attributes necessary for a specific application. The Index class includes the Cube class.
The Cube class is a repository for the qualitative characteristics of the indicator values and the values themselves (aka the Data Cube). Each set of characteristics is a point at the intersection of all dimensions of the cube. For example - if we have a cube of 3 characteristics, such as, for example, "weight", "color", "size", then each point - in this 3-dimensional space will be the established criteria for each of these measurements . Any data cube can be represented as a two-dimensional matrix - where directories will be expressed by matrix columns, and specified characteristics will be expressed in rows. For example, line 1 in such a matrix can have the values "Weight - 100 kg", "Color - red", "Size - small". Then, to store data aggregates, it is necessary to have a matrix describing all possible combinations of characteristics. When creating such a class, you can use two methods to optimize data storage in RAM:

Use one-dimensional dataset using offset
Use bitmap. In this case, it is proposed to analyze each column of the matrix for its diversity. For example, for the column "Directory of sexes", you need only three values - "total", "male", "female". Then the maximum amount of data that can occupy a value in this column is only 3 bits. And so for each directory

The Cube class includes the Measure class, which defines a common unit of measurement for the entire set of quality characteristics. This allows you to store the values of indicators in different units of measurement, depending on the available characteristics. Those. the same indicator in different cubes can have a unit of measurement corresponding to the cube. For example, this could be the indicator “Sold products”, which will be measured in kind - “piece” and in monetary “ruble”. This article presents only simple types of indicators, the unit of measure which can be only one for one cube. In practice, there are such indicators, the unit of measurement for which varies depending on the quality characteristics in the cube. Then it is necessary to move to a new level of abstractions, and introduce classes that serve simple data sets and complex ones.
The Cube class must also include the values themselves - the Value sets. Those. each set of characteristics must and will have values expressed by a number. At this stage, you can apply data classification methods to optimize their storage in RAM. For example, working with a database and choosing a storage or field for a numeric value — in most cases you choose the maximum number with a floating point. However, it is very resource intensive, especially when it comes to RAM. In practice, not all indicators need such accuracy. For example - the indicator “Percentage of plan fulfillment”, may need only integer values from 0 to 100. For which one byte is enough. Thus - you can implement a preliminary analysis of data for each indicator to determine the most necessary type - from BYTE to DECIMAL. Then you can go to the level of abstractions Value - and generate all of this class all the necessary types for the implementation of the application. In practice, this analytics gave a significant data compression, for example, with a volume of values in the database of 100 million, only 5% of them demanded the DECIMAL data type.
In this case, when modeling such a storage, it is necessary to implement methods for storing all data on the hard disk and reading them. I strongly recommend avoiding standard serializatatrov and writing your own based on BinaryWriter and BinaryReader. I also recommend intensive use of the MemoryStream classes. Those. The processes of reading and writing to the hard disk are divided into several stages:

Read to memory stream
Stream compression with standard algorithms
Record stream to hard disk

And vice versa:

Read from hard disk to stream in memory
Decompression flow
Reading data

This approach makes it possible to optimize the process of compressing and reading data using several cores of the CPU, without loading the hard disk. Those. reading and writing from the hard disk is performed sequentially, and reading and writing from streams in memory can be parallelized. So, in practice, reading 100 million values from a hard disk (approximately 2 GB of compressed data), and decompressioning them in memory and generating classes (up to 35 GB) takes about 5 minutes using a 32-core Xeon server. Extracting the same data from a relational DBMS took about 5 days.
This article does not consider options for updating data in such a repository. But the main task of the repository, such as - reading data, this model copes 100%. To update the data in such a storage, it is necessary to provide for exclusive access to memory. As well as in order to develop the system in view of the technical limitations on RAM, it is necessary to develop the system to distributed RAM (Sharding).
This article assumes that the data aggregation procedure takes place in the external environment and aggregates are already entering the system.
In this article, application codes are also not given due to their triviality, especially because of the use of LINQ technology, where, in most cases, there are operations comparing characteristics for similarity of objects or their differences.

Conclusion

With all the beauty and versatility of relational databases, there are tasks that cannot be solved using pre-prepared templates. On the example of aggregates data warehouses - an alternative approach to designing BI systems is shown. And as we can see, this approach is guaranteed to virtually eliminate the process of reading data from hard drives when they are intensively extracted, which is a bottleneck in any information system. By eliminating the process of reading from the hard disk, we solve the main problem of limiting the speed of access to data given by the technical capabilities of hard disks.
In fact, when working with data on the storage of aggregates, a limited set of functions arise that must be implemented, for example, as:

Get the values for the "indicator" in the given "characteristics" for a given type of "period"
Obtain “residuals” (i.e., all remaining available characteristics by indicator) in the cube, when some of the “characteristics” are identified
Combine data from several indicators
Search for indicators by specified characteristics

In most cases, data outputs are filled matrices according to specified criteria, which are the same RecordSets and DataTables obtained from relational databases.

Source: https://habr.com/ru/post/214643/

All Articles

Aggregates of multidimensional OLAP cubes in RAM

Staging

Model design

Conclusion

More articles: