📜 ⬆️ ⬇️

Aggregates of multidimensional OLAP cubes in RAM

Staging


The problem of Business Intelligence solutions (Business Analytics) is to provide interested parties with statistical, analytical information on the results of the activity of any automated process or complex of processes.
For example, there is a business process of fixing purchases made by people in an electronic store. In a relational business process model, there will naturally be sellers, buyers, products, and other entities. However, if the business process is successful, i.e. a rather intensive data flow takes place, there is a need to analyze this data for solving various tasks, including economic ones. For financiers, this will be a collection of data reflecting:

At the same time, if we are talking about a holding that includes shops, restaurants, and other activities, the amount of data increases, which sometimes leads to an increase in the types of analytical data views.
Thus, the developer is faced with the problem of providing the widest, most efficient and convenient tool for data analysis. OLAP solutions offered by various brands, such as Oracle, SAP, Microsoft, MicroStrategy, Pentaho and many others, come to the rescue.

I want to make a few reservations in advance:
  1. The article proposes a method for storing multiparameter data in RAM without using specialized repositories (DBMS) that serve as the basis for Business Intelligence solutions.
  2. The article will not discuss the OLAP cubes, the principles of their construction in the classical form MOLAP, ROLAP, HOLAP. For an introductory introduction to them, I recommend the following articles:
    habrahabr.ru/post/66356 - Multidimensional Cubes, OLAP and MDX
    habrahabr.ru/post/187782 - Run the Pentaho-based OLAP server step by step
    habrahabr.ru/post/67272 - We create OLAP cube. Part 1
    habrahabr.ru/company/eastbanctech/blog/173711 - 6 practical tips for beginners when building a simple BI solution.


Suppose that the solutions offered by brands are not entirely suitable for a specific task (sometimes they are not at all suitable), or the enterprise’s budget does not allow the use of one or another brand.

Model design


')
In view of my activities over the past 3 years, I have been creating business dashboards to provide multiparameter data to end users using HTML technologies and such libraries as DevExpress, HighCharts, ExtJS. A feature of aggregate storages is intensive data reading and block updating. As part of the work, various types of relational data models for storing aggregates were tried, but none of these types gave the desired effect — high read speed. This led to the solution described in this article - storing all aggregated data in RAM, which entails the following tasks:


First of all, it is necessary to determine the entities with which we will operate:

The following are examples of data that should be stored in the repository:

Having thus defined entities, it is necessary to determine the object model of classes, which determines their interrelation.



As can be seen from the figure, in the class diagram there are utility classes, such as ObjectName Repository, DataWarehouse. These classes are easy to use when working with data collections and you need to define the main interfaces that provide objects according to a given rule.
Index class - describes the “Indicator” entity, which includes such characteristics as the name of the indicator, attributes of the indicator visibility and other attributes necessary for a specific application. The Index class includes the Cube class.
The Cube class is a repository for the qualitative characteristics of the indicator values ​​and the values ​​themselves (aka the Data Cube). Each set of characteristics is a point at the intersection of all dimensions of the cube. For example - if we have a cube of 3 characteristics, such as, for example, "weight", "color", "size", then each point - in this 3-dimensional space will be the established criteria for each of these measurements . Any data cube can be represented as a two-dimensional matrix - where directories will be expressed by matrix columns, and specified characteristics will be expressed in rows. For example, line 1 in such a matrix can have the values ​​"Weight - 100 kg", "Color - red", "Size - small". Then, to store data aggregates, it is necessary to have a matrix describing all possible combinations of characteristics. When creating such a class, you can use two methods to optimize data storage in RAM:


The Cube class includes the Measure class, which defines a common unit of measurement for the entire set of quality characteristics. This allows you to store the values ​​of indicators in different units of measurement, depending on the available characteristics. Those. the same indicator in different cubes can have a unit of measurement corresponding to the cube. For example, this could be the indicator “Sold products”, which will be measured in kind - “piece” and in monetary “ruble”. This article presents only simple types of indicators, the unit of measure which can be only one for one cube. In practice, there are such indicators, the unit of measurement for which varies depending on the quality characteristics in the cube. Then it is necessary to move to a new level of abstractions, and introduce classes that serve simple data sets and complex ones.
The Cube class must also include the values ​​themselves - the Value sets. Those. each set of characteristics must and will have values ​​expressed by a number. At this stage, you can apply data classification methods to optimize their storage in RAM. For example, working with a database and choosing a storage or field for a numeric value — in most cases you choose the maximum number with a floating point. However, it is very resource intensive, especially when it comes to RAM. In practice, not all indicators need such accuracy. For example - the indicator “Percentage of plan fulfillment”, may need only integer values ​​from 0 to 100. For which one byte is enough. Thus - you can implement a preliminary analysis of data for each indicator to determine the most necessary type - from BYTE to DECIMAL. Then you can go to the level of abstractions Value - and generate all of this class all the necessary types for the implementation of the application. In practice, this analytics gave a significant data compression, for example, with a volume of values ​​in the database of 100 million, only 5% of them demanded the DECIMAL data type.
In this case, when modeling such a storage, it is necessary to implement methods for storing all data on the hard disk and reading them. I strongly recommend avoiding standard serializatatrov and writing your own based on BinaryWriter and BinaryReader. I also recommend intensive use of the MemoryStream classes. Those. The processes of reading and writing to the hard disk are divided into several stages:
  1. Read to memory stream
  2. Stream compression with standard algorithms
  3. Record stream to hard disk

And vice versa:
  1. Read from hard disk to stream in memory
  2. Decompression flow
  3. Reading data


This approach makes it possible to optimize the process of compressing and reading data using several cores of the CPU, without loading the hard disk. Those. reading and writing from the hard disk is performed sequentially, and reading and writing from streams in memory can be parallelized. So, in practice, reading 100 million values ​​from a hard disk (approximately 2 GB of compressed data), and decompressioning them in memory and generating classes (up to 35 GB) takes about 5 minutes using a 32-core Xeon server. Extracting the same data from a relational DBMS took about 5 days.
This article does not consider options for updating data in such a repository. But the main task of the repository, such as - reading data, this model copes 100%. To update the data in such a storage, it is necessary to provide for exclusive access to memory. As well as in order to develop the system in view of the technical limitations on RAM, it is necessary to develop the system to distributed RAM (Sharding).
This article assumes that the data aggregation procedure takes place in the external environment and aggregates are already entering the system.
In this article, application codes are also not given due to their triviality, especially because of the use of LINQ technology, where, in most cases, there are operations comparing characteristics for similarity of objects or their differences.

Conclusion



With all the beauty and versatility of relational databases, there are tasks that cannot be solved using pre-prepared templates. On the example of aggregates data warehouses - an alternative approach to designing BI systems is shown. And as we can see, this approach is guaranteed to virtually eliminate the process of reading data from hard drives when they are intensively extracted, which is a bottleneck in any information system. By eliminating the process of reading from the hard disk, we solve the main problem of limiting the speed of access to data given by the technical capabilities of hard disks.
In fact, when working with data on the storage of aggregates, a limited set of functions arise that must be implemented, for example, as:



In most cases, data outputs are filled matrices according to specified criteria, which are the same RecordSets and DataTables obtained from relational databases.

Source: https://habr.com/ru/post/214643/


All Articles