Approach to Online Analysis Processing

In the wake of this post.

Users of the system who are interested in processing and analyzing data should be given the opportunity to view specialized reports on changes in the system. Any user of the system requiring reporting information obviously uses it in the field of activity that is applied to the activity registered in the System. It follows from this that the System should be able to provide so many reporting forms so that each user can satisfactorily use some combination of them in his work.

')
This problem can be solved in two ways:

To plan at the design stage any number of static (ready-made) reporting templates and fill them with data at the request of users.
Develop a dynamic model for building reports “on the fly” using OLAP analysis technologies (OnLine Analytical Processing), then OLAP.

Obviously, the first method in practice is very difficult, because the probability of making a mistake at the design stage of templates is very high. The disadvantages of the second method include a limited set of measurements that can be analyzed. Assuming that the task of analysis is to minimize data sets inaccessible for reporting while maximizing the satisfied user needs, the solution will be a combination of the two proposed methods. At the same time, given that the customer’s expectations from the system exist a priori, the majority of user requirements are covered by static patterns. The OLAP mechanisms, in turn, will allow creating dynamic reports for more “refined” analysis conditions, describing the dynamics of filling the System and showing the directions of change vectors in the data.

In order to describe a technological solution with concrete steps at the design stage, one should refer to the methods of database design. There are several of them, but in the cases described, it is worth considering only two:

Online Transaction Processing (OLTP)
Online Analytical Processing (OLAP)

The first approach implies a “standard” database design, when several rules for the normalization of the form “permissible abstraction - entity - table” are executed. In this view, the database system is designed. It is ideal for data storage due to its effective redundancy, when a set of interconnected data is in the database with a minimum of duplicate fields, and, as a result, occupies a minimum of space, all other things being equal. This approach is inverse to the second (OLAP), the imperative of which is the maximum allowable denormalization of data and links between them.

The data analysis technology of OLAP is based on the creation of a so-called. multidimensional OLAP ‑ cubes, which are nothing more than a function of many variables (or dimensions), the discrete set of values of which (the so-called facts) describe some kind of qualitative state of the system. By specifying some of the dimensions of this function (that is, the cube) as input parameters, you can get a set of data suitable for use in a dynamic report template. Obviously, in order to quickly obtain such a data set, one should a priori know the finite set of values for each dimension. Thus, the redundancy of links and duplicate data in OLAP design is not a disadvantage in view of the goals achieved:

Minimize the calculation of the value for the required measurement
Minimizing the computation of the system state (fact) value

A commonly understood example of such measurements can be:

Time as a measurement
Entity Attributes as a Dimension
Entity quantities as cube value

The OLTP DB design is designed to minimize the excess number of values for any of the allowable dimensions, and therefore is not suitable as a data source for a cube. In order for the measurements to be sufficiently complete, the original database of the System should be converted to a so-called. data warehouse - specialized OLAP database that meets the conditions of the task. It will serve as a source of data used in reports.

From the above, it follows that when solving the problem of data analysis, the following statements are true:

With the use of a special transformation, the database should be filled with data from the main database.
For reports (both static and dynamic), the main database of the System will be used for the ARMs of the system and the rest of the rigging.
The resources consumed by the system when building reports and the rest of the work with the database will be distributed in parallel, thereby being divided into two independent execution processes and not interfering with each other.

In this approach, there are two minuses:

It is required to create a data conversion, driving the values from the main database to the storage
Since the transformation is theoretically resource-intensive, it can be performed at certain intervals.

The specified periods of time can be critical for users, since between them there is a system state, described as “data obsolescence”. This means that if a report was built at a certain point in time greater than the time of the last launch of the denormalizing transformation and less than the time of the subsequent one, then it will show irrelevant data for the user. The time of obsolescence of such data will not be longer than the set launch interval.

Thus, the secondary task of data analysis is the task of minimizing time intervals with minimal complexity of the transforming function. System developers should set such a task a high priority.

Practical solution

For the most effective use of denormalized data by a cube, the storage structure must meet the following conditions:

The database consists of tables of measurements and facts.
A fact is a calculated value of a multidimensional function (cube), which is a quality characteristic of the system.
Facts themselves can be dimensions and vice versa.
The tables must be linked so that there exists at least one algorithmic way to get the fact from the specified measurements.

Practically these four conditions will be characterized by the topology of the snowflake database tables:

Figure 1: an example of the practical implementation of the storage database "snowflake"
Here Value is the values of dimensions, facts.

In practice, a denormalizing transformation is specified in stored procedures and is executed in JOB-like database processes with specified time intervals.

Static report templates differ from dynamic templates only in the way they are received. In the dynamic version, the template is built by the user with the help of a special program, whereas static templates are inaccessible to the user for editing and must be selected manually by the user from a fixed list.

Templates can be parameterized. In this case, such values are transferred to the cube through a special interface query language in the form of specialized parameters.

An example of an industrial solution to the task of analyzing and building reports is Microsoft Analysis Services and Microsoft Reporting Services.

Source: https://habr.com/ru/post/96974/

All Articles

Approach to Online Analysis Processing

Practical solution

More articles: