Introduction to Multivariate Analysis

Some time ago I had the opportunity to organize a new development group, which was supposed to be engaged in the development of OLAP and BI products in a friendly software company. And since the group was assembled from fresh university graduates, I had to write a “short course for a young fighter” in order to give the most accessible initial concepts about OLAP to people who had never encountered him, but already had programming experience and working with DB .

I am posting now this Introduction to Public Domain.

The article somewhat confuses the notion of OLAP, Business Intelligence, and Data Warehouse, but in life it is often difficult to understand where the border lies. And in real projects, and even more so, they all walk side by side. Therefore, please do not judge strictly.
')

Introduction

Information systems of a serious enterprise, as a rule, contain applications designed for complex analysis of data, their dynamics, trends, etc. Accordingly, top management becomes the main consumers of the analysis results. Such an analysis, ultimately, is intended to facilitate decision-making. And in order to make any management decision, it is necessary to have the necessary information, usually quantitative. To do this, it is necessary to collect these data from all information systems of the enterprise, lead to a common format and then analyze it. To do this, create a data warehouse (Data Warehouses).

What is a data warehouse?

Usually - a place to collect all the information of analytical value. The requirements for such repositories are consistent with the classic OLAP definition, and will be explained below.

Sometimes the Warehouse has another goal - the integration of all enterprise data, to maintain the integrity and relevance of information within all information systems. So the repository accumulates not only analytical, but almost all information, and can issue it in the form of directories back to the rest of the systems.

A typical data warehouse is usually different from a regular relational database. First, regular databases are designed to help users do their daily work, while data warehouses are for decision-making. For example, the sale of goods and invoicing are made using a database designed for processing transactions, and analyzing the dynamics of sales over several years, allowing you to plan work with suppliers, using a data warehouse.

Secondly, the usual databases are subject to constant changes in the process of users, and the data warehouse is relatively stable: the data in it is usually updated according to a schedule (for example, weekly, daily or hourly - depending on needs). Ideally, the replenishment process is simply the addition of new data for a certain period of time without changing the old information already in the repository.

And, thirdly, conventional databases are most often the source of data that goes to the repository. In addition, storage may be replenished by external sources, such as statistical reports.

How to build a repository?

ETL - basic concept: Extraction, Transformation, Loading. Three stages:

Extraction - extraction of data from external sources in a clear format;
Transformation - the transformation of the structure of the source data into structures that are convenient for building an analytical system;
Download - download data to the repository.

Add another step - data cleaning ( Cleaning ) - the process of eliminating irrelevant or correcting erroneous data on the basis of statistical or expert methods. In order not to generate reports of the type “Sales for the year 20011”.

Let's return to the analysis.

What is analysis and what is it for?

Analysis - research data for decision making. Analytical systems are called - decision support systems ( DSS ).

Here it is worth pointing out the difference between working with DSS and a simple set of regulated and unregulated reports. The analysis in the DSS is almost always interactive and iterative. Those. The analyst delves into the data, compiling and correcting analytical queries, and receives reports whose structure may not be known in advance. We will return to this in more detail below when we discuss the MDX query language.

OLAP

Decision support systems usually have the means to provide the user with aggregate data for different samples from the initial set in a convenient form for perception and analysis (tables, charts, etc.). The traditional approach to segmentation of source data uses the selection from the source data of one or more multidimensional data sets (often called the hypercube or metacube), whose axes contain attributes, and the cells contain aggregated quantitative data. (Moreover, such data can be stored in relational tables, but in this case we are talking about the logical organization of data, and not about the physical implementation of their storage.) Along each axis, attributes can be organized as hierarchies representing different levels of detail. Thanks to this data model, users can formulate complex queries, generate reports, and receive subsets of data.

The technology of complex multidimensional data analysis is called OLAP (On-Line Analytical Processing). OLAP is a key component in the organization of traditional data warehouses. The concept of OLAP was described in 1993 by Edgar Codd , a well-known database researcher and author of the relational data model. In 1995, based on the requirements set forth by Codd, the so-called FASMI (Fast Analysis of Shared Multidimensional Information) test was formulated, including the following requirements for applications for multidimensional analysis:

providing the user with the results of the analysis in a reasonable time (usually not more than 5 s), even if at the cost of a less detailed analysis
the possibility of carrying out any logical and statistical analysis characteristic of this application, and its preservation in a form accessible to the end user;
multi-user data access with the support of appropriate locking mechanisms and authorized access tools;
a multidimensional conceptual view of data, including full support for hierarchies and multiple hierarchies (this is a key OLAP requirement);
the ability to access any necessary information regardless of its volume and storage location.

It should be noted that the OLAP-functionality can be implemented in various ways, starting with the simplest data analysis tools in office applications and ending with distributed analytical systems based on server products. Those. OLAP is not a technology, but an ideology .

Before talking about the different implementations of OLAP, let's take a closer look at what cubes are from a logical point of view.

Multidimensional concepts

We will use the Northwind database, included in the Microsoft SQL Server delivery kits and illustrating a typical database of wholesale food supply companies, to illustrate OLAP principles. Such data include information about suppliers, customers, a list of goods supplied and their categories, data on orders and ordered goods, a list of company employees.

Cube

Take for example the table Invoices1, which contains the orders of the company. The fields in this table will be as follows:

Order date
A country
City
Customer name
Delivery Company
Product Name
Quantity of goods
Order price

What aggregate data can we get based on this view? Usually these are answers to questions like:

What is the total value of orders placed by customers from a particular country?
What is the total value of orders placed by customers from a particular country and delivered by a particular company?
What is the total value of orders placed by customers from a particular country in a given year and delivered by a particular company?

All this data can be obtained from this table with quite obvious SQL queries with grouping.

The result of this query will always be a column of numbers and a list of attributes describing it (for example, a country) - this is a one-dimensional data set or, in mathematical terms, a vector.

Let us imagine that we need to get information on the total value of orders from all countries and their distribution among delivery companies - we will get a table (matrix) from numbers, where suppliers will be listed in column headings, countries will be listed in row headings, and amount of orders. This is a two-dimensional data array. Such a dataset is called a pivot table or cross table.

If we want to get the same data, but still in the context of years, then another change will appear, i.e. the data set will become three-dimensional (conditional third-order tensor or three-dimensional "cube").

Obviously, the maximum number of measurements is the number of all attributes (Date, Country, Customer, etc.) describing our aggregated data (the sum of orders, the quantity of goods, etc.).

So we come to the concept of multidimensionality and its embodiment - a multidimensional cube . Such a table will be called “the table of facts ”. Dimensions or Cube Axes ( dimensions ) are attributes whose coordinates are expressed by the individual values of these attributes that are present in the fact table. Those. For example, if information about orders was conducted in the system from 2003 to 2010, then this axis of the years will consist of 8 corresponding points. If orders come from three countries, the countries axis will contain 3 points, etc. No matter how many countries are listed in the Country Directory. The points on the axis are called its “members” ( Members ).

The aggregated data themselves in this case will be called “measures” ( Measure ). To avoid confusion with the “dimensions”, it is preferable to call the latter “axes”. The set of measures forms another axis “Measures” ( Measures ). There are as many members (points) as measures (aggregated columns) in the fact table.

Members of dimensions or axes can be combined by one or several hierarchies. What is a hierarchy, let us explain with an example: cities from orders can be combined into districts, districts in regions, regions of a country, countries into continents or other formations. Those. there is a hierarchical structure - continent- country-region-district-city - 5 levels ( Level ). For the district, the data are aggregated for all cities that are included in it. For the region in all districts that contain all the cities, etc. Why do you need several hierarchies? For example, we may want to group points (ie, days) according to the Year-Month-Day hierarchy or Year-Week-Day along the axis with the date of the order: in both cases, there are three levels. Obviously, the Week and the Month group the days differently. There are also hierarchies, the number of levels in which is not deterministic and depends on the data. For example, folders on a computer disk.

Data aggregation can occur using several standard functions: sum, minimum, maximum, average, quantity.

Mdx

Let us turn to the language of queries in multidimensional data.
The SQL language was originally designed not for programmers, but for analysts (and therefore has a syntax that resembles natural language). But over time, he became more and more complicated and now few analysts know how to use it well, if at all. He became a tool for programmers. The MDX query language, which was rumored by our former compatriot Moisha (or Moshe) Posumansky ( Mosha Pasumansky ) in the wilds of Microsoft, was also originally intended for analysts, but its concepts and syntax (which is a bit like SQL, and for nothing, so. to. it only confuses), even more complicated than SQL. Nevertheless, its basics are still easy to understand.

We will consider it in detail because it is the only language that received standard status within the framework of the general XMLA protocol standard , and secondly, because its open-source implementation exists in the form of a Mondrian project from Pentaho . Other OLAP analysis systems (for example, Oracle OLAP Option) usually use their SQL language syntax extensions, however, they declare support for MDX.

Working with analytical data arrays only means reading them and does not imply writing. So in MDX, there are no suggestions for changing data, but there is only one selection clause — select.

In OLAP, multi-dimensional cubes can be sliced — i.e. when data is filtered along one or several axes, or projections — when along one or several axes the cube “collapses”, aggregating the data. For example, our first example with the sum of orders from countries is the projection of the cube onto the axis of the Country. The mdx request for this case will look like this:

select [Territory].[Cities by Countries].[All].Children on rows from [invoices1]

What is what?

Select is a key word and is included in the syntax solely for beauty.
[Territory] is the name of the axis. All proper names in MDX are written in square brackets.
[Cities by Countries] is the name of the hierarchy. In our case, this is the Country-City hierarchy.
[All] is the name of the axis member at the first level of the hierarchy (ie, country). All is a meta member that unites all members of the axis. There is such a meta member in each axis. For example, in the axis of the years there is “All years”, etc.
Children is a member function. Each member has several features available. Such as Parent. Level, Hierarchy, returning, respectively, the ancestor, the level in the hierarchy and the hierarchy itself, to which the member belongs in this case. Children - returns a set of descendant members of the member. Those. in our case, the country.
on rows - Indicates how to arrange this data in the summary table. In this case, in the header lines. Possible values here: on columns, on pages, on paragraphs, etc. It is also possible to indicate simply by index, starting from 0.
from [invoices1] is an indication of the cube from which to fetch.

What if we do not need all the countries, but only need a couple of specific ones? To do this, you can explicitly indicate in the request those countries that we need, and not select all with the Children function.

 select { [Territory].[Cities by Countries].[All].[Russia], [Territory].[Cities by Countries].[All].[Ukrain] } on rows from [invoices1]

In this case, curly braces are set declarations . A set is a list, a list of members from one axis .

Now we will write a request for our second example - output in the context of the deliveryman:

 select [Territory].[Cities by Countries].[All].Children on rows [Shipper].Members on columns from [invoices1]

Here is added:
[Shipper] - axis;
.Members is an axis function that returns all members on it. The hierarchy and the level have the same function. Because in this axis, the hierarchy is one, then its indication can be omitted, since the level and hierarchy are also the same, then all members can be displayed in one list.

I think it is already clear how to proceed with this in our third example, with details by year. But let's better not detail by year, and filter - i.e. build a slice. To do this, we write the following query:

 select [Territory.Cities by Countries].[All].Children on rows [Shipper].Members on columns from [invoices1] where ([Date].[2007])

And where is the filtering?

where is the keyword
[2007] is one member of the [Date] hierarchy. The full name for all the terms would be: [Date.By months]. [All dates]. [2007] but the name of this member is unique within the axis, then all intermediate name qualifications can be omitted.

Why is the date term in brackets? Parentheses are a tuple. A tuple is one or more coordinates along different axes. For example, for filtering along two axes in parentheses at once, we will list two terms from different dimensions, separated by commas. That is, a tuple defines a “slice” of a cube (or “filtering,” if such terminology is closer).

The tuple is not only used for filtering. Tuples can also be in row / column / page headers, etc.

This is necessary, for example, in order to display the result of a three-dimensional query in a two-dimensional table.

 select crossjoin( [Territory].[Cities by Countries].[All].Children, [Date.By months].[All dates].Children ) on rows [Shipper].Members on columns from [invoices1] where ([Date].[2007])

Crossjoin is a feature. It returns a set of tuples (yes, a set can contain tuples!) Resulting from the Cartesian product of two sets. Those. the result set will contain all possible combinations of Countries and Years. The row headers will thus contain a couple of values: Country-Year .

The question is, where is the indication of which numerical characteristics should be output? In this case, the default measure specified for this cube is used, i.e. Order price. If we want to derive another measure, then we recall that measures are members of the Measures dimension. And we act in the same way as with the other axes. Those. Filtering a query by one of the measures will display exactly this measure in the cells.

The question is: how does filtering in where differ from filtering by specifying axis members in on rows. Answer: practically nothing. Just in where a slice is specified for those axes that do not participate in the formation of headers. Those. the same axis cannot be simultaneously present in on rows , and in where .

Calculated members

For more complex queries, you can declare calculated members. Members of both attribute axes and measure axes. Those. You can declare, for example, a new measure that will display the contribution of each country to the total amount of orders:

 with member [Measures].[Part] as '[Territory].CurrentMember / [Territory].[Cities by Countries].[All]', FORMAT_STRING='0.00%' select [Territory].[Cities by Countries].[All].Children on rows from [invoices1] where [Measures].[Part]

The calculation takes place in the context of a cell, which has all its known coordinate attributes. The corresponding coordinates (members) can be obtained by the CurrentMember function on each of the cube axes. Here we must understand that the expression [Territory] .CurrentMember / [Territory]. [Cities by Countries]. [All] 'does not divide one member by another, but divides the corresponding aggregated data of cube slices! Those. the cut in the current territory is divided into a cut in all territories, i.e. total value of all orders. FORMAT_STRING - sets the format for outputting values, i.e. %

Another example of a calculated member, but already along the axis of the years:

 with member [Date].[2007 and 2006 difference] as '[Date].[2007] - [Date].[2006]'

Obviously, the report will contain not the unit, but the difference of the corresponding cuts, i.e. the difference in the amount of orders in these two years.

Display in ROLAP

OLAP systems are somehow based on some kind of data storage and organization system. When it comes to RDBMS, they talk about ROLAP (we will keep MOLAP and HOLAP for self-study). ROLAP - OLAP on a relational database, i.e. described in the form of ordinary two-dimensional tables. ROLAP systems convert MDX queries to SQL. The main computational problem for the database is fast aggregation. To aggregate faster, the data in the database is usually strongly denormalized, i.e. stored not very efficient in terms of disk space and database integrity monitoring. Plus, additionally contain auxiliary tables that store partially aggregated data. Therefore, for OLAP, a separate database schema is usually created, which only partially repeats the structure of the original transactional databases in the reference part.

Navigation

Many OLAP systems offer tools for interactive navigation of an already formed query (and, accordingly, selected data). This uses the so-called "drilling" or "drilling" (drill). A more adequate translation into Russian would be the word "deepening". But this is a matter of taste., The word "dilling" has stuck in some environments.

Drill is a report drill down by reducing the degree of data aggregation combined with filtering along some other axis (or several axes). Drilling can be of several types:

drill-down - filtering by one of the original report axes with output of detailed information by descendants within the hierarchy of the selected filtering member. For example, if there is a report on the distribution of orders in the context of Countries and Years, then when you click on the year 2007, a report will be displayed in the context of the same Countries and months of 2007.
drill-aside - filtering under one or several selected axes and removing aggregation along one or several other axes. For example, if there is a report on the distribution of orders in the context of Countries and Years, then when clicking on 2007, another report will appear in the context of, for example, the Countries and Suppliers with filtration by 2007.
drill-trough — removing aggregations across all axes and simultaneously filtering them — allows you to see the raw data from the fact table from which the value is obtained in the report. Those. when you click on the cell value, a report is displayed with all the orders that gave this amount. Such instant drilling in the very "depths" of the cube.

That's all. Now, if you decide to devote yourself to Business Intelligence and OLAP, it's time to start reading serious literature.

Source: https://habr.com/ru/post/126810/

All Articles