Create an OLAP cube. Part 1

Continuing the theme of Multidimensional Cubes, OLAP and MDX and olap for a small company , traditionally, I suggest starting with a simple “Hello World” cube that will analyze the processes and trends of voting on Habré.

So let's try to create our first OLAP system.
But, before, rubbing your hands, run Business Intelligence Studio, I suggest first creating a data storage for habr-voices, the so-called Data Warehouse .
What for? There are several reasons for this:

the very essence of Data Warehouse is to store “cleaned” data, ready for analysis, so even its original structure can be very different from the structure of our Habr-OLTP database
in HabraDW (as we will call it), we will carry out only the information that we need for analysis, nothing superfluous
Data Warehouse does not impose normalization requirements. On the contrary, by denormalizing some data, a clearer scheme for building a cube can be achieved, as well as a speed of loading data into a cube.

A bit of theory.

In essence, Data Warehouse can be:

purely virtual (for example, defined as a set of SELECTs or even calls to complex stored procedures that somehow define the input to the cube)
quite real, that is, to exist physically on some server (or servers)

In the latter case, you most likely want to implement ETL processes (using Integration Services or something else), but this is a reason for another, no less interesting, article.
')

What should be Data Warehouse?

It's very simple - your Data Warehouse should have a star- shape structure or a snowflake model and consist of facts (facts) and dimensions.

Facts are actual records (records) about some process that we want to analyze, for example, the voting process on Habr, or the process of changing the price of goods on the exchange. Very often, the facts contain some numerical data, for example, the actual value of the voice or price.

Dimensions are the defining attributes of facts, and usually answer all kinds of questions: when did the fact happen, over or with what exactly, who was the object or subject, etc. In general, measurements have a more descriptive (i.e. textual) character, for example, user name or month name, since it will be much easier for the end user to perceive the results described by the text (for example, the name of the month) rather than numbers (the month number of the year).

Determining where we have the facts, and where the measurements - it is very easy to build a star model.

Star.

In the center we indicate our fact table, and with the rays we derive measurements.

And now the snowflake.

A snowflake is the same star , only measurements can depend on measurements of the next level, and those in turn can include more levels.

Each of these models has its own advantages and disadvantages, and the choice of the model itself should be based on the requirements for cube design, data loading speed, disk space, etc.
Naturally, the final Data Warehouse is usually much more complicated and consists of several stars or snowflakes that can share common measurements.

HabraDW.

Let's move on to the actual development of our Data Warehouse.

Our goal is to analyze the tendencies of voting on Habré, finding patterns and trends.
The main trends that we want to identify:

what time of year / month / week votes better / worse / more often
how to vote on Fridays and Mondays (for example)
How does Microsoft affect the result of a vote, or Karma
average activity of users, "peaks" of voting
etc.

For clarity, our first model will be absolutely simple - we will include only what is relevant to the voting and exclude all unnecessary, including the time of user registration and the fact of who posted the article, as well as the voting time (only date) and other attributes (all these data can be included in the following articles and try to analyze more complex things).

As a result, we have the following tables:

Fact table FactHabravote - determines who, when, for what and how exactly voted. The value of Vote in our case will be + - 1, but the field type allows you to expand the delta with votes, for example, + - 10
Time Measurement DimTime - determines the time attributes (values and names) necessary for analysis
DimUser user dimension - defines Habr users, while only a nickname
Measurement of posts DimPost - defines posts, in our case contains a header and boolean fields that determine whether the post contains the words Microsoft and Karma.

The final scheme of our star will be like this.

And here is the original SQL script that creates and fills (so far only random data) our repository.

Well, now everything is ready to load the data into the cube.
See you in the next article.

Source: https://habr.com/ru/post/67272/

All Articles