Overclock the median in OLAP

This post is for those who faced a performance problem when calculating the median in the OLAP cube.
One of the main advantages of OLAP technology is the speed of obtaining results when accessing the database. Calculations occur on the fly. However, with the median, not everything is so simple.
For reference: the median is a type of medium. This is the value that is in the middle of a series of values sorted in ascending order. For example, for a series of values {1, 2, 5, 6, 9} the median is 5.

Consider the situation on the example of the OLAP server from Microsoft - SSAS 2008 (SQL Server Analysis Services).
To calculate the median, SSAS suggests using the MDX function of Median. With it, you can create a calculated measure (Calculated Member) and use it in calculations.

We were faced with the task of designing an OLAP cube for analyzing data on job vacancies obtained from various sources. The total number of vacancies was about 10 million. We implemented the median to determine the average salary using the Median function. No calculation "on the fly" when working with a cube did not work. However, other aggregations, for example, the number of vacancies were considered quickly.
The problem is that such aggregations as a quantity or sum are “pre-aggregated” - when the cube is updated, they are calculated in advance, and when the data is requested, the server returns ready-made results. In the case of a median, the server does not calculate its values in advance, but calculates it each time the cube is accessed.
')
Consider a sample report:

For the specialty "Developer 1C" in 2011, 8,354 vacancies were found. To calculate the salary median for this specialty, the server needs to perform the following operations: selecting all the values related to this cell of the report, sorting them by the value of the salary, determining the value of the row of values (tuple) in the middle. And so for each cell. Therefore, the execution time of the report greatly increases. Most of the time it goes on sorting.

Decision

The values for calculating the median are selected from the fact table of the data warehouse, on the basis of which the OLAP cube was created. What if we assume that the fact table will be sorted in advance by the value of the salary. Then we do not need to sort the values for each cell. You just need to determine for each cell the number of elements and calculate the number of the element that is in the middle of the row. The value of this element will be the median.

Code to create Calculate Member:

//    CREATE MEMBER CURRENTCUBE.[measures].[AdvCount] AS Count(NonEmpty([Advertisement].[ID].members,[Measures].[Salary]) as AdvSet), VISIBLE = 0 ; //     CREATE MEMBER CURRENTCUBE.[Measures].[MedianReal] AS ([measures].[AdvCount]-1) * 50 / 100, VISIBLE = 0; CREATE MEMBER CURRENTCUBE.[Measures].[MedianInt] AS Int([Measures].[MedianReal]), VISIBLE = 0; CREATE MEMBER CURRENTCUBE.[Measures].[MedianFrac] AS [Measures].[MedianReal]- [Measures].[MedianInt], VISIBLE = 0; //  CREATE MEMBER CURRENTCUBE.[Measures].[MedianLow] AS (NonEmpty([Advertisement].[ID].members,[Measures].[Salary]).Item([Measures].[MedianInt]).Item(0),[Measures].[Salary]), VISIBLE = 0; CREATE MEMBER CURRENTCUBE.[Measures].[MedianHigh] AS (NonEmpty([Advertisement].[ID].members,[Measures].[Salary]).Item([Measures].[MedianInt] + 1).Item(0),[Measures].[Salary]), VISIBLE = 0; CREATE MEMBER CURRENTCUBE.[Measures].[Salary Median] AS ([Measures].[MedianLow] * [Measures].[MedianFrac]) +([Measures].[MedianHigh] * (1 - [Measures].[MedianFrac])), FORMAT_STRING = "# ### ### ##0;-# ### ### ##0", VISIBLE = 1 , ASSOCIATED_MEASURE_GROUP = 'Advertisement';

This code takes into account the situation when the tuple contains an even number of elements. In this case, the median is calculated as the arithmetic average of two values located in the middle of the series. If your task does not require absolute accuracy, then in this case you can assume the median is the left or right value. To do this, you have to slightly change the above code, but this will further reduce the calculation time.

Now how to sort out the fact table in advance. Suppose you have a source fact table, in which data is accumulated over time. Make a copy of this table and insert there the data from the source table sorted by the desired value.

Sample SQL query:

 INSERT INTO CopyBasicTable SELECT * FROM BasicTable ORDER BY ValueField

This operation will need to be done every time before updating the OLAP cube. Of course, in this method there is a serious minus - with a large amount of data, the operation time will be significant. However, this method is quite suitable for relatively small volumes. In a similar way, it is possible to calculate percentiles and quartiles.

Source: https://habr.com/ru/post/153869/

All Articles

Overclock the median in OLAP

Decision

More articles: