Greetings.
Creating, or even maintaining, an existing data warehouse, inevitably there is such a stage when a plurality of user desires meets with the inevitability of the physical limitations of the
DBMS that is used for the storage. In fact, no one can have infinite disk space, processor power, or an arbitrarily long time to update data.
At this point, the management may have questions, if they did not arise earlier, what exactly takes up so much space in the database, why the download has not yet ended, and so on.
')
To know what to answer, it is necessary to conduct accounting. The creation of HD is a long process, the people who designed the architecture may already be far away, I'm not even talking about the fact that business requirements change, sometimes, as quickly as new versions of the
Firefox browser come out.
When with a question about how much more data we can store and process in the current HD, my customer contacted me, I could only say what I could answer about, but I would have to calculate everything first, i.e. collect data storage metrics.
I tried to prepare some set of basic metrics that it is desirable to calculate in the first place, and, based on which, we can obtain derived metrics, which, in turn, could be used for forecasting.
After this set was received, I decided to google this topic, and found
an article by Bill Inmon with a list of metrics that it would be nice to have considered in most HDs.
Start of translation
Everything has metrics. There is a speed limit on the road, people have weight, days are temperature, there is a tachometer in the cars.
Metrics help us organize our reasoning and make meaningful comparisons. Even without referring to metrics as such, we can compare events and the conditions for their manifestation. And data warehouses are no exception in this case. We must have measurable characteristics if we are going to compare the data stores of different companies. It is a part of human nature, to have some criteria by which you could tell about yourself or your work and compare them with the merits of others.
Given the above, the list of metrics for the data warehouse system could be as follows:
The size- In tables HD.
- In indexes HD.
- The volume of the transform area (the volume of the temporary tables and / or the spool).
ETL- The number of ETL programs.
- Startup frequency
- The amount of data processed by ETL.
- The place of execution of ETL programs.
Datamars- The number of dates in HD.
- The amount of data in the data marts.
- The frequency of replenishment data.
Detailed HD data- Data access speed (bytes per second).
- The size of the physical record.
HD structures- The number of tables.
- The number of rows in each table.
- The average row size for each table.
Export data from HD- The number of rows per table leaving HD.
- The frequency of export to the table.
- Export criteria.
Requests to HD- The number of requests processed per day.
- The average amount of data for each request.
Story- The period of time during which the HD stores the history.
Other- Use date mining.
- Adaptive datmarts.
- Decision Support Applications (DSS).
- ERP application.
- Ods.
- Using tape or other type of long-term storage of information
- Archive repository.
Here such metrics could be in order to measure your HD.
Of course, this list can be expanded and changed in many different ways. One of the possible improvements is the addition of time to these metrics, since it is better to keep track of the values over a period of time. So, you can count, for example, besides the size metric, also the indicator how much the size has changed.
Or you can divide some tables by company department. It often makes sense to group data together, even if such grouping is not provided in the physical database tables. Of course, you can also show which DBMS are used for which tables, for example:
- Table ABC - Teradata
- DEF Table - IBM UDB
- Etc.
So, who can use HD metrics?
- Admin DB.
- Architects DB.
- DB developers.
- HD users.
- Attracted administrators.
- System programmers
- And many others ...
End of translation
I chose 4 groups of metrics, considering that such a set would be good for determining the need for collecting metrics specifically in our multi-
owner Data Warehouse (
Multitenant DWH )
project . These groups are:
- Metric by the number of rows in the tables HD.
- Metric by space (number of bytes) for each scheme. The data in the repository is located in several Oracle DBMS schemas.
- Metrics for the data load process (ETL).
- Metrics on the largest fact table in HD.
Metrics from each group were collected on a specific date and for each individual owner using
PL / SQL procedures that were run after each loading process.
Let's take a closer look at each group.
Row number metrics
In this group of metrics, information is collected about how many rows for each owner were in the HD schemes for each time point when these metrics were calculated.
Using these indicators, it is also possible to determine the ratio between different layers of CD, how many tables are detailed data, datamarts, range area. When the new table appeared, or how many, on average, reference books for one fact table.
It also becomes possible to quantify for different owners on the contents of each specific table.
The calculation was made using a standard approach - generating and running
SQL code in a cycle of selecting all tables from a given list of schemas.
Volume metrics
For the metrics of the volume occupied by the CD tables, I decided to calculate only the space occupied by the data in the diagrams reported by the DBMS:
SELECT owner, to_date(v_cur_date, 'RRRR-MM-DD'), sum(bytes) from dba_segments where owner in (' ') group by owner
In addition to this metric, I also want to add the calculation of the space occupied by the data on the disk. By comparing both indicators, you can get, for example, the ratio of the efficiency of
ETL in terms of disk space utilization.
ETL metrics
ETL tool that is used to download -
Oracle Warehouse Builder . The script shown below collects execution statistics for a given loading process, aggregating the number of processed lines from subordinate mappings.
select ae.top_level_execution_audit_id, ae.execution_name, ae.created_on, ae.elapse_time, dt.Selected, dt.Inserted, dt.Updated, dt.Deleted from owbsys.All_Rt_Audit_Executions ae, ( select sum(coalesce(au.number_records_selected, 0)) as Selected, sum(coalesce(au.number_records_inserted, 0)) as Inserted, sum(coalesce(au.number_records_updated, 0)) as Updated, sum(coalesce(au.number_records_deleted, 0)) as Deleted from OWBSYS.ALL_RT_AUDIT_MAP_RUNS au where au.execution_audit_id in (select execution_audit_id from owbsys.All_Rt_Audit_Executions ae where ae.top_level_execution_audit_id = < >) ) dt where ae.top_level_execution_audit_id = < >
If you wish to monitor the performance of each individual mapping, you can expand the set of data collected.
Metrics of the largest table
The largest fact table consists of transactions performed once a month for each active employee of the owner. The following script:
select p_Customer_Num, t.month_start_date, count(distinct f.Person_Id) as uniq, count(f.Person_Id) as t_cnt from dm.time_dimension t left outer join dm.trans_fact f on f.time_id = t.id and f.customer = p_Customer_Num where month_start_date between p_Start_Date and p_End_Date group by month_start_date;
allows you to collect data on the number of rows for each data owner for each month plus the number of unique employees for each month.
This metric is a point around which you can connect the remaining metrics in our
BI tool and get, as a result, the following derived indicators:
- The average cost of space (in GB) add a new owner for every 1000 employees.
- The average cost of time (in hours) to add a new owner for every 1000 employees.
- The average cost for a DBMS for processing rows (in millions of rows) of adding a new owner for every 1000 employees.
You can also create a bunch of beautiful graphs that show the average daily loading time, the dependence of this time on the number of owners processed, etc. Compare these metrics with each other after changes in ETL, improvements or deterioration in DBMS, etc., which will help to more objectively evaluate the results of improvements in HD building processes and answer the questions with which I started this article.
