The quality of data in the repository

The quality of the data in the storage is an important prerequisite for obtaining valuable information. Poor quality leads to a negative chain reaction in the long run.
First, the credibility of the information provided is lost. People are starting to use business intelligence applications less, the potential of applications remains unclaimed.
As a result, further investments in an analytical project are being called into question.

Responsibility for data quality

The aspect related to improving data quality is mega-important in BI projects. However, it is not privileged only by technical specialists.
Data quality is also influenced by aspects such as

Corporate culture
')

Are the workers themselves interested in producing good quality?
If not, why not? There may be a conflict of interest.
Maybe there are corporate rules that define those responsible for quality?

Processes

What data is created at the end of these chains?
Maybe operating systems are configured so that you need to "get out" to reflect this or that situation in reality.
Do the operating systems themselves check and verify data?

For the quality of the data in reporting systems are all responsible in the organization.

Definition and meaning

Quality is a proven customer satisfaction.

But the quality of the data does not contain a definition. It always reflects the context of use. The data warehouse and BI system serve purposes other than the operating system from which the data is taken.

For example, in an operating system, a client attribute may be an optional field. In the repository, this attribute can be used as a dimension and its filling is mandatory. Which, in turn, introduces the need to fill in the default values.

Requirements for data storage are constantly changing and they are usually higher than for operating systems. But it can be the other way round when there is no need to store detailed information from the operating system in the repository.

To make data quality measurable, its standards must be described. People who use information and figures for their work should be involved in the description process. The result of this involvement may be a rule, by following which, at a glance at the table, you can say whether there is an error or not. This rule must be drawn up in the form of a script / code for subsequent verification.

Data quality improvement

It is impossible to clean and fix all hypothetical errors in the process of loading data into the repository. Good data quality can only be achieved through the close work of all participants. People who enter data into the operating system should find out what actions lead to errors.

Data quality is a process. Unfortunately, in many organizations there is no strategy for its continuous improvement. Many people limit themselves to saving data and do not use the full potential of analytical systems. As a rule, when developing data warehouses, 70-80% of the budget is spent on the implementation of data integration. The process of control and improvement remains flawed, if at all.

Instruments

Using software tools can help automate the process of improving and monitoring data quality. For example, they can fully automate technical verification of storage structures: the format of fields, the presence of default values, compliance with the requirements of table field names.

It may be harder to check the contents. As storage requirements change, data interpretation may change. The tool itself can turn into a huge project requiring support.

Board

Relational databases, in which repositories are usually designed, have a remarkable opportunity to create views (views). They can be used to quickly check the data, if you know the features of the content. Each case of finding errors or problems in the data can be recorded as a query to the database.

Thus, the knowledge base about contents will be formed. Of course, such requests should be quick. As a rule, the maintenance of views takes less human time than the tools organized on the tables. The view is always ready to display the result of the check.
In the case of important reports, the presentation may contain a column with the addressee. It makes sense that the same BI tools do reporting on the quality of data in the repository.

Example

The request is written for the Oracle database. In this example, tests return a numeric value that can be properly interpreted. T_MIN and T_MAX can adjust the degree of alarm. The REPORT field was once used as a message in the commercial ETL product that could not adequately send emails, so the rpad is a “crutch”.

In the case of a large table, you can add, for example, AND ROWNUM <= 10, i.e. if there are 10 errors, then this is enough for alarm.

CREATE OR REPLACE VIEW V_QC_DIM_PRODUCT_01 AS SELECT CASE WHEN OUTPUT>=T_MIN AND OUTPUT<=T_MAX THEN 'OK' ELSE 'ERROR' END AS RESULT, DESCRIPTION, TABLE_NAME, OUTPUT, T_MIN, T_MAX, rpad(DESCRIPTION,60,' ') || rpad(OUTPUT,8,' ') || rpad(T_MIN,8,' ') || rpad(T_MAX,8,' ') AS REPORT FROM (-- Test itself SELECT 'DIM_PRODUCT' AS TABLE_NAME, 'Count of blanks' AS DESCRIPTION, COUNT(*) AS OUTPUT, 0 AS T_MIN, 10 AS T_MAX FROM DIM_PRODUCT WHERE DIM_PRODUCT_ID != -1 -- not default value AND ATTRIBUTE IS NULL ); -- count blanks

This publication contains materials from the book.
Ronald Bachmann, Dr. Guido kemper
Raus aus der BI-Falle
Wie business intelligence zum erfolg wird

Source: https://habr.com/ru/post/459682/

All Articles