Data Warehouse Testing

Published on behalf of IvanovAleksey .

There is little information on testing the Data Warehouse on the Internet.
You can find general requirements: data completeness, quality, etc.
But nowhere is there a description of the organization of the process, and with what checks it is possible to cover these requirements.
In this article I will try to tell you: how we test the data warehouse in Tinkoff Bank .

Our Data Warehouse

First, I will briefly describe our Vault.
The processes of loading and processing data are implemented as jobs. Jobs are started on the scheduler usually at night, when no one uses external systems (the load on them is minimal) and after the business day is closed.
The base structure is a set of tables of different sizes (from several rows to 3 billion and more), with different numbers of columns (from 1 to 200), with and without historicity. The base size is about 35TB.
External data sources are used: Oracle, CSV / XLS files, MySql, Informix, BIN files, CDH - Cloudera (Hadoop). We upload data to external systems also through Oracle, SAS tables and as CSV / XLS files.
Storage objects (descriptions of databases, jobs, tables, views, etc.) are developed in SAS Data Integration Studio and stored on the SAS server as metadata. Physically, the tables are in the Greenplum and SAS databases.
To run Jobs, a code is generated by metadata and transferred to the deployment server. After which they can be run on the scheduler.
Changes to the environment are transferred in the form of packages consisting of metadata and scripts (to create / edit physics, data). For ease of transfer, the developers have written a special program "Authors".
For manual start of jobs there is a web portal. There you can also see the running schedulers and the status of the jobs that work on them.

You can read more in our previous articles.

SAS integration with Greenplum
Data replication. Attunity Replicate and Greenplum
Greenplum DB as DWH core

Test objects

Any revision of DWH is the creation / modification of the physics and metadata of tables and jobs. Or it is the correction scripts already loaded data.
For example, a new storefront is being created. The package will have metadata of the new jobb and target and a script for creating physics of the new table in the database.
Thus, a test object is a modified / created job, data in its target table and data in target tables of dependent jobs (if any).

Levels and types of testing

Our test loop repeats the productive one completely: the same iron, the same data of the same volume, is loaded and processed using the same processes. Considering this feature, as well as the fact that tasks are developed when sources already have all the necessary data, it is possible to reduce the volume of checks without loss of quality.
We perform load testing (Performance testing) and testing on large amounts of data (Volume testing) only when we transfer the task to the test: we check the work time of the job, the load on the stand (for example, using Grafana ) and the amount of work (I will not describe these checks in detail in this article).
At the system level, jobs and data quality is checked automatically. Jobs are monitored by the night planner itself and the script for controlling the volumes of work. And the data after loading is checked using Data Quality Daemon (about it below). If something is wrong with the data, the responsible will receive letters with errors.
From the “white box” we look only at the correct indication of the environment (there were errors with the hardcode of the test and dev circuits). In the future, it is planned to check it automatically when the package is published by the developer.
The main ones are functional testing (“black box”) and regression testing, at levels: components and integration.
Taking into account the objects of testing defined in the previous paragraph, the full set of checks looks like this:

Unit testing We check the new / modified job and the data in its target table.
Functional testing: we build a prototype and compare it with the values in the new / modified columns of the target tables.
Regression testing: we perform a comparison with backup, the data not affected by the revision should be the same.
Integration testing . We check the quality of data at the output of dependent jobs (which have not been changed) and in external systems.
Functional testing: the quality of the data affected by the refinement must comply with the TOR.
Regression testing: the quality of unaffected data refinement should not change.

Under the "comparison with backup" refers to the reconciliation of the results of the new and previous versions of the job. That is, the old and new jobs are run on the same data. And their target tables are compared.
A “prototype” is a data set collected from a TOR, and which should be in the storefront after revision. This may be the layout of either a completely new table, or just changed columns in the old table.
Depending on the task, some of these checks may be redundant. By determining the types of improvements, you can get rid of redundant checks and reduce the testing time.

Types of changes and checks

The Data Warehouse project in the bank is constantly evolving - new windows are being created, old windows are being finalized, and loading processes are being optimized. But in reality, all tasks can be divided into four groups with their own set of tests:

Technical.
Optimization, migration, etc. - that is, a task where the algorithm does not change. And the data in the target table should not change either.
It is enough to perform a regression check: compare the target of the modified job with the backup. You can not check dependent Jobs, because if the target matches the backup, the dependent job will also process it.
Change the old functionality.
The algorithm changes, filters (the number of lines changes), new fields are added, sources. That is, the data set in the target table is changing.
All checks should be performed: compare data in the target table of the modified job with backup and prototype, check the quality of the data in the target tables of dependent jobs and external systems.
Development of new windows.
New tables and Jobs are created that load them.
We perform only functional testing: we compare target tables with prototypes.
If the upload goes to the external system, we additionally check the integration - we look at how the loaded data is displayed in the external system.
Edit data.
Delete duplicate, old records, fix versioning, write correct values.
Checking these changes is quite complicated and it will not work in two sentences. I will tell you in detail in the next article.

If within the project / task there are several types of changes, then in the test set we take checks for each of them.

These checks are sufficient to ensure that the requirements and the results of development are met in most tasks.

Blitz check

Building a prototype and performing the comparison can take a long time (depending on the performance of the environment and the amount of data). We encountered this problem when the test circuit was weaker than the productive one. In order not to waste time on comparison and immediately catch critical defects, quick checks were used.
Suitable for any type of task and performed before testing:

The task came to the test and the job executes (yes, it is sometimes forgotten to check)
There should not be null and duplicates by key (unless otherwise indicated in the statement of work). Versioning of the table must be respected.
New fields receive data: select count (new_field) from table1 must be greater than 0.
New records are loaded into the table: compare the number of records in backup and target.

If they failed, you can immediately return the task to the development / start defect Critical.

Instruments

The main actions during testing are: rolling tasks onto a test, comparing tables.
As already said, the transfer uses its own development, the program "Authors". That allows you to save time and avoid errors during manual transfer.

The work of the program looks like this:

Removed backups of physics tables and metadata of modified objects.
Running scripts that must run before importing new metadata
Import metadata.
Scripts are executed, which must be executed after the metadata is imported and before the jobs are launched on the scheduler.
Deploy job and run on the scheduler.
Scripts are executed that must be executed after the job is completed.

Runs through the console. The task number, environment name and additional parameters (for example, pause after each step) are input.

To compare the tables (target with prototype / backup), a macro is used which compares the values in the rows to the specified key and compares the occupancy of the fields.
The macro receives the table names and the comparison key as input.
Example of the result of the work:

The number of differences in columns. See the differences for yourself.

Obs	column_name	differ_base_to_comp
one	column_1	0
2	column_2	20
3	column_3	0

The number of groupings in _cd and _flg fields.

Obs	column_name	column_group	base_groups	compare_groups	diff	base_group_pct	compare_group_pct	diff_group_pct
one	column_2	A	18743	63	18680	0.0021	0.0024	-0.0003
2	column_2	B	4451740	17756	4433984	0.4897	0.6877	-0.1980
3	column_2	C	4619311	7813	4611498	0.5082	0.3026	0.2056
four	column_2	null	191	188	3	0.0000	0.0073	-0.0073

To check the quality of the data, a profiling macro is used. Which counts the number and percentage of entries with null for each column, duplicates and null by key, rows in grouping by flags and values, min / max / avg by sums in columns.
The input is the name of the table and key.
At the output we get a report, with signs for each of the calculations.
Example:

The number of missings in columns.

Obs	column_name	base_nulls	nulls_pct
one	column_1	0	0.00
2	column_2	0	0.00
3	column_3	7	0.03
four	column_4	0	0.00
five	column_5	0	0.00

There is also a macro for comparing the two different table profiling between each other (or with a table on the productive level). The principle of operation is the same: profiling is performed for each table, and the results are compared.
The output report is similar to ordinary profiling, with only data from two tables.

Data quality control

To control the quality of data in the entire repository, a self-signed Data Quality Daemon (DQD) is used, which checks all tables for compliance with the rules compiled by analysts and specialists of the data quality control department.
DQD is a process that every 10 minutes finds updated tables and executes specified SQL queries. It compares results with benchmarks (presets) and sends a report with deviations.
Report example:

Constraint Definition	SQL Script	Corrupted Rows Cnt
test_schema.table1 / Unique Entity Key [id]	select sum (cnt) as cnt from (select 0 as cnt union all select 1 as cnt from test_schema.table1 group by id having count (*)> 1 union all select 1 as cnt from test_schema.table1 where not ((id) is not null)) sq	15

Making test cases

In our bank, testing lives in Zephyr (Jira add-on). The tasks for finalization are issued as tickets in Jira, and test cases and test cases in Zephyr.
Having tried several options, we stopped at the fact that we are starting a case for each changed job. We call the case: “<task number in jira>: <job name>”. And link to the ticket.
The main advantages of this approach are:

in the task you can see case coverage (which Jobs will be checked)
you can easily calculate the percentage of run / pass / failed
A simple search by job name returns all the written cases, their status, who wrote it and when, performed and for what task.
Again, from the name of the case, you can find out the number of the task for revision. And opening it to go into it on the link.

Conclusion

Testing DWH - a difficult process with its own specifics. If you adhere to the classical methodology, it turns out to be very cumbersome.
Our approach allows you to test quickly (on average, one tester does the task in three days) and the number of missed errors is reduced to zero. For six months, more than 400 tasks were brought to the production.
We are not going to stop on our laurels. The next goal is the automation of most checks.

Source: https://habr.com/ru/post/302670/

All Articles