📜 ⬆️ ⬇️

Data Warehouse Testing

Published on behalf of IvanovAleksey .



There is little information on testing the Data Warehouse on the Internet.
You can find general requirements: data completeness, quality, etc.
But nowhere is there a description of the organization of the process, and with what checks it is possible to cover these requirements.
In this article I will try to tell you: how we test the data warehouse in Tinkoff Bank .


Our Data Warehouse


First, I will briefly describe our Vault.
The processes of loading and processing data are implemented as jobs. Jobs are started on the scheduler usually at night, when no one uses external systems (the load on them is minimal) and after the business day is closed.
The base structure is a set of tables of different sizes (from several rows to 3 billion and more), with different numbers of columns (from 1 to 200), with and without historicity. The base size is about 35TB.
External data sources are used: Oracle, CSV / XLS files, MySql, Informix, BIN files, CDH - Cloudera (Hadoop). We upload data to external systems also through Oracle, SAS tables and as CSV / XLS files.
Storage objects (descriptions of databases, jobs, tables, views, etc.) are developed in SAS Data Integration Studio and stored on the SAS server as metadata. Physically, the tables are in the Greenplum and SAS databases.
To run Jobs, a code is generated by metadata and transferred to the deployment server. After which they can be run on the scheduler.
Changes to the environment are transferred in the form of packages consisting of metadata and scripts (to create / edit physics, data). For ease of transfer, the developers have written a special program "Authors".
For manual start of jobs there is a web portal. There you can also see the running schedulers and the status of the jobs that work on them.



Test objects


Any revision of DWH is the creation / modification of the physics and metadata of tables and jobs. Or it is the correction scripts already loaded data.
For example, a new storefront is being created. The package will have metadata of the new jobb and target and a script for creating physics of the new table in the database.
Thus, a test object is a modified / created job, data in its target table and data in target tables of dependent jobs (if any).


Levels and types of testing


Our test loop repeats the productive one completely: the same iron, the same data of the same volume, is loaded and processed using the same processes. Considering this feature, as well as the fact that tasks are developed when sources already have all the necessary data, it is possible to reduce the volume of checks without loss of quality.
We perform load testing (Performance testing) and testing on large amounts of data (Volume testing) only when we transfer the task to the test: we check the work time of the job, the load on the stand (for example, using Grafana ) and the amount of work (I will not describe these checks in detail in this article).
At the system level, jobs and data quality is checked automatically. Jobs are monitored by the night planner itself and the script for controlling the volumes of work. And the data after loading is checked using Data Quality Daemon (about it below). If something is wrong with the data, the responsible will receive letters with errors.
From the “white box” we look only at the correct indication of the environment (there were errors with the hardcode of the test and dev circuits). In the future, it is planned to check it automatically when the package is published by the developer.
The main ones are functional testing (“black box”) and regression testing, at levels: components and integration.
Taking into account the objects of testing defined in the previous paragraph, the full set of checks looks like this:



Under the "comparison with backup" refers to the reconciliation of the results of the new and previous versions of the job. That is, the old and new jobs are run on the same data. And their target tables are compared.
A “prototype” is a data set collected from a TOR, and which should be in the storefront after revision. This may be the layout of either a completely new table, or just changed columns in the old table.
Depending on the task, some of these checks may be redundant. By determining the types of improvements, you can get rid of redundant checks and reduce the testing time.


Types of changes and checks


The Data Warehouse project in the bank is constantly evolving - new windows are being created, old windows are being finalized, and loading processes are being optimized. But in reality, all tasks can be divided into four groups with their own set of tests:


  1. Technical.
    Optimization, migration, etc. - that is, a task where the algorithm does not change. And the data in the target table should not change either.
    It is enough to perform a regression check: compare the target of the modified job with the backup. You can not check dependent Jobs, because if the target matches the backup, the dependent job will also process it.
  2. Change the old functionality.
    The algorithm changes, filters (the number of lines changes), new fields are added, sources. That is, the data set in the target table is changing.
    All checks should be performed: compare data in the target table of the modified job with backup and prototype, check the quality of the data in the target tables of dependent jobs and external systems.
  3. Development of new windows.
    New tables and Jobs are created that load them.
    We perform only functional testing: we compare target tables with prototypes.
    If the upload goes to the external system, we additionally check the integration - we look at how the loaded data is displayed in the external system.
  4. Edit data.
    Delete duplicate, old records, fix versioning, write correct values.
    Checking these changes is quite complicated and it will not work in two sentences. I will tell you in detail in the next article.

If within the project / task there are several types of changes, then in the test set we take checks for each of them.


These checks are sufficient to ensure that the requirements and the results of development are met in most tasks.


Blitz check


Building a prototype and performing the comparison can take a long time (depending on the performance of the environment and the amount of data). We encountered this problem when the test circuit was weaker than the productive one. In order not to waste time on comparison and immediately catch critical defects, quick checks were used.
Suitable for any type of task and performed before testing:



If they failed, you can immediately return the task to the development / start defect Critical.


Instruments


The main actions during testing are: rolling tasks onto a test, comparing tables.
As already said, the transfer uses its own development, the program "Authors". That allows you to save time and avoid errors during manual transfer.


The work of the program looks like this:



Runs through the console. The task number, environment name and additional parameters (for example, pause after each step) are input.


To compare the tables (target with prototype / backup), a macro is used which compares the values ​​in the rows to the specified key and compares the occupancy of the fields.
The macro receives the table names and the comparison key as input.
Example of the result of the work:


The number of differences in columns. See the differences for yourself.


Obscolumn_namediffer_base_to_comp
onecolumn_10
2column_220
3column_30

The number of groupings in _cd and _flg fields.


Obscolumn_namecolumn_groupbase_groupscompare_groupsdiffbase_group_pctcompare_group_pctdiff_group_pct
onecolumn_2A1874363186800.00210.0024-0.0003
2column_2B44517401775644339840.48970.6877-0.1980
3column_2C4619311781346114980.50820.30260.2056
fourcolumn_2null19118830.00000.0073-0.0073

To check the quality of the data, a profiling macro is used. Which counts the number and percentage of entries with null for each column, duplicates and null by key, rows in grouping by flags and values, min / max / avg by sums in columns.
The input is the name of the table and key.
At the output we get a report, with signs for each of the calculations.
Example:


The number of missings in columns.


Obscolumn_namebase_nullsnulls_pct
onecolumn_100.00
2column_200.00
3column_370.03
fourcolumn_400.00
fivecolumn_500.00

There is also a macro for comparing the two different table profiling between each other (or with a table on the productive level). The principle of operation is the same: profiling is performed for each table, and the results are compared.
The output report is similar to ordinary profiling, with only data from two tables.


Data quality control


To control the quality of data in the entire repository, a self-signed Data Quality Daemon (DQD) is used, which checks all tables for compliance with the rules compiled by analysts and specialists of the data quality control department.
DQD is a process that every 10 minutes finds updated tables and executes specified SQL queries. It compares results with benchmarks (presets) and sends a report with deviations.
Report example:


Constraint DefinitionSQL ScriptCorrupted Rows Cnt
test_schema.table1 / Unique Entity Key [id]select sum (cnt) as cnt from (select 0 as cnt union all select 1 as cnt from test_schema.table1 group by id having count (*)> 1 union all select 1 as cnt from test_schema.table1 where not ((id) is not null)) sq15

Making test cases


In our bank, testing lives in Zephyr (Jira add-on). The tasks for finalization are issued as tickets in Jira, and test cases and test cases in Zephyr.
Having tried several options, we stopped at the fact that we are starting a case for each changed job. We call the case: “<task number in jira>: <job name>”. And link to the ticket.
The main advantages of this approach are:


  1. in the task you can see case coverage (which Jobs will be checked)
  2. you can easily calculate the percentage of run / pass / failed
  3. A simple search by job name returns all the written cases, their status, who wrote it and when, performed and for what task.
  4. Again, from the name of the case, you can find out the number of the task for revision. And opening it to go into it on the link.

Conclusion


Testing DWH - a difficult process with its own specifics. If you adhere to the classical methodology, it turns out to be very cumbersome.
Our approach allows you to test quickly (on average, one tester does the task in three days) and the number of missed errors is reduced to zero. For six months, more than 400 tasks were brought to the production.
We are not going to stop on our laurels. The next goal is the automation of most checks.


')

Source: https://habr.com/ru/post/302670/


All Articles