7 errors of the ETL developer

Data warehousing projects have long been part of the IT infrastructure of most large enterprises. ETL processes are part of these projects, but developers sometimes make the same mistakes when designing and maintaining these processes. Some of these errors are described in this post.

I would like to immediately narrow the scope of the discussion and agree on terminology:

Data storage (Datawarehouse, DWH) implies traditional SQL DWH (Oracle Database, MS SQL Server, etc.);
When simulating DWH, the concepts of single version of truth and historical truth are usually implied;
ETL-process (Extraction-Transformation-Loading) means the process of loading data from one or several source systems (source systems) into DWH.
DWH was not created yesterday, and currently several development teams are independently working on it with their projects.

The term ETL is often interpreted differently, due to the “simple” interpretation of its abbreviation. In fact, ETL tasks are only a subset of Data Movement tasks. In Kimball, in his book “The Data Warehouse ETL Toolkit” there are 3 operations that must be performed by the ETL process:

Download data in the most convenient form for analytical applications;
In the process of loading data, enrich them with additional information;
Record and document lineage (origin) of data.

The first point is fairly obvious, so I’ll skip it. The second point says that the data should not just be reloaded from one place to another, but also enriched in the process, for example, with new calculated attributes, technical attributes (download session id, download date, source system, etc.). The third one says that for any recording it should be possible to track from where and when this record appeared in DWH, when and by what process it changed.

In general, the essence of most errors of an ETL developer can be explained by ignoring the life rule from this picture.
')

Further examples will be used for DWH based on Oracle 11g. So let's get started.

1. Using the system date (or a similar function) in business logic

One of the easiest and most common mistakes, especially for inexperienced developers. Suppose there is a business rule: during the “night window for loading”, unload orders that were closed for that day (across the close_date field). The result is sometimes something like this sql statement:

insert <....> into target_table
select <....> from orders
where close_date> = sysdate () - 1

Even if we forget that sysdate () can contain not only a date, but also a time, then we have problems with this script when the regular work of the ETL process is disrupted for quite obvious reasons (the original system will upgrade to the new version, lost connection with the original system, because of the new ETL process, a place in the temporary tablespace, etc. has ended. Those. at the moment when our ETL process needs to be restarted for some reason or suspended for a while and then restarted. Something interesting can also happen if for some reason this process is launched twice a day.

The solution to this error is usually simple: to parameterize the call to this process, and if necessary, then use sysdate () as the default value with the possibility of overriding. Although using the datetime type field to process the delta from the point of view of HD maintenance is not very optimal, and instead it’s better to use the delta on some discrete field (for example, the integer load session id or something like that)

2. Data profiling was not done before starting development.

Even the most documented and developed by all rules and methods of the original system usually contains incorrect or inconsistent data, despite the numerous assurances of its developers or support teams. And relying on assurances of correctness on the other side of the barricades is usually fraught with problems at the end of development. Any data source (table, file, xml, json, etc.) must be checked against the logical model DWH. There are various tools for data profiling, both tools that are built into ETL and are independent of them. I will list the most popular checks:

Check # 1: Uniqueness of identifiers and natural keys of source data

The difference between an identifier and a natural key is that an identifier is usually some kind of surrogate value that technically identifies a string, and a natural key is a value or a combination of values that have business meaning.

Order_details table:

order_details_id	document_position	order_id
35346346	ten	1224114
35346365	20	1224114
...	...	...
35345464	ten	1224438

In this example, order_details_id is an identifier, and the combination document_position + order_id is a natural key.

Example: I participated in a project on loading data into DWH from a distributed (instance-based) system, in which the accounting of network infrastructure objects was kept. The developers of this system in the blue eye assured that the id of this object is unique and even showed in the initial system a unique index on the table in confirmation of their words. The catch did not come to light right away: it turns out that the uniqueness of these ids existed only within one instance of the system, and when we tried to load all the data from all instances, we got a problem with uniqueness. As a result, we had to change the data model and expand the natural key of the “network object” entity with an additional “instance” field to ensure uniqueness.

Check # 2: Data Types

If the field is called Order_nr, then it does not necessarily contain only numeric values - there may well be alphanumeric sequences. It is also always worth checking the length of the fields. This problem is usually characteristic of file data sources - database tables are usually well typed.

Check # 3: Referential Integrity (FK Check)

The fact that the developer shows the ER-diagrams of his original system, shows on his DEV-environment the existing FK between the tables, and in general his mother swears that he has everything under control, is not a reason not to check the existence of "dangling" records. Since he may not be aware that in the productive environment the DBA has already disabled this check to improve performance (of course, agreeing with the developer manager, i.e. no one is to blame). Also problems with referential integrity are very common for file data sources. Also, do not forget about the use of the script late-arriving-data (for example, if the data arrives is agreed today, it is far from a fact that it will be so in six months).

Check # 4: NULL values

The main problem with NULL values is that NULL <> NULL, so any queries with a join across a field that may contain NULL will return unpredictable results. Therefore, all important fields should be wrapped with the nvl () construct. There is a separate holivar about loading NULL in non-key fields or replacing it with some default values. I’m closer to the idea of a universal replacement of NULLs for a more standardized approach to using DWH, but I don’t undertake to insist that you always have to do this.

Check # 5: Dates

Checking fields with dates are usually the most complicated, because in addition to standard checks, it is necessary to take into account that not all dates that are acceptable from the point of view of the database are such from the point of view of DWH: the date “21-07-1007” is hardly acceptable for the date of the conclusion of the contract for the provision of cellular services. When modeling DWH, there are usually so-called. dates of “beginning of time” and “end of time” (other names are possible), and any date not falling in this time range should be replaced with some default value.

Special mention should be made of the use of data types like varchar (8) for storing dates (in a format like '20151201'), because The number of checks here should be even greater.

3. Removing duplicates via GROUP BY or DISTINCT

Despite the fact that all data that comes from a source is usually loaded into DWH, there are scenarios when deliberately duplicated data arrives. But the uniqueness of the natural key requires only one record of duplicates. There are two wrong ways to remove duplicates:

Wrong way # 1: GROUP BY

Suppose we are loading customer addresses and we know that theoretically several records with address information can come for one customer (they are usually complete duplicates due to problems, for example, with synchronization). Yielding to the desire to solve the problem "in the forehead," the developer can write such a query:

insert into customer_address
select customer_id, max (street_name), max (house_nr)
from source_table
group by customer_id

The problems will begin if two really different records for one client come to the input (for example, there was an operator input error that he fixed, but both versions of the record were added to the data source):

customer_id	street_name	house_nr
1321	Moskovskaya str	127
1321	Pushkinskaya str	34

A query can return this result (depending on the locale):

customer_id	street_name	house_nr
1321	Pushkinskaya str	127

There was no such record in the source data, and DWH users may have a reasonable question, what is it all about? In fact, the 3rd requirement for the ETL process was violated here: an entry was loaded into DWH that could not be tracked to the source system, in other words, which is not there. And this is a clear ETL developer’s error.

Wrong way # 2: DISTINCT

The second solution to the forehead option in the scenario described above is to use DISTINCT to remove duplicate records.

insert into customer_address
select distinct customer_id, street_name, house_nr
from source_table

In this case, a pair of duplicate records with different attributes will be identified earlier, because instead of one there will be two records, and the uniqueness of the natural key will be violated and the ETL process will fall with an error.

One of the right ways

How should solve the problem of having two entries with the same natural key, but different attributes? Obviously, if no changes are made to the data data model, then only one correct one should be selected from all the records. You need to choose it according to a predetermined criterion: if the information is rather critical, then you can implement various Data Quality scenarios, if not, then take the last downloaded one as a valid record.

insert into customer_address
select customer_id, street_name, house_nr from (
select customer_id, street_name, house_nr,
row_number () over (partition by customer_id order by change_datetime desc) row_num
from source_table)
where row_num = 1

In general, it should be remembered that any entry in DWH should be able to be tracked to the source (s) of data depending on the business rule and not create “unexplainable” entries.

4. Using "static" scripts from source systems

Very often, business logic for DWH entities comes from developers or analysts of source systems in the form of SQL scripts. And this is a big help for an ETL developer, but, as they say, “fear the Daians who bring gifts”: as a rule, these scripts capture some conditionally “static” state of the original system at some point in time, and the ETL developer usually tracks data dynamics. and downloading only changes ("delta"). What should alarm in these "static" SQL scripts? Here are some of:

aggregate functions (SUM, AVG, COUNT, etc.)
IN and EXISTS operators
window functions (OVER (PARTITION BY ...))

An example of such a script:

insert order_id into orders_from_calls
select order_id from orders
where order_id IN (select order_id from calls where order_id <> -1)
and changed_date> $ last_loaded_date

It seems to be all logical: load all orders to our order_from_calls table that are referenced in the call table and for which the last modified date is more than the last download date. Now imagine that the update of the table calls in DWH did not occur (for example, it is loaded from another source system and the connection with it is broken for some reason), and this request did not load some id orders. After that, the table calls was reloaded correctly, and there these missed order IDs appeared, but we will not load them into the order_from_calls table, since nothing has changed in the orders table and the new launches of this query will not give anything. Therefore, in this case, the delta should be monitored not only by the orders table, but also by the calls table.

5. Development on a small amount of data for development

As a rule, ETL-developer for development on the DEV-environment unloads a small part of the data from the productive system, on which it is proposed to develop and debug the work of the ETL-processes. Unfortunately, solutions developed on such a small amount of data usually lead to various problems on a productive system, such as insufficient performance, lack of space for intermediate tables (for example, the developer decided to beautifully separate business logic steps on a set of intermediate tables, sequentially overloading from one to another one - but in the productive data system it turned out to be too much, and the tablespace for temporary tables suddenly ended).

Unfortunately, this error ETL-developer can not always solve on their own, due to various regulations and instructions, lack of budget for a full-fledged DEV-environment with the same amount of data as on the production, etc. Therefore, it is worth considering as a project risk.

One way out is to split the project stages into smaller ones and make releases more often in order to identify such problems not at the end of the project, but at least in the middle.

6. Improper use of technical and business dates

There are 2 types of dates in DWH: business dates and technical dates. The difference is in their origin: a business date is the date that came from a data source or was created according to business rules; technical date is the date that was generated by the ETL process or DWH itself. And very often they are used incorrectly:

# 1 Business dates are used as technical dates.

If an entity is historized as SCD2 (Slowly Changing Dimension type 2) and the data source contains the fields "_from" and "_to", which the ETL developer is offered to use as validity ranges of the data, then it should have just reinforced concrete guarantees that all validity ranges for each natural key there will be: 1) non-overlapping, 2) there will be no gaps between ranges, 3) the union of these date ranges will coincide with the date range “from the beginning of time” to the “end of time” set for your DWH (for example, it can be a pair of dates "01/01/1000 and "31.12.9999" or "11.11.1111" and "09.09.9999"). As a rule, developers of source systems do not bother much, and if the rule of "non-overlapping date ranges" is usually followed, then problems usually arise with the 2nd and 3rd paragraph. In any case, the general recommendation is not to use business dates for SCD2, but to generate your technical dates.

# 2 Technical dates are used as business dates.

Very often, data sources do not supply fields to track any key dates: for example, the document has only the closing status, but not the timestamp when this event occurred, and the solution is to use the technical dates "_from" and "_to", which were generated by the ETL process. However, this solution works before the first ETL process failure (for example, stopping the ETL processes for a couple of days): the failure occurred on Monday, the recovery occurred on Wednesday, and since The original system worked quite well all this time, all created documents will be loaded as created on Wednesday. In general, the “historical truth” scenario cannot be implemented if the data source does not supply all the dates needed by the users and can only be emulated (using technical dates), but in this case this scenario should be spoken and described in the documentation so that in a year users were not surprised at the zero number of closed documents on Monday and Tuesday, as well as triple their number on Wednesday.

7. “Mechanical” implementation

This is one of the most difficult to identify errors and, in truth, it is not an ETL developer’s error, but rather the DWH architect. But the same team is working on the project, and it is necessary to help out colleagues too.

Sometimes it happens that the target entity in DWH was incorrectly modeled on the basis of discrepancies in terminology for the developer of the source system and the architect. The developer of the source system thinks with the categories of its source system, the DWH architect also needs to think through various integration schemes, how to connect multiple objects from heterogeneous source systems in a single DWH.

I will describe with the example of the “customer” entity as one of the typical problems for this kind: in the data source there is a “customer” table, which has a unique natural key, the referential integrity is in order. Based on this table, the “customer” entity was created in DWH. Based on the name, it is logical to assume that one record in this table should correspond to one client, but in fact it turned out that in fact the same real client could have several records with the same attributes, but different natural keys. And this would lead to an unpleasant collision for DWH users who used this entity, for example, to count the total number of company customers. As a result, it was decided to divide this entity into two: “customer_record” and “customer”, connected through the FK by the relation M: 1.

And if the ETL developer had “mechanically” implemented everything according to the specification, then he would certainly not be guilty, but he had the opportunity to notice this, since in any case, compared with the architect, he works relatively speaking “on the ground”, unlike the architect who is “hovering in the clouds”.

In general, some symptoms of a “mechanical” implementation can be mentioned:

"Inherit" table names from source system
Copy business logic from existing data streams to new ones.
Use Joba "pilot" project mainly

What should be done to minimize the risks of "mechanical" implementation:

Carefully analyze business rules with any joins that could potentially “cut off” some of the data (left outer join is usually preferable to inner join)
On the other hand, check for “doubtful” joins, which can “propagate” the data due to incorrect or incomplete conditions.

Summarizing this point: you should always understand what exactly you load into DWH and whether the name matches the content, and also to load no more and no less data than is required.

Conclusion

Of course, this list is not complete, but I hope that this article can bring some order in the heads, which are already confused with deadlines, milestones, releases and bugfixes.

Source: https://habr.com/ru/post/273243/

All Articles