Introduction to Data Vault

Most companies today accumulate a variety of data obtained in the process. Often, data come from various sources - structured and not so much, sometimes in real time, and sometimes they are available at strictly defined periods. All this diversity needs to be structuredly stored, then successfully analyze, draw beautiful reports and notice anomalies in time. For these purposes, designed data warehouse (Data Warehouse, DWH).

There are several approaches to building such a universal repository that help the architect to avoid common problems, and most importantly, to ensure the proper level of flexibility and extensibility of DWH. I want to tell about one of such approaches.
')

Who will be interested in this article?

Looking for a more functional alternative to the “star” scheme and the Third Normal Form?
Do you already have a data warehouse, but is it hard to refine?
Need good support for historicity, and the current architecture is not suitable for this?
Having trouble collecting data from multiple sources?

If you answered yes to any of these questions and are not familiar with the Data Vault, please look under the cut!

Data Vault is a hybrid approach that combines the advantages of the familiar “star” scheme and the 3rd normal form. For the first time, this methodology was announced in 2000 by Dan Linstedt . The approach was conceived in the process of developing a data warehouse for the US Department of Defense and has worked well. Later, in 2013, Dan announced version 2.0, refined to take into account the rapidly gaining popularity of technologies (NoSQL, Hadoop) and new requirements for DWH. Let's talk about Data Vault 2.0.

Data Vault consists of three main components - Hub (Hub) , Link (Link) and Satellite (Satellite) .

Hub

A hub is the basic representation of an entity (Customer, Product, Order) from a business perspective. The table-hub contains one or more fields that reflect the essence in terms of business. Collectively, these fields are called "business key". The ideal candidate for the title of business key is the organization's TIN or VIN number of the car, and the ID generated by the system will be the worst option. The business key must always be unique and unchanged.
The hub also contains load timestamp and record source meta-fields in which the time of the initial loading of the entity into the storage and its source (the name of the system, database or file from which the data was loaded) are stored. It is recommended to use MD5 or SHA-1 hash from the business key as the primary key of the Hub.

Tables-Hubs

Link

Reference Tables link several hubs with a many-to-many relationship. It contains the same metadata as the Hub. The link may be associated with another Link, but this approach creates problems when loading, so it is better to select one of the Links in a separate Hub.

Table Link

Satellite

All descriptive attributes of the Hub or Links (context) are placed in the Satellite table. In addition to the context, the Satellite contains a standard set of metadata ( load timestamp and record source ) and one and only one “parent” key. In the satellites, you can easily keep the history of the context change, each time adding a new record when the context is updated in the source system. To simplify the process of updating a large satellite to the table, you can add a hash diff field: MD5 or SHA-1 hash from all its descriptive attributes. For a Hub or Link it can be arbitrarily Satellites, usually the context is divided according to the frequency of updates. Context from different source systems is taken to put in separate Satellites.

Satellite Tables

How to work with it?

* The picture is based on an illustration from the book Building a Scalable Data Warehouse with Data Vault 2.0

First, the data from the operating systems arrive in the staging area . Staging area is used as an intermediate in the process of loading data. One of the main functions of the Staging Zone is to reduce the load on the operating bases when performing queries. The tables here completely repeat the original structure, but any restrictions on the insertion of data, such as not null or checking the integrity of foreign keys, should be turned off to leave the possibility to insert even damaged or incomplete data (this is especially true for excel tables and other files). In addition, stage tables contain hashes of business keys and information about load time and data source.

After that, the data is divided into Hubs, Links and Satellites and loaded into Raw Data Vault . During the download process, they are not aggregated or recalculated.

Business Vault is an optional accessory add-on over Raw Data Vault. It is based on the same principles, but contains processed data: aggregated results, converted currencies, and so on. The separation is purely logical, physically the Business Vault is in the same database as Raw Data Vault and is intended primarily to simplify the creation of storefronts.

Business Satellite

b_sat_order_total_price
Sample Business Satellite

When the necessary tables are created and filled, it is the turn of the data marts . Each storefront is a separate database or schema designed for solving problems of various users or departments. It can contain a specially assembled “star” or a collection of denormalized tables. If possible, the tables inside the showcases should be made virtual, that is, computed "on the fly." For this, SQL views are usually used.

Filling Data Vault

Everything is quite simple here: Hubs are loaded first, then Links and then Satellites. Hubs can be loaded in parallel, as well as Satellites and Links, unless of course link-to-link is used.

There is an option to completely turn off the integrity check and download all data simultaneously. Just such an approach corresponds to one of the basic tenets of DV - “Load all available data all the time (Load all of the data, all of the time)” and it is here that the crucial role played by the business keys. The bottom line is that possible problems when loading data should be minimized, and one of the most common problems is a violation of integrity. The approach, of course, is controversial, but I personally use it and find it really convenient: the data is still checked, but after loading. You can often encounter the problem of missing records in several hubs when loading links and consistently figure out why a particular hub is not completely full, restarting the process and learning a new error. Alternatively, display the missing data after the download and see all the problems at once. Bonus we get error resistance and the ability to not follow the order of loading tables.

Advantages and disadvantages

[+] Flexibility and extensibility.
With Data Vault, there is no longer a problem in expanding the storage structure, but also in adding and matching data from new sources. The most complete storage of “raw” data and convenient storage structure allow us to create a window for any business requirements, and existing solutions on the database market do a great job with huge amounts of information and quickly perform even very complex queries, which makes it possible to virtualize most windows.
[+] Agile approach out of the box.
Modeling storage using the Data Vault methodology is fairly simple. New data is simply “connected” to the existing model without breaking or modifying the existing structure. At the same time, we will solve the task set in the most isolated manner, loading only the necessary minimum, and, probably, our time estimate for such a task will become more accurate. Planning sprints will be easier, and the results are predictable from the first iteration.
[-] Abundance of JOINs
Due to the large number of join operations, queries can be slower than in traditional data warehouses where tables are denormalized.
[-] Difficulty.
In the methodology described above, there are many important details that it is unlikely to understand in a couple of hours. To this can be added a small amount of information on the Internet and the almost complete lack of materials in Russian (I hope this is fixed). As a result, when implementing Data Vault, there are problems with team training, there are many questions about the nuances of a particular business. Fortunately, there are resources on which to ask these questions. A big drawback of complexity is a mandatory requirement for the presence of data marts, as the Data Vault itself is not well suited for direct queries.
[-] Redundancy.
Quite a controversial flaw, but I often see questions about redundancy, so I will comment on this point from my point of view.

Many people do not like the idea of creating a layer in front of data marts, especially considering that there are about 3 times more tables in this layer than could be in third normal form, which means 3 times more ETL processes. This is true, but the ETL processes themselves will be much simpler due to their monotony, and all the objects in the repository are simple enough to understand.

The seemingly redundant architecture is built to solve very specific problems, and of course is not a silver bullet. In any case, I wouldn’t recommend changing anything until the Data Vault benefits described above are in demand.

In custody

In this article, I mentioned only the main components of the Data Vault - the minimum required for an introductory article. Point in time and Bridge tables, features and rules for selecting business key components and the method for tracking deleted records remained behind the scenes. I plan to tell about them in the next article, if the topic is of interest to the community.

Source: https://habr.com/ru/post/348188/

All Articles