HP Vertica, Data Warehouse, Big Data Design

UPD: Continuation of the article on the link - habrahabr.ru/company/avito/blog/322510

What is the article about?

The year passed imperceptibly, as work began on the development and implementation of a data warehouse on the Vertik platform.
On Habré there are already articles about Vertik's DBMS itself, I especially recommend this: HP Vertica, the first project launched in the Russian Federation , because its author helped us a lot in the initial stage. Alexey, thanks again.
I would like to talk about what methodology was used to design the physical structure of the storage in order to make the most of the HP Vertica features.
This article would like to highlight the rationale for the optimality of the chosen methodology, and the next one - to tell about which techniques allow us to analyze data containing tens of billions.

Formulation of the problem

Consider a high-powered website of a large Russian Internet company (now you can - this is Avito;)).
The company's activity is described by the following figures: ~ 10 million active users, ~ 100 million page views per day, about 1 thousand new objects placed by users on the site for 1 minute, ~ 10 thousand user searches per minute.
A rough estimate of the number of actions to be stored in the repository is 100 million new records per day (~ 100 GB of new data per day).
Those. When building a classic data warehouse with the refusal to erase previously received data, the storage volume after 3 months of operation will be 10TB of raw data. Big Data as it is.
We need to build a repository that would store at least 6 months of data, allow them to be analyzed, visualized, and lagged behind real life as little as possible (at worst, it would be a day behind, at best, minutes).
Immediately beyond the brackets, the question of choosing a platform - the storage should work on HP Vertica, MPP database column storage, see the introductory article in the title.

')

The choice of methodology

Now several methodologies are popular for building storages:

First, it is Kimball , and building a repository in the form of a combination of "stars". One of the most popular methodologies, since she is taught in all of our institutes where people read about the vaults.
Secondly, it is Inmon. ... More precisely, not just Inmon, but rather inertia. His approach to the design of storage contains a number of beautiful theses, such as normalization, but does not contain an unambiguous algorithm for how to convert a business model into a data model (into DBMS tables). But there is always a short way - you can take the tables of the source system from which the storage is filled, transfer them to AS IS, modify the storage a little. Almost by Inmon.
Third, this is the Data Vault . It is a relatively new methodology, but in Russia it is already more or less well-known, even in Wikipedia, there is in Russian. Not a bad thing, there is an ideology, and an algorithm for constructing models.
Fourth, this is Anchor Modeling . A completely new methodology, sometimes shocking, because involves storing data in compliance with the 6th normal form.

The projected data storage must meet the following requirements arising from each other:

Heterogeneity - the repository must accept data from different accounting systems of a different nature (relational OLTP database, and NoSQL JSON storage of a free structure with web traffic data)
Flexibility and extensibility - the storage data model cannot be fixed. Business companies are expanding, adding new services, services, opening new branches. At any time, the storage data model may need to be revised to reflect new data types.
Historicity - it is unacceptable to store in the repository only the actual state of the objects. It is also necessary to keep a complete change history of all attributes of all essential. For example, if a user posted an ad on a company's website, and then changed his name and price three times, the repository should allow to evaluate the popularity of an ad for each time interval between revisions.
Velocity - in the previous sections of the article, estimates were made of the amount of hourly entering the data warehouse. This should not only add new data, but also update the old records. When you receive a list of ads, you must not only upload new data, but also check whether the fields loaded earlier have changed.
Volume - the storage should provide a high depth of data storage, which, together with estimates of the speed of data receipt, means that the data storage model must successfully cope with the task of maintaining historicity and flexibility on data volumes of billions of records and tens of terabytes.

Based on the analysis of the above requirements, the “Star” methodology was dropped first. Within the framework of this methodology, there are no staff mechanisms for non-destructive expansion of the model. In other words, to add new entities (new dimensions, new facts) to the model, it is necessary to extend old tables, which means modification of ETL processes.
Thus, with each extension there is a risk of making a mistake and disrupting the functionality of the old functionality.
In addition, the data model “Star” is experiencing difficulties in maintaining the historicity of the data. The historicity of measurement data is realizable through the implementation of slowly changing dimension type 2 . However, this approach leads to a multiple increase in the number of rows in the fact table, and it is extremely difficult to implement the historicity of the facts in this model.

Inmon's approach is too general, it can be used to create your own methodology ... after all existing, already prepared and proven methodologies have been discarded.

The Data Vault methodology looked very promising in the context of this task. Data Vault does not provide for dividing tables into facts and measurements; it allows independent maintenance of the historicity of any data fields, with the exception of business keys. Historicity is supported by implementing slowly changing type 2 dimensions (slow changing dimension type 2, with two dates — from_date, the data validity start date and to_date, the data validation end date).
Data Vault supports non-destructive growth of the model. In other words, when new business entities or new relationships between old business entities are discovered, the Data Vault repository model can be expanded without making changes to old tables and ETL processes. The old logic will simply continue to work regardless of the changes.
Also, Data Vault greatly simplifies the parallelization of ETL processes due to the complete independence of the loading processes of Hubs (entities) and Satellites (entity attributes). Only when loading Links (entity links) are possible dependencies that require performing some ETL processes before others.
All the above advantages of the Data Vault model are also relevant for Anchor Modeling, so further decisions were made on the basis of load experiments with industrial data in the Vertica database.

Experiments have shown the following differences between models built on the methodology of Data Vault and Anchor Modeling:

Maintaining historicity. Data Vault provides for the implementation of slowly changing type 2 measurements with two dates - the date of the beginning of the data and the end of the data. In other words, when a new version of the data arrives, the old version must be updated to set the date of expiration of the data. Anchor Modeling provides for storing only one date - the date of commencement of data. This approach allows you to implement the concept of only-insert-ETL - download data without updates, only with inserts. On conventional DBMS, the two-date approach is slightly slower in ETL processes (one insert + one update), but much faster as part of the search for the current version of the data (having two dates allows you to find the current record, independently sorting the versions and checking the dates for the current date between them). However, within the framework of MPP databases, the situation is changing - the update operation in them is much slower than the insert operation (within the DBMS Vertica - tens and hundreds of times). The implementation of window functions from the ANSI SQL 2003 standard (windowing, OVER (partition by ... order by ...) construction) allows you to search for the current version of data based only on the start date, with almost no performance loss compared to two dates. Thus, the inefficiency of the update operation in Vertica DBMS makes it preferable to choose to maintain the historicity of the data based on the Anchor Modeling methodology.
Maintain high speed ETL processes with increasing volumes. Data Vault involves storing entity attributes grouped in satellite tables. For example, if an entity has 50 attributes with a comparable likelihood of change, the Data Vault methodology recommends storing them in a single table. The Anchor Modeling methodology in such a case requires the creation of 50 tables, one for each attribute. The approach of Anchor Modeling at first glance seems redundant, although it satisfies 6 normal form. Collecting the actual values of all 50 attributes is possible only by creating special materialized representations. Without them, collecting attributes is too difficult for the analyst working with the repository. But how different are the approaches in Big Data? As part of the experiment, web traffic records containing 20 columns were loaded in parallel into two different structures — the only broad table of 20 columns corresponding to the Data Vault methodology, and into the 20 most normalized narrow tables corresponding to the Anchor Modeling methodology. In the first week after the launch of the experiment, the speed of the approaches differed slightly, while the approach of Data Vault gave the best performance for data analysis (no need to combine 20 tables in one storefront). However, after a month of loading, when the number of records in each of the tables exceeded 5 billion records, the following results were obtained:
• The regular launch of the ETL process for loading new data in the last 20 minutes allowed for the introduction of new lines of ~ 5 million.
• Data collection from source systems, purification and enrichment - identical for both methodologies, took about 10 minutes.
• Adding data to narrow tables Anchor Modeling started as 20 independent streams. Each of them started at the same time and ended in a time from 40 seconds to 2 minutes.
• Adding data to a single broad Data Vault table took from 7 to 9 minutes.
• The test launch of adding data to a wide Data Vault table, without parallel insertion into Anchor Modeling tables, showed the same figures, from 7 to 9 minutes.
The experiment showed that with an increase in the number of data lines already stored in the table to 5 billion, insertion into the Data Vault model starts to work about 4 times slower than in the Anchor Modeling model.
After a month of storage operation, ETL processes filling a broad Data Vault table required a total of 17 to 19 minutes to load data from external systems over a 20-minute interval (against 11-12 minutes for the Anchor Modeling table structure).
Further degradation of the performance of ETL processes that populate a broad Data Vault table has compromised real-time synchronization of storage with operational data. The storage began to lag behind the combat systems, and required the transfer of old historical data from the broad Data Vault table to the archive. The Anchor Modeling model has not demonstrated such flaws.
Maintain manageability and scalability of storage. Although Vertika would seem to be the base for column storage, and should be invariant to the number of columns in the table, the above example is far from the only demonstration that many narrow tables are better than one wide. For example, over time, it may be necessary to repartition the table. Vertika makes this operation simple - it creates a copy of the table, and starts the repartization on it. It would seem that everything is simple. But repartitioning is usually done in order to erase old historical partitions when disk space ends. The fact is that erasing data in Vertic is a very inefficient operation when it comes to millions of lines - it is possible, but when it comes to billions - erasing can last for days, blocking all other processes in the system. In the context of lack of disk space - space for creating a copy of the table with a large number of columns may not be enough elementary. And even if there is enough space, the repartitioning of such a table will take a day. If instead of a single table, many narrow, 3-4 columns are used, then even for 50-60 billion lines they will be reparted in hours, and the process can be performed in turns, which will simplify the search for the required space for the system.

findings

A brief summary - Anchor Modeling methodology was surprisingly suitable for creating Big Data storage on HP Vertica.
The given examples are only a part of what we had to face, and it almost always turned out that if we didn’t follow the path of Anchr Modeling, it would be much more difficult for us.
However, I do not recommend thinking that Anchor Modeling itself will solve all the problems for the architect. The strength and weakness of this methodology lies in the very clear structuredness of all operations. In order for the storage to be developed, the generation of data models, the code for creating database objects and ELT procedures must be fully automated.
We solved this task by using our own Python code, and the metadata description was stored in Excel, as a result, the entire infrastructure was created in a month, and two months after becoming familiar with Vertical, the repository began to solve the first business problems and reached a volume of 1TB.
Actually here. If the topic is interesting to someone, I plan to talk about other nuances of working with highly normalized storage in HP Vertica, in particular, how to emulate the Map-Reduce algorithm over SQL, when normal SQL fails.

Source: https://habr.com/ru/post/227111/

All Articles

HP Vertica, Data Warehouse, Big Data Design

What is the article about?

Formulation of the problem

The choice of methodology

findings

More articles: