Data Lake - from theory to practice. Hadoop and Enterprise DWH data integration methods

In this article I want to tell you about an important task that you need to think about and need to be able to solve if such an important component as Hadoop - the task of integrating Hadoop data and corporate DWH data appears in the analytical platform for working with data. At Data Lake in Tinkoff Bank, we learned how to effectively solve this problem, and in the article I will explain how we did it.

This article is a continuation of a series of articles about Data Lake at Tinkoff Bank (the previous article of Data Lake is from theory to practice. A tale about how we build ETL on Hadoop ).
')

Task

Lyrical digression. Above the picture shows a picturesque lake, or rather a system of lakes - one is smaller, the other is bigger. What is smaller, beautiful is, refined, with yachts - this is a corporate DWH. And what is seen on the horizon and does not fit in the picture due to its size is Hadoop. Lyrical digression is over, to the point.
Our task was rather trivial from the point of view of requirements, and not trivial from the point of view of technology selection and implementation. We had to dig a channel between these two lakes, establish a simple and efficient way to publish data from Hadoop to DWH and back within the framework of the regulatory processes that flow into Data Lake.

Technology selection

It would seem that the task is very simple - to learn how to quickly overtake the data from the Hive table to the Greenplum table and back. To solve such problems, as a rule, it is accepted through ETL. But, thinking about the size of the tables (tens of millions of rows, gigabytes of data), we conducted a study from the beginning. In the study, we compared four approaches:

Sqoop is a tool included in the Hadoop ecosystem for transferring data between structured repositories and HDFS;
Informatica Big Data Edition - used as ETL platform for batch data processing in Hadoop;
SAS Data Integration Studio - is used as an ETL platform for data processing in the corporate DWH (Greenplum);
gphdfs is a tool / utility included in the Greenplum DBMS for working (reading / writing data) with HDFS.

Next, I will tell about the advantages and disadvantages of each of them.

Sqoop

Sqoop is a tool for transferring data between Hadoop clusters and relational databases. With it, you can import data from a relational database management system (a relational database), such as SQL Server, MySQL or Oracle, into a Hadoop distributed file system (HDFS), convert data in a Hadoop system using MapReduce or Hive, and then export the data back to the relational database.

Since the transformation was not originally intended in the task, it seems that Sqoop is ideally suited to solve the problem. It turns out that as soon as the need to publish a table appears (either in Hadoop or Greenplum), you need to write a job (job) on Sqoop and learn how to call this task on one of the schedulers (SAS or Informatica), depending on the schedule.
Everything is fine, but Sqoop works with Greenplum via JDBC. We are faced with extremely low performance. The test table of 30 Gb was unloaded into Greenplum for about 1 hour. The result is extremely unsatisfactory. From Sqoop refused. Although in general, it is a very convenient tool for, for example, to unload one-time into Hadoop, the data of some not very large table from a relational database. But, in order to build regulatory processes on Sqoop, you need to clearly understand the requirements for the performance of these processes and on this basis make a decision.

Informatica Big Data Edition

Informatica Big Data Edition we use as ELT a data processing engine in Hadoop. Those. with the help of Informatica BDE, we build in Hadoop those showcases that need to be published in Greenplum, where they will be available to other application systems of the bank. It seems logical, after the ELT processes have worked on the Hadoop cluster, built a data mart, push this mart in Greenplum. To work with Greenplum in Informatica BDE there is a PWX for Greenplum, which can work in both Native mode and Hive mode. That is, as soon as the need to publish a table from Hadoop to Greenplum appears, you need to write a task (mapping) on the Informatica BDE and call this task on the Informatica scheduler.
Everything is good, but there is a nuance. PWX for Greenplum in Native mode works like a classic ETL, i.e. reads data from the Hive to the ETL server and already on the ETL server raises the gpload session and loads the data into Greenplum. It turns out that the entire data stream rests on the ETL server.
Next, we conducted experiments in the Hive mode. PWX for Greenplum in Hive mode works without an ETL server, the ETL server only controls the process, all data handling takes place on the Hadoop cluster side (Informatica BDE components are also installed on the Hadoop cluster). In this case, the gpload sessions go up on the nodes of the Hadoop cluster and load the data into Greenplum. Here we do not get a bottleneck in the form of an ETL server and the performance of this approach turned out to be quite good - a test table of 30 Gb was unloaded into Greenplum for about 15 minutes. But PWX for Greenplum in Hive mode worked, at the time of the research, it was unstable. And there is another important point. If you want to back publish data (from Greenplum to Hadoop), PWX for Greenplum works through ODBC.
To solve the problem, it was decided not to use Informatica BDE.

SAS Data Integration Studio

We use SAS Data Integration Studio as an ELT data processing engine in Greenplum. This is a different picture. Informatica BDE builds the necessary storefront in Hadoop, then SAS DIS does pull this storefront in Greenplum. Or otherwise, SAS DIS builds any storefront in Greenplum, then pushes this storefront into Hadoop. It seems to be beautiful. There are special SAS Access Interface to Hadoop components for working with Hadoop in SAS DIS. Drawing a parallel with PWX for Greenplum, SAS Access Interface to Hadoop does not have Hive mode of operation and therefore all data will flow through the ETL server. Received unsatisfactory performance processes.

gphdfs

gphdfs is a utility included in the Greenplum database engine that allows you to organize parallel data transport between a segment of Greenplum servers and nodes with Hadoop data. We conducted experiments with the publication of data from Hadoop to Greenplum, and vice versa - the performance of the processes was simply amazing. A test table of 30 Gb was uploaded to Greenplum for about 2 minutes.

Analysis of the results

For clarity, the table below shows the results of research.

Technology	Complexity of integration into regulatory processes	The complexity of process development	Process performance (Hadoop -> Greenplum)	Process performance (Greenplum -> Hadoop)
Sqoop	Complicated	Low	Poor (JDBC)	Poor (JDBC)
Informatica Big Data Edition (PWX for Greenplum in Native mode)	Easy	Low	Unsatisfactory (gpload on ETL server)	Poor (ODBC)
Informatica Big Data Edition (PWX for Greenplum in Hive mode)	Easy	Low	Satisfactory (gpload on Hadoop cluster nodes)	Poor (ODBC)
SAS Data Integration Studio (SAS Access Interface to Hadoop)	Easy	Low	Unsatisfactory	Unsatisfactory
gphdfs	Complicated	High	Very high (gphdfs)	Very high (gphdfs)

The conclusion is ambiguous - with the least problems in the performance of the processes, we get a utility that is completely unacceptable to use in the development of ETL processes as is. We thought ... ELT platform SAS Data Integration Studio allows you to develop its components (transforms) on it and we decided to reduce the complexity of developing ETL processes and reduce the complexity of integration into regulatory processes, to develop two transformations that will facilitate the work with gphdfs without losing performance of target processes. Further I will tell about implementation details.

Implementation of transforms

These two transforms have a rather simple task, to perform a series of operations around Hive and gphdfs successively.
An example (design) transformation for publishing data from Hadoop to Greenplum.

Hive Table - a table in Hive, registered in the SAS DI metadata;
Transform - transform, the steps of which I will describe next;
Greenplum Table - target or work table in Greenplum;

What makes the transform:

Creates an external table in the database work in Hive. An external table is created using a serializer, understandable for gphdfs (i.e., either CSV or TEXT);
It performs an overload from the Hive table we need (source), to the work table in Hive (created in the previous paragraph). We do in order to transfer the data we need into a format that is understandable for gphdfs. Since the task is performed on a cluster, we do not lose in performance. In addition, we get independence from the data format used in the table source in Hive (PARQUET, ORC, etc.);
Creates in work scheme of job (job) in Greenplum external table gphdfs, which looks at the files in hdfs, which were recorded as a result of the previous step;
Executes select from an external table (created in the previous step) - profit! Data flowed from Hadoop cluster data nodes to the Greenplum cluster server segment.

The developer needs to add this transform in the job (job) and specify the names of the input and output tables.
The development of such a process takes about 15 minutes.

By analogy, a transformation was implemented for publishing data from Greenplum to Hadoop.

IMPORTANT. One more benefit we received after solving this problem, we are potentially ready to organize the process of offloading data from corporate DWH to cheaper Hadoop.

Conclusion

What I wanted to tell you. There are two main points:
1. When you work with large amounts of data, be very careful when choosing a technology. Carefully study the task you are going to solve from all sides. Pay attention to the strengths and weaknesses of the technology. Try to avoid bottlenecks. The wrong choice of technology can greatly influence, even if not immediately, the performance of the system and, as a result, the business process in which your system participates;
2. Do not be alarmed, but on the contrary, welcome the improvements to your data integration platform with self-written components. This allows for orders to reduce the cost and time for further development and support.

Source: https://habr.com/ru/post/281196/

All Articles