Dual ETL project or how we built Disaster Recovery for Greenplum

In this article I want to talk about another stage in the development of DWH in Tinkoff Bank .

It's no secret that the requirements for the presence of Disaster Recovery (hereinafter referred to as DR) in modern business information systems fall into the category of “must have”. So, a little over a year ago, the team involved in the development of DWH in the bank was tasked with implementing DR for DWH, on which both the offline and online processes of the bank are built.

Formulation of the problem

So, what was given:

The system for capturing changes on the side of data sources and applying them to the Online Operational Data layer (hereinafter referred to as OOD) - Attunity Replicate ;
Loading, data conversion, IN / OUT data integration, storefront calculation and the whole ETL / ELT - SAS Data Integration Studio , ~ 1700 ETL tasks ;
The main DBMS on which DWH is built in the bank is MPP DBMS Pivotal Greenplum, data volume ~ 30TB ;
Reporting, Business Intelligence, Ad-hoc - SAP BusinessObjects + SAS Enterprise Guide;
Two data centers, geographically separated at different ends of Moscow, between which there is a network channel.

Objective: to ensure, within an hour after a failure, leading to the inoperability of Greenplum at the main site, the operability of all DWH processes on the Greenplum backup circuit. In essence, the task is to build a hot standby for Greenplum.

Study

About a month was given to us for research and development of the concept.

Of course, the first thing that occurred to us was to dig in the direction of the vendor - Pivotal, i.e. EMC. As a result of the research, we found that Greenplum does not have a regular tool for building hot standby, and the DR solution for Greenplum can potentially be built using the EMC Data Domain. But with a deeper study of Data Domain, it came to be understood that this technology is sharpened to create a large number of backups and in view of this is quite expensive. Also, the Data Domain out of the box does not have the functionality of supporting the second loop in the current state. We refused to consider the EMC Data Domain.

The second option that we worked out is to use the third-party replication tool GreenplumToGreenplum. The option quickly became obsolete because at that time, there were no replication tools in nature that support Greenplum as a source.

The third option that we set about is the Dual Apply class solution. Having looked at their solution called Informatica dual load for Teradata dual active solution ( Teradata Magazine ) from Teradata and Informatica, they began to explore the technology market to build a similar solution for Greenplum. But they did not find anything ready.

After research, we decided that our own development would be the best option. So we started writing our own system and called it Dual ETL (in the development community, the project was called “Duet”).

Duet and how we built it

Conceptual architecture

The principle of building a system is based on the principle: as soon as ETL has built a table on the main contour of Greenplum, the system must catch this event and transfer the data of this table to the backup contour of Greenplum. Thus, observing this principle, we synchronously, with some time delay, build DWH on two circuits in two geographically distant data centers.

Fig.1 Conceptual architecture

During the development of the architecture, the system was divided into six components:

Component Backup, which should work on the main site and be responsible for receiving data;
A transport component responsible for transferring data between the primary and backup sites;
The Restore component, which should work on the backup site and be responsible for applying the data;
Component transfer to the storage and collection of daily backup;
The controlling component that must steer all the processes of the system;
And the monitoring component, which was supposed to give an understanding of what is happening in the system and how much the reserve site lags behind the main site.

Completion of the existing ETL

In order to inform the system about the readiness of the table, a queue was implemented in which we taught our ETL to add an object (ie, a table) as soon as it finished building it. Thus, the event transmission functionality was implemented to the system, after which the system had to rebuild this table on the Greenplum backup circuit.

Implementation of the control component

Due to the fact that our ETL Task Scheduler is written in SAS, plus the DWH team has extensive expertise in working with the SAS Macro language, it was decided to write a control mechanism in SAS.

The implemented mechanism performs such simple actions as: get a new table from the queue, start the Backup component, send the resulting dump table to the backup site, start the Restore component. In addition to this, multithreading is implemented, while the number of threads can be adjusted for each type of tasks (Backup, transport, Restore, transfer of buckup-s to the storage system), and of course such necessary functionality as logging and e-mail notification .

Implementing Backup Components

The Backup component, for the transferred table, calls the gp_dump utility. We get dump tables spread over the main site Greenplum segment servers. An example of calling the gp_dump utility:

gp_dump --username=gpadmin --gp-d=$DIRECTORY --gp-r=$REPORT_DIR $COMPRESS -t $TABLE db_$DWH_DP_MODE &> /tmp/gp_dump_$NOW

Implementation of the transport component

The main task of the transport component is to quickly transfer dump files from the Greenplum segment of the main site to the corresponding Greenplum segment of the backup site. Here we are faced with a network constraint, namely: the segments of the main circuit do not see the segments of the backup circuit on the network. Thanks to the knowledge of our administrators, DWH has thought of a way to get around this using SSH tunnels. SSH tunnels were raised on the secondary Greenplum master servers of each of the circuits. Thus, each slice of the table’s dump was transferred from the server segment of the primary site to the server segment of the backup site.

Implements Restore Components

After the completion of the transport component, we get a dump of the table spread over the Greenplum segment servers of the backup site. For this table, the Restore component runs the gp_restore utility. As a result, we get an updated table on the reserve site. The following is an example of calling the gp_restore utility:

 gp_restore $COMPRESS --gp-d=$SOURCE --gp-r=$REPORT_DIR --gp-k=$KEY -s --status=$SOURCE -d db_$DWH_DP_MODE > /tmp/gp_restore_$(echo $KEY)_$RAND 2>>/tmp/gp_restore_$(echo $KEY)_$RAND

Monitoring implementation

After completion of the development of the main components and the first launches, we received a generally working system. The tables were rebuilt at the backup site, logging worked, letters came in the mail and it seemed like the system worked, but we didn’t have a clear understanding of what was going on at a specific point in time. We attended to the issue of monitoring, which was divided into two steps: the allocation of metrics for monitoring the system and the technological component of monitoring implementation.

The metrics were allocated rather quickly, which, in our opinion, should as a whole unambiguously make it clear at a specific point in time what is happening in the system:

The number of objects in the status;
The number of errors in the status;
The backlog from the moment of entering the queue;
Delay at the start of the stage;
The average duration of the stage (for the last 10 objects);
As well as monitoring the number of working threads at each stage.

The technical implementation was also determined quite quickly - Graphite + Graphana. Deploying Graphite on a separate virtual machine and programming invented metrics was not difficult. And with Graphana on a working Graphite, a beautiful dashboard was developed.

All of the above metrics found visualization and, not least, online:

Fig.2 Number of objects in statuses

Fig.3 Number of errors in statuses

Fig.4 Backlog from the moment of entering the queue

Fig.5. Delay at the start of the stage

Fig.6 Average duration of the stage (for the last 10 objects)

Fig.7. Number of working threads at each stage

Post processing

After the restore process is complete, the dump files are transferred to the storage system, on the backup site. Replication is configured between storage at the backup site and storage at the main site, this replication is implemented based on the NetApp SnapMirror technology. Thus, if a failure occurs at the main site that requires data recovery, we already have a prepared backup on the data storage system for these operations.

Results

What we got

The system was developed and everything turned out not even very bad. The tasks that the system was intended to accomplish are closed to it completely. After the development was completed, a number of regulations were developed that allowed the transition to a backup site as part of the DWH support process.
The main thing that we got is, of course, the opportunity to move to a backup site in the event of a failure on the main one within 30 minutes , which significantly minimizes a simple DWH as an information system and, as a result, enables business to work continuously with reports provided for analysis, performing ad- hoc and do not stop a number of online processes. The system also allowed us to abandon the procedure of daily routine backup, in favor of the backup received by the Dual ETL system.

Job statistics

About 6500 objects (tables) with a total volume of about 20TB pass through the system per day.
On the example of the “Lagging from the moment of entering the queue” metric (see fig. 8)

Fig.8 Backlog from the moment of entering the queue

You can observe the lag of the reserve site from the main one. During the work of the night planner, when ETL is actively building a repository, the lag at the peak reaches 2-3 hours. By the time the storage is completed, 10 am, the backlog is reduced and remains at the level of 5-10 minutes. During the day, there may be lags of lag during the work of the online scheduler, within 30 minutes.

And also relatively recently, our system had a small anniversary, with 1,000,000 (millionth) object flying through it!

Epilogue

The DWH team at Tinkoff Bank takes a strategic course towards Hadoop. In the following articles we plan to highlight the topics "ETL / ELT on Hadoop" and "Ad-hoc reporting on Hadoop".

Source: https://habr.com/ru/post/248853/

All Articles