
In this article I would like to
continue the description of the technologies used in the TCS Bank when building DWH. The article may be of interest to those who plan to use LogMining Change Data Capture (CDC) for replicating data from operational sources to the online storing of a repository built on the basis of the GreenPlum database.
Instead of entry.
In all modern IT and financial organizations, there are many different operating accounting systems that are designed to store and process operational data. Since the load on these sources in order to build reports is not permissible (it can affect the performance of business-critical processes), organizations build DWH, and the data is loaded into the Storage in most cases at night during the period of the lowest load on operating systems.
CDC LogMining technology allows you to solve several problems at once:
- reducing the load on operational sources at the time of loading data into DWH. Products that implement this technology, in most cases, read transactional log files of source systems, disassemble them and apply them to the receiving system. It is due to the fact that transaction logs are parsed by third-party software and the load on source systems is reduced;
- online (or close to online) replica of the source system data in the receiving system, which allows you to build close to online reporting;
- the ability of ETL-developers to access the replicas of the source tables at any time (not only in those intervals that are allocated to read the source systems);
- Most products in parallel with replicas allow you to create a so-called. change-tables that display the entire history of changes in the source tables.
')
Summary of Bank Source Systems Required for Replication
DWH in the Bank is built on the GreenPlum database, all data sources necessary for replication are managed by the Oracle DBMS. Summary indicators of replicated sources are given in the table:
Parameter | Value |
---|
The number of source systems for replication | five |
Number of replicated tables | ~ 200 |
The volume of transaction logs (GB / hour) | |
| ten |
| 160 |
| ~ 40 |
The total number of operations per day | ~ 50 million |
Search for ready-made solutions.
As a CDC solution, we considered the following products:
- Oracle Golden Gate. The product currently being developed and supported by Oracle. A remarkable, as it seemed to us, product in terms of stability and performance of reading changes from source systems. However, the disadvantages of Golden Gate include the fact that it does not support GreenPlum out of the box, there are no built-in means for initial data loading.
- Informatica Data Replicator. A product that supports Oracle as a source and Greenplum as a receiver has tools for initial data loading. The product guarantees the integrity of data between the source and the receiver, however, for this, transactional logs are required for all transactions that are open at the time of replication, which is not always permissible in view of limiting the volume of transactional disk storage systems
- Attunity Replicate. The new product that supports Oracle as a source and Greenplum as a receiver, as well as the Informatica Data Replicator, has tools for initial data loading. The solution has a simple control system, aims to work "out-of-the-box"
The mechanisms for applying changes to the receiver in various CDC solutions are their own and largely depend on the level of supplemental logging at the source (I recommend reading the excellent
article by Alexander Ryndin on supplemental logging). Oracle Golden Gate is a system that can be remarkably applied with full supplemental logging (consistently applying all INSERTs to the receiver as a batch, the last UPDATE by key and all DELETEs). A feature of one of our sources is that the number of columns in its tables is large, the number of rows updated in the source is also large, but the number of update fields within one update on the source is small. The inclusion of full supplemental logging in all fields at the source led to an incredible increase in transaction logs, and it was decided to include supplemental logging only on primary keys. In this regard, it was necessary to write complex scripts of aggregation and change application.
Starting a project in production
So, we have chosen the product Attunity Replicate. It turned out to be the only product that meets all the requirements of our pilot project. At the time of the introduction, the actual version was 2.0.
In this version, the only method for applying changes was implemented — row-by-row, in which each change at the source was applied to the receiver as a separate SQL query:
update target set f1='a', f2='b', f3='c' where key='key1'
The disadvantages of this approach in Greenplum came to light immediately after the inclusion of data in the replication, the volumes of which (both in the number of rows and the number of operations) exceeded those we used in the pilot project. It is obvious that Greenplum cannot digest the total number of transactions from source databases. On the one hand, this is an analytical DBMS, not intended for high transaction load, and on the other hand, the total number of transactions of all source databases is so large that it will be a problem for any DBMS. For this kind of CDC system, in our opinion, the mini-batch mode is optimal, in which changes are applied in the form of pre-aggregated detailed transactions at the receiving base. Changes are accumulated in the CDC system, and then applied once, for example, per minute, on the receiver. We used this mode, investigating earlier the applicability of GoldenGate. Scripts for aggregation and change were developed. These scripts were transferred to Attunity for study. After studying the scripts in detail, Attunity releases a new version of replicate 2.1, in which mini-batch mode appears. Now updates can be applied as follows:
update target set f1 = decode(net.f1, null, target.f1, '<ATT_NULL>', null, net.f1) ,f2 = decode(net.f2, null, target.f2, '<ATT_NULL>', null, net.f2) ,... from net where target.key=net.key
Here, net is a table into which changes are “predicted” by Attunity Replicate itself. As a result of the implementation of this refinement, the number of requests to the receiver base has decreased dramatically and we are now able to handle the full amount of changes that our sources make. This technology Attunity called
turbo stream CDC and it is fully justified: when the average load on the source data is replicated to the GreenPlum with a delay of an average of 60-80 seconds. There are, of course, problems. With a peak load at the source (for example, at the closing of the operating day), the delay increases, sometimes significantly, up to 1.5 hours. The reason lies in the fact that Attunity Replicate works through LogMiner, which does not always have time to process the database logs at peak loads.
In the process of applying changes, so-called apply conflicts are possible - conflicts of applying changes to the receiver. The most frequently encountered are: 0 rows affected (no records were updated on the receiver) and duplicate keys (insertion of a record that already exists in the receiver). During normal operation of the system, these conflicts are possible during the initial load, since in order to ensure the consistency of data in the receiver, Replicate begins to capture changes in the data a little earlier than full unloading. Attuity Replicate detects such conflicts and goes into row-by-row mode, saving them to a special system table for further analysis.
Work on LogMiner.
CDC solutions can use two different mechanisms for reading source transaction logs: their own binary (binary reader) mechanism and the source database engine. In the case of Oracle, this is Oracle LogMiner, which can return data in two modes:
- commited data (returns DML for commited transactions only)
- raw data (returns data about all transactions)
In commited data mode, the performance of Oracle LogMiner is terrible and requires additional resources at the source. In the case of raw data, the performance is much better, less resources are required, but in this case, the application should take over the functions of additional analysis of the results returned by the LogMiner and the generation of SQL-statements on this basis, which will be applied to the receiver.
At this stage, Attunity Replicate works using the Oracle LogMiner raw mode, but their plans are to stabilize their own Binary Reader, which can be configured to read transaction logs directly from the rack, where they are stored, and even a small load on the source system at the moment capture changes disappear completely.
Current project status
Currently, version 2.2 is successfully operating in production, and version 3.0 Attunity Replicate is being tested in parallel. The load on the systems - sources (outside the periods of initial loading) was not significant - due to the fact that when calling LogMiner methods, the raw data mode is used. As for GreenPlum, everything is in order here too. Although the number of transactions performed by Attunity Replicate and large, it is much less than the number of transactions that are performed on the source systems and in view of the small amount of changes made by each transaction, the load on the receiver is also not significant.
Instead of conclusion
This article is the second step in the coverage of DWH technologies in our Bank. Ahead of the story about the online-warehouse (the basis of which, as it is not difficult to guess, was CDC Log mining), and much more interesting.