Three letters you don't know about

Telling people at meetings and presentations about technology related to the field of databases, I am amazed at what a huge gap can be between what a person does and in what area he works. It is even more striking that such a gap may be present among technical specialists. I will try to clarify - for example, the “Database Specialist” is written on a person’s card, but in fact it turns out that this person is able to set up backups in RMAN, create indexes and restart instances. All of his database knowledge fits into knowing how to use these three tools. Oh, and, of course, SGA knowledge. Amazingly, there is such a feeling that a person has defined for himself a framework within which he knows something, and what happens beyond this framework seems to him uninteresting and unworthy of attention.

Having finished with the introductory part at the meeting and turning to such concepts as corporate data storage, slowly changing dimensions, data marts and three-dimensional OLAP-cubes, I begin to see how the people responsible for the business start to go into the pocket for the phones, pretending that they need to urgently respond to the message or letter. But they are forgivable, they have their no less interesting abbreviations, which they have to constantly think about how to improve them - EBITDA, ROI, NPV, IRR. However, technicians can not afford this, so they take a serious look, which suggests that now there will be a test of their mental abilities.

Reading the pseudoscientific journals of a well-known publisher about the present and future of information technologies in Russia, trying to convey information to the reader in an academic form, with a long introduction, and sometimes even theory, I can imagine how with the advent of technology a whole Universe of possibilities opens, promising many profits to companies and carrying relief to technicians. With all this, in life, there are more people who are very poorly familiar with the theory, but have received their knowledge of practice. It is no wonder that the horizons of such people are limited to those things that they were able to “touch” live. Something else that is outside of practice seems to be a dark forest, and sometimes even obscurantism and casuistry. “What are you talking about, what are Kimballs, what does Inmon have to do with it? These are all fruits of idle thoughts, and we have to work here! ”.

Although I got carried away, and Inmon here, probably, really nothing to do with it. However, the technology I’m talking about should be used as one of the components to build a solution. Not finding a response in the hearts of listeners, I still get a question from inquisitive technical specialists who are more active than their colleagues, and due to this, perhaps, they are already paying extra for a mortgage :) This is a difficult question, and even sacramental for someone. It sounds like this - “What does Oracle think about this and how does it work?”
')
Well, if you please, the information that Oracle provides is: “This has been the case.”

If you try to translate this phrase into Russian, you get the following: “Change Data Capture effectively identifies and captures data that has been added, updated, or deleted from Oracle tables and makes this changed data available to applications or individuals.” Although this formulation is notable for its conciseness, it does not give a complete picture of what benefits the CDC technology can provide to its consumers.

What does such a grand information technology like IBM think about this? Does it support this technology and how does it describe it in its products? Unfortunately, I could not find at least some complete description on the IBM website. I will give here the best and, perhaps, the only thing that I found:
“The Change Data Capture (CDC) technology with the logging that IBM received as a result of its acquisition of DataMirror replicates real-time information about the most important data-related events without affecting system performance.”

After this description, everyone, of course, must understand what the essence of the CDC technology is and why it is needed. At least, IBM is counting on this if it puts such information on its website. I strongly doubt that descriptions of this kind can in some way help to advance in understanding, and even more so in the use of this technology.

You can read what Microsoft writes, if you have a lot of free time that you don’t know how to spend and decided to kill for a completely useless activity.

It is for this reason that I will try to explain in this article what it is and why it is needed. Imagine that you have the need to track data changes that occur in certain tables in the database. For what it is needed, I will explain later. And so, at the moment of the birth of this need, you begin to think how to implement it in life. At the dawn of its development, the CDC technology offered the following methods of solution - we add a new column with a timestamp and, voila, we now know which lines were updated after a certain time. But for this we need to read the entire table, and if changes occur too often, then this becomes an overhead and lengthy process, and there is a non-zero probability that you can lose the change. In order not to lose the changes, we tried the versioning of the lines, got about the same result - it works, but it is bad and resource-intensive. And besides, in order to implement these methods, it is necessary to change the application logic, and if the application is purchased, and no one in it just will not allow something to change? This problem was solved with the advent of triggers - we hang the trigger on a table that will track all changes and write them to some intermediate or even final table that stores the necessary changes. It would seem that still need? But no, not that again - now with any data change in the table, a trigger is called, for processing of which computational power is needed, besides the number of write operations doubles, and this is at best, and the complexity associated with such a database also increases. However, some products involving the use of CDC technology are still triggered.

However, other companies went further and remembered that the databases are in fact transactional, and have a function such as change logging. All you need to do is to access the log of these changes (transaction log) and select from it those changes that we want to monitor. The most difficult thing here is to correctly read this transaction log and track the changes that are really interesting to us. Previously, most of the products used for this agent, which was installed at the source of changes, read the transaction log and transferred these changes to the file or immediately applied to the recipient. To be honest, most of the products have not changed since then and they also use agents. There are several leaders in the field of CDC that read the transaction log remotely, processing it in memory on a separate server, highlighting the necessary changes in it and applying these changes to the recipient. This architecture does not create a load on the source and allows you to achieve impressive performance, which provides the ability to track changes that occur with the data, and apply them to the recipient of these changes in real time. Imagine someone paid a card in a sex shop, and after three seconds you can see it in your report. If you are an Anglicist, this sounds like online replication of data. The CDC technology has reached the stage where it is no longer necessary to accumulate changes somewhere in a certain place and then, at a certain time, apply them to the recipient of these changes — now it is available in real time — the data is always available and correct. I forgot to mention that leading solutions in this area work with heterogeneous sources and recipients, i.e. from SQL Server to Oracle, Oracle to SQL Server, Oracle to DB2, CICS to SQL Server, Informix to file, Enscribe to Oracle. Any variations are available, any volumes and directions.

Supported topologies can be seen in the picture:

Maybe someone will ask - Why not read the changes directly from memory (for example, from Redo Buffer or Log Buffer), because it should be much faster than disk reading? Send a big, heartfelt greeting to Intel and the x86 architecture.

It should be noted that products using CDC technology for data replication are not full ETL tools. Although they allow simple transformations and filtering of rows, this is not enough to have the right to call itself an ETL tool. However, they provide data access for any ETL tools using ODBC, JDBC, JCA, OLEDB, or simply writing changes to a file. However, now the ELT approach is gaining more and more popularity, and the CDC technology is just in line with the concept of this approach - first, data is replicated as it is, and only then their transformation. It is also worth noting that some well-known ETL tools are also trying to embed the use of CDC in the data extraction mechanism. However, they use triggers, and this approach has many drawbacks listed above, and here we can say that using a triggered CDC does not make sense in productive systems.

At this moment, in the eyes of the listeners, the light of a timid hope lights up on the understanding of the technological essence of the CDC. In principle, more information technicians are not required. However, here, people responsible for the business, and having some insight ask the question:

And why is he needed at all, this CDC?

This rather delicate moment of time is called culminating in the literature. Right now, all the points over i will be dotted, and it will become clear whether a technology such as the CDC has the right to exist, and moreover, to use in real life. The moment comes when you can take a deep breath and look around. If at that moment I notice on the wall a portrait of the president of the company, happily smiling from his yacht to the camera, then, in principle, it becomes clear that the CDC was not in this company, no, and, most likely, will not be. However, the fashion for portraits remained only in state institutions, which gives hope that business in Russia will someday become civilized.

Turning to the essence of the question, I can say that the most obvious use of CDC technology in life is to eliminate the burden from operational databases by withdrawing the reporting process to a separate database (the so-called Operation Data Store). You can explain this with an example - let's say we have a database with which a number of users work. They start there applications, set the status, clarify at what stage these applications are, close them, correct, in general, perform a number of actions. It can be CRM, ERP, automated banking system, billing system or even POS-terminals. The business needs to understand the condition of the applications or other entities stored in this system - accounts, deposits, production, etc. Therefore, the business generates a report on a periodic basis, which reflects all the main indicators of interest to the business and necessary for making further management decisions. However, here's a bad luck - during reporting, all users notice that the speed of the system drops significantly and becomes uncomfortable for work. And here CDC technology comes to the rescue - we determine what data we are interested in, i.e. we trace changes in certain tables and we apply them to other database (ODS). And it already starts the reporting process. The result is that both the sheep are safe and the wolves are fed - users can work with the application system, and management can receive the reports they need to make management decisions at any convenient time.

The second type of application of CDC technology stems from the first - we can use this technology to maintain an identical copy of the database in case of disaster recovery. This is implemented as follows - we monitor all changes that occur with the data in the database in the first data center and apply them in real time to the second database located in the second data center. Thus, upon the occurrence of a catastrophe, we always have an identical copy of the first database, from which we can continue to work, without interrupting to restore the database from a backup. CDC technology in the same context can be used to migrate to new versions of the DBMS or to migrate from the Big Endian platform to the Little Endian and vice versa. The use of CDC here has indisputable advantages, among which are the elimination of the need to interrupt the business to perform such operations as migration to a new version or a new platform. Since tracking changes and transferring the necessary data occurs in real time, after carrying out the necessary testing, we can instantly switch to a new database version or to a new server without forcing the business to stop.

There is another interesting case where CDC technology may be required. Imagine that the company has a central office and several branches scattered throughout the country. Both in the central office and in each of the branches there is a personnel department, which recruits, dismisses and keeps records of employees. The management of this company wants to know at any time who works for them and at what position. To do this, they decide to implement MDM - to create a single place where information about employees working in this company will be stored. The database MDM_HR is created, into which information from the central office database, as well as from the databases of the branches, flows. However, in order for the information in the MDM_HR database to be relevant, it is necessary to create such a mechanism that when you change employee data (for example, your position, salary, last name changed, or the employee retired, and another came to replace him) or added new ones to the database MDM_HR. Data replication solutions using CDC are ideally suited as such a mechanism. Now the company's management, as well as the personnel department, can receive at any time a correct and up-to-date report on the employees working in the company. The use of CDC in such a key can relate not only to personnel records - it can be used for tasks where there is distributed data, which, nevertheless, are grouped according to some logical feature and you want to generate a report or track some characteristics based on general data, and not just their separate parts. These can be geographically distributed warehouses, various information systems, and the like, which require integration and data consolidation to form correct reporting.

I left the most interesting area of application for the CDC last. Suppose a CIO in a company has managed to convince management of the need to implement full-fledged BI in the company. He has long and convincingly said that business analysis is necessary for a company to make correct and timely financial decisions, financial planning and control, BI is necessary to track the effectiveness of marketing and advertising companies, it is needed as air to install and monitor various KPIs, work with partners, lower logistics costs, risk management, etc. In the end, the board of directors decided that everything did not sound so bad and why not try it? It was decided to implement BI in several stages - building a corporate data warehouse, filling it with data, forming data marts based on this storage, and visualizing data. So, they built a corporate data warehouse, determined what data should be placed in it, and began to think about how this data can be loaded into the data warehouse. There are two ways here - the first is to use ETL tools to extract data, transform on an intermediate server and load into the data warehouse. However, this path has significant drawbacks - the ETL process is not constant and runs from time to time, which, firstly, puts a strain on data sources, and secondly, means that the data in the data warehouse does not correspond to the latest data in the sources most of its time. This seriously upsets business - competitors are not sleeping, and tomorrow’s business should not find out tomorrow’s news, not just tomorrow morning, but literally tonight, i.e. in the evening, the business already wants to know what will happen tomorrow. For such an analysis requires the availability of data in real time, we must conduct analysis based on the latest data, the only way we can meet the ever-changing environment. – , . CDC – , , , – . ELT – Extract-Load-Transform. , , , , .

, CDC . , CDC . CDC, , , MDM, , BI, .. . , . CDC , .

, , CDC , , . , , , , . , , .

CDC , , . , . , – , , . – !

Source: https://habr.com/ru/post/170267/

All Articles

Three letters you don't know about

And why is he needed at all, this CDC?

More articles: