In this article I want to tell how you can solve the problem of impact analysis or impact analysis in a complex, multi-level infrastructure of a corporate data warehouse using the example of our
DWH in
Tinkoff Bank .
While working with DWH, everyone probably asked at least once questions:
- “What happens if you change the field in the table?”
- “What ETL processes will it affect?”
- “What reports will be affected?”
- “What business processes might suffer?”
To answer this question is usually not easy, because you need to look at a dozen ETL processes, then get into the BI tool, find the necessary reports, keep something in your head, remember that something is built there with a manual code and it all results in a big headache.
Even the most innocent change can affect, for example, a report that arrives every morning in the mail to the chairman of the bank. Exaggerating a bit, of course :)
')
Further in the article I will tell you how and with the help of what you can reduce the headache and quickly carry out impact analysis in the DWH infrastructure.
DWH at Tinkoff Bank
Before plunging into impact analysis, I will briefly describe what our DWH is. By structure, our repository is more like Bill Inmon's Corporate Information Factory. There are many data processing layers, target showcase groups and a normalized model represent the presentation layer of the data warehouse. All this works on MPP
Greenplum DBMS. ETL storage building processes are developed on
SAS Data Integration Studio . SAP Business Object is used as a reporting platform in the bank. Before the storage is ODS, implemented on Oracle DBMS. From my previous
article about DataLake at Tinkoff Bank, we know that we are building ETL on Hadoop on Informatica Big Data Edition.
Impact analysis
As a rule, the problem of analyzing the impact within a single enterprise tool is solved easily. All tools have metadata and functionality for working with this metadata, such as, for example, getting a list of dependent objects from the selected object.
The whole complexity of impact analysis arises when the process goes beyond the framework of a single instrument. For example, in a DWH environment consists of database sources, DBMS DWH, ETL, BI. Here, to reduce the headache, you need to be able to consolidate metadata from different tools and build dependencies between them. The task is not new and there are industrial systems on the market for its solution.
It was important for us that such a system could build for us an entire complex tree, or rather, a graph of our metadata, starting from Oracle tables in ODS and ending with reports in SAP (see Fig. 1).
Fig. 1 Example of dependencies of metadata from several systems, on the example of systems in Tinkoff BankThe system had to knit Greenplum objects among themselves, as well as through SAS jobs, in which Greenplum tables act as sources and receivers of data, and simply associate tables with views built on them.
We chose
Informatica Metadata Manager and successfully implemented the first metadata models in Tinkoff Bank. Later in the article I will tell you how and what we learned to do with this tool.
Informatica Metadata Manager
Informatica Metadata Manager is, in fact, a large metadata repository that allows you to:
- Model metadata, i.e. create metadata models, for example, DBMS, ETL tools or even business applications
- On the basis of models created or supplied as a set, create metadata download / update processes in your repository.
- Create binding rules between metadata objects both within the model and between models
- To create connections that cannot be brought under the rules, as well as within the model, as well as cross-model
- Work in a visual web interface with loaded metadata of your systems
Informatica Metadata Manager Models
Now, in order and in more detail about what the tool can do and how to prepare the Informatica Metadata Manager.
In the box with Informatica Metadata Manager comes a certain set of models with which, if you have purchased a license, you can start working immediately after installing the product.
Fig. 2 Informatica Metadata Manager ModelsIn Fig. Those 2 models that do not begin with the prefix Tinkoff go out of the box. A model in terms of Informatica Metadata Manager is a set of classes from which a certain hierarchy is built, which corresponds to the structure of the source metadata, i.e. some information system. For example, the Oracle DBMS metadata model in the Informatica Metadata Manager looks like this, see Figure. 3. I think those who have worked with Oracle DBMS in this hierarchy will see a lot of acquaintances, from what they are used to working with.
Fig. 3 Oracle DBMS Metadata ModelGreenplum DBMS or SAS Data Integration Studio metadata models in the Informatica Metadata Manager out of the box are not provided and we designed them ourselves for our tasks. It is very important to understand what the model should solve when you start working with a ready-made model or design your own. We had simple models (see Fig. 4 and Fig. 5), but at the same time, these models met our requirements. And our main requirement is to be able to build lineage from ODS tables, which are located in Oracle, to reports that are built on the universe in the SAP Business Objects.
Fig. 4 Greenplum DBMS Metadata ModelFig. 5 SAS Data Integration Studio Metadata ModelRegarding the SAP Business Objects metadata model, a dilemma has arisen - use a pre-configured model or develop your own.
The finished model, or rather its first level of hierarchy looks like this - see Fig. 6
Fig. 6 Preset Business Objects Metadata ModelMade a comparison:
Criterion | Preset model | Own model |
---|
Model completeness | Overweight | Optimal |
Meets the originally formulated requirements. | Probably answers | Answers |
Development cost of downloading / updating metadata | Is free | Depends on the complexity of the model and on the knowledge of the SAP Business Objects metadata structure |
Model cost | License for metadata exchange options for SAP Business Objects | Is free |
One criterion against its model was the fact of the complexity of the SAP Business Objects metadata structure. But, for us to audit the BI platform, a third-party product is used -
360eyes , which we took as a source of SAP Business Objects metadata for the Informatica Metadata Manager. The model turned out very simple, see Fig. 7, which met our stated requirements.
Fig. 7 SAP Business Objects Metadata ModelUpdating metadata in Informatica Metadata Manager
Models are created, now they need to be filled with metadata. The models that come in the box with the Informatica Metadata Manager have their predefined loaders that already know how to extract, for example, a list of tables and views from the Oracle DBMS dictionaries. For the metadata models that you design yourself, the loader will have to be developed by yourself. But you should not be afraid here, the process is quite transparent, and resembles the development of ETL procedures for clearly formulated TK. Based on the created Informatica Metadata Manager model, you can use a simple command to help you create a download template and here you have a set of CSV files that you need to fill with your system metadata. Then everything depends on you and on how well you know and know how to work with the metadata of your systems. By the way, the process of obtaining metadata from your systems is probably one of the most time-consuming steps in building the entire process of working with metadata in the Informatica Metadata Manager.
We wrote all the necessary code, which we collected from pg_catalog Greenplum, from SAS metadata and from the 360eyes repository, the data for the created download patterns and started the regular process. To update the metadata of Oracle, on which we work ODS, we used a pre-configured model (see. Fig. 3). The metadata in the Informatica Metadata Manager is updated every night.
Binding rules in the Informatica Metadata Manager
The system metadata is regularly updated in the Informatica Metadata Manager repository, now you need to link the metadata objects of different systems to each other. For this, the Informatica Metadata Manager has the ability to write rules (Rule Set), the rules can work both within the model and between models. The rules are an XML file of a simple structure:
<?xml version="1.0" encoding="UTF-8"?> <ruleSet name ="SAPBOBJ_to_SAPBOBJ_rules"> <sourceResource name="SAPBOBJ"/> <targetResource name="SAPBOBJ"/> <rule name="unv_to_rpt_unv" direction="SourceToTarget"> <sourceFilter> <element class="Universe"/> </sourceFilter> <targetFilter> <element class="Report"> <element class="Report Universe"/> </element> </targetFilter> <link condition="source.Name = target.Name"/> </rule> </ruleSet>
The above correctly suggests that you need to build a connection within the SAPBOBJ model between objects of the Universe class and objects of the Report Universe class by the condition of equality of their names.
There are situations that metadata objects need to be interconnected, but the rule cannot be developed for this connection. A simple example: Representation "A" is built on the table "B" and "C". For such situations, the Informatica Metadata Manager has the ability to load additional links, the so-called Enumerated Links. Enumerated Links is a CSV file in which full paths are written already in the Informatica Metadata Manager repository to the two metadata objects that need to be linked. With Enumerated Links, we build a link between the tables and Greenplum views.
Fig. 8 Enumerated Links PropertiesFig. 9 CSV file to download Enumerated LinksIn this case, we form the CSV file Enumerated Links ourselves, just like the metadata for download, based on pg_catalog. Through Enumerated Links, in the Greenplum model, we associate objects of the class “Table” with objects of the class “View”. We form the connection by specifying the full path to the metadata object already in the Informatica Metadata Manager repository.
Impact Analysis Informatica Metadata Manager
What did we get? The main thing that we got is the ability to build a lineage for all metadata objects included in the infrastructure of our DWH, i.e. to make an impact analysis, in two directions: Impact Upstream and Impact Downstream.
For example, we want to see what depends on the ODS table. We find the table we need, or in the catalog of models, or with the help of a search and run the lineage on this table. We get such a beautiful picture, see Fig. 10. Here we see that a number of ETL processes on the SAS, as well as two Universe and seven reports in SAP Business Objects depend on the selected table.
Fig. 10 Lineage from the ODS tableIn the web interface, you can expand each area of ​​the resulting lineage and from each object in the diagram, you can run linage directly from this window.
Or, for example, we want to see the data of which tables are involved in the construction of the report. We find the necessary report and run the lineage on it. We get the following beautiful picture, see Fig. 11. Here we see that the selected report is based on one Universe, which uses Greenplum tables of four schemes, which in turn are filled from ODS and some ETL processes on SAS.
Fig. 11 Lineage from Report to SAP Business ObjectsEach lineage result can be exported to Excel, which fully reflects all the dependencies on the selected metadata object.
Results and future plans
We learned how to prepare the Informatica Metadata Manager. Then just work, look for new ways to use it, implement new models, connect new users.
What new models we want to do:
- Banking systems sources
- Informatica Big Data Edition
- Hadoop (HDFS + Hive)
- Flume
What users have connected and plan to connect:
- DWH Systems Analyst
- Owners of banking systems sources
Development models:
- Refine our SAP Business Objects model, add metadata objects such as measures and indicators
Of the minuses we have encountered, this is a long construction of a
full lineage (> 10 minutes) on sites with a large number of connections (> 1000). In general, Informatica Metadata Manager is a very flexible, easy-to-use tool and, in our opinion, is well able to solve its tasks.