Organizational changes, sooner or later occurring in the life of any company, most often entail the need to integrate various information systems. What is integration for? It is necessary so that different systems can use a single information space, exchange data, store, analyze and process them for subsequent management and operational decisions. If decisions are made on the basis of data obtained from only one system, sooner or later chaos will arise, primarily due to the heterogeneous presentation and detailing of the same data in different systems, the presence of errors caused by human factors, etc. As experience shows, the most effective way to store information for its subsequent analysis and processing are analytical repositories with data marts, on the basis of which the user can carry out any analytical queries and receive certain necessary indicators.
Integration methods: pros and cons
There are various methods of integrating information systems. Each of them has its own advantages and disadvantages. So, the method of federalization does not provide for the transportation of data, they remain with the owners, and they are accessed upon request. However, this approach has significant limitations. All federated distributed databases that serve as sources of data must be in the format of a single application or DBMS, or special software is required to integrate heterogeneous environments. In addition, all sources must be in constant availability, and this is not always feasible. If data exchange with one of the sources occurs at low speed, this will affect the operation of the entire integration mechanism. Simultaneous data exchange with two different sources at the time of a user request should be carried out on the fly. This is associated with relatively high overhead, since it requires downloading a sufficiently large amount of information.
Another integration method using the Universal Serial Bus (Universal Serial Bus) also has a number of functional limitations. This is, first of all, throughput, since the bus is a service with an integrated registration mechanism for guaranteed delivery. If you need to transfer megabytes of data or exchange master data, periodically synchronize individual documents, then using a universal bus will be expedient and convenient. But when it comes to constant data flow, including that generated by smart devices, bus bandwidth is clearly not enough. For example, in the project implemented by RedSys for the deployment of infrastructure for high-speed trams in St. Petersburg, master data is transmitted using a universal data bus, and information about employees is transferred from the personnel system to the traffic management system.
The limitations are also imposed by the replication-based integration method. First of all, this is the requirement of a single platform from the same vendor. For replication in a heterogeneous environment, you must have special tools, the choice of which depends on the used in the organization database. Of course, this can not but affect the final cost of the integration project.
ETL integration mechanisms
Today we live and work in the era of big data that is generated not only by the information systems that people work with, but also by smart devices and sensors of the Internet of things, as well as many other inanimate machines. The ETL (Extract, Transform, Load) integration method is best suited for receiving and processing such volumes of data. It allows you to receive data, check them, unify, save for further preparation on the basis of their analytical information.
In the ETL integration mechanism, the data source can act both as a client and as a server. In the second case, it would be advisable to use modules for extracting modified data CDC (Change Data Capture). In cases where the data source is simultaneously busy with other tasks, or users are working with it, the use of CDC allows you to avoid additional workloads. The data source can also be assigned as a client. In this case, the system that serves as the data source itself translates the data into CSV, XML, or another universal industrial format, and the ETL periodically collects these files for further processing.
Data processing
The ETL integration method is characterized by deep data transformation. It is carried out in a data warehouse, which includes, in the classical model, three layers. The first layer consists of copies of data sources with the addition of special keys that ensure the uniqueness and historicity of the information. The second layer is responsible for logical processing and unification. It forms an object model from the data in an analytical section. The third layer loads the data marts for analysis, bringing all the indicators to the necessary sections, all the necessary calculations, etc.
When it comes to several extensive unsynchronized data sources, scattered reference books NSI, these data must be brought into a unified form. This is where MDM systems come to the rescue. In branched holding structures, where dozens of systems from different manufacturers can work, sometimes it is very difficult without MDM, for example, to calculate incomes and expenses for certain items of income and expenses. Poor-quality data can significantly reduce the value of BI information, based on the analysis of which management decisions are made. In order to further support the quality of the data, a link with the source is needed, through which information could be transmitted for correction. A special showcase or “basket” of low-quality data, which was rejected for one reason or another, can also be formed. ETL mechanisms help detect this data.
Tools and specialists
As for tools for ETL integration, there is a wide enough field for selection. There are specialized solutions for specific DBMSs, most often developed by their own developers (Oracle Data Integrator, IBM DataStage, Informatica PC, Integration Services (SSIS) as part of MS SQL Server), and there are universal products. The leader of the “magic quadrant” Gartner in the segment of solutions for ETL integration is the company Informatica with its products. All these are industrial-level systems that are able to dynamically distribute the load between sources and BI-storage, support the parallelism of operations and have a number of other functions. As a rule, the data storage platform at the customer is already defined, so it would be advisable to use the ETL solution from the platform developer used in the storage. Informatica solutions are very expensive, but in terms of technical capabilities they are also the most advanced, most productive and scalable. They can be used to integrate with the data warehouse on any platform.
The basis of the data storage project implemented by RedSys in the Pension Fund of the Russian Federation is based on IBM solutions, and the IBM DataStage is used as an ETL tool. To support integration ETL projects in an organization, three categories of specialists are required: architects, whose responsibilities include designing a data warehouse; analysts, both business and system, whose competencies consist in collecting data and business requirements, drafting specifications, as well as programmers involved in debugging ETL processes.
Business benefits
What does ETL integration mean for business? The benefit can be described in two words: “expensive but effective.” Of course, this is a very costly component of a comprehensive project to build an analytical data warehouse. Therefore, all advantages should be considered precisely in the context of the presence or absence of a single BI-repository in the company. Its absence leads to the fact that the business simply does not receive prompt answers to strategically important questions. A unified look at the picture of production and sales of products and services, their profitability and cost, allows, among other things, to minimize disagreements between the company's production and financial departments, and, if necessary, redistribute resources in favor of more profitable and efficient areas. We must not forget about reducing costs. Instead of individual departments and employees who collect data within their business units, data is collected automatically and much faster, while their quality is incomparably higher. Finally, the use of ETL integration allows the customer to focus on the organizational rather than the technical component of the BI project.
Source: https://habr.com/ru/post/359070/
All Articles