📜 ⬆️ ⬇️

Why E in the abbreviation EHD is about business processes

Data storage without E


Today, in any company relating to large and medium-sized businesses, the availability of data storage is the de facto corporate standard. No matter in which industry the company operates, without analyzing the available data about customers, suppliers, finances, it is impossible to maintain a competitive advantage. With the development of automation and optimization at every level of production of a product or service, an organization uses more and more IT systems that create data - production, accounting, planning, personnel management, and others.

How to build the process of creating a data warehouse is most effective from the point of view of global optimization of enterprise resources, new and current business needs, and why keeping metadata is important.

Tasks for using accumulated data are most often used for the following classes of tasks:


Often, for the most urgent purposes, it is enough to use one source - for example, if we are talking about providing the controller with some level of detail from a particular system, or sending the entire history of its orders to the customer using CRM. Even when changing information systems, there is usually no difficulty in obtaining reports.
')

Methods and types of data warehouses


However, when the size of an organization becomes large enough, or if a competitive advantage is required, it is no longer enough just to create a product and bring it to the market. Current trends - in a comprehensive study of the consumer to increase his loyalty. It is necessary to analyze the business from different angles and learn how to more accurately assess costs. Typical tasks from the category must have the following:


Each of the above examples requires the use of more than one data source. In addition, it is important that the methods for comparing data between sources are consistent. Otherwise, a situation will inevitably arise when, say, the director of strategy and the director of sales in the organization bring the same information to the general director, but with different numbers. And then a month they find out who was “more to the right”, using almost half of the staff at their disposal.

The most primitive way to organize data storage is the so-called “data lake” (or data lake), when we simply take and pile data from different sources. In this case, we have a single technical platform for working with data and isolate complex analytical queries from the primary tasks of information systems. Such a data repository may be completely self-related and non-relational. However, in this case, you can forget about the complex analysis, and operate only with simple queries. In addition, people working with data should be knowledgeable not only about the business area, but also about the data models of the source systems.

Further, according to the level of organization of the data warehouse, the data storage follows. Kimball classifications (Kimpball). Dimensions from different systems are unified, and thus, it turns out something like a network with two types of tables - facts and measurements. This is the primary enrichment of reference books, when we use a common natural key in the same tables from different sources, for example, the TIN in the directory of organizations, we get a single reference guide.

The next in complexity and reliability is a data warehouse with a single data model reflecting the most important objects describing the organization’s activities. Reliability lies in the fact that the data, being presented in a form close to the third normal one, with a properly constructed model, are a universal means of describing the life activity of the entire business, and thus, the data model can be easily adapted not only for analytical and regulatory reporting, but and for the operation of some enterprise systems.

E - United


Speaking about the thesis of this article, I will list the main problems faced by those responsible for building data warehouses:

" Horse in a vacuum ." The storage is built, but nobody uses it.

" Black box ". The storage is built, but what is in it and how it works is incomprehensible. Because of this, there are constant errors, and if part of the development team has also quit, then as a result, we slip into point a.

" Calculator ". The storage is built, but it satisfies only primitive requests, the business changes much faster than the implementation of the requirements, new business requests are not taken into account in it. In addition, some data may be outdated or rarely updated.

" Crystal Vase ". Storage requires a lot of manual control, checks and non-automated control actions, if one of the support participants is not at work, there is a big risk to get invalid data or not to get them at all.

Let us examine all four cases in more detail.

"Horse in a vacuum." If you get this result, it happened for one of two reasons:

  1. Less likely. You did not collect requirements from business units (or, equivalently, they were poorly developed). Such a seemingly absurd situation arises if the idea of ​​creating a repository comes not from a business, but from an IT department, which simply has an “extra” budget, and the repository was intended because everyone has it. It seems we’ll find customers later (even better is the option “we’ll come running with arms outstretched”) - if we put everything in there. Those responsible for the allocation of the budget, consider it something necessary, in books read, heard, well, it seems like a modernization, and nod in agreement.
  2. More likely. The customers of the data warehouse have been identified, for example, this is the sales department, and here comes the bright idea: “let's make a little more effort, delta, we’ll pound finance, personnel and a little more and the entire enterprise will use the storage”. The storage is built, but it is used only by the sales department, although everything is beautiful there, and dairy take cares - I don’t want to take it, but no, I don’t have time for honey and dairy, I need to dig a piece of data from morning to night. After all, this is a piece taken by sweat and blood (read: spent working time).

In both cases, there is no element of taking responsibility for the top manager and lowering it down the hierarchy. It’s like a corporate culture. If a gene. Director of the enterprise 2 deputies, only the gene itself can make use of storage at the enterprise level. Dir, or the repository is built for a part of the enterprise - the one that is supervised by the head of the highest position, who is aware of the need to implement the EHD.

To avoid such situations, the following is necessary:

  1. Determine formally the sponsor of the data warehouse project - who will be responsible for the result both financially and spiritually
  2. Approve the project’s skoup, possibly phasing, indicate approximate dates
  3. Coordinate with all departments - preferably, with the construction of business processes as is and to be

Only after that you can begin to implement the project - the collection of requirements, architectural design, etc.

" Black box ". So, you claim that you built a repository, that all requirements are taken into account, however, no one understands how to use it, and if one of the key developers left, it becomes almost impossible to understand what was done and how it was done.

In this case, obviously, the development documentation process was not set. The principle of “first documenting”, then the development should be erected, if not in the Absolute, then in tight enough control. And not only from the team responsible for developing the data warehouse. Ideally, it is necessary that additional reporting developers (analytical, regulatory), owners of the company's internal information systems, and, of course, consumers themselves should be connected to the continuous and relevant documentation process.

In addition, the documentation process should comply with the following principles:


Now there are software products that seriously simplify life, i.e. to link design and development, but so far there is no complete solution for data warehouses, it is:


Without up-to-date documentation, the complexity of the development of new requirements will increase, and with competent, it will decrease.

" Calculator ". If we assume that we have not received a “horse in a vacuum,” then this situation is about when the requirements seem to be met, but they are formally fulfilled. You wanted to count the balances by day - please. If you want to get them in the context of counterparties' regions - there was no such requirement, you need to upload to excel, then take unloading from counterparty to system X with the choice of the Y field, and then go to it.

The current situation indicates a lack of experience with the team, without an architectural look at the subsequent development of the repository, without even a primitive data model. Usually such storage facilities become temporary, or are quickly forgotten. In an amicable way, the vault should have the power of a snowball rolling down from the mountain. At first, when the lump is still small, and in front of the loose snow, you yourself will hardly have to collect and push it. At some point in time, the fame of your product will spread, and users will look at the repository more often.

So, in order for the storage to not turn out to be a calculator, you must ensure:

  1. qualified personnel - architects, analysts, EtL and SQL developers
  2. The charter of the project, in which the objectives of the repository will be identified not only for the next budget period, but also for subsequent years
  3. Quantitative and qualitative data warehouse criteria. If you do not have enough staff, it is recommended to involve consultants
  4. Be clear about what helps to optimize the data warehouse in the future - the cost of staff, software, increase the speed of report development, etc.


" Crystal Vase ". The storage is built, it seems to cope with its tasks, but to support it you need a lot of effort: maintaining some manual directories, constant reloading of some sources, failures in loading, duplicate data, etc.

This situation may occur for the following reasons:

  1. About it has already been said above - the lack of qualified personnel;
  2. Architecture-free concept — when different parts of a repository are made by different people or teams without a common, approved concept, as a result we have multiple ways to extract, transform, and load data;
  3. A very common situation is “outsourced development”, its support, while the acceptance of work is done poorly
  4. At some stage in the development of the repository, the “budget is over”. And then the repository is being finalized (supported) not by the team that created it, but by those who need data

To prevent these situations, the following actions are recommended:

  1. Told in paragraphs above - the qualified personnel, the project charter, the long-term plan and the budget, the interested person from the top manager.
  2. It is not the outsourcing who leads the process, but the internal employee (principal analyst or architect) directs the outsourcing.
  3. Any failing situations should be brought to meetings for consideration by the repository architect. If there are several architects, then the architectural committee.
  4. It is advisable to enter the data warehouse quality metric, you can use this metric to bind to the KPI command.

As can be seen, in all the above cases, despite the fact that the creation of a data warehouse is a project activity, the creation processes themselves must be regulated to create a quality result.

Transition from data storage to a single


As mentioned above, the success of the project to create a data warehouse is determined by quite a lot of input data (budget, sponsor, team, goals, customers). However, we almost did not deal with business processes that are aimed at developing and maintaining HD itself. Below, I will try to formulate the main business processes, which are designed to make the processes of working with data in the enterprise really uniform:

  1. The processes of maintaining technical and user documentation up to date
  2. The processes of keeping up to date the business dictionary (glossary) of data
  3. Data quality control processes
  4. Processes for the collection and management of CD requirements and reporting system
  5. Processes for managing storage infrastructure and data processing
  6. Storage Optimization and Data Acquisition Processes

In the modern paradigm, this set of business processes is the basis of the concept of Data Governance.

Very often, when trying to implement these processes, the HD creation and reporting team will actively resist, or ignore the processes. It is understandable, because in a local sense, this is an extension of the development.

Therefore, it will be useful to take the following actions:


Despite the fact that in the local sense, the transition process seems to be significantly “bureaucratic” and heavy, in the global sense, this gives significant advantages and time savings. Since the main loss of time - when inventing from scratch already existing solutions due to the impossibility or lack of desire to understand the existing mechanism.

A little bit about the target architectural solution


Although the EXD architecture pulls into a separate large article, or even a book, I will also outline the main technical requirements for a mature data warehouse:

  1. The data lake paradigm does not replace corporate data warehouses, but coexists with it
  2. EXD should have different data delivery interfaces: bi tools, the ability to perform ad-hoc sql queries, standard data rendering in json, xml, etc.
  3. A role-based data access model should be implemented.
  4. Response speed when accessing data: 90% of typical requests - less than 1 second, 99% of requests - less than 10 seconds. There should be a good resource margin.
  5. The presence of a single and coherent central layer of HD (preferably - Inmon methodology)

As a result, the data warehouse is called not only by the availability of sources, but by the availability of data consumers. And this is much more difficult than writing a universal ETL and adjusting petabytes of memory.

Source: https://habr.com/ru/post/418361/


All Articles