Integrating XML data is another way.

This article describes “unconventional”, but rather powerful XML processing technology that allows you to import any XML data and transform its structure efficiently and simply, while the same processing process allows you to transform the source data of any structure without any change in the program code .

XML as an intermediate data exchange format “overgrown” with an ecosystem of technologies and tools for working with it - specialized editors, DOM parsers, XQUERY / XPATH, XSLT, special ETL modules, etc. All this diversity and sophistication of tools ideologically lead to that we now have the technology to work not just with data, but with special XML data. It is like a separate "science of chemistry" and a separate "science of chemistry for substances that are stored in blue boxes."

The processing principles described below allowed us to slightly rewind the conditional progress of technologies and to transfer almost all work with XML data to the level of a “pure” DBMS, without specific XML tools. This approach made it possible to organize a unified storage of any XML-data and provide fast random access to the necessary parts of the information. In addition, it became possible to implement the ETL functionality in a universal, non-cost (practically without coding) method.

The described approach has shown itself to be particularly good at large data sources and complex structures, with frequent changes in the data schema. If you have to deal with a small amount of information, and / or the structure is simple, you may not need this technology yet.
')
A good demo case for this technology can be public procurement server zakupki.gov.ru, available on the appropriate FTP: the volume of daily updates - tens and hundreds of thousands of XML files in gigabytes or tens of gigabytes, on average, a new version comes out every few weeks data schemas.

The data structure follows the requirements of the legislation, therefore, for example, information on notifications on public procurement is provided by more than a dozen types of documents fcsNotification * depending on the type of purchase (electronic auction fcsNotificationEF, request for quotes fcsNotificationZK, purchase from a single supplier fcsNotificationEP, etc.)
All these documents are based on the same basic type of notification but differ in details, therefore, for the purposes of analysis, all this diversity must be “collapsed” at some point during import and lead to a “common denominator”.

The described approach has been successfully applied and works effectively on these public procurements.

Briefly stages / elements of the described technology:

(1) Import all XML data into a unified structure table . We are not talking about saving into the document database as a whole, we are importing data elementwise as pairs “element name” - “value” or “attribute name” - “value”. As a result of this stage, we get rid of XML as a storage format and get quick unified access to the data of all imported XML documents of any structure (and we no longer need XQUERY / XPATH).

(2) The second element of the technology is the creation of specifications for “pulling out” the data we need - identifying the resulting tables, into which we will pour the data, and mapping the source and result fields. This stage can be carried out both on the basis of XSD-schemes of documents, and without the use of schemes, through the analysis of sample documents uploaded at the first stage. This stage does not require any programming and special skills, the main tool here is a spreadsheet.

(3) The final steps are selection of the necessary information from the primary import storage (1) using specifications (2), conversion to a “columnar” view (“pivoting”) and automated transformation to the final “analytical” format — in terms of analytical data repositories this fact tables of the asterisk structure (star) with links to measurement guides (dimensions) and numerical measures (measures).

1. Primary import.

We consider an XML document as a tree, the vertices of which are the pairs “name” - “value”. Thus, the described approach is quite universal and can be applied to any tree-like data representation.

Table structure for loading data from XML:

Record_ID : an identifier of an element of a special hierarchical view that allows you to associate different levels of the document
File_ID : since you will need to load the contents of multiple XML files into a table, you must also store the file ID
Path : the full path to this element from the root of the document (in fact, this is the XPATH path to this element)
Element_Name : the name of the element or attribute
Element _Value : the value of the element or attribute (as a string - just as it is stored in XML)
Type : record type (element or attribute) - save just in case, then suddenly it will be necessary to restore XML from the table

The idea of loading a tree into a table is fairly obvious. In MS SQL (I will not say about other DBMS, did not look) there is such a built-in capability –XML without specifying a schema is imported into a so-called EDGE table. This is not exactly what we need, since the element name and its value (that is, the name is the parent record for the value) are stored in separate format in EDGE format — this format is simply inconvenient to be used for further manipulations. In addition, in the EDGE table, the links in the tree are registered by specifying the ParentID.

In short, you can make the desired data representation from the EDGE table, but you have to sweat a little to “glue” the names and values of the elements, recreate the XPATH before each element and create a hierarchical identifier (how we build it - just below). With a large amount of data, the solution of these problems can be quite resource-intensive, but you can get by with a single tool / language.

A better way is to get the document tree using the XML parser (there is some kind of implementation in almost every language and development environment) and fill in the necessary information with one pass through the document.

Let's look at a specific example. We have demo XML files deliveries.xml and returns.xml. The deliveries.xml file contains the root element Deliveries, at the top level the dates of the beginning and end of the period for which the data is unloaded, then there are products with an indication of the name and supplier, for each product the details of deliveries are given - date, quantity, price.

deliveries.xml

<Deliveries> <PeriodBegin>2017-01-01</PeriodBegin> <PeriodEnd>2017-01-31</PeriodEnd> <Products> <Product> <Supplier>Zaanse Snoepfabriek</Supplier> <ProductName>Chocolade</ProductName> <Details> <Detail> <DeliveryDate>2017-01-03</DeliveryDate> <UnitPrice>10.2000</UnitPrice> <Quantity>70</Quantity> </Detail> </Details> </Product> <Product> <Supplier>Mayumi's</Supplier> <ProductName>Tofu</ProductName> <Details> <Detail> <DeliveryDate>2017-01-09</DeliveryDate> <UnitPrice>18.6000</UnitPrice> <Quantity>12</Quantity> </Detail> <Detail> <DeliveryDate>2017-01-13</DeliveryDate> <UnitPrice>18.7000</UnitPrice> <Quantity>20</Quantity> </Detail> </Details> </Product> </Products> </Deliveries>

The returns.xml file (returns) is absolutely the same, only the root element is called Returns and in detail the date element is called differently.

returns.xml

 <Returns> <PeriodBegin>2017-02-01</PeriodBegin> <PeriodEnd>2017-02-28</PeriodEnd> <Products> <Product> <Supplier>Pavlova, Ltd.</Supplier> <ProductName>Pavlova</ProductName> <Details> <Detail> <ReturnDate>2017-02-21</ReturnDate> <UnitPrice>13.9000</UnitPrice> <Quantity>2</Quantity> </Detail> </Details> </Product> <Product> <Supplier>Formaggi Fortini srl</Supplier> <ProductName>Mozzarella di Giovanni</ProductName> <Details> <Detail> <ReturnDate>2017-02-27</ReturnDate> <UnitPrice>27.8000</UnitPrice> <Quantity>4</Quantity> </Detail> </Details> </Product> </Products> </Returns>

The names of the uploaded files are stored in a separate table, the codes of our files there are 2006 (deliveries) and 2007 (returns).

In our receiving table, the image of our demo documents will look like this:

(It's not all, just the beginning of the table)

Record_ID	File_ID	Path	Element_Name	Element_Value	Type
001	2006		Deliveries		E
001 \ 001	2006	Deliveries \	PeriodBegin	2017-01-01	E
001 \ 002	2006	Deliveries \	PeriodEnd	2017-01-31	E
001 \ 003	2006	Deliveries \	Products		E
001 \ 003 \ 001	2006	Deliveries \ Products \	Product		E
001 \ 003 \ 001 \ 001	2006	Deliveries \ Products \ Product \	Supplier	Zaanse snoepfabriek	E
001 \ 003 \ 001 \ 002	2006	Deliveries \ Products \ Product \	ProductName	Chocolade	E
001 \ 003 \ 001 \ 003	2006	Deliveries \ Products \ Product \	Details		E
001 \ 003 \ 001 \ 003 \ 001	2006	Deliveries \ Products \ Product \ Details \	Detail		E
001 \ 003 \ 001 \ 003 \ 001 \ 001	2006	Deliveries \ Products \ Product \ Details \ Detail \	DeliveryDate	2017-01-03	E
001 \ 003 \ 001 \ 003 \ 001 \ 002	2006	Deliveries \ Products \ Product \ Details \ Detail \	UnitPrice	10.2000	E
001 \ 003 \ 001 \ 003 \ 001 \ 003	2006	Deliveries \ Products \ Product \ Details \ Detail \	Quantity	70	E

Regarding the hierarchical identifier Record_ID : its goal is to uniquely number the nodes of the document tree while preserving information about the links with all ancestors.

Note:

In the same SQL Server there is a special data type (object extension) called hierarchyid, which serves for this purpose.

In the example above, we use a simple platform-independent implementation with a sequential concatenation of element counters at each level of the tree. We “finish” the counter of each level with zeros to a given fixed depth in order to get an easy and fast allocation of ancestor identifiers of any level through the allocation of substrings of fixed length.

That's all at this stage. Now we can use simple SQL queries to isolate subsets of product data, delivery details or returns, and link them together through file and item identifiers.

In many cases, this approach will be quite enough for efficient data storage and access to them, the “advanced” techniques described below may not be necessary.

Indeed, we have already loaded all our data into a single repository, not paying attention to the possible difference in the structure of documents, got a fairly effective way to isolate the necessary information from the whole array with simple SQL queries, we can link the “subtracted” data subsets together.

The performance of such a solution as an XML source storage system for one-time queries, even without optimizing indexes and other tweaks, will clearly be much higher than if you stored XML in files or even wrote it in special database fields. In our case, it is not necessary to run the XPATH search procedure (which implies a new parsing) for each document, we did it once and then calmly use the saved result through fairly simple queries.

But:

At this stage, we have not yet done so that DeliveryDate , Quantity and UnitPrice become fields of the same table, this is described as the process of “pivotirovaniya” in the third section.

In the next steps, we will look at the transformation of these XML documents into a single data structure containing 3 tables: MovementReports (movement type — delivery or return, start and end dates from the document root), Products (name and supplier), and MovementDetails (price, quantity, date - the date field as a result will be the same for both source documents, despite the fact that the fields in the source files are called differently)

2. Creation of transformation specifications in the resulting tables.

Consider the process of creating specifications for mapping the source data and the resulting tables. To create such specifications, we need something else.

2.1. Getting a label with the structure of documents.

For further processing, you need to have a detailed structure of all of our XML documents in order to decide on the basis of it into which structure of tables we will transform all this.
One particular sample XML document for this task is not enough for us; there may not be any optional elements in a specific document that suddenly appear in other documents.

If we do not have an XSD schema or we don’t want to communicate with it, then it may be enough for us to load into our table some representative sample of samples of XML documents and build the required list using the Path and Element_Name fields.

However, do not forget that as a result we want to load information into some target “final” tables, so we need to know where one-to-many relationships are stored in XML. That is, it is necessary to understand which elements form the “child” table where reproduction occurs.

Sometimes, if the scheme of the document is not very complicated, it is immediately clear to us empirically, "eyes." Also, when grouping the data of our “representative sample”, we can count the number of elements and see from this statistics where they begin to “multiply”. But in general, if we have a normal XSD scheme, it is better to use it - we will “catch” one-to-many data multiplication, revealing the XSD construction maxoccurs = unbounded .

As you can see, the task is a bit more complicated: I want to get not only a simple table containing a list of XPATH paths for all elements of our documents, but also indicating where the reproduction of data begins. (And when analyzing a well-written XSD scheme with a pleasant bonus, you might be able to pull out the descriptions of the elements and their types.)

It would be very good to use the functionality of some XML editor to get such a label, but it was not possible to find such a tool that would give us the structure of documents we needed using the XSD scheme (we searched and tried for a long time).

In all these Oxygen, Altova, Liquid and less sophisticated, the necessary information inside is undoubtedly used - however, none of them are able to give it in the right form. As a rule, an advanced editor has the ability to generate Sample XML based on a schema, but in XSD there may be a choice design, when there may be something to choose from several different elements in the document — then real “combat” document samples are better to analyze. And yet - according to the sample or samples of documents, we also will not catch the moment of one-to-many reproduction of information in an explicit form.

As a result, I had to reinvent the wheel and write the generator of such a plate (in fact, a special kind of XSD parser) on my own. The benefit of XSD is also XML, you can also upload it to our repository and pull out the desired view with relational operations. If the scheme is simple, without references to complex types of elements and without inheritance from basic types, then it is quite simple. In the case when all this type inheritance is available (as in public procurement, for example), the task is more complicated.

For our example, we will get something like this table of the structure of documents:

Path	Element_Name	maxoccurs
Deliveries \	PeriodBegin
Deliveries \	PeriodEnd
Deliveries \ Products \ Product \	ProductName	Products \ Product
Deliveries \ Products \ Product \	Supplier	Products \ Product
Deliveries \ Products \ Product \ Details \ Detail \	DeliveryDate	Products \ Product \ Details \ Detail
Deliveries \ Products \ Product \ Details \ Detail \	Quantity	Products \ Product \ Details \ Detail
Deliveries \ Products \ Product \ Details \ Detail \	UnitPrice	Products \ Product \ Details \ Detail
Returns \	PeriodBegin
Returns \	PeriodEnd
Returns \ Products \ Product \	ProductName	Products \ Product
Returns \ Products \ Product \	Supplier	Products \ Product
Returns \ Products \ Product \ Details \ Detail \	Quantity	Products \ Product \ Details \ Detail
Returns \ Products \ Product \ Details \ Detail \	ReturnDate	Products \ Product \ Details \ Detail
Returns \ Products \ Product \ Details \ Detail \	UnitPrice	Products \ Product \ Details \ Detail

2.2. Transformation Description

Having obtained the initial data in the form of a table specification of documents, we proceed to the design of data transformation. As a result of this transformation, we will transform the data of our original primary XML repository into new tables with new field names and unique codes of target table entries. We do not need any special software tools, just add a few new columns to this label:

In the example above, some fields are abbreviated to improve readability.

Path	Element_ Name	maxoccurs	Target Table	Target Field	Element Depth	Additional Info
Deliveries \	PeriodBegin		MovementReports	PeriodBegin_Date	one	Deliveries
Deliveries \	PeriodEnd		MovementReports	PeriodEnd_Date	one	Deliveries
Deliveries \ ... \ Product \	ProductName	... \ Product	Products	ProductName_Dim	3
Deliveries \ ... \ Product \	Supplier	... \ Product	Products	Supplier_Dim	3
Deliveries \ ... \ ... \ ... \ Detail \	DeliveryDate	... \ Detail	MovementDetails	MovementDate_Date	five
Deliveries \ ... \ ... \ ... \ Detail \	Quantity	... \ Detail	MovementDetails	Quantity_Val	five
Deliveries \ ... \ ... \ ... \ Detail \	UnitPrice	... \ Detail	MovementDetails	UnitPrice_Val	five
Returns \	PeriodBegin		MovementReports	PeriodBegin_Date	one	Returns
Returns \	PeriodEnd		MovementReports	PeriodEnd_Date	one	Returns
Returns \ ... \ Product \	ProductName	... \ Product	Products	ProductName_Dim	3
Returns \ ... \ Product \	Supplier	... \ Product	Products	Supplier_Dim	3
Returns \ ... \ ... \ ... \ Detail \	Quantity	... \ Detail	MovementDetails	MovementDate_Date	five
Returns \ ... \ ... \ ... \ Detail \	ReturnDate	... \ Detail	MovementDetails	Quantity_Val	five
Returns \ ... \ ... \ ... \ Detail \	UnitPrice	... \ Detail	MovementDetails	UnitPrice_Val	five

Here are the fields we added:

The name of the target table is TargetTable . Please note that we take into account one-to-many reproduction information (maxoccurs column) to determine which table which data to fill in.
The field name of the target table is TargetField . We will further use the onvention over configuration approach and will assign the _Dim suffix for the fields that will become dimensions (dimensions), the _Date suffix for the date fields, and the _Val suffix for the number fields (measures). At the next stages of the process, the corresponding utilities for the suffix will understand what to do with this field - build and update the necessary reference book or convert the value to the appropriate format.
The effective nesting depth of the ElementDepth elements. For the subsequent transformations, we will need to retain a single code for the entry of the target table based on the contents of the Record_ID and File_ID fields . In XML, the depth of elements may be different, but they will have to fall into one target table, so we indicate what part of the hierarchical code Record_ID we need to keep, discarding the remainder we don't need. Due to the fixed length of each segment of the hierarchical code, this will be a rather “cheap” operation of extracting substrings of length [Number of characters per code segment] * ElementDepth .
Additional Information AdditionalInfo . As a result of our transformation, we will overload the source data broken down by target tables into a similar structure with new field names, but in some places we will need to save important information about which XPATH path we took the source data from.

This technique opens up several interesting data transformation manipulation capabilities. Without going into details, just list what we can do:

If we have several initial XML data coding schemes (for example, the “old” government procurement scheme for 95 FZ and the “new” for 44 FZ), we can bring them to a single data structure through the unification of the field names by the transformation description stage.
As mentioned earlier, we can “collapse” different sections within one or several document schemas into a single, more compact storage structure, as we do in our example.
You can also “collapse” data of different depths into a uniform view. For example, if we take the scheme of notices on public procurement, then in different schemes information about the lots of purchases may be “at” \ lot for single lot purchases (the only element) and \ lots \ lot for multi lot purchases (reproduction). With this approach, you can quite simply zamappit this whole zoo into a single table of information about lots.

After our transformation specification label is ready, we load it into the database and join our primary repository for the Path and Element_Name fields.

By concatenating the File_ID , “trimmed” in accordance with the ElementDepth values of the Record_ID field and the AdditionalInfo values, we form the composite key of our target table.

The result of the join is poured for each target table into a separate “temporary” table (depending on the amount of data you can try using the query result “on the fly”), in the next step the final utilities of our “pipeline” will work with these tables.

A little more about the composite key, maybe a little repeating - but this is an important and subtle point:

Before pouring we have a data set at the level of individual fields (elements). In order to connect the fields in the record of the resulting table, we need to have some kind of key that will uniquely identify the record of the target table, into which the corresponding fields will fall.
The hierarchical identifier Record_ID can be of different lengths, depending on the “depth” of individual elements in the document outline. However, if the level deepening is not accompanied by one-to-many elements multiplication, we cut our Record_ID to the minimum sufficient depth defined by the ElementDepth parameter, which will ensure that the identifier is the same for all fields of our target table. In our demo documents there is no such situation, but imagine, for example, that our UnitPrice would “ split ” into 2 values - the wholesale and retail prices of UnitPrice \ Retail and UnitPrice \ Wholesale .
Since the contents of many files are in our base repository, we can not do without the value of File_ID in our key.
The following stages of data transformation work only with the “transformed” tables obtained at this step, we don’t have any end-to-end system of settings. The type of the field (dimension / measure) we pass through the suffixes of the names, but sometimes we need to pass “along the chain” also the information about which section of the document we took the information (remember that we can transform into different form the documents encoded by different schemes). To transfer this information to the next stage, we use the optional parameter of our AdditionalInfo transformation, “hooking” it to our composite key so that the identification of target records that we need is not violated.

Let's see what happened on the output in our example:

The result of the transformation:

Movement Reports:

KEY	TargetField	Element_Value
001; 2006 @ Deliveries	PeriodBegin_Date	2017-01-01
001; 2006 @ Deliveries	PeriodEnd_Date	2017-01-31
001; 2007 @ Returns	PeriodBegin_Date	2017-02-01
001; 2007 @ Returns	PeriodEnd_Date	2017-02-28

Products:

KEY	TargetField	Element_Value
001 \ 003 \ 001; 2006	Supplier_Dim	Zaanse snoepfabriek
001 \ 003 \ 001; 2006	ProductName_Dim	Chocolade
001 \ 003 \ 002; 2006	Supplier_Dim	Mayumis
001 \ 003 \ 002; 2006	ProductName_Dim	Tofu
001 \ 003 \ 001; 2007	Supplier_Dim	Pavlova, Ltd.
001 \ 003 \ 001; 2007	ProductName_Dim	Pavlova
001 \ 003 \ 002; 2007	Supplier_Dim	Formaggi Fortini srl
001 \ 003 \ 002; 2007	ProductName_Dim	Mozzarella di giovanni

MovementDetails:

KEY	TargetField	Element_Value
001 \ 003 \ 001 \ 003 \ 001; 2006	MovementDate_Date	2017-01-03
001 \ 003 \ 001 \ 003 \ 001; 2006	UnitPrice_Val	10.2000
001 \ 003 \ 001 \ 003 \ 001; 2006	Quantity_Val	70
001 \ 003 \ 002 \ 003 \ 001; 2006	MovementDate_Date	2017-01-09
001 \ 003 \ 002 \ 003 \ 001; 2006	UnitPrice_Val	18.6000
001 \ 003 \ 002 \ 003 \ 001; 2006	Quantity_Val	12
001 \ 003 \ 002 \ 003 \ 002; 2006	MovementDate_Date	2017-01-13
001 \ 003 \ 002 \ 003 \ 002; 2006	UnitPrice_Val	18.7000
001 \ 003 \ 002 \ 003 \ 002; 2006	Quantity_Val	20
001 \ 003 \ 001 \ 003 \ 001; 2007	MovementDate_Date	2017-02-21
001 \ 003 \ 001 \ 003 \ 001; 2007	UnitPrice_Val	13.9000
001 \ 003 \ 001 \ 003 \ 001; 2007	Quantity_Val	2
001 \ 003 \ 002 \ 003 \ 001; 2007	MovementDate_Date	2017-02-27
001 \ 003 \ 002 \ 003 \ 001; 2007	UnitPrice_Val	27.8000
001 \ 003 \ 002 \ 003 \ 001; 2007	Quantity_Val	four

Please note - the obtained key is the same for all fields that will be included in the corresponding records of our target tables.

3. Final processing.

3.1. Pivotting

Having obtained the “blank” target table with the previous transformation with source data divided into triples <key> - <field name> - <value> , we need to translate it into a more familiar look of a table with many fields. The algorithm of this transformation is obvious - first, by grouping the values of our composite key, we get the “skeleton” of the table, then we perform the joins of this “skeleton” with the table-result of transformation according to the value of the composite key. (“Increasing Meat,” so to speak.)

That is, there will be N “skeleton” connections with subsets of the table — the result of the transformation, identified by the field names, where N is the number of field names in the target table — the result of the transformation.

We safely “brought” the AdditionalInfo field to this stage by encoding it inside the composite key. Now we need to free our key from this “burden” and cut the AdditionalInfo part into the AdditionalInfo_Dim new field.

We combined the file code and record identifier to transfer the key to the pivot stage with one field. For “final” storage, it is better to divide the file code and hierarchical identifier back into two fields, so it will be easier to associate the resulting tables with each other.

The result will be such

pivot tags:

Movement Reports:

Record_ID	File_ID	AdditionalInfo_Dim	PeriodBegin_Date	PeriodEnd_Date
001	2006	Deliveries	2017-01-01	2017-01-31
001	2007	Returns	2017-02-01	2017-02-28

Products:

Record_ID	File_ID	Supplier_Dim	ProductName_Dim
001 \ 003 \ 001	2006	Zaanse snoepfabriek	Chocolade
001 \ 003 \ 002	2006	Mayumis	Tofu
001 \ 003 \ 001	2007	Pavlova, Ltd.	Pavlova
001 \ 003 \ 002	2007	Formaggi Fortini srl	Mozzarella di giovanni

MovementDetails:

Record_ID	File_ID	MovementDate_Date	UnitPrice_Val	Quantity_Val
001 \ 003 \ 001 \ 003 \ 001	2006	2017-01-03	10.2000	70
001 \ 003 \ 001 \ 003 \ 001	2007	2017-02-21	13.9000	2
001 \ 003 \ 002 \ 003 \ 001	2006	2017-01-09	18.6000	12
001 \ 003 \ 002 \ 003 \ 001	2007	2017-02-27	27.8000	four
001 \ 003 \ 002 \ 003 \ 002	2006	2017-01-13	18.7000	20

3.2. Normalization

The next stage can be conventionally called normalization, we will replace all fields with the _Dim suffix with links to the corresponding directories, we will convert the fields with the _Date and _Val suffixes into dates and numbers, respectively. It is possible, if necessary, to use other data type suffixes, it is not a dogma.

For each _Dim field , we check the availability of the corresponding directory (if there is no directory, we will create it) and add new values from the field of the brewed table.

At the end of the process, we transfer the data to the final storage structure by linking the brewed table with the updated reference books for the field values.

As a result, we get our target tables, which will contain the file code, hierarchical identifier, references to the values of reference books and the remaining fields converted to the necessary types.

In conclusion of the process description

A hierarchical identifier in each target table gives us the opportunity to link these tables with each other (and in any order, if necessary, intermediate links can be omitted). As noted above, the fixed size of the elements of a hierarchical identifier allows us to fairly easily build expressions to link our target tables.

Of course, if you wish, you can supplement the structure of the table with links through the values of the usual auto-increment keys.

Note the sufficient tolerance of the described approach to the errors and omissions of the developer (more precisely, the person configuring the mapping). If in the process of preparing the transformation you missed the required field, or if there were changes in the file scheme, such as adding or renaming fields that were not tracked on time, that's fine, even if a large amount of information was already processed without taking this error into account.

All the information we have saved and available in our “primary” table storage, so the source code will not have to be re-parsed.

It is possible to register mapping of fields for “small” data transformation - only for the missing fragment we need. After we “pull out” the missing piece of data, we will be able to “pick up” it quite easily to the main processed array through a bundle using a hierarchical identifier.

findings

As we see, from the very beginning we put all the original information, regardless of the document schemes, in a uniform form, which allows you to quickly “pull out” the necessary data in ad-hoc mode.

In the course of further processing, ~~no animals suffered,~~ there was no need to create or modify any scripts or objects of the type ETL task, specific to the structure of the source XML files.

Of course, the software environment of our “framework” (download infrastructure and data overflow) needed to be created once, but then the data integration of any XML structures is reduced to editing the mapping table of the source and target views.

Source: https://habr.com/ru/post/325186/

All Articles