This article describes “unconventional”, but rather powerful XML processing technology that allows you to import any XML data and transform its structure efficiently and simply, while the same processing process allows you to transform the source data of any structure without any change in the program code .
XML as an intermediate data exchange format “overgrown” with an ecosystem of technologies and tools for working with it - specialized editors, DOM parsers, XQUERY / XPATH, XSLT, special ETL modules, etc. All this diversity and sophistication of tools ideologically lead to that we now have the technology to work not just with data, but with special XML data. It is like a separate "science of chemistry" and a separate "science of chemistry for substances that are stored in blue boxes."
The processing principles described below allowed us to slightly rewind the conditional progress of technologies and to transfer almost all work with XML data to the level of a “pure” DBMS, without specific XML tools. This approach made it possible to organize a unified storage of any XML-data and provide fast random access to the necessary parts of the information. In addition, it became possible to implement the ETL functionality in a universal, non-cost (practically without coding) method.
The described approach has shown itself to be particularly good at large data sources and complex structures, with frequent changes in the data schema. If you have to deal with a small amount of information, and / or the structure is simple, you may not need this technology yet.
')
A good demo case for this technology can be public procurement server zakupki.gov.ru, available on the appropriate FTP: the volume of daily updates - tens and hundreds of thousands of XML files in gigabytes or tens of gigabytes, on average, a new version comes out every few weeks data schemas.
The data structure follows the requirements of the legislation, therefore, for example, information on notifications on public procurement is provided by more than a dozen types of documents fcsNotification * depending on the type of purchase (electronic auction fcsNotificationEF, request for quotes fcsNotificationZK, purchase from a single supplier fcsNotificationEP, etc.)
All these documents are based on the same basic type of notification but differ in details, therefore, for the purposes of analysis, all this diversity must be “collapsed” at some point during import and lead to a “common denominator”.
The described approach has been successfully applied and works effectively on these public procurements.
Briefly stages / elements of the described technology:
(1) Import all XML data into a unified structure
table . We are not talking about saving into the document database as a whole, we are importing data elementwise as pairs “element name” - “value” or “attribute name” - “value”. As a result of this stage, we get rid of XML as a storage format and get quick unified access to the data of all imported XML documents of any structure (and we no longer need XQUERY / XPATH).
(2) The second element of the technology is the
creation of specifications for “pulling out” the data we need - identifying the resulting tables, into which we will pour the data, and mapping the source and result fields. This stage can be carried out both on the basis of XSD-schemes of documents, and without the use of schemes, through the analysis of sample documents uploaded at the first stage. This stage does not require any programming and special skills, the main tool here is a spreadsheet.
(3) The final steps are selection of the necessary information from the primary import storage (1) using specifications (2), conversion to a “columnar” view (“pivoting”) and automated
transformation to the final “analytical” format — in terms of analytical data repositories this fact tables of the asterisk structure (star) with links to measurement guides (dimensions) and numerical measures (measures).
1. Primary import.
We consider an XML document as a tree, the vertices of which are the pairs “name” - “value”. Thus, the described approach is quite universal and can be applied to any tree-like data representation.
Table structure for loading data from XML:
- Record_ID : an identifier of an element of a special hierarchical view that allows you to associate different levels of the document
- File_ID : since you will need to load the contents of multiple XML files into a table, you must also store the file ID
- Path : the full path to this element from the root of the document (in fact, this is the XPATH path to this element)
- Element_Name : the name of the element or attribute
- Element _Value : the value of the element or attribute (as a string - just as it is stored in XML)
- Type : record type (element or attribute) - save just in case, then suddenly it will be necessary to restore XML from the table
The idea of loading a tree into a table is fairly obvious. In MS SQL (I will not say about other DBMS, did not look) there is such a built-in capability –XML without specifying a schema is imported into a so-called EDGE table. This is not exactly what we need, since the element name and its value (that is, the name is the parent record for the value) are stored in separate format in EDGE format — this format is simply inconvenient to be used for further manipulations. In addition, in the EDGE table, the links in the tree are registered by specifying the ParentID.
In short, you can make the desired data representation from the EDGE table, but you have to sweat a little to “glue” the names and values of the elements, recreate the XPATH before each element and create a hierarchical identifier (how we build it - just below). With a large amount of data, the solution of these problems can be quite resource-intensive, but you can get by with a single tool / language.
A better way is to get the document tree using the XML parser (there is some kind of implementation in almost every language and development environment) and fill in the necessary information with one pass through the document.
Let's look at a specific example. We have demo XML files deliveries.xml and returns.xml. The
deliveries.xml file contains the root element Deliveries, at the top level the dates of the beginning and end of the period for which the data is unloaded, then there are products with an indication of the name and supplier, for each product the details of deliveries are given - date, quantity, price.
deliveries.xml<Deliveries> <PeriodBegin>2017-01-01</PeriodBegin> <PeriodEnd>2017-01-31</PeriodEnd> <Products> <Product> <Supplier>Zaanse Snoepfabriek</Supplier> <ProductName>Chocolade</ProductName> <Details> <Detail> <DeliveryDate>2017-01-03</DeliveryDate> <UnitPrice>10.2000</UnitPrice> <Quantity>70</Quantity> </Detail> </Details> </Product> <Product> <Supplier>Mayumi's</Supplier> <ProductName>Tofu</ProductName> <Details> <Detail> <DeliveryDate>2017-01-09</DeliveryDate> <UnitPrice>18.6000</UnitPrice> <Quantity>12</Quantity> </Detail> <Detail> <DeliveryDate>2017-01-13</DeliveryDate> <UnitPrice>18.7000</UnitPrice> <Quantity>20</Quantity> </Detail> </Details> </Product> </Products> </Deliveries>
The
returns.xml file
(returns) is absolutely the same, only the root element is called Returns and in detail the date element is called differently.
returns.xml <Returns> <PeriodBegin>2017-02-01</PeriodBegin> <PeriodEnd>2017-02-28</PeriodEnd> <Products> <Product> <Supplier>Pavlova, Ltd.</Supplier> <ProductName>Pavlova</ProductName> <Details> <Detail> <ReturnDate>2017-02-21</ReturnDate> <UnitPrice>13.9000</UnitPrice> <Quantity>2</Quantity> </Detail> </Details> </Product> <Product> <Supplier>Formaggi Fortini srl</Supplier> <ProductName>Mozzarella di Giovanni</ProductName> <Details> <Detail> <ReturnDate>2017-02-27</ReturnDate> <UnitPrice>27.8000</UnitPrice> <Quantity>4</Quantity> </Detail> </Details> </Product> </Products> </Returns>
The names of the uploaded files are stored in a separate table, the codes of our files there are 2006 (deliveries) and 2007 (returns).
In our receiving table, the image of our demo documents will look like this:
(It's not all, just the beginning of the table)Record_ID | File_ID | Path | Element_Name | Element_Value | Type |
001 | 2006 | | Deliveries | | E |
001 \ 001 | 2006 | Deliveries \ | PeriodBegin | 2017-01-01 | E |
001 \ 002 | 2006 | Deliveries \ | PeriodEnd | 2017-01-31 | E |
001 \ 003 | 2006 | Deliveries \ | Products | | E |
001 \ 003 \ 001 | 2006 | Deliveries \ Products \ | Product | | E |
001 \ 003 \ 001 \ 001 | 2006 | Deliveries \ Products \
Product \ | Supplier | Zaanse snoepfabriek | E |
001 \ 003 \ 001 \ 002 | 2006 | Deliveries \ Products \
Product \ | ProductName | Chocolade | E |
001 \ 003 \ 001 \ 003 | 2006 | Deliveries \ Products \
Product \ | Details | | E |
001 \ 003 \ 001 \ 003 \
001 | 2006 | Deliveries \ Products \
Product \ Details \ | Detail | | E |
001 \ 003 \ 001 \ 003 \
001 \ 001 | 2006 | Deliveries \ Products \
Product \ Details \ Detail \ | DeliveryDate | 2017-01-03 | E |
001 \ 003 \ 001 \ 003 \
001 \ 002 | 2006 | Deliveries \ Products \
Product \ Details \ Detail \ | UnitPrice | 10.2000 | E |
001 \ 003 \ 001 \ 003 \
001 \ 003 | 2006 | Deliveries \ Products \
Product \ Details \ Detail \ | Quantity | 70 | E |
Regarding the hierarchical identifier
Record_ID : its goal is to uniquely number the nodes of the document tree while preserving information about the links with all ancestors.
Note:In the same SQL Server there is a special data type (object extension) called hierarchyid, which serves for this purpose.
In the example above, we use a simple platform-independent implementation with a sequential concatenation of element counters at each level of the tree. We “finish” the counter of each level with zeros to a given fixed depth in order to get an easy and fast allocation of ancestor identifiers of any level through the allocation of substrings of fixed length.
That's all at this stage. Now we can use simple SQL queries to isolate subsets of product data, delivery details or returns, and link them together through file and item identifiers.
In many cases, this approach will be quite enough for efficient data storage and access to them, the “advanced” techniques described below may not be necessary.
Indeed, we have already loaded all our data into a single repository, not paying attention to the possible difference in the structure of documents, got a fairly effective way to isolate the necessary information from the whole array with simple SQL queries, we can link the “subtracted” data subsets together.
The performance of such a solution as an XML source storage system for one-time queries, even without optimizing indexes and other tweaks, will clearly be much higher than if you stored XML in files or even wrote it in special database fields. In our case, it is not necessary to run the XPATH search procedure (which implies a new parsing) for each document, we did it once and then calmly use the saved result through fairly simple queries.
But:At this stage, we have not yet done so that DeliveryDate , Quantity and UnitPrice become fields of the same table, this is described as the process of “pivotirovaniya” in the third section.
In the next steps, we will look at the transformation of these XML documents into a single data structure containing 3 tables:
MovementReports (movement type — delivery or return, start and end dates from the document root),
Products (name and supplier), and
MovementDetails (price, quantity, date - the date field as a result will be the same for both source documents, despite the fact that the fields in the source files are called differently)
2. Creation of transformation specifications in the resulting tables.
Consider the process of creating specifications for mapping the source data and the resulting tables. To create such specifications, we need something else.
2.1. Getting a label with the structure of documents.
For further processing, you need to have a detailed structure of all of our XML documents in order to decide on the basis of it into which structure of tables we will transform all this.
One particular sample XML document for this task is not enough for us; there may not be any optional elements in a specific document that suddenly appear in other documents.
If we do not have an XSD schema or we don’t want to communicate with it, then it may be enough for us to load into our table some representative sample of samples of XML documents and build the required list using the
Path and
Element_Name fields.
However, do not forget that as a result we want to load information into some target “final” tables, so we need to know where
one-to-many relationships are stored in XML. That is, it is necessary to understand which elements form the “child” table where reproduction occurs.
Sometimes, if the scheme of the document is not very complicated, it is immediately clear to us empirically, "eyes." Also, when grouping the data of our “representative sample”, we can count the number of elements and see from this statistics where they begin to “multiply”. But in general, if we have a normal XSD scheme, it is better to use it - we will “catch” one-to-many data multiplication, revealing the XSD construction
maxoccurs = unbounded .
As you can see, the task is a bit more complicated: I want to get not only a simple table containing a list of XPATH paths for all elements of our documents, but also indicating where the reproduction of data begins. (And when analyzing a well-written XSD scheme with a pleasant bonus, you might be able to pull out the descriptions of the elements and their types.)
It would be very good to use the functionality of some XML editor to get such a label, but it was not possible to find such a tool that would give us the structure of documents we needed using the XSD scheme (we searched and tried for a long time).
In all these Oxygen, Altova, Liquid and less sophisticated, the necessary information inside is undoubtedly used - however, none of them are able to give it in the right form. As a rule, an advanced editor has the ability to generate Sample XML based on a schema, but in XSD there may be a
choice design, when there may be something to choose from several different elements in the document — then real “combat” document samples are better to analyze. And yet - according to the sample or samples of documents, we also will not catch the moment of one-to-many reproduction of information in an explicit form.
As a result, I had to reinvent the wheel and write the generator of such a plate (in fact, a special kind of XSD parser) on my own. The benefit of XSD is also XML, you can also upload it to our repository and pull out the desired view with relational operations. If the scheme is simple, without references to complex types of elements and without inheritance from basic types, then it is quite simple. In the case when all this type inheritance is available (as in public procurement, for example), the task is more complicated.
For our example, we will get something like this table of the structure of documents:Path | Element_Name | maxoccurs |
Deliveries \ | PeriodBegin | |
Deliveries \ | PeriodEnd | |
Deliveries \ Products \ Product \ | ProductName | Products \ Product |
Deliveries \ Products \ Product \ | Supplier | Products \ Product |
Deliveries \ Products \ Product \ Details \ Detail \ | DeliveryDate | Products \ Product \ Details \ Detail |
Deliveries \ Products \ Product \ Details \ Detail \ | Quantity | Products \ Product \ Details \ Detail |
Deliveries \ Products \ Product \ Details \ Detail \ | UnitPrice | Products \ Product \ Details \ Detail |
Returns \ | PeriodBegin | |
Returns \ | PeriodEnd | |
Returns \ Products \ Product \ | ProductName | Products \ Product |
Returns \ Products \ Product \ | Supplier | Products \ Product |
Returns \ Products \ Product \ Details \ Detail \ | Quantity | Products \ Product \ Details \ Detail |
Returns \ Products \ Product \ Details \ Detail \ | ReturnDate | Products \ Product \ Details \ Detail |
Returns \ Products \ Product \ Details \ Detail \ | UnitPrice | Products \ Product \ Details \ Detail |
2.2. Transformation Description
Having obtained the initial data in the form of a table specification of documents, we proceed to the design of data transformation. As a result of this transformation, we will transform the data of our original primary XML repository into new tables with new field names and unique codes of target table entries. We do not need any special software tools, just add a few new columns to this label:
In the example above, some fields are abbreviated to improve readability.Path | Element_
Name | maxoccurs | Target
Table | Target
Field | Element
Depth | Additional
Info |
Deliveries \ | PeriodBegin | | MovementReports | PeriodBegin_Date | one | Deliveries |
Deliveries \ | PeriodEnd | | MovementReports | PeriodEnd_Date | one | Deliveries |
Deliveries \ ... \ Product \ | ProductName | ... \ Product | Products | ProductName_Dim | 3 | |
Deliveries \ ... \ Product \ | Supplier | ... \ Product | Products | Supplier_Dim | 3 | |
Deliveries \ ... \ ... \ ... \ Detail \ | DeliveryDate | ... \ Detail | MovementDetails | MovementDate_Date | five | |
Deliveries \ ... \ ... \ ... \ Detail \ | Quantity | ... \ Detail | MovementDetails | Quantity_Val | five | |
Deliveries \ ... \ ... \ ... \ Detail \ | UnitPrice | ... \ Detail | MovementDetails | UnitPrice_Val | five | |
Returns \ | PeriodBegin | | MovementReports | PeriodBegin_Date | one | Returns |
Returns \ | PeriodEnd | | MovementReports | PeriodEnd_Date | one | Returns |
Returns \ ... \ Product \ | ProductName | ... \ Product | Products | ProductName_Dim | 3 | |
Returns \ ... \ Product \ | Supplier | ... \ Product | Products | Supplier_Dim | 3 | |
Returns \ ... \ ... \ ... \ Detail \ | Quantity | ... \ Detail | MovementDetails | MovementDate_Date | five | |
Returns \ ... \ ... \ ... \ Detail \ | ReturnDate | ... \ Detail | MovementDetails | Quantity_Val | five | |
Returns \ ... \ ... \ ... \ Detail \ | UnitPrice | ... \ Detail | MovementDetails | UnitPrice_Val | five | |
Here are the fields we added:
- The name of the target table is TargetTable . Please note that we take into account one-to-many reproduction information (maxoccurs column) to determine which table which data to fill in.
- The field name of the target table is TargetField . We will further use the onvention over configuration approach and will assign the _Dim suffix for the fields that will become dimensions (dimensions), the _Date suffix for the date fields, and the _Val suffix for the number fields (measures). At the next stages of the process, the corresponding utilities for the suffix will understand what to do with this field - build and update the necessary reference book or convert the value to the appropriate format.
- The effective nesting depth of the ElementDepth elements. For the subsequent transformations, we will need to retain a single code for the entry of the target table based on the contents of the Record_ID and File_ID fields . In XML, the depth of elements may be different, but they will have to fall into one target table, so we indicate what part of the hierarchical code Record_ID we need to keep, discarding the remainder we don't need. Due to the fixed length of each segment of the hierarchical code, this will be a rather “cheap” operation of extracting substrings of length [Number of characters per code segment] * ElementDepth .
- Additional Information AdditionalInfo . As a result of our transformation, we will overload the source data broken down by target tables into a similar structure with new field names, but in some places we will need to save important information about which XPATH path we took the source data from.
This technique opens up several interesting data transformation manipulation capabilities. Without going into details, just list what we can do:
- If we have several initial XML data coding schemes (for example, the “old” government procurement scheme for 95 FZ and the “new” for 44 FZ), we can bring them to a single data structure through the unification of the field names by the transformation description stage.
- As mentioned earlier, we can “collapse” different sections within one or several document schemas into a single, more compact storage structure, as we do in our example.
- You can also “collapse” data of different depths into a uniform view. For example, if we take the scheme of notices on public procurement, then in different schemes information about the lots of purchases may be “at” \ lot for single lot purchases (the only element) and \ lots \ lot for multi lot purchases (reproduction). With this approach, you can quite simply zamappit this whole zoo into a single table of information about lots.
After our transformation specification label is ready, we load it into the database and join our primary repository for the
Path and
Element_Name fields.
By concatenating the
File_ID , “trimmed” in accordance with the
ElementDepth values of the
Record_ID field and the
AdditionalInfo values, we form the composite key of our target table.
The result of the join is poured for each target table into a separate “temporary” table (depending on the amount of data you can try using the query result “on the fly”), in the next step the final utilities of our “pipeline” will work with these tables.
A little more about the composite key, maybe a little repeating - but this is an important and subtle point:- Before pouring we have a data set at the level of individual fields (elements). In order to connect the fields in the record of the resulting table, we need to have some kind of key that will uniquely identify the record of the target table, into which the corresponding fields will fall.
- The hierarchical identifier Record_ID can be of different lengths, depending on the “depth” of individual elements in the document outline. However, if the level deepening is not accompanied by one-to-many elements multiplication, we cut our Record_ID to the minimum sufficient depth defined by the ElementDepth parameter, which will ensure that the identifier is the same for all fields of our target table. In our demo documents there is no such situation, but imagine, for example, that our UnitPrice would “ split ” into 2 values - the wholesale and retail prices of UnitPrice \ Retail and UnitPrice \ Wholesale .
- Since the contents of many files are in our base repository, we can not do without the value of File_ID in our key.
- The following stages of data transformation work only with the “transformed” tables obtained at this step, we don’t have any end-to-end system of settings. The type of the field (dimension / measure) we pass through the suffixes of the names, but sometimes we need to pass “along the chain” also the information about which section of the document we took the information (remember that we can transform into different form the documents encoded by different schemes). To transfer this information to the next stage, we use the optional parameter of our AdditionalInfo transformation, “hooking” it to our composite key so that the identification of target records that we need is not violated.
Let's see what happened on the output in our example:
The result of the transformation:Movement Reports:
KEY | TargetField | Element_Value |
001; 2006 @ Deliveries | PeriodBegin_Date | 2017-01-01 |
001; 2006 @ Deliveries | PeriodEnd_Date | 2017-01-31 |
001; 2007 @ Returns | PeriodBegin_Date | 2017-02-01 |
001; 2007 @ Returns | PeriodEnd_Date | 2017-02-28 |
Products:
KEY | TargetField | Element_Value |
001 \ 003 \ 001; 2006 | Supplier_Dim | Zaanse snoepfabriek |
001 \ 003 \ 001; 2006 | ProductName_Dim | Chocolade |
001 \ 003 \ 002; 2006 | Supplier_Dim | Mayumis |
001 \ 003 \ 002; 2006 | ProductName_Dim | Tofu |
001 \ 003 \ 001; 2007 | Supplier_Dim | Pavlova, Ltd. |
001 \ 003 \ 001; 2007 | ProductName_Dim | Pavlova |
001 \ 003 \ 002; 2007 | Supplier_Dim | Formaggi Fortini srl |
001 \ 003 \ 002; 2007 | ProductName_Dim | Mozzarella di giovanni |
MovementDetails:
KEY | TargetField | Element_Value |
001 \ 003 \ 001 \ 003 \ 001; 2006 | MovementDate_Date | 2017-01-03 |
001 \ 003 \ 001 \ 003 \ 001; 2006 | UnitPrice_Val | 10.2000 |
001 \ 003 \ 001 \ 003 \ 001; 2006 | Quantity_Val | 70 |
001 \ 003 \ 002 \ 003 \ 001; 2006 | MovementDate_Date | 2017-01-09 |
001 \ 003 \ 002 \ 003 \ 001; 2006 | UnitPrice_Val | 18.6000 |
001 \ 003 \ 002 \ 003 \ 001; 2006 | Quantity_Val | 12 |
001 \ 003 \ 002 \ 003 \ 002; 2006 | MovementDate_Date | 2017-01-13 |
001 \ 003 \ 002 \ 003 \ 002; 2006 | UnitPrice_Val | 18.7000 |
001 \ 003 \ 002 \ 003 \ 002; 2006 | Quantity_Val | 20 |
001 \ 003 \ 001 \ 003 \ 001; 2007 | MovementDate_Date | 2017-02-21 |
001 \ 003 \ 001 \ 003 \ 001; 2007 | UnitPrice_Val | 13.9000 |
001 \ 003 \ 001 \ 003 \ 001; 2007 | Quantity_Val | 2 |
001 \ 003 \ 002 \ 003 \ 001; 2007 | MovementDate_Date | 2017-02-27 |
001 \ 003 \ 002 \ 003 \ 001; 2007 | UnitPrice_Val | 27.8000 |
001 \ 003 \ 002 \ 003 \ 001; 2007 | Quantity_Val | four |
Please note - the obtained key is the same for all fields that will be included in the corresponding records of our target tables.
3. Final processing.
3.1. Pivotting
Having obtained the “blank” target table with the previous transformation with source data divided into triples
<key> - <field name> - <value> , we need to translate it into a more familiar look of a table with many fields. The algorithm of this transformation is obvious - first, by grouping the values of our composite key, we get the “skeleton” of the table, then we perform the joins of this “skeleton” with the table-result of transformation according to the value of the composite key. (“Increasing Meat,” so to speak.)
That is, there will be N “skeleton” connections with subsets of the table — the result of the transformation, identified by the field names, where N is the number of field names in the target table — the result of the transformation.
We safely “brought” the
AdditionalInfo field to this stage by encoding it inside the composite key. Now we need to free our key from this “burden” and cut the AdditionalInfo part into the
AdditionalInfo_Dim new field.
We combined the file code and record identifier to transfer the key to the pivot stage with one field. For “final” storage, it is better to divide the file code and hierarchical identifier back into two fields, so it will be easier to associate the resulting tables with each other.
The result will be such
pivot tags:Movement Reports:
Record_ID | File_ID | AdditionalInfo_Dim | PeriodBegin_Date | PeriodEnd_Date |
001 | 2006 | Deliveries | 2017-01-01 | 2017-01-31 |
001 | 2007 | Returns | 2017-02-01 | 2017-02-28 |
Products:
Record_ID | File_ID | Supplier_Dim | ProductName_Dim |
001 \ 003 \ 001 | 2006 | Zaanse snoepfabriek | Chocolade |
001 \ 003 \ 002 | 2006 | Mayumis | Tofu |
001 \ 003 \ 001 | 2007 | Pavlova, Ltd. | Pavlova |
001 \ 003 \ 002 | 2007 | Formaggi Fortini srl | Mozzarella di giovanni |
MovementDetails:
Record_ID | File_ID | MovementDate_Date | UnitPrice_Val | Quantity_Val |
001 \ 003 \ 001 \ 003 \ 001 | 2006 | 2017-01-03 | 10.2000 | 70 |
001 \ 003 \ 001 \ 003 \ 001 | 2007 | 2017-02-21 | 13.9000 | 2 |
001 \ 003 \ 002 \ 003 \ 001 | 2006 | 2017-01-09 | 18.6000 | 12 |
001 \ 003 \ 002 \ 003 \ 001 | 2007 | 2017-02-27 | 27.8000 | four |
001 \ 003 \ 002 \ 003 \ 002 | 2006 | 2017-01-13 | 18.7000 | 20 |
3.2. Normalization
The next stage can be conventionally called normalization, we will replace all fields with the
_Dim suffix
with links to the corresponding directories,
we will convert the fields with the
_Date and
_Val suffixes into dates and numbers, respectively. It is possible, if necessary, to use other data type suffixes, it is not a dogma.
For each
_Dim field
, we check the availability of the corresponding directory (if there is no directory, we will create it) and add new values from the field of the brewed table.
At the end of the process, we transfer the data to the final storage structure by linking the brewed table with the updated reference books for the field values.
As a result, we get our target tables, which will contain the file code, hierarchical identifier, references to the values of reference books and the remaining fields converted to the necessary types.
In conclusion of the process description
A hierarchical identifier in each target table gives us the opportunity to link these tables with each other (and in any order, if necessary, intermediate links can be omitted). As noted above, the fixed size of the elements of a hierarchical identifier allows us to fairly easily build expressions to link our target tables.
Of course, if you wish, you can supplement the structure of the table with links through the values of the usual auto-increment keys.
Note the sufficient tolerance of the described approach to the errors and omissions of the developer (more precisely, the person configuring the mapping). If in the process of preparing the transformation you missed the required field, or if there were changes in the file scheme, such as adding or renaming fields that were not tracked on time, that's fine, even if a large amount of information was already processed without taking this error into account.
All the information we have saved and available in our “primary” table storage, so the source code will not have to be re-parsed.
It is possible to register mapping of fields for “small” data transformation - only for the missing fragment we need. After we “pull out” the missing piece of data, we will be able to “pick up” it quite easily to the main processed array through a bundle using a hierarchical identifier.
findings
As we see, from the very beginning we put all the original information, regardless of the document schemes, in a uniform form, which allows you to quickly “pull out” the necessary data in ad-hoc mode.
In the course of further processing, no animals suffered, there was no need to create or modify any scripts or objects of the type ETL task, specific to the structure of the source XML files.
Of course, the software environment of our “framework” (download infrastructure and data overflow) needed to be created once, but then the data integration of any XML structures is reduced to editing the mapping table of the source and target views.