⬆️ ⬇️

Integrating XML data is another way.

This article describes “unconventional”, but rather powerful XML processing technology that allows you to import any XML data and transform its structure efficiently and simply, while the same processing process allows you to transform the source data of any structure without any change in the program code .



XML as an intermediate data exchange format “overgrown” with an ecosystem of technologies and tools for working with it - specialized editors, DOM parsers, XQUERY / XPATH, XSLT, special ETL modules, etc. All this diversity and sophistication of tools ideologically lead to that we now have the technology to work not just with data, but with special XML data. It is like a separate "science of chemistry" and a separate "science of chemistry for substances that are stored in blue boxes."



The processing principles described below allowed us to slightly rewind the conditional progress of technologies and to transfer almost all work with XML data to the level of a “pure” DBMS, without specific XML tools. This approach made it possible to organize a unified storage of any XML-data and provide fast random access to the necessary parts of the information. In addition, it became possible to implement the ETL functionality in a universal, non-cost (practically without coding) method.



The described approach has shown itself to be particularly good at large data sources and complex structures, with frequent changes in the data schema. If you have to deal with a small amount of information, and / or the structure is simple, you may not need this technology yet.

')

A good demo case for this technology can be public procurement server zakupki.gov.ru, available on the appropriate FTP: the volume of daily updates - tens and hundreds of thousands of XML files in gigabytes or tens of gigabytes, on average, a new version comes out every few weeks data schemas.



The data structure follows the requirements of the legislation, therefore, for example, information on notifications on public procurement is provided by more than a dozen types of documents fcsNotification * depending on the type of purchase (electronic auction fcsNotificationEF, request for quotes fcsNotificationZK, purchase from a single supplier fcsNotificationEP, etc.)

All these documents are based on the same basic type of notification but differ in details, therefore, for the purposes of analysis, all this diversity must be “collapsed” at some point during import and lead to a “common denominator”.



The described approach has been successfully applied and works effectively on these public procurements.



Briefly stages / elements of the described technology:



(1) Import all XML data into a unified structure table . We are not talking about saving into the document database as a whole, we are importing data elementwise as pairs “element name” - “value” or “attribute name” - “value”. As a result of this stage, we get rid of XML as a storage format and get quick unified access to the data of all imported XML documents of any structure (and we no longer need XQUERY / XPATH).



(2) The second element of the technology is the creation of specifications for “pulling out” the data we need - identifying the resulting tables, into which we will pour the data, and mapping the source and result fields. This stage can be carried out both on the basis of XSD-schemes of documents, and without the use of schemes, through the analysis of sample documents uploaded at the first stage. This stage does not require any programming and special skills, the main tool here is a spreadsheet.



(3) The final steps are selection of the necessary information from the primary import storage (1) using specifications (2), conversion to a “columnar” view (“pivoting”) and automated transformation to the final “analytical” format — in terms of analytical data repositories this fact tables of the asterisk structure (star) with links to measurement guides (dimensions) and numerical measures (measures).



1. Primary import.



We consider an XML document as a tree, the vertices of which are the pairs “name” - “value”. Thus, the described approach is quite universal and can be applied to any tree-like data representation.



Table structure for loading data from XML:





The idea of ​​loading a tree into a table is fairly obvious. In MS SQL (I will not say about other DBMS, did not look) there is such a built-in capability –XML without specifying a schema is imported into a so-called EDGE table. This is not exactly what we need, since the element name and its value (that is, the name is the parent record for the value) are stored in separate format in EDGE format — this format is simply inconvenient to be used for further manipulations. In addition, in the EDGE table, the links in the tree are registered by specifying the ParentID.



In short, you can make the desired data representation from the EDGE table, but you have to sweat a little to “glue” the names and values ​​of the elements, recreate the XPATH before each element and create a hierarchical identifier (how we build it - just below). With a large amount of data, the solution of these problems can be quite resource-intensive, but you can get by with a single tool / language.



A better way is to get the document tree using the XML parser (there is some kind of implementation in almost every language and development environment) and fill in the necessary information with one pass through the document.



Let's look at a specific example. We have demo XML files deliveries.xml and returns.xml. The deliveries.xml file contains the root element Deliveries, at the top level the dates of the beginning and end of the period for which the data is unloaded, then there are products with an indication of the name and supplier, for each product the details of deliveries are given - date, quantity, price.



deliveries.xml
<Deliveries> <PeriodBegin>2017-01-01</PeriodBegin> <PeriodEnd>2017-01-31</PeriodEnd> <Products> <Product> <Supplier>Zaanse Snoepfabriek</Supplier> <ProductName>Chocolade</ProductName> <Details> <Detail> <DeliveryDate>2017-01-03</DeliveryDate> <UnitPrice>10.2000</UnitPrice> <Quantity>70</Quantity> </Detail> </Details> </Product> <Product> <Supplier>Mayumi's</Supplier> <ProductName>Tofu</ProductName> <Details> <Detail> <DeliveryDate>2017-01-09</DeliveryDate> <UnitPrice>18.6000</UnitPrice> <Quantity>12</Quantity> </Detail> <Detail> <DeliveryDate>2017-01-13</DeliveryDate> <UnitPrice>18.7000</UnitPrice> <Quantity>20</Quantity> </Detail> </Details> </Product> </Products> </Deliveries> 




The returns.xml file (returns) is absolutely the same, only the root element is called Returns and in detail the date element is called differently.



returns.xml
 <Returns> <PeriodBegin>2017-02-01</PeriodBegin> <PeriodEnd>2017-02-28</PeriodEnd> <Products> <Product> <Supplier>Pavlova, Ltd.</Supplier> <ProductName>Pavlova</ProductName> <Details> <Detail> <ReturnDate>2017-02-21</ReturnDate> <UnitPrice>13.9000</UnitPrice> <Quantity>2</Quantity> </Detail> </Details> </Product> <Product> <Supplier>Formaggi Fortini srl</Supplier> <ProductName>Mozzarella di Giovanni</ProductName> <Details> <Detail> <ReturnDate>2017-02-27</ReturnDate> <UnitPrice>27.8000</UnitPrice> <Quantity>4</Quantity> </Detail> </Details> </Product> </Products> </Returns> 




The names of the uploaded files are stored in a separate table, the codes of our files there are 2006 (deliveries) and 2007 (returns).



In our receiving table, the image of our demo documents will look like this:



(It's not all, just the beginning of the table)
Record_IDFile_IDPathElement_NameElement_ValueType
0012006DeliveriesE
001 \ 0012006Deliveries \PeriodBegin2017-01-01E
001 \ 0022006Deliveries \PeriodEnd2017-01-31E
001 \ 0032006Deliveries \ProductsE
001 \ 003 \ 0012006Deliveries \ Products \ProductE
001 \ 003 \ 001 \ 0012006Deliveries \ Products \

Product \
SupplierZaanse snoepfabriekE
001 \ 003 \ 001 \ 0022006Deliveries \ Products \

Product \
ProductNameChocoladeE
001 \ 003 \ 001 \ 0032006Deliveries \ Products \

Product \
DetailsE
001 \ 003 \ 001 \ 003 \

001
2006Deliveries \ Products \

Product \ Details \
DetailE
001 \ 003 \ 001 \ 003 \

001 \ 001
2006Deliveries \ Products \

Product \ Details \ Detail \
DeliveryDate2017-01-03E
001 \ 003 \ 001 \ 003 \

001 \ 002
2006Deliveries \ Products \

Product \ Details \ Detail \
UnitPrice10.2000E
001 \ 003 \ 001 \ 003 \

001 \ 003
2006Deliveries \ Products \

Product \ Details \ Detail \
Quantity70E




Regarding the hierarchical identifier Record_ID : its goal is to uniquely number the nodes of the document tree while preserving information about the links with all ancestors.



Note:
In the same SQL Server there is a special data type (object extension) called hierarchyid, which serves for this purpose.



In the example above, we use a simple platform-independent implementation with a sequential concatenation of element counters at each level of the tree. We “finish” the counter of each level with zeros to a given fixed depth in order to get an easy and fast allocation of ancestor identifiers of any level through the allocation of substrings of fixed length.



That's all at this stage. Now we can use simple SQL queries to isolate subsets of product data, delivery details or returns, and link them together through file and item identifiers.



In many cases, this approach will be quite enough for efficient data storage and access to them, the “advanced” techniques described below may not be necessary.



Indeed, we have already loaded all our data into a single repository, not paying attention to the possible difference in the structure of documents, got a fairly effective way to isolate the necessary information from the whole array with simple SQL queries, we can link the “subtracted” data subsets together.



The performance of such a solution as an XML source storage system for one-time queries, even without optimizing indexes and other tweaks, will clearly be much higher than if you stored XML in files or even wrote it in special database fields. In our case, it is not necessary to run the XPATH search procedure (which implies a new parsing) for each document, we did it once and then calmly use the saved result through fairly simple queries.



But:
At this stage, we have not yet done so that DeliveryDate , Quantity and UnitPrice become fields of the same table, this is described as the process of “pivotirovaniya” in the third section.



In the next steps, we will look at the transformation of these XML documents into a single data structure containing 3 tables: MovementReports (movement type — delivery or return, start and end dates from the document root), Products (name and supplier), and MovementDetails (price, quantity, date - the date field as a result will be the same for both source documents, despite the fact that the fields in the source files are called differently)



2. Creation of transformation specifications in the resulting tables.



Consider the process of creating specifications for mapping the source data and the resulting tables. To create such specifications, we need something else.



2.1. Getting a label with the structure of documents.


For further processing, you need to have a detailed structure of all of our XML documents in order to decide on the basis of it into which structure of tables we will transform all this.

One particular sample XML document for this task is not enough for us; there may not be any optional elements in a specific document that suddenly appear in other documents.



If we do not have an XSD schema or we don’t want to communicate with it, then it may be enough for us to load into our table some representative sample of samples of XML documents and build the required list using the Path and Element_Name fields.



However, do not forget that as a result we want to load information into some target “final” tables, so we need to know where one-to-many relationships are stored in XML. That is, it is necessary to understand which elements form the “child” table where reproduction occurs.



Sometimes, if the scheme of the document is not very complicated, it is immediately clear to us empirically, "eyes." Also, when grouping the data of our “representative sample”, we can count the number of elements and see from this statistics where they begin to “multiply”. But in general, if we have a normal XSD scheme, it is better to use it - we will “catch” one-to-many data multiplication, revealing the XSD construction maxoccurs = unbounded .



As you can see, the task is a bit more complicated: I want to get not only a simple table containing a list of XPATH paths for all elements of our documents, but also indicating where the reproduction of data begins. (And when analyzing a well-written XSD scheme with a pleasant bonus, you might be able to pull out the descriptions of the elements and their types.)



It would be very good to use the functionality of some XML editor to get such a label, but it was not possible to find such a tool that would give us the structure of documents we needed using the XSD scheme (we searched and tried for a long time).



In all these Oxygen, Altova, Liquid and less sophisticated, the necessary information inside is undoubtedly used - however, none of them are able to give it in the right form. As a rule, an advanced editor has the ability to generate Sample XML based on a schema, but in XSD there may be a choice design, when there may be something to choose from several different elements in the document — then real “combat” document samples are better to analyze. And yet - according to the sample or samples of documents, we also will not catch the moment of one-to-many reproduction of information in an explicit form.



As a result, I had to reinvent the wheel and write the generator of such a plate (in fact, a special kind of XSD parser) on my own. The benefit of XSD is also XML, you can also upload it to our repository and pull out the desired view with relational operations. If the scheme is simple, without references to complex types of elements and without inheritance from basic types, then it is quite simple. In the case when all this type inheritance is available (as in public procurement, for example), the task is more complicated.



For our example, we will get something like this table of the structure of documents:
PathElement_Namemaxoccurs
Deliveries \PeriodBegin
Deliveries \PeriodEnd
Deliveries \ Products \ Product \ProductNameProducts \ Product
Deliveries \ Products \ Product \SupplierProducts \ Product
Deliveries \ Products \ Product \ Details \ Detail \DeliveryDateProducts \ Product \ Details \ Detail
Deliveries \ Products \ Product \ Details \ Detail \QuantityProducts \ Product \ Details \ Detail
Deliveries \ Products \ Product \ Details \ Detail \UnitPriceProducts \ Product \ Details \ Detail
Returns \PeriodBegin
Returns \PeriodEnd
Returns \ Products \ Product \ProductNameProducts \ Product
Returns \ Products \ Product \SupplierProducts \ Product
Returns \ Products \ Product \ Details \ Detail \QuantityProducts \ Product \ Details \ Detail
Returns \ Products \ Product \ Details \ Detail \ReturnDateProducts \ Product \ Details \ Detail
Returns \ Products \ Product \ Details \ Detail \UnitPriceProducts \ Product \ Details \ Detail




2.2. Transformation Description


Having obtained the initial data in the form of a table specification of documents, we proceed to the design of data transformation. As a result of this transformation, we will transform the data of our original primary XML repository into new tables with new field names and unique codes of target table entries. We do not need any special software tools, just add a few new columns to this label:



In the example above, some fields are abbreviated to improve readability.
PathElement_

Name
maxoccursTarget

Table
Target

Field
Element

Depth
Additional

Info
Deliveries \PeriodBeginMovementReportsPeriodBegin_DateoneDeliveries
Deliveries \PeriodEndMovementReportsPeriodEnd_DateoneDeliveries
Deliveries \ ... \ Product \ProductName... \ ProductProductsProductName_Dim3
Deliveries \ ... \ Product \Supplier... \ ProductProductsSupplier_Dim3
Deliveries \ ... \ ... \ ... \ Detail \DeliveryDate... \ DetailMovementDetailsMovementDate_Datefive
Deliveries \ ... \ ... \ ... \ Detail \Quantity... \ DetailMovementDetailsQuantity_Valfive
Deliveries \ ... \ ... \ ... \ Detail \UnitPrice... \ DetailMovementDetailsUnitPrice_Valfive
Returns \PeriodBeginMovementReportsPeriodBegin_DateoneReturns
Returns \PeriodEndMovementReportsPeriodEnd_DateoneReturns
Returns \ ... \ Product \ProductName... \ ProductProductsProductName_Dim3
Returns \ ... \ Product \Supplier... \ ProductProductsSupplier_Dim3
Returns \ ... \ ... \ ... \ Detail \Quantity... \ DetailMovementDetailsMovementDate_Datefive
Returns \ ... \ ... \ ... \ Detail \ReturnDate... \ DetailMovementDetailsQuantity_Valfive
Returns \ ... \ ... \ ... \ Detail \UnitPrice... \ DetailMovementDetailsUnitPrice_Valfive




Here are the fields we added:





This technique opens up several interesting data transformation manipulation capabilities. Without going into details, just list what we can do:





After our transformation specification label is ready, we load it into the database and join our primary repository for the Path and Element_Name fields.



By concatenating the File_ID , “trimmed” in accordance with the ElementDepth values ​​of the Record_ID field and the AdditionalInfo values, we form the composite key of our target table.



The result of the join is poured for each target table into a separate “temporary” table (depending on the amount of data you can try using the query result “on the fly”), in the next step the final utilities of our “pipeline” will work with these tables.



A little more about the composite key, maybe a little repeating - but this is an important and subtle point:
  • Before pouring we have a data set at the level of individual fields (elements). In order to connect the fields in the record of the resulting table, we need to have some kind of key that will uniquely identify the record of the target table, into which the corresponding fields will fall.

  • The hierarchical identifier Record_ID can be of different lengths, depending on the “depth” of individual elements in the document outline. However, if the level deepening is not accompanied by one-to-many elements multiplication, we cut our Record_ID to the minimum sufficient depth defined by the ElementDepth parameter, which will ensure that the identifier is the same for all fields of our target table. In our demo documents there is no such situation, but imagine, for example, that our UnitPrice wouldsplit ” into 2 values ​​- the wholesale and retail prices of UnitPrice \ Retail and UnitPrice \ Wholesale .

  • Since the contents of many files are in our base repository, we can not do without the value of File_ID in our key.
  • The following stages of data transformation work only with the “transformed” tables obtained at this step, we don’t have any end-to-end system of settings. The type of the field (dimension / measure) we pass through the suffixes of the names, but sometimes we need to pass “along the chain” also the information about which section of the document we took the information (remember that we can transform into different form the documents encoded by different schemes). To transfer this information to the next stage, we use the optional parameter of our AdditionalInfo transformation, “hooking” it to our composite key so that the identification of target records that we need is not violated.




Let's see what happened on the output in our example:



The result of the transformation:
Movement Reports:

KEYTargetFieldElement_Value
001; 2006 @ DeliveriesPeriodBegin_Date2017-01-01
001; 2006 @ DeliveriesPeriodEnd_Date2017-01-31
001; 2007 @ ReturnsPeriodBegin_Date2017-02-01
001; 2007 @ ReturnsPeriodEnd_Date2017-02-28


Products:

KEYTargetFieldElement_Value
001 \ 003 \ 001; 2006Supplier_DimZaanse snoepfabriek
001 \ 003 \ 001; 2006ProductName_DimChocolade
001 \ 003 \ 002; 2006Supplier_DimMayumis
001 \ 003 \ 002; 2006ProductName_DimTofu
001 \ 003 \ 001; 2007Supplier_DimPavlova, Ltd.
001 \ 003 \ 001; 2007ProductName_DimPavlova
001 \ 003 \ 002; 2007Supplier_DimFormaggi Fortini srl
001 \ 003 \ 002; 2007ProductName_DimMozzarella di giovanni


MovementDetails:

KEYTargetFieldElement_Value
001 \ 003 \ 001 \ 003 \ 001; 2006MovementDate_Date2017-01-03
001 \ 003 \ 001 \ 003 \ 001; 2006UnitPrice_Val10.2000
001 \ 003 \ 001 \ 003 \ 001; 2006Quantity_Val70
001 \ 003 \ 002 \ 003 \ 001; 2006MovementDate_Date2017-01-09
001 \ 003 \ 002 \ 003 \ 001; 2006UnitPrice_Val18.6000
001 \ 003 \ 002 \ 003 \ 001; 2006Quantity_Val12
001 \ 003 \ 002 \ 003 \ 002; 2006MovementDate_Date2017-01-13
001 \ 003 \ 002 \ 003 \ 002; 2006UnitPrice_Val18.7000
001 \ 003 \ 002 \ 003 \ 002; 2006Quantity_Val20
001 \ 003 \ 001 \ 003 \ 001; 2007MovementDate_Date2017-02-21
001 \ 003 \ 001 \ 003 \ 001; 2007UnitPrice_Val13.9000
001 \ 003 \ 001 \ 003 \ 001; 2007Quantity_Val2
001 \ 003 \ 002 \ 003 \ 001; 2007MovementDate_Date2017-02-27
001 \ 003 \ 002 \ 003 \ 001; 2007UnitPrice_Val27.8000
001 \ 003 \ 002 \ 003 \ 001; 2007Quantity_Valfour




Please note - the obtained key is the same for all fields that will be included in the corresponding records of our target tables.



3. Final processing.



3.1. Pivotting


Having obtained the “blank” target table with the previous transformation with source data divided into triples <key> - <field name> - <value> , we need to translate it into a more familiar look of a table with many fields. The algorithm of this transformation is obvious - first, by grouping the values ​​of our composite key, we get the “skeleton” of the table, then we perform the joins of this “skeleton” with the table-result of transformation according to the value of the composite key. (“Increasing Meat,” so to speak.)



That is, there will be N “skeleton” connections with subsets of the table — the result of the transformation, identified by the field names, where N is the number of field names in the target table — the result of the transformation.



We safely “brought” the AdditionalInfo field to this stage by encoding it inside the composite key. Now we need to free our key from this “burden” and cut the AdditionalInfo part into the AdditionalInfo_Dim new field.



We combined the file code and record identifier to transfer the key to the pivot stage with one field. For “final” storage, it is better to divide the file code and hierarchical identifier back into two fields, so it will be easier to associate the resulting tables with each other.



The result will be such



pivot tags:
Movement Reports:

Record_IDFile_IDAdditionalInfo_DimPeriodBegin_DatePeriodEnd_Date
0012006Deliveries2017-01-012017-01-31
0012007Returns2017-02-012017-02-28


Products:

Record_IDFile_IDSupplier_DimProductName_Dim
001 \ 003 \ 0012006Zaanse snoepfabriekChocolade
001 \ 003 \ 0022006MayumisTofu
001 \ 003 \ 0012007Pavlova, Ltd.Pavlova
001 \ 003 \ 0022007Formaggi Fortini srlMozzarella di giovanni


MovementDetails:

Record_IDFile_IDMovementDate_DateUnitPrice_ValQuantity_Val
001 \ 003 \ 001 \ 003 \ 00120062017-01-0310.200070
001 \ 003 \ 001 \ 003 \ 00120072017-02-2113.90002
001 \ 003 \ 002 \ 003 \ 00120062017-01-0918.600012
001 \ 003 \ 002 \ 003 \ 00120072017-02-2727.8000four
001 \ 003 \ 002 \ 003 \ 00220062017-01-1318.700020




3.2. Normalization


The next stage can be conventionally called normalization, we will replace all fields with the _Dim suffix with links to the corresponding directories, we will convert the fields with the _Date and _Val suffixes into dates and numbers, respectively. It is possible, if necessary, to use other data type suffixes, it is not a dogma.



For each _Dim field , we check the availability of the corresponding directory (if there is no directory, we will create it) and add new values ​​from the field of the brewed table.



At the end of the process, we transfer the data to the final storage structure by linking the brewed table with the updated reference books for the field values.



As a result, we get our target tables, which will contain the file code, hierarchical identifier, references to the values ​​of reference books and the remaining fields converted to the necessary types.



In conclusion of the process description



A hierarchical identifier in each target table gives us the opportunity to link these tables with each other (and in any order, if necessary, intermediate links can be omitted). As noted above, the fixed size of the elements of a hierarchical identifier allows us to fairly easily build expressions to link our target tables.



Of course, if you wish, you can supplement the structure of the table with links through the values ​​of the usual auto-increment keys.



Note the sufficient tolerance of the described approach to the errors and omissions of the developer (more precisely, the person configuring the mapping). If in the process of preparing the transformation you missed the required field, or if there were changes in the file scheme, such as adding or renaming fields that were not tracked on time, that's fine, even if a large amount of information was already processed without taking this error into account.



All the information we have saved and available in our “primary” table storage, so the source code will not have to be re-parsed.



It is possible to register mapping of fields for “small” data transformation - only for the missing fragment we need. After we “pull out” the missing piece of data, we will be able to “pick up” it quite easily to the main processed array through a bundle using a hierarchical identifier.



findings



As we see, from the very beginning we put all the original information, regardless of the document schemes, in a uniform form, which allows you to quickly “pull out” the necessary data in ad-hoc mode.



In the course of further processing, no animals suffered, there was no need to create or modify any scripts or objects of the type ETL task, specific to the structure of the source XML files.



Of course, the software environment of our “framework” (download infrastructure and data overflow) needed to be created once, but then the data integration of any XML structures is reduced to editing the mapping table of the source and target views.

Source: https://habr.com/ru/post/325186/



All Articles