📜 ⬆️ ⬇️

Data acquisition, part 2

In the first part of my story about data acquisition, I wrote about what tools are used to get HTML from the Internet. In this post I will tell you in more detail how to get the necessary data from this HTML, and how to transform this data into the format we need.

HTML Formation


When you get HTML from some resource, you may have two options - either ideally generated HTML that can be immediately converted to XML (that is, taken and used), or poorly formed HTML. Most HTML, unfortunately, is poorly formed. In this situation, there are two options: either use the HTML Agility Pack to pull out all the necessary data, or use the same library to “correct” the resulting HTML and make it more XML-shaped. Here is the most minimal example of how to remove all unclosed IMG elements:

var someHtml = "<p><img src='a.gif'>hello</p>" ;<br/>
HtmlDocument doc = new HtmlDocument();<br/>
doc.LoadHtml(someHtml);<br/>
// fix images
foreach ( var node in doc.DocumentNode.SelectNodes( "//img" ))<br/>
if (!node.OuterHtml.EndsWith( "/>" ))<br/>
node.Remove();<br/>
Console.WriteLine(doc.DocumentNode.OuterHtml);<br/>
Console.ReadLine();<br/>

It may seem to someone that fixing HTML is an unnecessary task - using the same SelectNodes() method, you can get any element, even if this element is poorly formed (malformed). But there is one advantage that should not be forgotten - if you got the right XML, then a) you can make (or generate) XSD for this piece of XML; and b) by getting an XSD, you can generate mappings from an XML structure on a POCO, which is much easier to work with.

Mappings


Data mapping is usually featured in integration systems like BizTalk. The idea is to convert a dataset to anything — usually it's really just another dataset. In fact, in many cases this is a one-to-one mapping, but different conversions are often needed - for example, all HTML is text, and to get a number, you need to do a conversion ( int.Parse() , etc.). Let's look at how this is done.
')
Suppose we get the following (primitive) structure when parsing:

<table><br/>
<tr><br/>
<td>Alexander</td><br/>
<td>RD</td><br/>
</tr><br/>
<tr><br/>
<td>Sergey</td><br/>
<td>MVP, RD</td><br/>
</tr><br/>
<tr><br/>
<td>Dmitri</td><br/>
<td>MVP</td><br/>
</tr><br/>
</table><br/>

And now let's imagine that we need to fix this data on the following structure:

class Person<br/>
{<br/>
public string Name { get; set; }<br/>
public bool IsMVP { get; set; }<br/>
public bool IsRD { get; set; }<br/>
}<br/>

For this class, it is better to create a collection class right away:

public class PersonCollection : Collection<Person> {}<br/>

Now we will generate the XSD for the source data. The result looks like this:

<xs:schema xmlns:xs= "http://www.w3.org/2001/XMLSchema" ><br/>
<xs:element name= "table" ><br/>
<xs:complexType><br/>
<xs:sequence><br/>
<xs:element name= "tr" maxOccurs= "unbounded" ><br/>
<xs:complexType><br/>
<xs:sequence><br/>
<xs:element name= "td" type= "xs:string" /><br/>
<xs:element name= "td" type= "xs:string" /><br/>
</xs:sequence><br/>
</xs:complexType><br/>
</xs:element><br/>
</xs:sequence><br/>
</xs:complexType><br/>
</xs:element><br/>
</xs:schema><br/>

It is easy - probably too easy. What's harder is getting the schema for our class collections. (Nb: instead of a schema, you can use, for example, a database directly, but I’ll probably use XSD.) Attention, a magic trick: we compile an assembly with type PersonCollection and then run the following command:

xsd -t:PersonCollection "04 Mapping.exe" <br/>

Do not believe it - this command generates XSD based on the CLR-type! I note that it only makes sense to run XSD in the “bit” of your system. Despite the fact that everything is compiled for x86 in me, I had to make a 64-bit build to make XSD work. The result is the following XSD file, with which you can do the mapping:

<xs:schema elementFormDefault= "qualified" xmlns:xs= "http://www.w3.org/2001/XMLSchema" ><br/>
<xs:element name= "ArrayOfPerson" nillable= "true" type= "ArrayOfPerson" /><br/>
<xs:complexType name= "ArrayOfPerson" ><br/>
<xs:sequence><br/>
<xs:element minOccurs= "0" maxOccurs= "unbounded" name= "Person" nillable= "true" type= "Person" /><br/>
</xs:sequence><br/>
</xs:complexType><br/>
<xs:complexType name= "Person" ><br/>
<xs:sequence><br/>
<xs:element minOccurs= "1" maxOccurs= "1" name= "Name" type= "xs:string" /><br/>
<xs:element minOccurs= "1" maxOccurs= "1" name= "IsMVP" type= "xs:boolean" /><br/>
<xs:element minOccurs= "1" maxOccurs= "1" name= "IsRD" type= "xs:boolean" /><br/>
</xs:sequence><br/>
</xs:complexType><br/>
</xs:schema><br/>

Well, we have the left and right side of the mapping. The mapping itself can be created using an application like Stylus Studio or MapForce. Mappings are created visually, but the creation process is not intuitive , so if you have never worked with visual mappings, you will have to suffer a little at the beginning.

In order to create my own mapping, I used the Altova MapForce program. In short, this program can do many different mappings, including XSD-to-XSD, which is what we need. Mappings are generated for XSLT1 / 2, XQuery, Java, C #, and C ++. Personally, I use XSLT2 for my own purposes, and to start the transformations I use the free AltovaXML engine, since Everything Microsoft gives to .Net for XSLT is a real squalor. And XQuery is not in .Net at all. And no, the Mvp.Xml library also does not really help, although the prize for the efforts of the developers is necessary.

The first thing we do is visually describe the mapping using the primitives available to us. The result looks like this:




Now we generate for XSLT mapping. All that remains is to decide how to call it. If we consider that we use AltovaXML for transformation, the code itself looks like this:

public static string XsltTransform( string xml, string xslt)<br/>
{<br/>
var app = new Application();<br/>
var x = app.XSLT2;<br/>
x.InputXMLFromText = xml;<br/>
x.XSLFromText = xslt;<br/>
return x.ExecuteAndGetResultAsString();<br/>
}<br/>

In order to deserialize XML into a collection, we use the following method:

public static T FromXml<T>( string xml) where T : class <br/>
{<br/>
var s = new XmlSerializer( typeof (T));<br/>
using ( var sr = new StringReader(xml))<br/>
{<br/>
return s.Deserialize(sr) as T;<br/>
}<br/>
}<br/>

That's all - after receiving our XML, it can be easily transformed:

string xml = File.ReadAllText( "Input.xml" );<br/>
string xslt = File.ReadAllText( "../../output/MappingProjectMapToPersonCollection.xslt" );<br/>
string result = XsltTransform(xml, xslt);<br/>
var pc2 = FromXml<PersonCollection>(result);<br/>

Lyrics about mappings


It may seem to someone that mappings are superfluous, and for simple cases this may be true. But I want to note that mappings, being an additional level of abstraction, allow you to better control the result and adapt it to changing conditions - and in the case of changing website design this is really relevant.

Mappings and working with XML as a whole is not free - Visual Studio (even 2010) does a really bad job with it, so I used a specialized, paid program. Although no, I'm lying of course, because mappings are supported in BizTalk (and therefore in VS2008). And naturally, our task can be “transposed”, in a sense, on BizTalk. And what, for personal use, you can try if you are on an MSDN subscription.

That's all for today. Sources, as always, here . Comments welcome.

Source: https://habr.com/ru/post/94128/


All Articles