📜 ⬆️ ⬇️

Correct HTML serialization in .Net

Good all!

Those who actively use XSLT to generate HTML (not XHTML), probably often faced with situations when it is necessary to generate not only valid XML - XHTML, but also for browsers that do not support XHTML, generate valid HTML, which, in general, is not same. To do this, we used dirty hacks in XSLT.
In this post I will talk about a cleaner and more beautiful method, which, unfortunately, is not often used.

The method is specific to the .Net infrastructure, but probably there are similar tools in other platforms.
')
And now in order.

Introduction


It is clear that the informational representation of XML is enough to describe any HTML document, and, moreover, it is wonderful falls within the scope of XML. The problem is that the textual representation of XML may differ from the representation of the same document in HTML.

The essence of the problem


The critical differences in serializing XML from HTML are quite simple:

In addition, there are restrictions that depend on the content itself (the serializer has nothing to do with it), but it is important that these restrictions are met:

Method itself


The fact is that for serializing XML, the environment uses an XmlWriter , which takes care of all the proper XML formatting. This class is used in almost all operations where you need to somehow write XML. In particular, in XSL transformations ( XslCompiledTransform.Transform ), an instance of this class is used as the destination.

So, all that is needed is to implement your XmlWriter, which will correctly format our XML in accordance with the rules of HTML. So, we present - HtmlXmlWriter !

Theory


We take the HTML specification, and more specifically HTML5 (where now without it), and we see that there are 5 types of elements in it:

Now our HtmlXmlWriter should control and not allow no content to be added to empty (void) elements and they will always be self-closing (<col />).

Pure-text (raw text) can only have text (no entities or comments), but should not contain a sequence that can be interpreted as a closing tag (regardless of case).

RCDATA can not have child elements, but can only have text, including entity references. Comments in them, too, it seems impossible.

External elements can be any - this is plain XML. No restrictions.

Normal elements can also contain whatever they want, but they always need a closing tag.

Implementation


Well, actually, the implementation itself, I will not give here, it is not complicated and everyone can do it myself. I did it to myself, and, possibly, when I document it, and if I’m being impatiently begged, I’ll post it on some kind of storage facility. Here are just useful notes (maybe a little messy).

HtmlXmlWriter will be an XmlWriter heir. Must aggregate a third-party XmlWriter instance (which you need to pass to the constructor), and by default call the appropriate methods from it.

HtmlXmlWriter should keep track of which element it is currently in (the name and type of the last element), defining this in the XmlWriter.WriteStartElement / WriteEndElement method. It must also monitor whether it is on the attribute (WriteStartAttribute / WriteEndAttribute).

When closing an element (WriteEndElement / WriteFullEndElement), choose to use WriteEndElement or WriteFullEndElement depending on the type of element.

The most difficult part is with raw text elements, because XmlWriter will escape some characters. Therefore, you need to substitute text output (WriteCharEntity, WriteString, WriteSurrogateCharEntity) to WriteRaw on them. But here we must not forget to control, so that there is no closing tag in the text.

Conclusion


Now that you have such a class, you can easily pass it to an XSL transformation (or wherever else) and get normal HTML from XHTML, so even any dumb HTML parser will understand this.

Source: https://habr.com/ru/post/82034/


All Articles