Correct HTML serialization in .Net

Good all!

Those who actively use XSLT to generate HTML (not XHTML), probably often faced with situations when it is necessary to generate not only valid XML - XHTML, but also for browsers that do not support XHTML, generate valid HTML, which, in general, is not same. To do this, we used dirty hacks in XSLT.
In this post I will talk about a cleaner and more beautiful method, which, unfortunately, is not often used.

The method is specific to the .Net infrastructure, but probably there are similar tools in other platforms.
')
And now in order.

Introduction

It is clear that the informational representation of XML is enough to describe any HTML document, and, moreover, it is wonderful falls within the scope of XML. The problem is that the textual representation of XML may differ from the representation of the same document in HTML.

The essence of the problem

The critical differences in serializing XML from HTML are quite simple:

The document cannot have an xml declaration ;
Some elements must have a closing tag , which means that the standard XML serializer will incorrectly make a <div /> self-closing tag for an empty div, because The HTML parser should expect a closing tag for the div;
Some elements cannot contain entity references , which means that the HTML parser does not handle entity references in elements such as script or style.

In addition, there are restrictions that depend on the content itself (the serializer has nothing to do with it), but it is important that these restrictions are met:

Some items may not have content , i.e. must be empty; for example, no content is allowed for the link element, and therefore, if for the link it does not even have content, but there is a separate closing tag, this will be an error of the HTML parser (which, of course, it will ignore);
Some elements cannot contain child elements or comments , such elements as title and textarea;
general limitations of the structure of the document , which we will not consider here, but leave to inquiring minds =)

Method itself

The fact is that for serializing XML, the environment uses an XmlWriter , which takes care of all the proper XML formatting. This class is used in almost all operations where you need to somehow write XML. In particular, in XSL transformations ( XslCompiledTransform.Transform ), an instance of this class is used as the destination.

So, all that is needed is to implement your XmlWriter, which will correctly format our XML in accordance with the rules of HTML. So, we present - HtmlXmlWriter !

Theory

We take the HTML specification, and more specifically HTML5 (where now without it), and we see that there are 5 types of elements in it:

Void (empty) elements - area, base, br, col, command, embed, hr, img, input, keygen, link, meta, param, source;
raw text (pure text) elements - script, style;
RCDATA elements (text only) - textarea, title;
foreign (external) elements - any external non-HTML elements, in particular from MathML and SVG, but we will consider as such any elements not from the XHTML namespace;
normal elements — all other HTML elements;

Now our HtmlXmlWriter should control and not allow no content to be added to empty (void) elements and they will always be self-closing (<col />).

Pure-text (raw text) can only have text (no entities or comments), but should not contain a sequence that can be interpreted as a closing tag (regardless of case).

RCDATA can not have child elements, but can only have text, including entity references. Comments in them, too, it seems impossible.

External elements can be any - this is plain XML. No restrictions.

Normal elements can also contain whatever they want, but they always need a closing tag.

Implementation

Well, actually, the implementation itself, I will not give here, it is not complicated and everyone can do it myself. I did it to myself, and, possibly, when I document it, and if I’m being impatiently begged, I’ll post it on some kind of storage facility. Here are just useful notes (maybe a little messy).

HtmlXmlWriter will be an XmlWriter heir. Must aggregate a third-party XmlWriter instance (which you need to pass to the constructor), and by default call the appropriate methods from it.

HtmlXmlWriter should keep track of which element it is currently in (the name and type of the last element), defining this in the XmlWriter.WriteStartElement / WriteEndElement method. It must also monitor whether it is on the attribute (WriteStartAttribute / WriteEndAttribute).

When closing an element (WriteEndElement / WriteFullEndElement), choose to use WriteEndElement or WriteFullEndElement depending on the type of element.

The most difficult part is with raw text elements, because XmlWriter will escape some characters. Therefore, you need to substitute text output (WriteCharEntity, WriteString, WriteSurrogateCharEntity) to WriteRaw on them. But here we must not forget to control, so that there is no closing tag in the text.

Conclusion

Now that you have such a class, you can easily pass it to an XSL transformation (or wherever else) and get normal HTML from XHTML, so even any dumb HTML parser will understand this.

Source: https://habr.com/ru/post/82034/

All Articles