Using XSLT to Prevent XSS by Filtering Custom Content

Problem formulation

I think none of the web developers need to explain what XSS is and how it is dangerous. But at the same time, many sites, such as forums, blogs, social networks, etc., seek to provide the user with the ability to embed their content on the page. For the convenience of inexperienced users, WYSIWYG editors are invented, making the process of adding a beautiful commentary easy and enjoyable. But behind all this facade is a security threat. In fact, any WYSIWYG editor sends to the server not just the text of the comment, it sends the HTML code. And even if the editor itself does not provide for the use of dangerous HTML tags (for example, <iframe>), the attacker will not stop it - he can send arbitrary HTML text to the server, which may be dangerous for other visitors to the site. I think very few people will like to get something like in their browser:

<script type="text/javascript">window.location="http://hardcoresex.com/";</script>

Thus, the problem arises: the HTML code received from the user must be filtered. But what does “filter” mean? What should be the filtering algorithm in order not to create unreasonable restrictions to legal users, but at the same time to make impossible the XSS attack from the attacker? Alas, but HTML is rather complicated, to write a good parser is quite difficult, and any mistake in it can lead to the fact that the attacker will have a loophole through which he can strike.

Formulation of the problem

To begin, I propose to formulate the problem formally. So, what the filter should do:

Parse the resulting HTML
Apply filter rules to it, delete or convert unsafe elements.
Return the resulting secure HTML for further processing.

In order to parse HTML, you can use existing libraries, for example in PHP, this can be done almost elementarily:

 function htmlToDOM($html) { $doc=new DOMDocument(); $doc->loadHTML($html); return $doc; }

But what to do with the resulting DOM further? How to formulate what rules need to apply to it? I wanted to get a solution that would be:

Reliable. By reliability I mean, first of all, a low probability of an error in the code, which can lead to the omission of dangerous tags, attributes, or attribute values.
Universal. By versatility, I mean the ability to filter HTML with an arbitrary degree of detail: from "no tags, only text" to "<iframe> with the src attribute containing the youtube address is possible, the others are not allowed" or "the <p> tags can use the style attribute, but from its values remove all that relates to properties except color and background-color "
Easily configurable. It should be possible to describe these rules in an understandable way, and simple rules should be described simply, without having to scroll through five screens of check marks and drop-down lists in order to simply ban all tags.

Finding a solution

I returned to this task from time to time, but did not find a satisfying solution. It turned out to be either very difficult (both in configuration and implementation), or rather limited. The decision came suddenly. I pondered the prospect of using XSL templates to format XML content, as it dawned on me: after all, XSLT is used to transform a document, which means it can also be used to filter unwanted elements too!

The solution really satisfies the requirements stated above:

Reliability. All work is performed by an XSLT processor, the probability of error in which is quite low, much lower than in the samopisny solution.
Versatility. Using XSLT, you can formulate filtering rules with any degree of detail.
Ease of configuration. Simple configuration is reduced to adding items to the “white” or “black” list by the existing template. Difficult cases, of course, will require additional descriptions, but this complexity arises only if there is a need for fine-tuning filtering. Another advantage of using XSLT is that this configuration can be read, understood, and modified by any XSLT expert.

Creating a filter with XSLT

Blacklist implementation

To find out if this idea is able to function at all, I decided to create an XSL file that describes simply copying the original document into the resulting one.

 <?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:output method="xml" encoding="utf-8"/> <xsl:template match="@*|*"> <xsl:copy> <xsl:apply-templates select="@*|node()" /> </xsl:copy> </xsl:template> </xsl:stylesheet>

As you can see, the whole point is

  <xsl:template match="@*|*"> <xsl:copy> <xsl:apply-templates select="@*|node()" /> </xsl:copy> </xsl:template>

This fragment is responsible for processing all elements of the document: tags and their attributes. Text elements are processed by the default rule, which simply copies them into the resulting document. With this template, the element being processed is also copied into the resulting document, and templates are recursively applied to its child elements and attributes (in fact, it is the same universal template). So in order to filter out some elements you need to add templates for them. So, for example, you can filter <script> tags along with their contents:

  <xsl:template match="script" />

One line! If you do not need to filter the content, you can use another option, for example, after adding the following fragment, all links will cease to be such:

  <xsl:template match="a"> <xsl:apply-templates /> </xsl:template>

This fragment will remove the <a> tags, but leave their contents (which, of course, will also be filtered.) And this is how you can fight undesirable attributes, for example, remove the style attribute from all elements:

  <xsl:template match="@style" />

As you can see, the rules are simple to write and require minimal comments even for someone unfamiliar with this system. But pushing all that is impossible in the blacklist is inconvenient. Blacklisting is rather an additional feature, but by no means a defense, as new tags appear, new attributes and un-updated filtering rules can pose a threat to the site. Therefore, I consider it more appropriate to apply a “white list” to protect against XSS (everything that is clearly not allowed is forbidden)
')

White List Implementation

To implement the white list, the universal rule must be rewritten as follows:

  <xsl:template match="*"> <xsl:apply-templates /> </xsl:template> <xsl:template match="@*" />

Without additional permitting rules, it will leave only text elements from the HTML code, removing all tags and their attributes (if you do not describe the attributes separately, their values will be copied as text). To allow, for example, links and images you need to add:

  <xsl:template match="a|img"> <xsl:copy> <xsl:apply-templates select="@*|node()" /> </xsl:copy> </xsl:template>

This rule will allow the tags themselves, but not their attributes - they will be deleted, making the tags useless. This is easy to fix:

  <xsl:template match="a/@href|img/@src"> <xsl:copy /> </xsl:template>

This rule allows the href attribute on the <a> tag and the src tag on the <img> tag. Since the attributes of child elements do not exist, they are simply copied into the resulting document. In this rule, you can implement additional verification, for example, that the link leads to an object using the http: // or https: // protocol (and thus get rid of unsafe protocols, such as data: // ):

  <xsl:template match="a[@href]"> <xsl:variable name="target" select="@href" /> <xsl:choose> <xsl:when test="starts-with($target, 'http://')"> <xsl:copy> <xsl:apply-templates select="@*|node()" /> </xsl:copy> </xsl:when> <xsl:otherwise> <xsl:apply-templates /> </xsl:otherwise> </xsl:choose> </xsl:template> <xsl:template match="a/@href"> <xsl:copy/> </xsl:template>

In this rule, the target of the link is checked and, depending on this, the decision is made whether to copy the tag or not. Tags <a> without the href attribute will fall under the default rule and will be deleted. Similarly, you can do with images. An alternative solution is to check the attribute value in the attribute template, but this means splitting the logic into two places:

  <xsl:template match="a[@href]"> <xsl:copy> <xsl:apply-templates select="@*|node()" /> </xsl:copy> </xsl:template> <xsl:template match="a/@href"> <xsl:variable name="target" select="." /> <xsl:if test="starts-with($target, 'http://')"> <xsl:copy/> </xsl:if> </xsl:template>

Another typical task is to add the rel = "nofollow" attribute to links:

  <xsl:template match="a[@href]"> <xsl:copy> <xsl:attribute name="rel">nofollow</xsl:attribute> <xsl:apply-templates select="@*|node()" /> </xsl:copy> </xsl:template>

Finally, the most difficult case: attribute value manipulation. I will demonstrate the solution of the problem formulated in the requirements - enable the style attribute, remove everything from its value except the color and background-color properties. First, create a template that analyzes the value of a single property and either allows it to be used or not:

  <xsl:template name="filter-style-value"> <xsl:param name="value" /> <xsl:variable name="key" select="substring-before($value, ':')" /> <xsl:if test="($key = 'color') or ($key = 'background-color')"> <xsl:value-of select="$value" /> </xsl:if> </xsl:template>

Now the second step: enumerating all the properties in the value and checking each for validity:

  <xsl:template name="filter-style"> <xsl:param name="value" /> <xsl:param name="filtered" select="''" /> <xsl:choose> <!--        --> <xsl:when test="contains($value, ';')"> <!--        --> <xsl:variable name="head" select="substring-before($value, ';')" /> <xsl:variable name="tail" select="substring-after($value, ';')" /> <!--    --> <xsl:variable name="fltr"> <xsl:call-template name="filter-style-value"> <xsl:with-param name="value" select="$head" /> </xsl:call-template> </xsl:variable> <!--    --> <xsl:call-template name="filter-style"> <xsl:with-param name="value" select="$tail" /> <xsl:with-param name="filtered"> <!--         (     ) --> <xsl:choose> <xsl:when test="string-length($fltr) > 0"> <xsl:value-of select="concat($filtered, $fltr, ';')"/> </xsl:when> <xsl:otherwise> <xsl:value-of select="$filtered" /> </xsl:otherwise> </xsl:choose> </xsl:with-param> </xsl:call-template> </xsl:when> <!--      --> <xsl:otherwise> <!--  --> <xsl:variable name="fltr"> <xsl:call-template name="filter-style-value"> <xsl:with-param name="value" select="$value" /> </xsl:call-template> </xsl:variable> <!--    --> <xsl:choose> <xsl:when test="string-length($fltr) > 0"> <xsl:value-of select="concat($filtered, $fltr, ';')"/> </xsl:when> <xsl:otherwise> <xsl:value-of select="$filtered" /> </xsl:otherwise> </xsl:choose> </xsl:otherwise> </xsl:choose> </xsl:template>

This is the largest and most complex template, but the task is not trivial. It can be somewhat simplified by highlighting the duplicate code into another auxiliary template, but I did not do that. He is commented, so I think a detailed description of his work is not required. Well, the last template is actually responsible for filtering tags:

  <xsl:template match="p[@style]"> <xsl:variable name="style" select="@style" /> <xsl:copy> <xsl:attribute name="style"> <xsl:call-template name="filter-style"> <xsl:with-param name="value" select="@style"/> </xsl:call-template> </xsl:attribute> <xsl:apply-templates /> </xsl:copy> </xsl:template>

Conclusion

Thus, I believe that he came as close as possible to the stated goal - to create a reliable and flexible filter for user-entered content. I want to make a reservation right away - the XSL given contains inaccuracies, it is intended solely to demonstrate the concept, this is not the code that can be used in production. I also have not designed the system as a whole, but it is obvious that it will save the result of the filter, so the conversion will be performed once - when adding content. A safe version will be displayed on the page.

Thank you for reading to the end. I hope the community will find this article useful.

Source: https://habr.com/ru/post/171557/

All Articles