The fundamental HTML vulnerability when embedding scripts

To describe the essence of the problem, I need to tell you what HTML is all about. You probably in general imagined, but I still briefly go over the main points that will be needed to understand. If someone can not wait, immediately go to the point .

HTML is a hypertext markup language. To speak this language, you must comply with its format, otherwise the one who reads the written, can not understand you. For example, in HTML tags have attributes:

<p name="value">

Here [name] is the name of the attribute, and [value] is its value. In the article, I will use square brackets around the code to make it clear where it begins and ends. After the name is an equal sign, and after it - the value enclosed in quotes. The attribute value begins immediately after the first quotation mark and ends immediately before the next quotation mark, wherever it is. This means that if instead of [value] you write [OOO " ".] , Then the value of the name attribute will be [OOO ] , and your element will also have three other attributes with names: [] , [] and ["."] , but without values.

 <p name="OOO "  "."></p>

If this is not what you expected, you need to somehow change the value of the attribute so that it does not contain a quotation mark. The simplest thing you can think of is to just cut the quotes.

 <p name="OOO   ."></p>

Then the HTML parser will correctly read the value, but the trouble is that it will be a different value. You wanted [OOO " "] , and you got [OOO .] . In some cases, this distinction may be critical.

So that you can specify any string as a value, the HTML format offers the possibility to escape attribute values. Instead of quotation marks in the string value, you can write a sequence of characters ["] and the parser will understand that in this place there was a quotation mark in the source string you want to use as the attribute value. Such sequences are called HTML entities.

 <p name="OOO &quot;  &quot;."></p>

At the same time, if in your source line there really was a sequence of characters ["] , you still have the opportunity to write it in such a way that the parser does not turn it into a quotation mark — to do this, replace the [&] sign with the sequence of characters [&] , that is, instead of ["] you will need to write in the raw text [&quot;] .

It turns out that the conversion from the source string to the one that we write between two quotation marks is unambiguous and reversible . Thanks to these transformations, you can write and read any string as an attribute of an HTML tag, without going into the essence of its contents. You just keep the format, and everything works.

Actually, this is the way most of the formats we come across: there is a syntax, there is a way of shielding content from this syntax and a way of shielding shielding characters, if suddenly such a sequence occurs in the source line. Most, but not ...

<Script> tag

The <script> tag is used to embed in HTML fragments written in other languages. Today, in 99% of cases, this is Javascript. The script begins immediately after the opening <script> tag and ends immediately before the closing </ script> tag. The HTML parser does not look inside the tag, for it is just some kind of text that it then gives to the Javascript parser.

In turn, Javascript is an independent language with its own syntax, it, generally speaking, is not designed in any special way for what will be embedded in HTML. In it, as in any other language, there are string literals in which there can be anything. And, as you should have guessed, there may be a sequence of characters that means the closing </ script> tag.

 <script> var s = "surprise!</script><script>alert('whoops!')</script>"; </script>

What should happen here: a harmless string must be assigned to the variable s .

What actually happens here: The script in which the variable s declared ends with the following: [var s = "surprise!] , Which leads to a syntax error. All text after it is interpreted as pure HTML and any text can be embedded into it markup. In this case, the new <script> tag opens and the malicious code is executed.

We get the same effect as when the quote value is present in the attribute value. But unlike attribute values, there is no way for the <script> tag to escape the source content. HTML entities inside the <script> tag do not work, they will be transferred to the Javascript parser without changes, that is, they either lead to an error or change its meaning. The HTML standard explicitly states that the content of the <script> tag cannot contain a sequence of characters </ script> in any form. And the Javascript standard does not prohibit such a sequence to be anywhere in string literals.

A paradoxical situation turns out: after embedding a valid Javascript into a valid HTML document using absolutely valid means, we can get an invalid result .

In my opinion, this is a HTML markup vulnerability, leading to vulnerabilities in real applications.

How vulnerability is exploited

Of course, when you just write some code, it’s hard to imagine that you will write in the </ script> line and you will not notice any problems. At a minimum, syntax highlighting will let you know that the tag has closed ahead of time, as a maximum, the code you write will not start and you will search for a long time what happened. But this is not the main problem with this vulnerability. The problem arises where you insert some content into Javascript when you generate HTML. Here is a frequent piece of application code on a server-side rendering server:

 <script> window.__INITIAL_STATE__ = <%- JSON.stringify(initialState) %>; </script>

The initialState </ script> may appear anywhere where data comes from the user or from other systems. JSON.stringify will not change such strings during serialization, because they fully comply with the JSON and Javascript formats, so they will simply go to the page and allow the attacker to execute arbitrary Javascript in the user's browser.

Another example:

 <script> analytics.identify( '<%- user.id.replace(/(\'|\\)/g, "\\$1") %>', '<%- request.HTTP_REFERER.replace(/(\'|\\)/g, "\\$1") %>', ... ); </script>

Here, the user id and the referer that came to the server are written into the lines with the corresponding screening. And, if the user.id unlikely to be anything other than numbers, then in the referer attacker can cram anything.

But on the closing tag </ script> the jokes don't end. The danger is also represented by the opening <script> tag, if in front of it in any place there are symbols [<!--] , which in normal HTML mark the beginning of a multi-line comment. And in this case, the syntax highlighting of most editors will not help you.

 <script> var a = 'Consider this string: <!--'; var b = '<script>'; </script> <p>Any text</p> <script> var s = 'another script'; </script>

What does a healthy person and most syntax highlights in this code see? Two <script> tags between which there is a paragraph.

What does the sick HTML5 parser see? He sees one (!) Unclosed (!) <Script> tag containing all the text from the second line to the last.

I don’t understand why it works like this, I only understand that when I encounter the characters [<!--] somewhere, the HTML parser begins to read the opening and closing <script> tags and does not consider the script to be complete until all are closed. public tags <script>. That is, in most cases, this script will go to the bottom of the page (unless someone could inject another extra closing tag </ script> below, hehe). If you haven’t come across this before, you might think that I'm joking now. Unfortunately not. Here is a screenshot of the DOM tree of the example above:

The most annoying thing is that, unlike the closing </ script> tag, which in Javascript can occur only inside string literals, sequences of characters <!-- and <script can also occur in the code itself! And will have exactly the same effect.

 <script> if (x<!--y) { ... } if ( player<script ) { ... } </script>

Are you sure the specification?

The HTML specification, besides the fact that it prohibits the use of legal sequences of characters inside the <script> tag and does not provide any way of escaping them within HTML, also advises the following:

"<! -" as "<\! -", "<script" as "<\ script", and "<! -" as "as" <\ / script "when you use these constructs in expressions.

What can be translated as “Always escape the sequences" <!-- "as" <\!-- "," <script "as" <\script ", and" </script "as" <\/script "when occur in string literals in your scripts and avoid these expressions in the code itself. " This recommendation touches me. Several naive assumptions are made at once:

In an embedded script (and this is not necessarily Javascript), the above sequences of characters can either be inside string literals, or they can be easily avoided in the syntax of the language.
In an embedded script, non-special characters can be escaped in string literals, and this does not change the values of the literals.
Anyone who embeds a script knows what the script is, deeply understands its syntax and is able to make changes in its structure.

And, if the first two items are executed at least for Javascript, then the last one is not executed even for it. Not always the script is inserted into HTML by a qualified person, it could be some kind of HTML generator. Here is an example of how the browser itself does not cope with this:

 var script = document.createElement('script') script.innerText = 'var s = "</script><script>alert(\'whoops!\')</script>"'; console.log(script.outerHTML); >>> <script>var s = "</script><script>alert('whoops!')</script>"</script>

As you can see, the string with the serialized element will not be parsed into an element similar to the original one. Transformation DOM-tree → HTML-text in general is not unambiguous and reversible. Some DOM trees simply cannot be represented as HTML source text.

How to avoid problems?

As you already understood, there is no way to safely insert Javascript into HTML. But there are ways to make Javascript safe for insertion into HTML (feel the difference). However, for this you need to be extremely attentive all the time while you are writing something inside the <script> tag, especially if you insert any data using the template engine.

Firstly, the probability that you will encounter characters in the source text (even after minification) not in string literals [<!-- <script>] extremely small. You yourself are unlikely to write something like that, and if an attacker can write something directly in the <script> tag, then the introduction of these characters will bother you last.

There remains the problem of embedding characters into strings. In this case, as written in the specification, all you need is to replace everything " <!-- " with " <\!-- ", " <script " with " <\script ", and " </script " with " <\/script ". But the trouble is that if you output some structure using JSON.stringify() , you are unlikely to want to parse it again to find all the string literals and screen something in them. I also do not want to advise using other packages for serialization, where this problem has already been taken into account, because situations are different, but I always want to defend myself and the solution should be universal. Therefore, I would advise to escape the characters / and! using the backslash after serialization. These characters can not be found in JSON anywhere except within strings, so a simple replacement will be absolutely safe. This does not change the sequence of characters " <script ", but it does not pose a danger if it occurs by itself.

 <script> window.__INITIAL_STATE__ = <%- JSON.stringify(initialState).replace(/(\/|\!)/g, "\\$1") %>; </script>

Similarly, individual lines can be escaped.

Another tip is to not embed anything in the <script> tag. Store data in places where transformations to insert data are unambiguous and reversible. For example, in the attributes of other elements. The truth is it looks pretty dirty and works only with strings, JSON will have to parse separately.

 <var id="s" data="surprise!</script><script>alert(&quot;whoops!&quot;)</script>"></var> <script> var s = document.getElementById('s').getAttribute('data'); console.log(s); </script>

But, in an amicable way, of course, if you want to develop applications normally, and not walk neatly through a minefield, you need a reliable way to embed scripts in HTML. Therefore, I consider it to be the right decision to completely abandon the <script> tag as not safe.

<Safescript> tag

If you do not use embedded scripts, then what? Of course, connecting all the scripts from the outside is not an option, sometimes having some kind of Javascript with the data inside the HTML document is very convenient: there are no extra HTTP requests, no need to make additional routes on the server side.

Therefore, I propose to introduce a new tag - <safescript>, the contents of which will completely obey the usual rules of HTML - HTML entities will work to screen content - and therefore embedding any script into it will be absolutely safe.

 <safescript> var s = "surprise!&lt;/script&gt;&lt;script&gt;alert('whoops!')&lt;/script&gt;"; </safescript> <safescript> var a = 'Consider this string: &lt;!--'; var b = '&lt;script&gt;'; </safescript>

There is no need to wait for the implementation of this tag in browsers. I wrote a very simple polyfil safescript that allows you to use it right now. Here is all that is needed for this:

 <script type="text/javascript" src="/static/safescript.js"></script> <style type="text/css">safescript {display: none !important}</style>

The code inside the <safescript> looks awful and unusual. But this is the code that goes into the HTML itself. In the template engine that you use, you can make a simple filter that will insert the tag and screen all its contents. This is how the code in the Django template engine might look like:

 {% safescript %} var s = "surprise!</script><script>alert('whoops!')</script>"; {% endsafescript %} {% safescript %} var a = 'Consider this string: <!--'; var b = '<script>'; {% endsafescript %}

This approach allows you to forget about escaping Javascript and avoid many vulnerabilities. And it would be very nice if the guys developing the HTML specification added such a script to the base set or came up with some other way to solve the problem of unsafe embedding of scripts in HTML.

Source: https://habr.com/ru/post/348558/

All Articles