Parser in Nimbus Note, or how we solved the problem of "pure" HTML

One of the key features of Nimbus Note is to save and / or edit notes in the form of an html document. And these notes are created / edited in the browser or on mobile devices. Then - sent to the server. And as professional paranoia suggests, the information that came from the user cannot be trusted. Since there can be anything: XSS, a document that turns the layout into an abstractionist dream or never a text at all. Consequently, the data coming from the user needs preprocessing. In this article I will describe some features of our solution to this problem.

It would seem that there is difficult? Add any html-purifyer before saving and that's it. Yes, that's right, it could have been done if not for some circumstances:
')

there can be a lot of text in one note (several megabytes);
A significant number of simultaneous requests to save changes are assumed
requests for preservation will presumably be made from different parts written in different languages;
after processing the text and before saving, additional checks are possible;
after processing, you need to keep the appearance of the note as close as possible to the original one (ideally, the appearance should not change completely);
page layout when displaying a saved note should not "suffer";
unable to use iframe.

The first three points clearly require a solution that works separately from the main code. The fourth excludes the use of queues (RabbitMQ for example) or, equivalently, leads to the need for non-trivial solutions when using them.

And, finally, the last three points require deep processing of the layout, taking into account the fact that initially it is most likely not valid (“left” and / or unclosed tags, attributes, values). For example, if the width of any element is set to 100,500, then this value does not fit the definition of “permissible” and must be removed or replaced (depending on the settings).

All of the above arguments led us to decide to write our parser / validator ~~bicycle~~ . Python was chosen as the language. The reason was that the main project was written in this language and, of course, aesthetic preferences played a role.

In order not to write everything from scratch, it was decided to simplify your life and use some lightweight framework. The choice fell on the tornado, since with him we already had experience.

For reasons of scalability, we added a load balancer to the nginx system. Such a structure allows a fairly wide range to increase the processing capacity by simply adding parser instances. And the presence of the timeout at the client while waiting for a response from the parser allows you to set the maximum wait time that still does not leave the comfort zone for users (it does not cause the feeling that “everything is hanging”).

At first, lxml was chosen as the engine for the html parser. Good, fast, written in C parser. And all would be well with him, if not for a couple of "surprises."

Firstly, in the course of the work, such a well-known fact emerged in all its “glory” as the interpretation of lxml by the html-documents library as “broken” xml-approx. This feature, which at first did not cause concern, began to produce an increasing number of crutches. So, for example, lxml persistently believed that "" is a single tag and regularly performed the following conversion "=> <span />".

However, it would be possible to put up with the “crutches” if it were not for the “second”. During a test run on a copy of real data, the parser flew steadily along a “Segmentation Fault”. What was the reason for this is unknown. Since “Departure” guaranteed to occur after processing about half a thousand records. Regardless of their contents (sampling was made from different places in the table).

Thus, having filled a number of “cones”, we stopped at a bunch of “Beautiful Soup”, “html5lib” plus our crutches.

After this decision almost began to seem "here it is, happiness." And this happiness lasted until the moment, until the page parser processed by the parser, msn.com, caught my eye. The noteworthy features of this page turned out to be the active, with fiction, use of the “type” attribute for the “input” tag and their layout makers' love for “position: absolute;”. Since the problem was localized, it was relatively easy to fix it — fix the configs, a bit of code and, of course, write tests covering the thin spots found.

Now we are not just abstractly confident that many of the pages in the network contain invalid html, but we are waiting for the new “surprise” to come. We are waiting, trying to take preventive measures and we know that one day we will see her, having passed all the filters, all the tricks. We will see the page being the product of the abstractive delirium of an abstract artist ...

Source: https://habr.com/ru/post/217435/

All Articles

Parser in Nimbus Note, or how we solved the problem of "pure" HTML

More articles: