How I developed a markup language translator

This story began back in 2008. In one of my projects to counteract XSS attacks, I decided to use BB codes and started looking for suitable Java libraries. Googling well, I did not find anything. Of course, then there were and now there are many libraries for translating BB codes in HTML, but they all did not suit me by one criterion - the inability to add or delete tags. Remembering the course "Programming Languages and Translation Methods" (benefit from classical education!), I set about implementing my own library for parsing BB codes. As a result, my longest-running project KefirBB appeared .

KefirBB is now a configurable markup translator. The library already includes configurations for BBCode translation, HTML filtering, Textile, partly Markdown. Also, the user can implement his own markup language. The configuration can be specified in an XML file or programmatically.

Briefly about the configuration

There are 2 main entities: codes and scope. The code is a sample for parsing BB code and a template for generating HTML.

<code name="bold"> <pattern ignoreCase="true">[b]<var inherit="true"/>[/b]</pattern> <template>&lt;b&gt;<var/>&lt;/b&gt;</template> </code>

For parsing text inside a code, completely different codes can be used than those used for parsing outside. To do this, there are scopes that include several codes. Different scopes may include the same codes. In this example, the scope is inherited from the parent code. There is a default scope ROOT.

More information about all this can be found in the documentation - User Guide

Here I would like to share the experience that I gained during the development of this translator.

The first problem I encountered is error handling. I initially created the translator for context-sensitive grammars. But I later found out. Those. it does not work on state machines. Parser is a recursive algorithm. At the university, of course, we were taught to struggle with mistakes, but we were taught with the expectation that the programmer makes mistakes in the program, and not the attacker who wants to hack us. The very first version of the translator went away from meditation on plain text:

 [b][b][b][b][b][b][b][b][b][b][b][b][b][b][b][b][b][b][b][b][b][b][b][b][b][b][b][b][b][b][b][b]

This happened because when the tag [b] met, the parser recognized it as a tag [b] [/ b] and started parsing the insides as if it was already inside the tag [b]. Then 2nd, 3rd, 4th, etc. When it reached the last and the end of the text, the parser understood that it was deceived. Then he decided that the last tag was erroneous and returned to the last but one. He immediately understood that the penultimate erroneous one and began to parse the last tag in the context of the precedent one. First I decided that it was correct, then what was wrong, then returned, etc. The process ended somewhere in eternity. The complexity of the algorithm is O(2^N) .

And I must say, it was a serious problem. I was almost desperate and was ready to give up everything. Fortunately, my fellow student programming world champion Renat Mullakhanov, who, unfortunately, is no longer with us, suggested a solution. It is necessary to remember the places where there were already errors and not to re-parse them again. In our example, after the translator realizes that the last tag is wrong, it will not attempt to parse it in the context of the penultimate tag, but immediately return an error. The complexity of the algorithm has sharply decreased to O(N) . Then there were no scopes in the translator, with their appearance the algorithm had to be rewritten, but the basic idea remained.

Subsequently, the translator developed, its capabilities expanded and it was firmly included in all our projects. Once even for translation it was used. And, one day, it was noticed that on one of the sites the translation of the text creates quite serious problems. The broadcast of the average text size took 2 seconds. What was unacceptable. Began a long optimization.

The first thing you should learn when optimizing a code is to never pay attention to your heuristic guesses. Only a profiler can help you identify performance problems. But even that did not save me. I looked in the profiler and saw that most of the time was taken by the method that determines which code should be used for parsing at the moment. I optimized it long and hard and sped it up hundreds of times. But as it turned out, most of the time the text is parsed without any codes at all. Those. it is necessary not to optimize the parsing of codes, but to parse the text without codes, because it is ten times more in real texts.

How to do it? In the first step, KefirBB determines the characters with which the code can begin. These characters are usually not so much. For example, for BB codes - this is [ , and for HTML - < . Then the check goes only to the comparison of these characters. Surprisingly, this optimization has accelerated word processing by dozens of times.

I continue to develop KefirBB, add new features, include new markup languages. I will be glad to any criticism and comments.

Source: https://habr.com/ru/post/236907/

All Articles

How I developed a markup language translator

More articles: