
Translation of the article
The MediaWiki parser, uncovered .
2009 , , -, , -, , Mediawiki, .
The Mediawiki parser is a fundamental part of the
Mediawiki engine code. Without it, you would not be able to insert various tags into your Wikipedia articles: sections, links, or pictures. You could not even see or quickly change the layout of other articles. The wiki markup is flexible enough to make it equally easy for beginners and HTML experts to write articles. Because of this, the parser code is somewhat complicated, and over the years it went through many attempts to improve it. However, even today it still works fairly quickly for Wikipedia, one of the largest websites in the world. Let's take a look at the insides of this valuable (but slightly abstruse) piece of code.
')
Short story
Disclaimer This story, as I understand it, is mainly drawn from discussions in which I participated for many years in the Wikimedia mailing list, as well as from discussions at the Wikimania 2006 conference. Until 2008, the Mediawiki parser suffered greatly from the extraordinary complexity to fit in one pass (for speed), but also due to the fact that new rules were added to the existing code all the time. Over time, the parser code became a real spaghetti code, which was very difficult to debug, and even harder to improve. Rewrite it was almost impossible, because It belongs to the engine core. Millions of Wikipedia pages could fly in one moment if an error had occurred in some place of the new code.
What to do
There has been a lot of discussion about how to solve this problem. Someone suggested rewriting the parser in C, which would make the parser faster, which would allow the parser to parse the text not in a single pass, but in a cycle - this was required by an increasing number of templates and sub patterns, which were included on Wikipedia pages. There were also proposals to change the syntax of Mediawiki so as to eliminate uncertainties when parsing certain constructs (such as bold or italic, bold or italic, for example, or the relationship between triple and double curly braces in templates).
In the end, it was decided, and I consider it a brilliant idea, to leave the parser for PHP (because rewriting it in C would lead to the division of Mediawiki developers into 2 classes) and the division of parsing into two stages,
preprocessing and
parsing . The work of the preprocessor was to present the wikitext as an XML DOM. At the actual parsing stage, the DOM tree was processed in a loop of as many iterations as was required (for template substitution, for example) in order to get valid static HTML. Looping through the DOM is incredibly fast, and besides this is very natural from the XHTML point of view. And in PHP, such processing is also very well supported.
Preprocessor
In the Mediawiki source folder you will find two versions of the preprocessor, the version of Hash and DOM, they can be found at the addresses /includes/parser/Preprocessor_Hash.php and /includes/parser/Preprocessor_DOM.php, respectively.
We will focus on the DOM version, since it is almost identical to the Hash version, but it works faster because it uses PHP XML support, an optional component of PHP. The most important function in the preprocessor class is called
preprocessToObj () . Inside the Preprocessor_DOM.php file are several other important classes that the preprocessor uses: PPDStack, PPDStackElement, PPDPart, PPFrame_DOM and PPNode_DOM.
Preprocessor does less than you think.
So what is Mediawiki XML like? Here is an example of how the wikitext "{{MyMash Pattern}} this [[test]]" looks like in an XML representation:
<root> <template> <title>mytemplate</title> </template> this is a [[test]] </root>
Note That the internal link is not processed at all. The preprocessor code avoids the work that can be done at a later stage (and it has grounds for it), so the only real work of the preprocessor is to create XML elements for templates and a couple of other things. These things, i.e. base nodes (full list): template, tplarg, comment, ext, ignore, h.
If you have ever worked with wikitext, then you already know which markup corresponds to these basic nodes. Just in case, here it is:
- template = double braces {{...}}
- tplarg = triple braces {{{...}}}
- comment = HTML-comment of any type
- ext = All that needs to be dealt with in a separate extension
- ignore = noinclude tags, as well as includeonly tags and content inside them
- h = sections
That's all. Everything else is ignored and returned as a source wikitext to the parser.
How the preprocessor works
There is nothing special, but it is worth saying a few words. In order to get the XML representation we need, the preprocessor passes the wikitext in the loop as many times as the characters contain this text. There is no other way for correct processing of recursive templates that can be represented in the text as you like thanks to the syntax. So, if a Wikipedia article contains 40,000 characters, then it is likely that the cycle will consist of 40,000 iterations. Now I understand why speed is so important for the parser.
Parsing itself
Skip the rest of the preprocessor details and the extra classes that are used to generate the XML code. Let's turn our attention to the parser itself and take a look at the typical case of the parser when clicking on a Wikipedia article. Here, however, we should not forget that the wiki pages are cached by any possible means, so it is unlikely that each time you click on the page you will cause the parser to parse the page.
Here is a typical parser call tree for the current version of the page, starting from the Article object.
01. Article-> view |
02. --Article-> getContent |
03. ---- Article-> loadContent |
04. ------ Article-> fetchContent-> returns the wikitext extracted from the database |
05. --Article-> outputWikiText-> prepare for the parse |
06. ---- Parser-> parse |
07. ------ Parser-> internalParse |
08. -------- Parser-> replaceVariables |
09. ---------- Parser-> preprocessToDom |
10. ------------ Preprocessor-> preprocessToObj |
11. ---------- Frame-> expand |
12. -------- Parser-> doTableStuff |
13. -------- Parser-> replaceInternalLinks |
14. -------- Parser-> replaceExternalLinks |
15. ------ Parser-> doBlockLevels |
16. ------ Parser-> replaceLinkHolders |
Let's take a look at these features. Again, these are the
main functions, not all that are called in this example. Items 2-4 retrieve and return wikitext articles from the database. This text is passed to the outputWikiText object, which prepares it for passing to the Parser :: parse () function.
Further, it becomes interesting again at points 8-11. Inside the replaceVariables function, the text is transformed into a DOM representation; in a loop for each character of the article, the starting and ending labels of the templates, sub patterns and other nodes mentioned above are searched.
Point number 11 is an interesting step that I’ll skip right now because it requires some knowledge of other classes from the Preprocessor_DOM.php file (mentioned above). Expand is a very important function, does a bunch of things (among which is its recursive call), but suffice it to say that it does the job of actually extracting text from DOM nodes (we remember that templates can be nested - so you don't always get the full text from each included article) and into a valid HTML text in which all wiki tags are revealed, with the exception of three types: tables, links and lists. So, in the example above, "{{MyShablon}} this [[test]]" expand () will return the text of the form:
“I included [[text]] from my template, this is [[test]]”
As you can see from this simple example, at this stage everything is sorted out, with the exception of tables, links and lists.
Links are a special case.
Yes, the links have their own section. And not only because they are perhaps the most significant part of what makes wiki its very (besides the possibility of editing). But also because they understand a very special way from the rest of the markup in the parser code. What makes them special is that they are processed in two stages: at the first stage, each link is assigned a unique id, and at the second stage its valid HTML is inserted in the place of the link holder. In our example, this is the result after the first stage:
“I have included <! - LINK 0 -> from my template, this is <! - LINK 1 ->”.
As you can imagine, there is also an array that matches the text of the links with their ID of the form LINK #ID, this is a Parser class variable named mLinkHolders. In addition to matching, this variable also stores Title objects for each link.
So at the second stage of link analysis we use this array for simple search and replace. That's it! We send the finished text out the door!
Next stage
In the next post, I will focus more on the preprocessor and the details of the classes from the Preprocessor__DOM.php file, namely how they are used to build the original DOM XML tree. I will also tell you about how I used them to cache infoboxes in the Unbox extension.