Hello!
We continue the cycle of articles on the development of the browser engine.
In this article I will tell you how to create the fastest HTML parser with DOM. We will look at the HTML specification and how it is bad about performance and resource consumption when parsing HTML.
With this topic I reported on the last HighLoad ++. Not everyone can attend the conference, plus the article contains more details.
I assume that the reader has basic knowledge about HTML: tags, nodes, elements, namespace.
Before you begin to somehow affect the implementation of the HTML parser, you need to understand what HTML specifications to believe.
There are two HTML specifications:
Naturally, the choice fell on industry leaders - WHATWG
. Living standard, large companies, each of which has its own browser / browser engine.
UPDATE: Unfortunately, the above references to the specifications do not open from Russia. Apparently, the "echo of the war" with telegrams.
The process of building an HTML tree can be divided into four parts:
Consider each stage separately.
The tokenizer accepts code points for unicode input. Accordingly, we need to convert the current byte stream into unicode characters. To do this, use the Encoding specification.
If we have HTML with an unknown encoding (no HTTP header), then we need to define it before decoding. To do this, we use the encoding sniffing algorithm .
If very briefly, the essence of the algorithm comes down to the fact that we are waiting for 500
or the first 1024
from the byte stream and run the algorithm that tries to find the <meta>
with the attributes http-equiv
, content
or charset
and tries understand what encoding the HTML developer specified.
The Encoding
specification specifies the minimum set of supported encodings by the browser engine (21 total): UTF-8, ISO-8859-2, ISO-8859-7, ISO-8859-8, windows-874, windows-1250, windows-1251, windows -1252, windows-1254, windows-1255, windows-1256, windows-1257, windows-1258, gb18030, Big5, ISO-2022-JP, Shift_JIS, EUC-KR, UTF-16BE, UTF-16LE and x-user -defined.
After we decoded the bytes into unicode characters, we need to perform a "sweep". Namely, replace all carriage returns ( \r
) followed by a line feed ( \n
) with a carriage return ( \r
). Then, replace all carriage returns with a line feed ( \n
).
This is described in the specification. That is, \r\n
=> \r
, \r
=> \n
.
But, in fact, no one does. Make it easier:
If the carriage return character ( \r
) is caught, then we see if there is a newline character ( \n
). If there is, then we change both characters for a newline character ( \n
), if not, we only change the first character ( \r
) for a newline ( \n
).
At this preliminary data processing is completed. Yes, all you need to do is get rid of the carriage return symbols so that they do not fall into the tokenizer. The tokenizer does not expect and does not know what to do with the carriage return symbol.
So that no further questions arise, you should immediately tell what a parse error
.
In fact, nothing terrible. It sounds menacing, but in fact it is only a warning that we were expecting one thing, but we have another.
A parsing error will not stop data processing or tree building. This is a message that signals that we do not have valid HTML.
Parsig error can be obtained for surrogate pairs, \0
, wrong tag location, invalid <!DOCTYPE>
and more.
By the way, some parsing errors lead to consequences. For example, if you specify "bad" <!DOCTYPE>
then the HTML tree will be marked as QUIRKS
and the logic of some DOM functions will change.
As mentioned earlier, the tokenizer accepts unicode characters as input. This is a state machine that has 80
states. In each state, conditions for unicode characters. Depending on the incoming character, the tokenizer may:
The tokenizer creates six kinds of tokens: DOCTYPE, Start Tag, End Tag, Comment, Character, End-Of-File. Which come to the stage of building a tree.
It is noteworthy that the tokenizer does not know about all its states, but where about 40% (taken from the ceiling, for example). "Why the rest?" - you ask. About the remaining 60% knows the stage of building a tree.
This is done in order to properly parse tags such as <textarea>
, <style>
, <script>
, <title>
and so on. That is, usually those tags in which we do not expect other tags, but only closing ourselves.
For example, the <title>
cannot contain other tags. Any tags in <title>
will be perceived as text until it meets the closing tag for itself </title>
.
Why is this done? After all, it was possible to simply indicate to the tokenizer that if we encounter a <title>
then we go along the “path we need”. And that would be true if not namespaces! Yes, the name space affects the behavior of the tree building stage, which in turn changes the behavior of the tokenizer.
For example, consider the behavior of the <title>
in the HTML and SVG namespace:
HTML
<title><span></span></title>
Result of building a tree:
<title> "<span></span>"
Svg
<svg><title><span></span></title></svg>
Result of building a tree:
<svg> <title> <span> ""
We see that in the first case (HTML namespace) the <span>
is text, the span
element was not created. In the second case (SVG namespace), an element was created based on the <span>
tag. That is, depending on the namespace, tags behave differently.
But that is not all. The cake on this "celebration of life" is the fact that the tokenizer itself needs to know what namespace the tree building stage is in. And this is necessary solely in order to properly handle CDATA
.
Consider two examples from CDATA
, two namespaces:
HTML
<div><![CDATA[ ]]></div>
Result of building a tree:
<div> <!--[CDATA[ ]]-->
Svg
<div><svg><![CDATA[ ]]></svg></div>
Result of building a tree:
<div> <svg> " "
In the first case (HTML namespace), the tokenizer took CDATA
for comment. In the second case, the tokenizer disassembled the CDATA
structure and obtained data from it. In general, the rule is this: if we meet CDATA
not in the HTML namespace, then parse it, otherwise we consider it as a comment.
This is the result of a hard link between the tokenizer and the construction of the tree. The tokenizer should know in which namespace the tree building stage is now, and the tree building stage can change the states of the tokenizer.
Below, all six types of tokens created by a tokenizer will be considered. It is worth noting that all tokens have prepared data, that is, already processed and "ready to use." This means that all named character references, like ©
, will be converted to Unicode characters.
The DOCTYPE token has its own structure unlike other tags. Token contains:
In modern HTML, the only valid / valid DOCTYPE should look like this:
<!DOCTYPE html>
All other <!DOCTYPE>
will be considered a parsing error.
The opening tag can contain:
For example,
<div key="value" />
The opening tag may contain the self-closing
flag. This flag does not affect the closure of the tag in any way, but it may cause a parsing error for non- void elements.
Closing tag It has all the properties of the token of the opening tag, but has a slash /
before the tag name.
</div key="value" />
The closing tag may contain a self-closing
flag which will cause a parsing error. Also, a parsing error will cause the attributes of the closing tag. They will be parsed correctly, but thrown out at the stage of building a tree.
The comment token contains all the text of the comment. That is, it is completely copied from the stream to the token.
Example,
<!-- -->
Perhaps the most interesting token. Unicode token character. It may contain one (only one) character.
For each character in HTML, a token will be created and sent to the tree building stage. It is very expensive.
Let's take a look at how this works.
Take HTML data:
<span> ! ®</span>
How do you think how many tokens will be created for the given example? Answer: 22.
Consider the list of generated tokens:
Start tag token: <span> Character token: Character token: Character token: Character token: Character token: Character token: Character token: Character token: Character token: Character token: Character token: Character token: Character token: Character token: Character token: Character token: Character token: ! Character token: Character token: End tag token: </span> End-of-file token
Not comforting, right? But, of course, many creators of HTML parsers actually have only one token in the process. Running it around and rubbing it with new data each time.
Let's go ahead and answer the question: why is this done? Why not take this text in one piece? The answer lies in the stage of building a tree.
Tokenizer is useless without the stage of building an HTML tree. It is at the stage of building a tree that text is glued together with different conditions.
Conditions, approximately, such:
U+0000
( NULL
) arrived, then we cause a parsing error and ignore the token.U+0009
( CHARACTER TABULATION
), U+000A
( LINE FEED (LF)
), U+000C
( FORM FEED (FF)
), or U+0020
( SPACE
) character tokens arrived, then call the algorithm to restore the active formatting elements and insert a token into a tree.The symbol token is added to the tree according to the algorithm:
This behavior creates many problems. The need for each character to create a token and send for analysis in the stage of building a tree. We do not know the size of the text node and we have to either allocate a lot of memory in advance or do realoki. All this is extremely expensive in memory or in time.
Simple and clear token. The data is over - we will report on this stage of building a tree.
Tree building is a state machine with 23
states with a variety of conditions for tokens (tags, text). The tree construction stage is the largest, occupies a significant part of the specification, and is also capable of causing lethargic sleep and irritation.
Everything is very simple. Tokens are accepted at the input and the tree building state is switched depending on the token. At the exit, we have a real DOM.
The following problems look quite obvious:
Character copying
At the input, each state of the tokenizer takes one character each which it copies / converts in cases of need: tag names, attributes, comments, symbols.
It is very wasteful both in memory and in time. We are forced to pre-allocate an unknown amount of memory for each attribute, tag name, comment, and so on. And this, respectively, leads to reals, and realokes lead to lost time.
And if we imagine that HTML contains 1000 tags, and each tag has at least one attribute, then we get a hellishly slow parser.
Character token
The second problem is the character token. It turns out that we create a token for each character and give it to the construction of a tree. Building a tree does not know how many of these tokens we will have and cannot immediately allocate memory for the required number of characters. Accordingly, here all the same revocals + constant checks for the presence of a text node in the current state of the tree.
Monolithic system
The big problem is the dependence of everything on everything. That is, the tokenizer depends on the state of building the tree, and the tree building can control the tokenizer. And the namespaces are to blame.
Next, I will describe the implementation of the HTML parser in my Lexbor project, as well as solving all the voiced problems.
We remove the preliminary data processing. We will teach the tokenizer to understand carriage return ( \r
) as a space character. Thus, it will be thrown into the stage of building a tree where we will deal with it.
With a flick of the wrist, we unify all tokens. We will have one token for everything. In general, there will be only one token in the entire parsing process.
Our unified token will contain the following fields:
We will not work with the textual representation of the tag name. We translate everything into numbers. The numbers are easy to compare, easier to work with.
Create a static hash table of all known tags. Create an enumeration (enum) from all known tags. That is, we need to rigidly assign an identifier to each tag. Accordingly, in the hash table the key will be the name of the tag, and the value written from the enumeration.
For example:
typedef enum { LXB_TAG__UNDEF = 0x0000, LXB_TAG__END_OF_FILE = 0x0001, LXB_TAG__TEXT = 0x0002, LXB_TAG__DOCUMENT = 0x0003, LXB_TAG__EM_COMMENT = 0x0004, LXB_TAG__EM_DOCTYPE = 0x0005, LXB_TAG_A = 0x0006, LXB_TAG_ABBR = 0x0007, LXB_TAG_ACRONYM = 0x0008, LXB_TAG_ADDRESS = 0x0009, LXB_TAG_ALTGLYPH = 0x000a, /* ... */ }
From the example you can see that we created tags for the END-OF-FILE token, for the text, the document. All this for the sake of further convenience. Opening the curtain, I’ll say that we will have a Tag ID
in the DOM Node Interface
. This is done in order not to make two comparisons: on the type of the node and on the element. That is, if we need a DIV
element, then we do one check in the node:
if (node->tag_id == LXB_TAG_DIV) { /* Best code */ }
But, you can certainly do this:
if (node->type == LXB_DOM_NODE_TYPE_ELEMENT && node->tag_id == LXB_TAG_DIV) { /* Oh, code */ }
The two underscores in LXB_TAG__
are needed to separate common tags from system tags. In other words, the user can create a tag with the name text
or end-of-file
and if we then search by the tag name, no errors will occur. All system tags start with #
.
But still, the node can store a textual representation of the tag name. For 98.99999% node this parameter will be NULL
. In some namespaces, we need to prefix or the name of a tag with a fixed register. For example, baseProfile
in SVG namespace.
The logic of work is simple. If we have a tag with a clearly specified register then:
Custom tags (custom elements)
Developer can create any tags in HTML. So, as we have in the static hash table only those tags we know about, and the user can create any, we need a dynamic hash table.
Everything looks very simple. When the tag comes to us we will see if it is in a static hash table. If there is no tag, then we will look at the dynamic one; if there is no tag either, we increase the counter of identifiers by one and add the tag to the dynamic table.
Everything described happens at the tokenizer stage. Inside the tokenizer and after all comparisons go by Tag ID
(with rare exceptions).
Now in the tokenizer we will not have data processing. We will not copy and convert anything. We simply take pointers to the beginning and end of the data.
All data processing, such as symbolic links, will take place at the tree building stage.
Thus, we will know the size of the data for subsequent memory allocation.
It's still as simple as that. We do not copy anything, but simply store pointers to the beginning / end of the name and attribute values. All transformations occur at the time of building the tree.
Since we have unified tokens, we need to somehow inform the construction of the tree about the type of token. To do this, use the bitmap field Flags.
The field can contain the following values:
enum { LXB_HTML_TOKEN_TYPE_OPEN = 0x0000, LXB_HTML_TOKEN_TYPE_CLOSE = 0x0001, LXB_HTML_TOKEN_TYPE_CLOSE_SELF = 0x0002, LXB_HTML_TOKEN_TYPE_TEXT = 0x0004, LXB_HTML_TOKEN_TYPE_DATA = 0x0008, LXB_HTML_TOKEN_TYPE_RCDATA = 0x0010, LXB_HTML_TOKEN_TYPE_CDATA = 0x0020, LXB_HTML_TOKEN_TYPE_NULL = 0x0040, LXB_HTML_TOKEN_TYPE_FORCE_QUIRKS = 0x0080, LXB_HTML_TOKEN_TYPE_DONE = 0x0100 };
In addition to the type of token, opening or closing, there are values for the data converter. Only the tokenizer knows how to properly convert the data. Accordingly, the tokenizer marks in the token how the data should be processed.
From the previously described, we can conclude that the symbolic token has disappeared. Yes, now we have a new type of token: LXB_HTML_TOKEN_TYPE_TEXT
. Now we create a token for all the text between the tags, marking how it should be further processed.
Because of this, we will have to change the conditions in the construction of the tree. We need to teach him to work not with character tokens, but with textual ones: convert, delete unnecessary characters, skip spaces, and so on.
But there is nothing complicated. At the stage of building a tree changes will be minimal. But now the tokenizer does not correspond to the specification from the word at all. But, we do not need it, this is normal. Our task is to get the HTML / DOM tree that fully meets the specifications.
To ensure high processing speed in the tokenizer, we will add an iterator to each stage. According to the specification, each stage in our country takes one symbol and, depending on the incoming symbol, makes decisions. But the truth is that it is very expensive.
For example, to go from the ATTRIBUTE_NAME
stage to the ATTRIBUTE_VALUE
stage ATTRIBUTE_VALUE
we need to find a whitespace in the attribute name, which will indicate its completion. According to the specification, we must feed the symbol to the ATTRIBUTE_NAME
stage until the whitespace character is encountered, and this stage will not switch to another. This is very expensive, usually it is implemented through a function call on each character or callback like "tkz-> next_code_point ()".
We add a cycle to the ATTRIBUTE_NAME
stage and pass the incoming buffer entirely. In the cycle, we look for the symbols we need to switch and continue to work in the next stage. Here we get a lot of gain, even optimization of the compiler.
But! The worst thing is that we thereby broke the support of the chunks out of the box. Thanks to character processing in each stage of the tokenizer, we had support for chunks, and now we have broken it.
How to fix? How to implement chunks support ?! It's simple, we introduce the concept of incoming buffers (Incoming Buffer).
Often html parsit chunks. For example, if we receive data over the network. In order not to stand idle while waiting for the remaining data, we can send already received data for processing / parsing. Naturally, data can be "torn" anywhere. For example, we have two buffers:
The first
<div clas
Second
s="oh-no-oh-no">
Since we do not copy anything at the tokenization stage, we only take pointers to the beginning and end of the data, then we have a problem. Pointers to different user buffers. And given the fact that developers often use the same buffer for data, we are dealing with a pointer to the beginning of non-existent data.
.
:
" " . .
, . , ( ) . . , , , . .
, . , : . . ( ), . .
: . , .
.
, . , .
Here's what it looks like:
tree_build_in_body_character(token) { if (token.code_point == '\0') { /* Parse error, ignore token */ } else if (token.code_point == whitespaces) { /* Insert element */' } /* ... */ }
Lexbor HTML
tree_build_in_body_character(token) { lexbor_str_t str = {0}; lxb_html_parser_char_t pc = {0}; pc.drop_null = true; tree->status = lxb_html_token_parse_data(token, &pc, &str, tree->document->mem->text); if (token->type & LXB_HTML_TOKEN_TYPE_NULL) { /* Parse error */ } /* Insert element if not empty */ }
, . :
pc.replace_null /* '\0' (REPLACEMENT CHARACTER (U+FFFD)) */ pc.drop_null /* '\0' */ pc.is_attribute /* " " */ pc.state /* . . */
. - \0
, - REPLACEMENT CHARACTER
. - , - . .
, . . , <head>
. , , : " ". , .
HTML ( ) sarcasm
. .
An end tag whose tag name is "sarcasm" Take a deep breath, then act as described in the "any other end tag" entry below.
.
HTML DOM/HTML Interfaces HTML/DOM HTML .
, :
i7 2012 , , 235MB (Amazon-).
, 1.5/2 , . , . , CSS (Grammar, , Grammar). HTML, CSS , "".
HTML Lexbor HTML .
CSS Grammar. , . - 6-8 .
. , .
( ). .
Thanks for attention!
Source: https://habr.com/ru/post/430736/
All Articles