Modest - development of an open-source HTML rendering engine on the “bare” C

Hello! My name is Alexander Borisov and I am developing Modest - an open HTML-rendering engine on the “bare” C without using external dependencies (hereinafter the engine). I just want to clarify what it means “without external dependencies” - all code is written from scratch, the code is not borrowed anywhere.

A lot of time has passed since my last publication . During this time, much has changed and I want to share with you the achievements in development.

About the project

The main idea of the project is its simplicity, which means:

- Ability to compile / install on any device where there is C
- Work speed
- The minimum possible resource consumption

One of the end products of this development is a truly fast, easy, full-featured browser.
')

Everything in order

Ability to compile / install on any device where there is C

The idea is not only to be able to compile / install on any piece of hardware, because C is almost everywhere, but also to be able to connect the engine to another programming language revealing the full range of API engine for this language. Simply put, having the ability to easily make a binding for your programming language.

What will the strapping in other languages give us?

A simple and understandable API engine will allow us to directly work with HTML Tree, CSS, Render Tree / Layers (tree of drawing / layers) through our favorite programming language, without using JavaScript.

In practice, this will mean the following:

High speed access, create, change elements / layers.
Easy creation of interfaces, games, applications through a familiar programming language using all the features of HTML / DOM Events / CSS.

Moreover, I went ahead and tested the following

At once I will make a reservation that this is just an experiment, why it may not be so clear in life, but it can be done.

Take a ready-made harness for Perl and change it
Add Perl type to script script processing: <script type = "Per">
Parse html in Perl
When the parser encounters a script tag of the Perl type, it executes this code in the current interpreter.

The Perl script is as follows:

use utf8; use strict; use warnings; use HTML::MyHTML::Fun; my $html = q~ <div> <span>text</span> </div> <script type="Perl"> # $MyHTML_TREE global var my $nodes = $MyHTML_TREE->get_elements_by_tag_name("span"); foreach my $node (@$nodes) { $node->delete($MyHTML_TREE); } </script> <span>footer</span> ~; # parse HTML my $myhtml = HTML::MyHTML->new(MyHTML_OPTIONS_DEFAULT, 1); my $tree = $myhtml->new_tree(); $myhtml->parse($tree, MyHTML_ENCODING_UTF_8, $html);

Result of processing:

 <html> <head> <body> <div> <script type="Perl"> <-text>: ... <span> <-text>: footer

Work speed

Nobody likes to wait, and I especially. One of the key points of development is to ensure fast processing of one or another part of HTML / CSS / Render. For example, the average processing time of a typical page html is 0.001 sec., That is, 1ms, which is 1000 pages per second. Parsing the bootstrap CSS file and its selectors costs about 1.5 ms. At the moment we have the fastest full-fledged parser HTML and CSS. And this is not the limit.

Minimum possible resource consumption

If, with the speed of work, everything is more or less clear, then with the consumption of resources everything is much more complicated. Specifications usually "advise / require" all stored in memory. More precisely not so, all the arguments go as if everything is at hand, everything is created.

What is it manifested in?

Take, for example, the CSS syntax specification. She tells us that we need to create tokens for each character / sequence of characters and arrange them into groups, create groups of tokens. To put it bluntly, we must create tokens for each character not included in the general tokenization rules (delim-token), as well as for each: ";", ":", "(", ")", "," and others , the full list of rules can be seen here .

Agree that to create a token for each comma or semicolon is quite a wasteful exercise. At the same time, it is worth noting that in the rules of tokenization of symbols there are conditions like these: having the current symbol, see that the following H are equal to X, otherwise create Y.

Later, when all the tokens are created they need to be parsed . That is, select groups from the sequence of created tokens. And it is these groups that use CSS modules, which are not small .

That's where the fun begins. We have to fully comply with the specification, but do so in order not to create tokens for each other. Not a little reading the specifications and wondering how not to create "extra" tokens and save memory, the following conditions have been formed:

1) CSS parsing should support chunks, i.e. stream parsing. This specification does not require, but it is important for further development.

It is because of this condition that we do not know when the data stream will end, for us it is infinite. That is, at any time we are guaranteed to have only the current and all previous characters, but we have no idea what's next, and accordingly we can’t look at the condition “if the A symbol has arrived, then see if there is a further opening bracket”.

2) We do not create all tokens, we create only one, which will continue to be overwritten in the future. At any given time, we have only one token, the current token. Accordingly, we can not see the previous or next token, they are not. At first, this seemed to be a problem, since the specification doesn’t have enough conditions like “if token X arrives, then look at the following three tokens, and if they are not H then Y”.

All of the above is implemented in MyCSS. Already, MyCSS successfully parses selectors and some CSS properties while consuming minimal amounts of memory. Accordingly, if we do not constantly go for the "pieces" of memory, then our speed will increase significantly.

And in the appendix to all of the above, the MyCSS parser retained clarity, flexibility, ease in further developing modules to it.

By the way, in MyHTML everything is implemented exactly the opposite. There is an emphasis on creating tokens and further work with them. This clearly shows that in such a matter as writing an HTML renderer you cannot use the “silver bullet”. Everywhere need an individual approach. Well, of course, all this can not be created without a full understanding of the specifications and what is required in the specification.

Current project structure

Currently the following parts of the project are implemented:

MyHTML - HTML parser
MyCSS - CSS parser (Selectors, Values, Namespace, Property)
MyFONT - parser for .otf and .fft files. Getting metrics for glyphs: width, height, baseline, x-height and others. It is worth noting that here we are talking about the size of characters, as in browsers. See an example .

About selectors

I see no point in writing a separate article on selectors, but I want to boast. Already, you can use selectors to find nodes in the tree:

 div > :nth-child(2n+1):not(:has(a))

or list of comma comma

 .header, :nth-child(2n+2 of div:not([id])) >> :not(:has(> [class ~= "bukabyaka" i]))

They work pretty fast. Above, the first, given selector on a typical article, a habr works for 0.00015 sec., That is, for 0.15ms, and finds 247 elements. This time includes parsing CSS, parsing and creating selectors, searching the tree. You can create a selector in advance and reuse it, which will reduce the work time.

Currently, all selectors from the specification of the quadruple version ( Selectors 4 ) are supported except:

All pseudo-elements
: dir,: lang,: scope, "Time-dimensional Pseudo-classes",: drop
: nth-column,: nth-last-column

The first and second will be developed / added as the engine is developed. The third hands did not reach, but, of course, they will also be implemented in the near future.

Future of the project

It (the future) seems very bright. I am engaged in the project during working hours. My employer allowed me to spend all the work time on the project, of course, if you are asked to do something, you have to be distracted.

Now, I started to create a rendering tree (Render Tree, Layers). That is, in the near future, it will be possible to receive calculated metrics for HTML nodes, such as width, height, font-size, border-color, and others.

There are a lot of ideas, and there is even more “gasoline” in me! Thanks for attention!

PS: If someone has a desire to help / participate in the project, you can safely write to the post office .

References:

" Modest
» Modest Examples
» MyHTML Examples
» MyCSS Examples
»MyHTML Binding for Perl

Source: https://habr.com/ru/post/309756/

All Articles