📜 ⬆️ ⬇️

MyHTML - HTML parser on the "bare" C with POSIX Threads support

image

Hello!

As it was possible to guess from the title, we will focus on HTML parsing (hereinafter referred to as html).
')
Preamble

Once I got the idea of ​​"X", but for its implementation, you need a calculated DOM with all the styles and buns. Googling and Yandex showed nothing good. There are all sorts of wrappers for WebKit, but they do not work on all platforms, and very cropped. There are projects where WebKit is wrapped in a certain frontend with which you work via JavaScript. Something was tried, but the result was deplorable. Consumption of resources which only cost.

Wishlist

But I wanted, as it seemed, not so much:
  1. Renderer html without wild dependencies. Only a renderer, the network would fall on the user. In other words, the full calculation of html until drawing in the window.
  2. Ability to adjust wrapper for JavaScript engine
  3. The ability to easily make a wrapper for other programming languages


And I entered into an unequal battle!

Studied existing HTML and CSS (hereinafter cess) parsers.
Being a back-end developer, I was not always satisfied with existing HTML parsers. All of them were divided conditionally into three categories:
  1. Parsing as you please, having only your own idea about tokenization of html
  2. Somehow following the specification
  3. Parsing clearly following specifications

It would seem that there is a third point, and you can probably close the topic? But no, and here's why: all existing parsers are arranged according to the principle “Parsim and Dying”. This is when you give the program a whole html, the program returns the result and any subsequent manipulations are impossible, only reading. This fact greatly limits the scope of the parsers. It is worth making a remark, there are those who shift work with the DOM to a higher level. The meaning is this: parsim with a parser, and then through the wrapper we try to work with the DOM on, for example, Python, which is a little absurd.

Further, no one allowed to break into the stream (meaning html) at the time of parsing. This is extremely important for adjusting the JavaScript engine. I will not explain for a long time, but I'd better show you why:

Fragment of html document:
<script>document.write("<div cl");</script>ass="future"></div> 

The result of any browser with JS:
 ... <div class="future"></div> 


That is, in the end a full DIV element will be created. By the way, tokenization tag SCRIPT is still the case. I had to paint
scheme
image

After all he saw, it was decided to write everything from scratch on C. The requirements for the code immediately arose:
  1. C99 support
  2. Ability to separate html parser from renderer to use separately
  3. No external dependencies

Why is it so tough right away - on Si ?! The solution must be embedded so that you can easily make a binding for a third-party programming language.

With varying success, it was possible to implement in draft form:
  1. Html parser
  2. Parser cess
  3. Selectors
  4. Inline Drawer, inline-block, block, table ...

You can write about the renderer for a long time, for the short phrase “Renderer of inline elements” there is not enough hiding: working with fonts according to the specification, calculating the size of the text, calculating the vertical-align, building an auxiliary tree for drawing the text, and a whole lot more.

In the end, after two or three years of slow development, I begin to convert the draft into a working one. The first, which is logical, was the html parser.

Now html parser has the following features:

+ A whole bunch of small, but necessary pieces about which you can write for a long time.

Next up are the CSS parser and Render. I do everything alone, "gasoline" should be enough.
Any help is extremely welcome!

Thanks for attention! I hope you will be useful!

Actually the parser itself

PS: If the community shows interest in this topic, I can write narrowly focused articles on how the rendering calculation works and what difficulties I encountered / face.

Source: https://habr.com/ru/post/277031/


All Articles