Benchmark HTML parsers

Rewrote in the island a piece of one service from Python to Erlang. The service itself is engaged in downloading a large number of similar HTML pages via HTTP and extracting some information from them. The main CPU load of the service falls on the HTML parsing in the DOM tree.

At first I wanted to compare the performance of the Erlang parser mochiweb_html with that used in Python lxml.etree.HTML (). I spent the simplest benchmark, made the necessary conclusions, and then I thought that it would be nice to add a couple of parsers and platforms to the benchmark, make a more beautiful look, publish the code and write an article.
At the moment, I managed to write benchmarks on Erlang , Python , PyPy , NodeJS and C in the following combinations:

Erlang - mochiweb_html
CPython - lxml.etree.HTML
CPython - BeautifulSoup 3
CPython - BeautifulSoup 4
CPython - html5lib
PyPy - BeautifulSoup 3
PyPy - BeautifulSoup 4
PyPy - html5lib
Node.JS - cheerio
Node.JS - htmlparser
Node.JS - jsdom
C - libxml2 (more for reference)

The test compares the processing speed of N iterations of the parser and peak memory consumption.

Intrigue: Who is faster - Python or PyPy? How does the Erlang immunity affect parsing speed and memory consumption? How fast is the V8 NodeJS? And how does all this look at the code on pure C.

Terms

Most likely you are familiar with them, but why not repeat?
')
Quiet HTML parser is an HTML parser that can handle invalid HTML code (unclosed tags, characters > < inside <script> tags, non-escaped ampersand characters & , attribute values without quotes, etc.). It is clear that not any broken HTML can be recovered, but you can bring it to the form to which the browser leads it. It is noteworthy that most of the HTML, which is found on the Internet, is to some extent invalid!
DOM tree - Document Object Model strictly speaking, the DOM is the API that is provided to javascript in the browser for manipulating HTML documents. We will simplify the task a bit and assume that this is a data structure that is a tree-like display of the structure of an HTML document. At the root of the tree is the <html> element, its child elements are <head> and <body> and so on. For example, in a python document

 <html lang="ru-RU"> <head></head> <body>Hello, World!</body> </html>

Can be in the simplest form as

 ("html", {"lang": "ru-RU"}, [ ("head", {}, []), ("body", {}, ["Hello, World!"]) ])

HTML is usually converted into a DOM tree for transformation or for data extraction. To extract data from the tree is very convenient to use XPath or CSS selectors.

Contestants

Erlang
- Mochiweb html parser . The only weak HTML parser for Erlang. Written on Erlang.
CPython
- lxml.etree.HTML binding libxml2. Cython
- BeautifulSoup 3 Written in python HTML DOM parser (3rd version).
- BeautifulSoup 4 HTML DOM parser with pluggable backends.
- html5lib Written on python DOM parser, focused on HTML5.
PyPy (the same parsers as CPython, except lxml)
- BeautifulSoup 3
- BeautifulSoup 4
- html5lib
Node.JS
- cheerio JS HTML written DOM parser with jQuery API support
- htmlparser HTML DOM parser on pure JS
- jsdom JS HTML written DOM parser with a heaped API similar to the browser API
C
- libxml2 Written in C lax HTML SAX / DOM parser.

Goals

In general, HTML parsing (as well as JSON) is interesting because the document should be viewed character by character. There are no instructions in it like “the next 10Kb is a solid text, we copy it as is”. If we encounter a <p> tag in the text, then we need to sequentially look through all the subsequent characters for the presence of a </ . The fact that HTML can be invalid makes it "double check everything 2 times." Because, for example, if we met the <option> , it is far from a fact that we will meet the closing </option> . The second problem that usually arises with such formats is the escape of special characters. For example, if the entire document is <html>...100 ... & ... 100 ...</html> <html>...100 ... & ... 100 ...</html> , the parser will have to create in memory a complete copy of the contents of the tag with a single change - "& amp;", converted into "&" (although some parsers simply break such text into 3 individual piece).

The need to build a rather large structure in memory — a tree of small objects imposes rather strict requirements on memory management, the garbage collector, on the overhead to create many small objects.

Our benchmark want:

Compare the performance and memory consumption of various non-strict HTML DOM parsers.
To study the stability of the parser. Will the processing time per page and the amount of memory consumed increase with an increase in the number of iterations?
How the speed of parsing and memory consumption depends on the size of the HTML document.
Well, evaluate the effectiveness of the platform: the speed of working with strings, the effectiveness of memory management

We will not check the quality of the work of the parser in terms of completeness of the restoration of broken documents. Comparison of the convenience of the API and the availability of tools for working with a tree will also be left behind the scenes.

Test conditions and methods

The program once reads the document from the disk into memory and then N parses it sequentially in a loop.
The parser on each iteration should build in memory the full DOM tree.
After N iterations, the program prints the cycle time and ends.

Each parser is launched on a set of several HTML documents at N = 10, 50, 100, 400, 600 and 1000 iterations.
We measure User CPU, System CPU, Real runtime and (approximate?) Peak memory usage with /usr/bin/time .
HTML documents:

page_google.html (116Kb) - Google output, 50 results per page. A lot of HTML and JS embedded in the page, a little text, all HTML in one line.
page_habrahabr-70330.html (1,6Mb) - article on habre with 900 comments. A very large page, many tags, spaces and tabs.
page_habrahabr-index.html (95Kb) - the main page of habrahabr. Typical blog page.
page_wikipedia.html (99Kb) - an article on wikipedia. I wanted a page with a lot of text and few tags, but I chose not the most successful one. As a result, a lot of tags and inline CSS.

In fact, I realized that most of the documents are the same size only at the end, but did not redo it, because the measurement process itself takes quite some time. And it would be interesting to track various dependencies on the page size as well. UPD: the second part of the article is being prepared; we will parse the sites from Alexa TOP1000 in it.

Tests run sequentially on Ubuntu 3.5.0-19-generic x86_64, processor Intel Core i7-3930K CPU @ 3.20GHz × 12. (Haha 12 cores, if the tests run in series? Ehh ...)

Code

All code is available on github . You can try to run it yourself, detailed instructions are in the readme file. Not even - I strongly recommend not to believe me, but to check how the tests on your environment behave!
Tip: if you want to test only part of the platforms (for example, you do not want to install Erlang or PyPy for yourself), then this is easily set by the PLATFORMS environment variable.
I will be glad to pull-requests with the implementation of parsers in other languages (PHP? Java? ~~.NET?~~ Ruby?), I will try to add to the results (native implementations are interesting first of all - binding to libxml usually do not differ in speed). It would be interesting to try running tests on some other interesting HTML files (large nesting of tags, different file sizes).

results

Here are the raw measurement results as a CSV file results-1000.csv results-600.csv results-400.csv results-100.csv results-50.csv results-10.csv . Let's try to analyze them, for this we use the script in the R language (located in the repository with the benchmark in the stats / folder).

Speed

To study the dependence of the speed of the parser on the number of iterations, we construct histograms of the dependence of [time to process one page] on [the number of iterations]. We consider the time for processing one page as the runtime of the parser divided by the number of iterations. Ideally, the speed of the parser should not depend on the number of iterations, and better - should increase (due to JIT for example).
All graphics are clickable! Do not break your eyes!

Dependence of the processing time of the document on the number of iterations of the parser (hereinafter, there is a separate chart for each HTML page).

Bars of the same height - good, different - bad. It can be seen that for most parsers there is no dependency (all columns of the same height). The only exceptions are BeautifulSoup 4 and html5lib under PyPy; for some reason, with increasing number of iterations, their performance decreases. That is, if your parser on PyPy should work for a long time, the performance will gradually decrease. Suddenly…

Now the most interesting graph is the average processing speed of one page by each parser. Construct a box-plot chart.
The average time to process a document.

The higher is the box - the slower the parser works. The larger the box by area, the greater the spread of values (i.e., the higher the performance dependence on the number of iterations). It can be seen that the parser on C is in the lead, followed by lxml.etree , closely to the parsers on NodeJS and Erlang, then the bsoup3 parser on PyPy, the parsers on CPython and then by a large margin the same parsers running on PyPy. Such a surprise! PyPy merges everything.
Another oddity - bsoup 3 to the Python parser didn’t like the Wikipedia page :-).

Sample tabular data:

  > subset (res, (file == "page_google.html") & (loops == 1000)) [c ("platform", "parser", "parser.s", "real.s", "user.s ")]
     platform parser parser.s real.s user.s
 6 c-libxml2 libxml2_html_parser.c 2.934295 2.93 2.92
 30 erlang mochiweb_html.erl 13.346997 13.51 13.34
 14 nodejs cheerio_parser.js 5.303000 5.37 5.36
 38 nodejs htmlparser_parser.js 6.686000 6.72 6.71
 22 nodejs jsdom_parser.js 98.288000 98.42 98.31
 33 pypy bsoup3_parser.py 40.779929 40.81 40.62
 57 pypy bsoup4_parser.py 434.215878 434.39 433.91
 41 pypy html5lib_parser.py 361.008080 361.25 360.46
 65 python bsoup3_parser.py 78.566026 78.61 78.58
 49 python bsoup4_parser.py 33.364880 33.45 33.43
 60 python html5lib_parser.py 200.672682 200.71 200.70
 67 python by lxml_parser.py 3.060202 3.08 3.08

Memory

Now look at the memory usage. First, let's see how memory consumption depends on the number of iterations. Build the histograms again. Ideally, all the columns of the same parser should be the same height. If consumption grows with an increase in the number of iterations, this indicates memory leaks or problems with the garbage collector.
Memory consumption depending on the number of iterations of the parser.

It is interesting. Bsoup4 and html5lib under PyPy occupied 5 GB of memory after 1000 iterations of 1 MB file. (He brought here only 2 graphics, because the rest is the same picture). It can be seen that with an increase in the number of iterations, the memory consumed grows almost linearly. It turns out that PyPy is simply not compatible with Bsoup4 and html5lib parsers. I don’t know what is the reason for it and who is to blame, but it’s clear that using PyPy without thorough checking of compatibility with all used libraries is a very risky task.
It turns out that the combination of PyPy with these parsers is eliminated. Let's try to remove them from the charts:
Memory consumption depending on the number of iterations of the parser (without Bsoup4 and html5lib on PyPy).

We see that for the parser on C all columns are almost identical in height. Same for lxml.etree . For most parsers, memory consumption at 10 iterations is slightly less. Perhaps just time does not have time to measure it. The jsdom NodeJS parser behaves quite strangely - its memory consumption for some pages jumps in a very random way, but in general you can see an increase in memory consumption over time. Perhaps some problems with garbage collection.

Compare the average memory consumption for the remaining parsers. Construct a box plot.
Average memory consumption.

We see that the arrangement is about the same as in comparison of speed, but Erlang memory consumption was lower than that of NodeJS. lxml.etree requires approximately 2 times more memory than C libxml2 , but less than any other parser. Jsdom's NodeJS parser drops out of the overall picture, consuming ~ 2 times more memory than other NodeJS parsers - apparently it has a significant overhead associated with creating additional attributes for DOM elements of the tree.
Sample tabular data:

 > subset (res, (file == "page_google.html") & (loops == 1000)) [c ("platform", "parser", "maximum.RSS")]
     platform parser maximum.RSS
 6 c-libxml2 libxml2_html_parser.c 2240
 30 erlang mochiweb_html.erl 21832
 14 nodejs cheerio_parser.js 49972
 38 nodejs htmlparser_parser.js 48740
 22 nodejs jsdom_parser.js 119256
 33 pypy bsoup3_parser.py 61756
 57 pypy bsoup4_parser.py 1701676
 41 pypy html5lib_parser.py 1741944
 65 python bsoup3_parser.py 42192
 49 python bsoup4_parser.py 54116
 60 python html5lib_parser.py 45496
 67 python by lxml_parser.py 9364

Overhead on the launch of the program

This is not so much a test of HTML parser as an attempt to find out which platform should be used for writing console utilities. Just a small addition (since we already have the data). The platform overhead is the time that the program spends not on working directly, but on preparing for it (initialization of libraries, reading an HTML file, etc.). To calculate it, subtract from the time that the utility time - “time.s” produced, the time that was measured around the parser's cycle - “parser.s”.

It is seen that the overhead projector is in most cases insignificant. It is noteworthy that Erlang is comparable to Python. It can be assumed that it depends on the massiveness of the libraries that the program imports.

findings

As you can see, the C implementation is ahead of the rest of the planet (but the code in it turned out to be bigger).

Binding libxml2 to python (lxml.etree.HTML) works at almost the same speed, but consumes 2 times more memory (most likely the overhead of the interpreter itself). So the preferred Python parser is lxml.

The parser on bare Erlang shows surprisingly high results, despite the erlang attributed “copying data for every sneeze” ©. The speed is comparable with simple parsers on NodeJS and higher than that of Python parsers. Memory consumption is lower only for C and lxml. Stability is excellent. Such a parser can be released in production (which I did).

Simple parsers on NodeJS work very quickly - 2 times slower than sishna libxml. V8 works extremely efficiently. But the memory is consumed at the level of Python, and the memory is not spent very stably (memory consumption can increase with an increase in the number of iterations from 10 to 100, but then stabilizes). The jsdom parser for simple parsing is clearly not suitable, since he has too much overhead. So for HTML parsing in NodeJS the best choice is cheerio.

Parsers in pure Python are merged in both speed and memory consumption, and the results jump a lot on different pages. But at the same time, they behave stably at different numbers of iterations (does GC work evenly?)

But most of all surprised PyPy. Either there are some problems with GC, or the task is not suitable for it, or the parsers are unsuccessful, or I’ve got somewhere, but the performance of the PyPy parsers decreases with increasing number of iterations, and the memory consumption grows linearly. Bsoup3 parser more or less copes, but its performance is on par with CPython. Those. for parsing on PyPy, only Bsoup3 is suitable, but it doesn’t give significant advantages over CPython.

Links

Benchmark Code My Blog
Document Object Model

Send your implementations! It is very easy!

UPD:
Results for added parsers (graphics, CSV) :
golang (3 pcs, one faster C (!!!))
haskell (very good)
java (openjdk, oracle-jre) (automatically parallel, taking up ~ 150% CPU, user CPU> real CPU)
perl
php + tidy
ruby (binding to libxml)
dart (very slow)
mono (2 pcs) (automatically parallel, occupying ~ 150% CPU, user CPU> real CPU)

A full article + study of the dependence of speed and memory consumption on page size on Alexa TOP1000 will be later (hopefully).

Source: https://habr.com/ru/post/163979/

All Articles