Parse HTML in .NET and Survive: Analyzing and Comparing Libraries

In the course of working on one home project, I was faced with the need for parsing HTML. Google search gave a comment to Athari and his micro-review of actual HTML parsers in .NET for which many thanks to him.

Unfortunately, no figures and / or arguments in favor of this or that parser were found, which was the reason for writing this article.

Today I will test popular, at the moment, libraries for working with HTML, namely: AngleSharp , CsQuery , Fizzler , HtmlAgilityPack and, of course, Regex-way . I compare them in speed and usability.
')

TL; DR : The code for all benchmarks can be found on github . There are also test results. The most relevant parser at the moment is AngleSharp - a convenient, fast, ~~youth~~ parser with a convenient API.

Those who are interested in a detailed review - welcome under cat.

Content

Library Description

In this section there will be brief descriptions of the libraries in question, a description of licenses, and so on.

HtmlAgilityPack

One of the most (if not the most) well-known HTML parser in the .NET world. A lot of articles have been written about him in both Russian and English, for example, on habrahabr .

In short, this is a fast, relatively convenient library for working with HTML (if the XPath queries are simple). The repository has not been updated for a long time.
MS-PL license.

The parser will be convenient if the task is typical and well described by the XPath expression, for example, to get all the links from the page, we need quite a bit of code:

/// <summary> /// Extract all anchor tags using HtmlAgilityPack /// </summary> public IEnumerable<string> HtmlAgilityPack() { HtmlDocument htmlSnippet = new HtmlDocument(); htmlSnippet.LoadHtml(Html); List<string> hrefTags = new List<string>(); foreach (HtmlNode link in htmlSnippet.DocumentNode.SelectNodes("//a[@href]")) { HtmlAttribute att = link.Attributes["href"]; hrefTags.Add(att.Value); } return hrefTags; }

However, if you want to work with css-classes, then using XPath will give you a lot of headaches:

 /// <summary> /// Extract all anchor tags using HtmlAgilityPack /// </summary> public IEnumerable<string> HtmlAgilityPack() { HtmlDocument hap = new HtmlDocument(); hap.LoadHtml(html); HtmlNodeCollection nodes = hap .DocumentNode .SelectNodes("//h3[contains(concat(' ', @class, ' '), ' r ')]/a"); List<string> hrefTags = new List<string>(); if (nodes != null) { foreach (HtmlNode node in nodes) { hrefTags.Add(node.GetAttributeValue("href", null)); } } return hrefTags; }

From the observed oddities - a specific API, sometimes incomprehensible and confusing. If nothing is found, it returns null , not an empty collection. Well, the library update was somehow delayed - for a long time nobody commited. The bugs are not fixed ( Athari mentioned a critical bug Incorrect parsing of HTML4 optional end tags , which leads to incorrect processing of HTML tags, closing tags for which are optional.)

Fizzler

Add-in to the HtmlAgilityPack, which allows the use of CSS selectors.
The code, in this case, will be a clear description of what problem Fizzler solves:

 //     var html = new HtmlDocument(); html.LoadHtml(@" <html> <head></head> <body> <div> <p class='content'>Fizzler</p> <p>CSS Selector Engine</p></div> </body> </html>"); // Fizzler   -  HtmlAgilityPack //   QuerySelectorAll  HtmlNode var document = html.DocumentNode; // : [<p class="content">Fizzler</p>] document.QuerySelectorAll(".content"); // : [<p class="content">Fizzler</p>,<p>CSS Selector Engine</p>] document.QuerySelectorAll("p"); //    document.QuerySelectorAll("body>p"); //  [<p class="content">Fizzler</p>,<p>CSS Selector Engine</p>] document.QuerySelectorAll("body p"); //  [<p class="content">Fizzler</p>] document.QuerySelectorAll("p:first-child");

The speed of work is almost the same as HtmlAgilityPack, but it is more convenient due to the work with CSS selectors.

With commits, the same problem as with HtmlAgilityPack - there are no updates for a long time and, apparently, it is not foreseen.

License: LGPL .

Csquery

It was one of the modern HTML parsers for .NET. The validator.nu parser for Java, which in turn is a port of the parser from the Gecko engine (Firefox), was taken as the basis.

The API was inspired by jQuery, and the selector language is CSS. The names of the methods are copied almost one-to-one, that is, for programmers who are familiar with jQuery, learning will be easy.

Currently, the development of CsQuery is in a passive stage.

Message from the developer

CsQuery is not being actively maintained. I no longer work in .NET much more these day! Therefore, it is difficult to address any problems or questions. If you post issues, I’m not going to be able to fix them.

It is a stable release, it is stable, it has been found out. However, I’m not going to know what to say about a new release.

Making this project active again. If you use CsQuery and are interested in being a collaborator on the project please contact me directly.

The author himself advises to use AngleSharp as an alternative to your project.

The code for getting links from the page looks nice and familiar to anyone using jQuery:

 /// <summary> /// Extract all anchor tags using CsQuery /// </summary> public IEnumerable<string> CsQuery() { List<string> hrefTags = new List<string>(); CQ cq = CQ.Create(Html); foreach (IDomObject obj in cq.Find("a")) { hrefTags.Add(obj.GetAttribute("href")); } return hrefTags; }

License: MIT

Anglesharp

Unlike CsQuery, it is written from scratch manually in C #. Also includes parsers of other languages.

The API is based on the official JavaScript specification of the HTML DOM. In some places there are oddities unusual for developers on .NET (for example, when accessing the wrong index in the collection, null will be returned and not an exception thrown; there is a separate Url class; namespaces are very granular), but in general nothing is critical.

The library is developing very quickly. The number of different buns that facilitate the work is simply amazing, for example, IHtmlTableElement , IHtmlProgressElement, and so on.

The code is clean, neat, convenient.
For example, extracting links from a page is practically no different from Fizzler:

 /// <summary> /// Extract all anchor tags using AngleSharp /// </summary> public IEnumerable<string> AngleSharp() { List<string> hrefTags = new List<string>(); var parser = new HtmlParser(); var document = parser.Parse(Html); foreach (IElement element in document.QuerySelectorAll("a")) { hrefTags.Add(element.GetAttribute("href")); } return hrefTags; }

And for more complex cases there are dozens of specialized interfaces that will help solve the problem.

License: MIT

Regex

Ancient and not the most successful approach to working with HTML. I really liked the commentary of Athari , so I will comment on it here, and will duplicate it:

Scary and horrible regular expressions. It is undesirable to use them, but sometimes it becomes necessary, since the parsers that build the DOM are noticeably more gluttonous than the Regex: they consume more CPU time and memory.

If it came to regular expressions, then you need to understand that you can not build on them a universal and absolutely reliable solution. However, if you want to parse a specific site, then this problem may not be so critical.

For God's sake, do not turn regular expressions into unreadable mess. You do not write C # code in one line with single-letter variable names, and regular expressions do not need to be spoiled. The regular expression engine in .NET is powerful enough to be able to write high-quality code.

The code for getting links from the page looks more or less clear:

 /// <summary> /// Extract all anchor tags using Regex /// </summary> public IEnumerable<string> Regex() { List<string> hrefTags = new List<string>(); Regex reHref = new Regex(@"(?inx) <a \s [^>]* href \s* = \s* (?<q> ['""] ) (?<url> [^""]+ ) \k<q> [^>]* >"); foreach (Match match in reHref.Matches(Html)) { hrefTags.Add(match.Groups["url"].ToString()); } return hrefTags; }

But if you suddenly want to work with the tables, and even in an elaborate format, please look here first.

The license is listed on this site .

Benchmark

The speed of the parser, whatever one may say, is one of the most important attributes. The speed of HTML processing depends on how long your task will take.

To measure the performance of parsers, I used the DreamWalker BenchmarkDotNet library, for which many thanks to him.

The measurements were made on an Intel® Core (TM) i7-4770 CPU @ 3.40GHz, but experience suggests that the relative time will be the same on any other configurations.

A few words about Regex - do not repeat this at home. Regex is a very good tool in capable hands, but working with HTML is not exactly where to use it. But as an experiment, I tried to implement a minimally working version of the code. He completed his task successfully, but the amount of time spent writing this code suggests that I’m definitely not going to repeat it.

Well, let's look at the benchmarks.

Getting addresses from links on the page

This problem, it seems to me, is basic for all parsers - more often with this formulation of the task, a fascinating acquaintance with the world of parsers begins (sometimes Regex).

The benchmark code can be found on github , and below is a table with the results:

Library	Average time	Standard deviation	operations / sec
Anglesharp	8.7233 ms	0.4735 ms	114.94
Csquery	12.7652 ms	0.2296 ms	78.36
Fizzler	5.9388 ms	0.1080 ms	168.44
HtmlAgilityPack	5.4742 ms	0.1205 ms	182.76
Regex	3.2897 ms	0.1240 ms	304.37

In general, the expected Regex was the fastest, but not the most convenient. HtmlAgilityPack and Fizzler showed approximately the same processing time, slightly ahead of AngleSharp. CsQuery, unfortunately, has hopelessly lagged behind. It is likely that I do not know how to cook it. I would be glad to hear comments from people who worked with this library.

It is not possible to evaluate the convenience, since the code is almost identical. But other things being equal, I liked the CsQuery and AngleSharp code more.

Getting data from a table

I encountered this task in practice. Moreover, the table with which I was supposed to work was not easy.

A note about life in Belarus

I wanted to receive current information about the exchange rate of currencies in the glorious city of Minsk. No services were found for obtaining information about courses at banks, but stumbled upon http://select.by/kurs/ . There, information is updated frequently and there is what I need. But in a very inconvenient format.
Guys, if you read this - make a normal service, well, or at least correct HTML.

I made an attempt to hide everything as much as possible that is not specific to HTML processing, but due to the specifics of the task, not everything worked out.

The code for all libraries is about the same, the only difference is in the API and how the results are returned. However, it is worth mentioning two things: first, AngleSharp has specialized interfaces, which made it easier to solve the problem. Secondly, Regex for this task does not fit at all .

Let's see the results:

Library	Average time	Standard deviation	operations / sec
Anglesharp	27.4181 ms	1.1380 ms	36.53
Csquery	42.2388 ms	0.7857 ms	23.68
Fizzler	21.7716 ms	0.6842 ms	45.97
HtmlAgilityPack	20.6314 ms	0.3786 ms	48.49
Regex	42.2942 ms	0.1382 ms	23.64

As in the previous example, HtmlAgilityPack and Fizzler showed about the same and very good time. AngleSharp is behind them, but maybe I didn’t do everything in the most optimal way. To my surprise, CsQuery and Regex showed equally bad processing time. If everything is clear with CsQuery - it is just slow, then with Regex everything is not so simple - most likely the problem can be solved in a more optimal way.

findings

Conclusions, probably, everyone made for himself. From myself I will add that AngleSharp will be the best choice now, as it is being actively developed, it has an intuitive API and it shows well the processing time. Does it make sense to run over AngleSharp with HtmlAgilityPack? Most likely not - we put Fizzler and enjoy a very fast and convenient library.

Thank you all for your attention.
All code can be found in the github repository . Any additions and / or changes are welcome.

Source: https://habr.com/ru/post/273807/

All Articles