Html Agility Pack - convenient .NET HTML parser

Hello!
Once I had the idea to analyze the vacancies posted on Habré. Specifically interested in whether there is a relationship between the size of the salary and the availability of higher education. And now the students have a session (including me), then maybe someone is already tired of pulling nerves on exams and this analysis will be useful.
Since I am a programmer on .Net, then I solved this task - I decided to parse the ads on Habré in C #. I did not want to manually parse the html lines, so it was thought to find an html parser that would help accomplish the task.
Looking ahead, I will say that nothing interesting came out of the analysis and the session will have to be passed further :(
But I’ll tell you a little about the very useful Html Agility Pack library.

Parser selection

I went to this library through a discussion on Stackoverflow. The comments also offered solutions, for example, the SgmlReader library, which translates HTML into an XmlDocument, and for XML into .Net tools, a complete set. But for some reason it didn’t bribe me and I went to download the Html Agility Pack.

A quick inspection of the Html Agility Pack

Library help can be downloaded from the project page. The functionality is actually very happy.
There are twenty main classes available to us:

')
Method names correspond to DOM interfaces ( note k12th ) + buns: GetElementbyId (), CreateAttribute (), CreateElement (), etc., so it will be especially convenient if you have to deal with JavaScript.
It seems that html is still distilled into Xml, and HtmlDocument and other classes are a wrapper, well, that's okay, because of this, such options are available as:

Linq to Objects (via LINQ to Xml)
XPATH
Xslt

Parsim Habr!

Jobs in Habré are presented in the form of a table, in the lines given information about the required specialty and salary, but since we need information about education, we will have to go to the job page and disassemble it.
So let's start, we need a table to pull out the links and information about the position with the salary:

static void GetJobLinks ( HtmlDocument html )
{
var trNodes = html. GetElementbyId ( "job-items" ) . ChildNodes . Where ( x => x. Name == "tr" ) ;
foreach ( var item in trNodes )
{
var tdNodes = item. ChildNodes . Where ( x => x. Name == "td" ) . ToArray ( ) ;
if ( tdNodes. Count ( ) ! = 0 )
{
var location = tdNodes [ 2 ] . ChildNodes . Where ( x => x. Name == "a" ) . ToArray ( ) ;
jobList. Add ( new HabraJob ( )
{
Url = tdNodes [ 0 ] . ChildNodes . First ( ) . Attributes [ "href" ] . Value ,
Title = tdNodes [ 0 ] . FirstChild . InnerText ,
Price = tdNodes [ 1 ] . FirstChild . InnerText ,
Country = location [ 0 ] . InnerText ,
Region = location [ 2 ] . InnerText ,
City = location [ 2 ] . InnerText
} ) ;
}
}
}

And then it remains to go through each link and pull out the information about education and at the same time also employment - there is a small problem in that if the table with links to the vacancy lay in a div with a known id, then the information about the vacancy lies in the table without id, so I had to go back a little bit:

static void GetFullInfo ( HabraJob job )
{
HtmlDocument html = new HtmlDocument ( ) ;
html. LoadHtml ( wClient. DownloadString ( job. Url ) ) ;
// html.LoadHtml (GetHtmlString (job.Url));
// you can not do this :-(
var table = html. GetElementbyId ( "main-content" ) . ChildNodes [ 1 ] . ChildNodes [ 9 ] . ChildNodes [ 1 ] . ChildNodes [ 2 ] . ChildNodes [ 1 ] . ChildNodes [ 3 ] . ChildNodes . Where ( x => x. Name == "tr" ) . ToArray ( ) ;
foreach ( var tr in table )
{
string category = tr. ChildNodes . FindFirst ( "th" ) . InnerText ;
switch ( category )
{
case "Company" :
job. Company = tr. ChildNodes . FindFirst ( "td" ) . FirstChild . InnerText ;
break ;
case "Education:" :
job. Education = HabraJob. ParseEducation ( tr. ChildNodes . FindFirst ( "td" ) . InnerText ) ;
break ;
case "Employment:" :
job. Employment = HabraJob. ParseEmployment ( tr. ChildNodes . FindFirst ( "td" ) . InnerText ) ;
break ;
default :
continue ;
}
}
}

results

Well, then, save the results in XML and look in Excel-e, what happened ... and we see that nothing good happened because most companies either do not indicate salary or do not indicate education information (they forget, they indicate in the body jobs, or really not important), or do not indicate everything at once.
Who cares, here are the results in xlsx and xml , and here the source

PS

When parsing, there was such a problem - the pages were downloaded very slowly. Therefore, I first tried WebClient, and then WebRequest, but there was no difference. A Google search indicated that the Proxy should be explicitly disabled in the code, and then everything will be fine, but that did not help either.

Source: https://habr.com/ru/post/112325/

All Articles

Html Agility Pack - convenient .NET HTML parser

Parser selection

A quick inspection of the Html Agility Pack

Parsim Habr!

results

PS

More articles: