Parsim HTML in C ++ and Gumbo

Gumbo is an HTML5 parser on C. So far, Gumbo provides only a tree, but no convenient functions for working with it. Therefore, I wrote a couple of auxiliary classes:

STL compatible depth iteration iterator;
comparators for searching by tag, attribute;
a pair of facades.

Under the cut there will be an example of parsing the page habrahabr.ru/all/
Read from file and parse HTML

std::string readAll(const std::string &fileName); //... using namespace EasyGumbo; auto page = readAll(argv[1]); Gumbo parser(&page[0]);

The constructor of the Gumbo class accepts a pointer to a memory buffer, which is terminated by '\ 0'.

std :: find_if, comparators and Element

Let's display a list of articles and their url. To do this, find the a (anchor) tag containing the class attribute with the value post_title .

  Gumbo::iterator iter = parser.begin(); while (iter != parser.end()) { iter = std::find_if(iter, parser.end(), And(Tag(GUMBO_TAG_A), HasAttribute("class", "post_title"))); if (iter == parser.end()) { break; } Element titleA(*iter); auto text = titleA.children()[0]; std::cout << "***\n"; std::cout << std::setw(8) << "Title" << " : " << text->v.text.text << std::endl; std::cout << std::setw(8) << "Url" << " : " << titleA.attribute("href")->value << std::endl; ++iter; }

It is based on the standard algorithm std :: find_if with the Tag and HasAttribute comparators . The And template function creates an instance of the LogicalAnd comparator. Element is the facade above GumboNode .

Interface element

 struct Element { typedef Vector<GumboNode*> ChildrenList; Element(GumboNode &element) noexcept : m_node(element) { assert(GUMBO_NODE_ELEMENT == m_node.type); } ChildrenList children() const noexcept { return ChildrenList(m_node.v.element.children); } const GumboSourcePosition& start() const noexcept { return m_node.v.element.start_pos; } const GumboSourcePosition& end() const noexcept { return m_node.v.element.end_pos; } const GumboAttribute* attribute(const char* name ) const noexcept { return gumbo_get_attribute(&m_node.v.element.attributes, name); } GumboNode &m_node; };

In Gumbo, text is stored in nodes of type GUMBO_NODE_TEXT , so we ’ll refer not to the A tag, but to its descendant titleA.children()[0] .

findAll and iterator

Sometimes it is inconvenient to go element by element, but I want to immediately get a list of the necessary nodes.

  iter = std::find_if(iter, parser.end(), And(Tag(GUMBO_TAG_DIV), HasAttribute("class", "hubs"))); std::cout << std::setw(8) << "Hubs" << " : "; auto hubs = findAll(iter.fromCurrent(), parser.end(), Tag(GUMBO_TAG_A)); for (auto hub: hubs) { Element hubA(*hub); if (hub != hubs[0]) { std::cout << ", "; } std::cout << hubA.children()[0]->v.text.text; } std::cout << std::endl;

Here we find a node with hubs, then through findAll and creating a new iterator using the fromCurrent method, pulling out all the A tags.
The iterator is designed in such a way that it remembers the top of the tree with which it started walking. If, during a tour, it stumbles upon this node, the tour ends. This behavior is convenient, no need to care about leaving the subtree. This makes it possible to write constructions

  auto posts = findAll(parser.begin(), parser.end(), And(Tag(GUMBO_TAG_A), HasAttribute("class", "post shortcuts_item"))); for(auto post : posts) { Gumbo::iterator iter(post); /* *     */ ... }

This is convenient, but it turns out that we go through subtrees twice. Also, the open access method renders the gotoAdj method, which allows you to go to the neighboring element, thereby skipping the subtree. The rest of the code is simple.
')
All parsing looks like this:

 #include <fstream> #include <iomanip> #include <iostream> #include <algorithm> #include <gumbo.h> #include "Gumbo.hpp" using namespace std; std::string readAll(const std::string &fileName) { std::ifstream ifs; ifs.open(fileName); ifs.seekg(0, std::ios::end); size_t length = ifs.tellg(); ifs.seekg(0, std::ios::beg); std::string buff(length, 0); ifs.read(&buff[0], length); ifs.close(); return buff; } int main(int argc, char *argv[]) { if (argc != 2) { return 0; } using namespace EasyGumbo; auto page = readAll(argv[1]); Gumbo parser(&page[0]); Gumbo::iterator iter = parser.begin(); while (iter != parser.end()) { iter = std::find_if(iter, parser.end(), And(Tag(GUMBO_TAG_A), HasAttribute("class", "post_title"))); if (iter == parser.end()) { break; } Element titleA(*iter); auto text = titleA.children()[0]; std::cout << "***\n"; std::cout << std::setw(8) << "Title" << " : " << text->v.text.text << std::endl; std::cout << std::setw(8) << "Url" << " : " << titleA.attribute("href")->value << std::endl; iter = std::find_if(iter, parser.end(), And(Tag(GUMBO_TAG_DIV), HasAttribute("class", "hubs"))); std::cout << std::setw(8) << "Hubs" << " : "; auto hubs = findAll(iter.fromCurrent(), parser.end(), Tag(GUMBO_TAG_A)); for (auto hub: hubs) { Element hubA(*hub); if (hub != hubs[0]) { std::cout << ", "; } std::cout << hubA.children()[0]->v.text.text; } std::cout << std::endl; iter = std::find_if(iter, parser.end(), And(Tag(GUMBO_TAG_DIV), HasAttribute("class", "views-count_post"))); ++iter; std::cout << std::setw(8) << "Views" << " : " << iter->v.text.text << std::endl; iter = std::find_if(iter, parser.end(), And(Tag(GUMBO_TAG_SPAN), HasAttribute("class", "favorite-wjt__counter js-favs_count"))); ++iter; std::cout << std::setw(8) << "Stars" << " : " << iter->v.text.text << std::endl; iter = std::find_if(iter, parser.end(), And(Tag(GUMBO_TAG_A), HasAttribute("class", "post-author__link"))); Element authorA(*iter); std::cout << std::setw(8) << "Author" << " : " << authorA.children()[2]->v.text.text << std::endl; iter = std::find_if(iter, parser.end(), And(Tag(GUMBO_TAG_DIV), HasAttribute("class", "post-comments"))); auto comments = findAll(iter.fromCurrent(), parser.end(), Tag(GUMBO_TAG_A)); if (comments.size() == 1) { Element commentsA(*comments[0]); std::cout << std::setw(8) << "Comments" << " : " << commentsA.children()[0]->v.text.text << std::endl; } } return 0; }

Code as always available on GitHub

Source: https://habr.com/ru/post/280270/

All Articles

Parsim HTML in C ++ and Gumbo

std :: find_if, comparators and Element

findAll and iterator

More articles: