
On a gloomy autumn morning, as an experiment, we wrote down our search for Habr with structure and speed. All the work took about 10 minutes. Those who are too lazy to read a
tyk to view a new search (search engine right in the body of a blog entry)
To obtain such a search, we did not request access to the database, or fill in articles through our API. Everything is done very simply, through the usual crawler. For example, we have scrubbed about 5000 articles.
Prehistory
Hello. Let me remind you that we are doing a quick structured search for
Indexisto sites (+ our
post on Habré ). For a long time nothing was heard from us, and now we finally come out with a release.
')
After the first publication, many liked what we were doing and many people wanted to introduce. I am very grateful to the first ones who connected - we saw a lot of, let's say, “subtleties” that we would not have found. We repaired everything pretty quickly, and at the same time never dropped a single “live” index. However, there was a structural problem:
- Very high threshold of entry. Complicated connection through the database, complex settings of templates, search query, analyzers, etc.
We solved this problem by manually configuring the client configs (for example, you can watch our issue on
maximonline.ru ), and by explaining that we need early adopters. Development (besides bugs) almost got up, and we realized that either we are becoming an integrator, or we need to change something in order to remain an Internet project.
Development of events
Today we want to present a cardinal solution to the connection problem - you just need to enter the site URL, and get a ready-made search result. That's it.

Everything else is done automatically. A lot of complex settings are taken from the finished template and applied to your index. At the same time, the admin panel was reduced to the point - to checkboxes and drop-down lists. For hardcore fans, we left the opportunity to switch to the advanced mode.
Crawler and parser
And so, now we have a crawler and a content parser. Crawler provides relatively reasonable pages: we have more or less learned how to discard pagination, various feeds, view changes (like? Sort = date.asc). But even if the crawler works perfectly, we will have pages with articles in which there is a ton of excess: the menu, the blocks in the right and left column. Let's face it, I would not really like to see all this in the search results, if we stick to our positioning.
Here we go to the undoubtedly Uber system: the parser allows you to extract any data from the page.
Conceptually, the system combines two approaches:
- automatically extracting content based on algorithms, for example, this Boilerplate Detection using Shallow Text Features . About this will be a separate post.
- data extraction "in the forehead" using xpath . Let me remind you, if by simple - xpath allows you to search for text in certain tags, for example
//span[contains(@class, 'post_title')]
- pull the title from the span tag with the class post_title .
The system can operate without any additional settings, as well as using manual settings for a specific site.
Parser masks for content extraction
All settings
xpath we store in masks
The parser gets a page at the input and starts chasing it in different masks, passing from one to the other. Each mask tries to isolate something from the html page and add to the received document: title, image, text of the article. For example, there is a mask that extracts Open Graph tags and appends their contents to the document:
<mask name="ogHighPrecision" level="0.50123"> <document name="ogHighPrecisionTags"> <field name="_url">//meta[contains(@property, 'og:url')]/@content</field> <field name="_subtype">//meta[contains(@property, 'og:type')]/@content</field> <field name="_image">//meta[contains(@property, 'og:image')]/@content</field> <field name="title">//meta[contains(@property, 'og:title')]/@content</field> <field name="description">//meta[contains(@property, 'og:description')]/@content</field> <field name="siteName">//meta[contains(@property, 'og:site_name')]/@content</field> </document> </mask>
As it is already clear masks we describe in XML. The code does not require special explanation)
We have quite a few such masks - for OG, microdata, garbage pages with noindex, etc.
Thus, in principle, you can enter the address of the site, and get an acceptable issue.
However, many want not just acceptable, but perfect. And here we give you the opportunity to write
xpath yourself.
Custom masks
Without extra water, let's see how we extracted data from Habra
<?xml version="1.0" encoding="UTF-8"?> <mask name="habrahabrBody" level="0.21"> <allowUrl>/company/</allowUrl> <allowUrl>/events/</allowUrl> <allowUrl>/post/</allowUrl> <allowUrl>/qa/</allowUrl> <document name="habrahabrBody"> <field name="body" required="true">//div[contains(@class, 'html_format')]</field> <field name="title" required="true">//span[contains(@class, 'post_title')]</field> </document> </mask>
The code does not need an explanation) In fact, we said to this mask: work only on pages of posts, companies, events and questions, take the body of the article from a
div with the
html_format class, a heading from the
span with a class
post_titleExtraction of the picture takes place at the level of system (embedded) masks by the Open Graph tag, so we did not remember anything about the picture in our mask.
In the future, we will try to make this process even easier, like Google’s webmaster panel (
Video )