Vertical search engines - some parts of the report

Excerpts from the SPIC report in St. Petersburg.

Before my speech, several people turned to me with a question, what is, in fact, a vertical search engine? So I added some clarifying points ...

If by vertical search we mean structured information on any topic, then the very idea of vertical or niche search engines will be far from new. At the very beginning of the Internet (when there were no major players and the market was not structured), webmasters did thematic sites, for example, search by car, news about cars, reviews, a catalog of auto-related sites.
')

In RuNet, for example, Avto.ru is a typical vertical, information is structured, of one subject. Only the database is replenished by the users themselves, and does not consist of ads from other sites. Comparison of prices kelkoo.com (since 2000) or price aggregator prices.ru , which appeared 10 years ago - to some extent also vertical. The base is replenished by providing editors of sites with content in a given format.

In general, in order not to be confused in all this diversity, I propose to consider a vertical search as a search engine on sites of the same subject, and the vertical search engine is the distributor of traffic, and not the holder of it within itself.

How is a vertical search better than horizontal?

1. The pile is small.

Let's try to enter the query "nissan x-trail new" in Rambler and Google. In the top ten we get links to ...

SERP pos	Rambler	Google
one.	Dealer	Dealer
2	Car catalog	Dealer - model card
3	Dealer	Car review
four.	Dealer	Jeans
five.	Photo album	The review is the same as in p3
6	Post on the forum	Catalog, lineup
7	News - more precisely cesspool	News
eight.	news	Dealer (for some reason, Chinese cars)
9.	news	Overview
ten.	news	Dealer

As you can see, search results from sources that are completely different in essence and structure (news, descriptions, prices, catalogs, etc.) are highlighted as one long numbered list with little or no means to sort them out, and with the loss of the structure of the initial data.

2. Multi-click

Search results is a list of links. In order to obtain information, the user must click on one of these links again (at best - once!). If the user is looking for information for comparison (prices, conditions, etc.), then he has to click on the links for a long time, and he needs a lot of patience to compile a complete picture of the world.

3. Garbage

A large part of Ineta consists of text trash: spammer sites, jeans, reprints, junk, it intends to distort information - moreover, these sites are increasingly difficult to distinguish from "normal" sites, especially since large resources are spent on promotion of all this garbage. In such a situation, it becomes increasingly difficult to isolate from (in the machine understanding of the objective list) sources of authoritative, reliable and relevant information.

But on the other hand, horizontal search. It is impossible even a thousand verticals (although there are hardly more than 50 popular themes) to cover the entire Internet.

What are the verticals? Or a bit of classification

Upon receipt of information:

Content we collect and normalize, i.e. we bring "to a common denominator". Examples: ( 100work , avto.yandeks.ru ).
Content providers themselves provide content in a uniform format. Example: ( price.ru ).
Mixed version, i.e. we collect + we accept content from suppliers.
Web-based (search on selected sites). Example: YellowSearch .

By topic: news, mp3 files, videos, books, software codes, electronics, dictionaries. I think examples are not needed.

By type of information: text, pictures, video, music.

By geography (webbased). Countrywide, region-regiona. For example, a search engine only on the sites of one city.

Pitfalls in the development of vertical search engines

The topic of vertical vacancies is expanded here .
In general, the problems are as follows:

Disinterest from large players (who aggregated a large share of data on the market). There is a risk of collusion between players to disable any vertical search. (and it is difficult to switch to the display of off-line ads - all newspaper ads are short, not informative, and therefore not suitable for the format).
Thematic base bath. Same verticals, but offline (available to professional players). They are, for example, in real estate, in tourism. They will always be more relevant, more complete than what appears on the Internet.
Verticals difficult to monetize. The presence of strong off-line competitors (work, tourism, leisure and entertainment, construction, real estate, beauty and health, cars, goods, etc.), which have most of the client's budget.

Each vertical search tends to be as complete as possible (to contain the most data), to be as up-to-date as possible, to present data as qualitatively as possible. I think that in a year or two it will be necessary to think about what to offer users above all this ...

Why did Beta do?

Beta is an experimental project. Beta is a set of specialized (vertical) searches, “implanted” into the body of traditional search through the pages of Internet sites. Combines the comprehensiveness of the search on the Internet, allows you to structure the search results by topic. Provides simultaneous - through one click - search on various sources of information.

The objectives of the project were the most pragmatic:
1. Collect user feedback on new design, interfaces, new visual solutions, new verticals (for example, “reviews and reviews”).
2. Collect statistics (including to improve the work of the relevance of verticals). Various studies. For example, when requesting the product name “canon 40d”, what do users most often mean? Buy a camera? Read the review? Find out the news? And when you request a "card"? Geographical? (And what?) Graphic? Playing?
3. Monetization.

How do we determine the relevance of verticals?

Static relevance

We have assumptions about which vertical one or another query belongs to. Currently, this is the frequency of the query words in a particular subject (vertical package) + a list of keywords and expressions that are manually defined. In the first step, we make an initial assessment of the request, called static relevance. The evaluation is produced by the internal module QueryBroker. There is a lower threshold of static relevance, which allows you to poll the vertical.

Dynamic and Resultant Relevance

Dynamic relevance is the evaluation by the vertical of own correspondence to the query. This decision can be made on the basis of a number of assessments - for example, the number of results on request at the moment. Algorithms for determining the dynamic relevance are discussed with each vertical separately.

The resulting relevance is obtained from the formula in which the main parameters are stat. and dynam. relevancy, other parameters and constants. Verticals are sorted based on their relevance. If rez. will be less than a certain number, the vertical is not shown.

The result of applying res.relevancy, drawing the order of verticals can be viewed in our new vertical car.

Xag

The XAG system (eXtended AGgregator) is the core of our vertical search. It provides receiving, analysis and processing of the information received, as well as search through it. The uniqueness of the system is that it is relatively easy to adapt it to the new vertical (thematic area), without spending a lot of resources.

Data collection. For each site, a parser application is created that allows you to select the necessary information from the html document. For example, in the search for vacancies, we highlight such parameters as the name of the vacancy, company name, salary, description, etc. Moreover, the selection takes place in semi-automatic mode.

Data processing. It consists in analyzing the information obtained by generalizing and structuring it. For example, if we have the base of most employers, and if the employer does not specify the document, but only its telephone number is indicated, then we can determine the name of the employer by its number. Also, this database can be updated with new data about the employer. Thus, we can calculate staffing agencies, even if it is not explicitly indicated on the site. Or, for example, we define job-repeats, vacancies of dubious subjects, such as network marketing. From the "cleared" data built search indexes used directly in the search.

This takes into account the synonyms of expressions, for example, “honey. insurance ”and“ medical insurance ”correspond to one term. By the way, the synonyms will be in the names of companies, such as "Google OJSC" and "Google". It is also planned in the title of vacancies: “interface specialist” and “usability”.

What will be in Beta 2.0

All at a conference in Kaliningrad, where I go at the end of the week. They say a very beautiful city.

PS “What is the difference between the old and the new search? - you still have to enter what you are looking for ... (from user reviews). ”

Source: https://habr.com/ru/post/27194/

All Articles

Vertical search engines - some parts of the report

More articles: