Smart search or what can be improved in the search on the site

Probably, it is not necessary here to describe in detail all the search technologies as a background to the problem that our project solves. Interested readers are probably known. Therefore, I will go over them briefly, only to identify the problem itself.

Basically, the search on the site by the efforts of the developers is limited to searching by key, taking into account the proximity of the query words to each other, as well as different ranking options based on the compatibility of words. Some more morphology, synonyms, and sometimes, as in the RCO, for example, taking into account some aspects of the query syntax for setting search operators (see the search publication on their website). And this is essentially the search technology in a limited corpus of documents ends. The cores of these tools are built into the Sphinx and Lucene search engines, therefore they are available to any mortal programmer.

As a result, for a search on the site we have only a search by keywords, extended morphology and sometimes synonyms. But site search is not an internet search. The result is much worse. And that's why.

')
On the Internet, searching is easier because there is always (almost) a text that is expressed in the same words as in the query. Therefore, there is a search for key. And in the ranking you can rely on such factors as millions of views of the results of the same query, as well as links from other sources, etc. There is nothing like it on the corporate website or document base - the number of texts is much smaller than in the whole Internet, there are even fewer users, and no one refers to many pages at all. According to the statements of one search engine, about 100 parameters are used for searching and ranking results on corporate websites. For comparison, they also use more than 1000 different parameters for searching the Internet.

As a result, a search by key in such limited conditions leads to sad results. This is indirectly indicated by the fact that usually the search string on sites is somewhere in the corner, as the interface is not the most important, and all sites try to make universal navigation through the headings and links on the site. Why it is bad for the user I will not explain. I can only say that it turns out either a lot of headings or large texts that no one reads. Of course, there are technologies like ABBYY's Comprento, based on ontologies. But they are not mass because of their cost, and are applicable only in areas mastered by their engineers. We are talking about ordinary mortals.

We decided to add some semantics to the search. This is the only way to search in a limited corpus of texts, such as a website or corporate documents. We believe that in order to achieve the best results, the search should be carried out not so much by the key, as by the semantic similarity of the query and the search text. The similarity of the entire query, not individual words from it. Almost all words in a language are multi-valued. And the value depends more on the context, that is, on their joint result. That is why the search by individual key does not work - a single word found in the text may appear there for a completely different reason. And even a search by one word (topic, topic) should be more precise - the result should correspond to the most frequently used concept of the word, in the hope that the user will find the word in the search.

We are looking for similarities "in meaning", that is, taking into account all the words of the query and the subject of the query accordingly. The search is ranked according to the parameters of semantic similarity, and not the frequency or compatibility of the query keywords (although this is also taken into account as one of the parameters). And the result is very encouraging - at the top of the search results when searching in documents only those sentences from the texts that are most relevant to the query. And we strive to make the answer unequivocal - in the form of one result in the issue. Soon we will open several demo pages on the site, where our technologies are implemented and I will describe them here.

For now, we will not talk about how and with what help we achieve this result. But gradually, in the posts about each individual project, the ways in which we achieve the result will not remain secret. In the meantime, we offer to contact us, all features will be available by API.

Source: https://habr.com/ru/post/289866/

All Articles

Smart search or what can be improved in the search on the site

More articles: