Full-text search on the site - the scourge of the modern Internet

Implementing a good site search is often a highly undervalued task. Search is a weak point of sites so often that when I see the search bar, I immediately have a biased feeling of the upcoming fiasco. And in order not to get upset once again, I immediately forward my question to Google or Yandex and quickly find what was required. What to do to somehow improve this situation?

Search form on the site from Yandex and Google

It was best understood and made for us by the creators of popular search engines. And we can take advantage of the results of their work, simply by setting the search form on the site from Yandex or Google . This is a simple way, but it has its drawbacks:

The search may not be available to all pages of the site. The search engine does not guarantee inclusion in the search of all pages of the site, and besides not all pages may be accessible to the search robot.
The big delay between the appearance of new pages on the site and their availability in the search.
You can not specify refinements to search. For example, set a search only in one subsection of the site or by price range of the goods.
You can not perfectly embed the search results in the design of the site. This negates all the advantages of such a search for most reputable portals.

Here is an incomplete list of problems that a novice search engine may encounter. Therefore, this solution can be recommended only to sites that are not particularly worried about their commercial image.

Search quality

First you need to understand what makes up the concept of search quality. Search quality depends on many factors. You can read about many of them in the book of the famous search engine optimizer of the candidate of technical sciences, Igor Ashmanov. (I will tell you a secret that I recently saw her on torrents.ru ). All factors can be divided into three categories: completeness, accuracy and ranking.
')

Completeness

Completeness is the number of pages that are being searched. There are two approaches to indexing data for search: “from the inside” and “outside”.

“Inside” is the indexing of the site’s source data, usually stored in the database. This method eliminates “junk” pages from entering the search results, but is also associated with the risk of reducing the search completeness.
"Outside" - indexing by the search robot. This approach in most cases guarantees high completeness, but also generates many problems that will be described in future articles .

If a user sees a search line on the site without accompanying text, he expects that by entering the “contacts” query, he will go to the contacts page. And if it is not, then this is a webmaster's mistake, because the client is always right :)

The reason for this is most often the fact that most sites only search by dynamic data because the search program retrieves data from the database. Moreover, usually the webmaster (or the creator of the CMS ) decides which tables in the database are the most important, and which are unworthy of attention. As a result, “overboard” search remains some “insignificant” dynamic data and all static pages.

On the other hand, if we aim at ensuring maximum search completeness, then the results can turn out to be a lot of “junk” and duplicate pages, which also negatively affects user loyalty.

Accuracy

Search accuracy is a characteristic of matching the found pages to a search query. It includes the account of morphology, the removal of homonymy, the accounting of typos, search for synonyms, etc. For example, if the user is looking for "the number of Arshavin's heads", then it is clear that there is nothing to do with heads, and you only need to show information about goals scored. Here is another interesting example of homonymy . But this is aerobatics, and the simplest thing that the user wants to see is a search in all possible word forms.

To account for the morphology, various algorithms are used: stemmers , morphological dictionaries and hybrid algorithms. All of them are to some extent imperfect. For example, the word "is" can have the forms "was", "will", "eat". A simple stimmer will not understand this. The morphological dictionary is unlikely to provide word forms for the word "upyachka . " More complex hybrid algorithms that use vocabulary databases and sets of heuristics are more perfect, but even they are not perfect. Now the situation with morphology is approximately as follows:

When searching the database using only SQL tools, a stemmer is usually used. This is the worst accounting morphology.
Open source search engines such as Sphinx , Lucene , Xapian usually support the installation of their morphological analyzer, but as a built-in algorithm for the Russian language is also usually used stemmer.
Yandex.Server , FAST , Google Appliance have advanced hybrid morphology accounting algorithms. Probably, Yandex.Server and Google Appliance have the best morphological analyzer for the Russian language from the existing ones, since they use the same algorithm as in the web search.

Ranging

Under the ranking refers to the order of sorting the found documents on the search results page. Sometimes it is enough to sort the results by a simple criterion, for example, by the modification date, but most often it is necessary to order the documents in order of decreasing the proximity of the search query to the result. On the rankings, the developers of large search engines broke a lot of copies, so their products give the best results. The ranking situation is something like this:

When using SQL search, ranking is available only by simple criteria, such as date.
Open Source systems (Sphinx, Lucene, etc.) have built-in advanced ranking algorithms. Usually these are modifications of the textual relevance log.
Commercial products (Yandex.Server, FAST, Google Appliance, etc.) have complex multifactor ranking algorithms, the secret of which is kept secret under seven locks just like the recipe of Coca-Cola.

findings

For a small non-commercial site, a Yandex / Google search form will do.
To search for a section of the site that does not require an analysis of the morphology of the query, a complex ranking and containing a small amount of data, you can use SQL query + stemmer.
For a sufficiently large site containing non-trivial articles, you should use the engine with good morphology and ranking: Yandex.Server, FAST, Google Appliance, etc.
Lucene, Sphinx, and others are suitable if the soul lies in Open Source and the search requirements are met by the engine's capabilities.