Search on a site is not only search on a site

Standalone search module is needed not all sites. If the site has five pages - no search is needed. If the site is updated once a month or all updates are reflected on the title page - you can do without an external search on the site from Google or Yandex. But external search cannot solve some problems. This article is about what functions the built-in search module can perform. And some of these functions are not directly related to the search process.

What can not Yandex

Large search engines provide us with the opportunity to use their many years of experience in the field of relevance, calculation of search spam, morphology and other rankings. But there are tasks with which the external search engine is not able to cope due to its “appearance”.

Instant reindexing

You have added an article to the site - and it is already immediately in the index and is available for search. You deleted the abusive comment - and no one will find it. If your site has a search form through the global search engine, you will have to wait for weeks sometimes to re-index. If the built-in search module allows you to re-index a fragment of the site by the event “add”, “change” or “delete”, the account will go on for minutes or even seconds.
')

Synonyms

The name of our product is often written in Russian as “netcat”, “netcat”, “netket”. Yandex won’t think that for such requests it is necessary to show the pages where NetCat is written (except for the “netkat” query, Yandex recognized it). What to say about cases like “CMS - csm, cmska, siemes”. A built-in search, we can clearly specify such synonyms.

Tag weight management when calculating relevance rate

Writing texts “for Yandex” has cardinal differences from writing texts for people. In the first case, we need that out of a million pages "on this topic" Yandex above all showed the page on our site. In the second, the person, who ALREADY came to the site, quickly found the page he needed and bought our product. Therefore, if a person is looking for a “pink elephant” on our website, we need to show him not a long article with a perfectly verified number of data phrases, but a page with a couple of photos and a “buy” button. Having the ability to set the weights of tags and individual blocks of web pages (for example, according to the class attribute) for the internal search engine, we can prepare the content so that the “entered request - bought” process takes the user a minimum of time.

Flexible task of prohibited pages for indexing

In the robots.txt file, we can write a Disallow instruction that will prevent external search engines from indexing some parts of the site. As the summer scandals showed that getting private information into search engines, this does not always help. But even if you do not take this into account, the disallow syntax is very primitive, and it would be much better to specify forbidden areas with a regular expression. Example: page sloniki.html? Action = add, intended to be added by the content administrator to the corresponding page, may well get into the index, even if there is only the title “Pink elephants” and the authorization form. But why should we litter the search results?

Automatic selection of query options as you type

Everyone is familiar with the drop-down list, which Yandex or Google shows us as we type a request. But this hint is just a list of the most popular queries. The internal search can load not only popular queries, but also page headers (that is, the names of documents) that match the input query. Starting to enter "pink", we will see a list of contents of the title tag, which contains this fragment; clicking on the “Pink Elephants” we need, we will immediately get to the page you are looking for instead of the search results.

Flexible re-indexing schedule

If our site is large, its complete reindexing can take a lot of time and resources. But if the “Forum” section needs to be re-indexed every hour, “News” every day, and “About the Company” never does, it would be great to do so. Internal search may well afford different schedules for different sections. Of course, in the sitemap we can control the frequency of re-exposure using the changefreq attribute, but ... Yandex and Google will hardly accept our wish as a guide to action.

And:

indication of the search area (for example, search everywhere or only on the forum)
extracting additional attributes from objects and searching them (find all the articles of such and such an author, all products in this price range)
sorting search results not only by relevance, but also by date (as in Habré)

... and not only search

The search module between times can perform not only their direct duties, but also other socially useful tasks. Here are some examples.

Auto build sitemap.xml

Who will be the most thorough of all the list of pages of the site for external search engines, if not a local search robot? At the same time, at the level of the site structure, we can set changefreq and priority parameters, for different sections are different.

Search for broken links

There are at least two ways to search for internal links to non-existent pages. The first is to write a 404 error handler that will send an email with the address of the page and the referrer (or add a message to the database of the site) every time someone visits this page. The second is to entrust this to a search robot. This is clearly a more correct way.

Collection of statistics on requests

If the search engine collects statistics on queries and their results, this data can help us very much. First, we can see requests for which users do not find anything, and add relevant pages. Secondly, after seeing common typos, we can add them to the synonym dictionary. Thirdly, if a page is searched for too often, it is difficult to find it without searching; it may be worth bringing it to the menu. Well, and so on.

By the way, a separate item is statistics on requests of specific registered users. Just do not think that I urge you to follow them :)

Keep up with the "older"

All these bells and whistles are good only if the search is really convenient, looking for what is needed, and they are easy and simple to use. Therefore, our search module should be able to do what big search engines can do. As good as they are. Well, or almost the same.

Full morphology

Many local search engines to work with word forms costing stemming. This term refers to discarding the end of a word in an attempt to find its root and, as a consequence, word forms. We take the word "pink", apply stemming, we get "roses", and now we consider all words beginning with this "root" as word forms. So at the request of "pink" we will find "pink", "pink" and so on. But stemming gives too big an error and is not suitable, for example, for isolated verbs ("go - go - go"). Morphological dictionaries provide the most accurate wordform search. For everyday or business texts, they are not so large (NetCat uses a free dictionary from aot.ru , Russian and English dictionaries together occupy only 15 megabytes, which is not so much for modern hosts; you can use other dictionaries; you can also add dictionaries of other languages) .

By the way, taking this opportunity, I want to say a huge thank you to the author of the morphoanalysis library phpMorphy , which turned out to be very useful for our tasks.

Fight typos

There are two ways to deal with typos. The first one is to find the most similar word, as the same Yandex does, the second one is to use a fuzzy search.

But fixing the keyboard layout is much easier.

Exotic Cases

We did not include the following opportunities in our search module because of their exotic or laboriousness, although in some cases they may be useful.

RTL-languages and hieroglyphs

European languages (including Russian) have a direction of writing from left to right (left-to-right, LTR). But the Arabic language is written in the opposite direction. If your project is focused on this language audience, get ready to write (or connect ready) stemmer. And hieroglyphs are generally a separate case; one stemmer is indispensable.

Search in closed areas

The systems of delineation of rights in Internet projects are different, including quite paranoid (I want to write a separate article about this). An example of a complex option: publishing systems. A journalist may have the right to add an article (to certain sections!) And correct it until it has been corrected by the editor. The editor has the right to view, correct and switch on / off all materials within the scope of his subject. The chief editor has no right to correct custom articles without the approval of the commercial department. From the point of view of the paranoid system of demarcation of rights, the ideal search module would index all the content placed on the project, and by searching it would check the user's rights and display only the materials available to it. And at the same time would allow to do a filter on the status of the document.

Autodetection of pages

Analyzing the indexed page of the text, it is possible to determine its subject matter with some degree of error. Useful effects of such features: automatic cataloging of materials and building a tag cloud, analyzing the interests of the community or its individual members (for UGC projects), building lists of related materials (have you seen the “see also” blocks?). Most often, this analysis is used to target contextual advertising.

This includes caching search results, searching by images, indexing pages generated by filling out forms or ajax-ohm, searching for duplicate pages.

Another interesting use of the search engine occurred to me in the process of writing this article. It is suitable, for example, for collective blogs and media. Analyzing the texts of different authors, we can build their ratings on different parameters. The first thing that comes to mind: vocabulary. In addition: rating authors choleric (who most often uses exclamation marks), lovers of extensive reasoning (who loves question marks), according to the word-parasites, etc. Maybe Habraadministration suggest to do something like that? :) True, I haven’t yet come up with a commercial justification for such a toy.

If you are writing your search ...

... first you need to answer the question, is it needed, is it not worth using existing solutions: Yandex.Server , Sphinx , etc. The main advantage of your own search engine is the possibility of tight integration with other CMS modules used on the site. It's not just about embedding the management interface in the admin area, but about integrating with the system of differentiation of rights, management of the structure, users, etc. (I already wrote about this).

As for technology, there are plenty of flexible and powerful platforms. The default NetCat search module uses Zend_Search_Lucene . This solution has disadvantages, for example, a relatively low speed of work. In our case, this is justified by the fact that the site for NetCat should work on any standard UNIX hosting without the need to install additional components, and Zend_Search_Lucene does not require anything other than PHP. To be fair, it should be noted that we have made the module extensible, that is, it is possible to replace not only dictionaries, but also software components: if the project is large enough and the server is dedicated, you can replace the components responsible for storing and retrieving information, indexing, casting to basic form etc. For example, use the same Sphinx or Solr (and, if necessary, Yandex.XML ).

If you are developing not a universal CMS, but a specific large project, choosing and setting up the optimal platform is not a problem. It is much more important to understand how to use its capabilities as efficiently as possible.

All screenshots in the article are made on our website and in its administration system.

Source: https://habr.com/ru/post/136492/

All Articles