Do it yourself search on the site

Probably many have ever wondered how to do a search on the site? Of course, for large sites with a large amount of content, search is simply an indispensable thing. In most cases, the user, having visited your site for the first time in search of something important, will not understand the navigation bars, drop-down menus and other navigation elements, and in a hurry will try to find something similar to the search bar. And if there is no such luxury on the site, or it does not cope with the search query, then the visitor will simply close the tab. But the article is not about the meaning of the search for the site and not about the psychology of visitors. I'll tell you how to implement a small full-text search algorithm, which I hope will save novice developers from headaches.

The reader may ask: why write everything from scratch, if everything has been written for a long time? Yes, major search engines have an API, there are some cool projects like Sphinx and Apache Solr. But each of these solutions has its advantages and disadvantages. Using the services of search engines, such as Google and Yandex, you will receive a lot of buns, such as a powerful morphological analysis, correction of typos and errors in the query, the recognition of an incorrect keyboard layout, but there will not do without a spoon of tar. Firstly, such a search does not integrate into the structure of the site - it is external, and you cannot tell it which data is most important and which is not very. Secondly, the site content is indexed only at a certain interval, which depends on the selected search engine, so that if something is updated on the site, you will have to wait for the moment when these changes get into the index and become available in the search. Sphinx and Apache Solr do much better with integration and indexing, but not every hosting will allow you to run from.

Nothing prevents you from writing a search engine yourself. It is assumed that the site runs on PHP in conjunction with some database server, such as MySQL. Let's first define what is required from the search on the site?
')

Search based on language morphology. Regardless of the case, ending and
other delights of the great and mighty language search must find what you need
to the user. In other words, "apples", "apples", "apples" are the forms of one and the same
same words "apple" that need to be considered in the search algorithm. One way
achieving this goal is to bring each word search query and words
site content to the basic form.
Ability to specify the search context. That is, the ability to choose
the content of the site, within which the search algorithm will work, and also determine
significance for each of the limits. For example, consider an online store. Supposed to
that the search query will most often contain the name of the desired product, so a search by
product names will have the highest priority. As a next priority, you can
select a search by product properties, then search by description.
Indexing site content. Imagine the situation: at the same time about 30 people
perform search queries. Server accepts every connection, flow control
passed to PHP interpreter. With each request, the search is reinitialized.
engine, re-breaks the contents of the site ... It is difficult to say how much time and
resources will be required to handle all these requests. It is in order not to
to do the same job a hundred times, an indexing technology was invented.
Indexing is performed only when changing or adding content to the site.
and the search is done by index, not by content.
Ranking mechanism. Ranking search results is the sorting of search results based on an assessment of the significance of the data found. For example, in some blog, the search query "space" is executed. This word is contained in two articles: the first 16 times, the second - 5 times. Most likely, the first article will be of greater importance for the initiator of the search. Also, for each type of site content during indexing, a certain coefficient is set, which will affect its position in the search results.

Now a few words about what we have to implement:

morphological analyzer,
ranking algorithm
indexing algorithm
search algorithm.

At the end of the article, an example of the search implementation will be shown on the example of a simple online store. Those who are too lazy to study all this and just need a ready-made search engine, you can safely take the engine from the GitHub FireWind repository.

Principle of operation

From the back end, the search works like this:

site content is indexed,
the user sends a request
service parts of speech are excluded from the request,
the resulting string is broken down into an array of words translated into a basic form,
search for each word of the resulting array is carried out in the index,
search results are ranked, sorted and given to the user.

Training

The task is set, now you can go to the point. I use Linux as a working OS, but I will try not to use its exotic features so that Windows lovers can “build” the search engine by analogy. All you need is knowledge of the basics of PHP and the ability to handle MySQL. Go!

Our project will consist of a core where all vital functions will be collected, as well as a module for morphological analysis and text processing. To begin with, we will create the root folder of the firewind project, and in it we will create the core.php file - it will be the core.

$ mkdir firewind $ cd firewind $ touch core.php

Now we arm ourselves with our favorite text editor and prepare the frame:

 <?php class firewind { public $VERSION = "1.0.0"; function __construct() { //  // } } ?>

Here we have created a main class that can be used on your sites. At this preparatory part ends, it's time to move on.

Morphological analyzer

Russian language is a rather complicated thing, which pleases with its diversity and shocks foreigners with constructions, such as “yes no, probably”. Teaching a car to understand it, and any other language, is a rather difficult task. The most successful in this regard are search companies, such as Google and Yandex, which constantly improve their algorithms and keep them secret. We'll have to do something different, simpler. Fortunately, there is no need to reinvent the wheel - everything has already been done for us. Meet, phpMorphy is a morphological analyzer that supports Russian, English and German. More detailed information can be obtained here , but we are only interested in its two possibilities: lemmatization, that is, obtaining the basic form of a word, and obtaining grammatical information about a word (gender, number, case, part of speech, etc.).

Need a library and a dictionary for it. All this stuff can be found here . The library is located in the eponymous folder "phpmorphy", dictionaries are located in "phpmorphy-dictionaries". Download the latest version of the project in the root folder and unpack:

 #   $ unzip phpmorphy-0.3.7.zip $ mv phpmorphy-0.3.7 phpmorphy #    phpmorphy/dicts $ unzip morphy-0.3.x-ru_RU-withjo-utf-8.zip -d phpmorphy/dicts/ #    $ rm phpmorphy-0.3.7.zip morphy-0.3.x-ru_RU-withjo-utf-8.zip

Fine! The library is ready to use. It's time to write a "shell", which abstracts the work with phpMorphy. To do this, create another morphyus.php file in the root directory:

 <?php require_once __DIR__.'/phpmorphy/src/common.php'; class morphyus { private $phpmorphy = null; private $regexp_word = '/([a-z-0-9]+)/ui'; private $regexp_entity = '/&([a-zA-Z0-9]+);/'; function __construct() { $directory = __DIR__.'/phpmorphy/dicts'; $language = 'ru_RU'; $options[ 'storage' ] = PHPMORPHY_STORAGE_FILE; //   // $this->phpmorphy = new phpMorphy( $directory, $language, $options ); } /** *      * * @param {string} content      * @param {boolean} filter   HTML-   * @return {array}   */ public function get_words( $content, $filter=true ) { //  HTML-  HTML- // if ( $filter ) { $content = strip_tags( $content ); $content = preg_replace( $this->regexp_entity, ' ', $content ); } //     // $content = mb_strtoupper( $content, 'UTF-8' ); //     // $content = str_ireplace( '', '', $content ); //     // preg_match_all( $this->regexp_word, $content, $words_src ); return $words_src[ 1 ]; } /** *    * * @param {string} word   * @param {array|boolean}    ,  false */ public function lemmatize( $word ) { //     // $lemmas = $this->phpmorphy->lemmatize( $word ); return $lemmas; } } ?>

So far, only two methods have been implemented. get_words breaks the text into an array of words, while filtering HTML tags and entities like "& nbsp;". The lemmatize method returns an array of lemmas of the word, or false, if none were found.

The mechanism of ranking at the level of morphology

Let's look at a language unit like a sentence. The most important part of the sentence is the basis in the form of the subject and / or predicate. Most often the subject is expressed by the noun, and the predicate is a verb. Secondary members are mainly used to clarify the meaning of the base. In different sentences, the same parts of speech sometimes have completely different meanings, and today only a person can most accurately assess this meaning in the context of a text. However, it is still possible to programmatically evaluate the meaning of a word, although not so accurately. In this case, the ranking algorithm should be based on the so-called text profile, which is determined by its author. A profile is an associative array, the keys of which are parts of speech, and the values, respectively, are the rank (or weight) of each of them. I will show an example of the profile in the conclusion, but for now we will try to translate these reflections into the PHP language, adding another method to the morphyus class:

 <?php require_once __DIR__.'/phpmorphy/src/common.php'; class morphyus { private $phpmorphy = null; private $regexp_word = '/([a-z-0-9]+)/ui'; private $regexp_entity = '/&([a-zA-Z0-9]+);/'; // ... // /** *    * * @param {string} word   * @param {array} profile   * @return {integer}    0  5 */ public function weigh( $word, $profile=false ) { //     // $partsOfSpeech = $this->phpmorphy->getPartOfSpeech( $word ); //    // if ( !$profile ) { $profile = [ //    // '' => 0, '' => 0, '' => 0, '' => 0, '' => 0, '' => 0, //     // '' => 5, '' => 5, '' => 3, '' => 3, //    // 'DEFAULT' => 1 ]; } //        // if ( !$partsOfSpeech ) { return $profile[ 'DEFAULT' ]; } //   // for ( $i = 0; $i < count( $partsOfSpeech ); $i++ ) { if ( isset( $profile[ $partsOfSpeech[ $i ] ] ) ) { $range[] = $profile[ $partsOfSpeech[ $i ] ]; } else { $range[] = $profile[ 'DEFAULT' ]; } } return max( $range ); } } ?>

Indexing site content

As mentioned above, indexing significantly speeds up the execution of a search query, since the search engine does not need to process the content every time — the search is performed by index. But what does happen when indexing? If in order, then:

First, an array of words is formed from the text, and this is done using the get_words method.
According to the profile, insignificant parts of speech are discarded from the text.
Significant evaluated on a five-point scale, using the method of weigh.
For each owl, a search is made for lemmas, in other words, basic forms.
Calculate the number of repetitions of each word and the total rank.
All data is written to the object and in the form of JSON is written to the database.

The result is an object of the following format:

 { "range" : "<   >", "words" : [ //    // { "source" : "<  >", "range" : "< >", "count" : "<     >", "weight" : "<    >", "basic" : [ //    // ] } ] }

We write the initializer and the first method of the search engine core:

 <?php require_once 'morphyus.php'; class firewind { public $VERSION = "1.0.0"; private $morphyus; function __construct() { $this->morphyus = new morphyus; } /** *    * * @param {string} content    * @param {integer} [range]     * @return {object}   */ public function make_index( $content, $range=1 ) { $index = new stdClass; $index->range = $range; $index->words = []; //     // $words = $this->morphyus->get_words( $content ); foreach ( $words as $word ) { //    // $weight = $this->morphyus->weigh( $word ); if ( $weight > 0 ) { //     // $length = count( $index->words ); //       // for ( $i = 0; $i < $length; $i++ ) { if ( $index->words[ $i ]->source === $word ) { //       // $index->words[ $i ]->count++; $index->words[ $i ]->range = $range * $index->words[ $i ]->count * $index->words[ $i ]->weight; //    // continue 2; } } //        // $lemma = $this->morphyus->lemmatize( $word ); if ( $lemma ) { //      // for ( $i = 0; $i < $length; $i++ ) { //       // if ( $index->words[ $i ]->basic ) { $difference = count( array_diff( $lemma, $index->words[ $i ]->basic ) ); //         // if ( $difference === 0 ) { $index->words[ $i ]->count++; $index->words[ $i ]->range = $range * $index->words[ $i ]->count * $index->words[ $i ]->weight; //    // continue 2; } } } } //      ,   , // //     // $node = new stdClass; $node->source = $word; $node->count = 1; $node->range = $range * $weight; $node->weight = $weight; $node->basic = $lemma; $index->words[] = $node; } } return $index; } } ?>

Now, when adding or changing data in tables, it is enough to simply call this function to index them, but this is not necessary: the indexing may be delayed. The first argument of the make_index method is the source text, the second is the coefficient of significance of the data being indexed. The rank of each word, by the way, is calculated by the formula:

 <?php $range = < > * <    > * < >; //     : // $index->words[ $i ]->range = $range * $index->words[ $i ]->count * $index->words[ $i ]->weight; ?>

Storage of indexed data

Obviously, the index must be stored somewhere, and even attached to the original data. The most suitable place for them will be the database. If the content of files is indexed, then you can create a separate table in the database, which will contain an index for each file, and for content that is already stored in the database, you can add another type field to the table structure. This approach will allow you to separate content types when searching, for example, titles and description of articles in the case of a blog.

Unresolved is only the question of the format of the indexed content, because make_index returns the object, and so simply in the database or file it is not recorded. You can use JSON and store it in fields of type LONGTEXT, you can BSON or CBOR, using the data type LONGBLOB. The latter two formats allow you to present data in a more compact form than the first.

As the saying goes, “the master is the master,” so you decide where and how everything will be stored.

Benchmark

Let's check what we did. I took the text of my favorite article “Dark Matter of the Internet” , namely the contents of the #content html_format node and saved it in a separate file.

 <?php require_once '../src/core.php'; $firewind = new firewind; //    // $source = file_get_contents( './source.html' ); //    // $begin_time = microtime( true ); echo "Indexing started: $begin_time\n"; //  // $index = $firewind->make_index( $source ); //    // $finish_time = microtime( true ); echo "Indexing finished: $finish_time\n"; //  // $total_time = $finish_time - $begin_time; echo "Total time: $total_time\n"; ?>

On my configuration machine:
CPU: Intel Core i7-4510U @ 2.00GHz, 4M Cache
RAM: 2x4096 Mb
OS: Ubuntu 14.04.1 LTS, x64
PHP: 5.5.9-1ubuntu4.5

Indexing took about a second:

 $ php benchmark.php Indexing started: 1417343592.3094 Indexing finished: 1417343593.5604 Total time: 1.2510349750519

I think quite a good result.

Implementation of the search

There remains the last and most important method, the search method. The method takes the search query index as the first argument, and the content index in which the search is performed as the second argument. As a result of the execution, the total rank is calculated, calculated on the basis of the rank of the words found, or 0 if nothing was found. This will sort the search results.

 <?php require_once 'morphyus.php'; class firewind { public $VERSION = "1.0.0"; private $morphyus; // ... // /** *         * * @param {object} target   * @param {object} source ,     * @return {integer}       */ public function search( $target, $index ) { $total_range = 0; //    // foreach ( $target->words as $target_word ) { //    // foreach ( $index->words as $index_word ) { if ( $index_word->source === $target_word->source ) { $total_range += $index_word->range; } else if ( $index_word->basic && $target_word->basic ) { //         // $index_count = count( $index_word ->basic ); $target_count = count( $target_word ->basic ); for ( $i = 0; $i < $target_count; $i++ ) { for ( $j = 0; $j < $index_count; $j++ ) { if ( $index_word->basic[ $j ] === $target_word->basic[ $i ] ) { $total_range += $index_word->range; continue 2; } } } } } } return $total_range; } } ?>

Everything! Search engine ready for use. But there is one thing ... In fact, this is not a genie-wizard, and just throwing it on your website you will not get anything. It needs to be integrated, and this process largely depends on the architecture of your site. Consider this process on the example of a small online store.

The implementation of the search on the example of an online store

Suppose information about the products sold is stored in the production table:

 CREATE TABLE `production` ( `uid` INT NOT NULL AUTO_INCREMENT, --   `name` VARCHAR(45) NOT NULL, --   `manufacturer` VARCHAR(45) NOT NULL, --  `price` INT NOT NULL, --   `keywords` TEXT NULL, --    PRIMARY KEY ( `uid` ) ); SHOW COLUMNS FROM `production`; +--------------+-------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +--------------+-------------+------+-----+---------+-------+ | uid | int(11) | NO | PRI | NULL | | | name | varchar(45) | NO | | NULL | | | manufacturer | varchar(45) | NO | | NULL | | | price | int(11) | NO | | NULL | | | keywords | text | YES | | NULL | | +--------------+-------------+------+-----+---------+-------+

And the description in the description table:

 CREATE TABLE `description` ( `uid` INT NOT NULL AUTO_INCREMENT, --   `fid` INT NOT NULL, --        `description` LONGTEXT NOT NULL, --   `index` TEXT NULL, --   PRIMARY KEY ( `uid` ) ); SHOW COLUMNS FROM `description`; +-------------+----------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------------+----------+------+-----+---------+-------+ | uid | int(11) | NO | PRI | NULL | | | fid | int(11) | NO | | NULL | | | description | longtext | NO | | NULL | | | index | text | YES | | NULL | | +-------------+----------+------+-----+---------+-------+

The field production.keywords will contain the index of product keywords, description.index will contain an indexed description. And all this will be stored in JSON format.

Here is an example of the function of adding a new product:

 <?php require_once 'firewind/core.php'; $firewind = new firewind; $connection = new mysqli( 'host', 'user', 'password', 'database' ); if ( $connection->connect_error ) { die( 'Cannot connect to database.' ); } $connection->set_charset( 'UTF8' ); function add_product( $name, $manufacturer, $price, $description, $keywords ) { global $firewind, $connection; //    // $description_index = $firewind->make_index( $description ); $description_index = json_encode( $description_index ); //    // $keywords_index = $firewind->make_index( $keywords, 2 ); $keywords_index = json_encode( $keywords_index ); //   // $production_query = $connection->prepare( "INSERT INTO `production` ( `name`, `manufacturer`, `price`, `keywords` ) VALUES ( ?, ?, ?, ? )" ); $description_query = $connection->prepare( "INSERT INTO `description` ( `fid`, `description`, `index` ) VALUES ( LAST_INSERT_ID(), ?, ? )" ); if ( !$production_query || !$description_query ) { die( "Cannot prepare requests!\n" ); } if ( //   // $production_query -> bind_param( 'ssis', $name, $manufacturer, $price, $keywords_index ) && $description_query -> bind_param( 'ss', $description, $description_index ) && //   // $production_query -> execute() && $description_query -> execute() ) { //     // echo( "Product successfully added!\n" ); //   // $production_query -> close(); $description_query -> close(); return true; } else { die( "An error occurred while executing query...\n" ); } } ?>

Here, the search engine was integrated into the function of adding a new product to the store. And now the search request handler:

 <?php require_once '../src/core.php'; $firewind = new firewind; $connection = new mysqli( 'host', 'user', 'password', 'database' ); if ( $connection->connect_error ) { die( 'Cannot connect to database.' ); } $connection->set_charset( 'UTF8' ); //   // $query = isset( $_GET[ 'query' ] ) ? trim( $_GET[ 'query' ] ) : false; if ( $query ) { //    // $query_index = $firewind->make_index( $query ); //   // $production = $connection->query(" SELECT p.`uid`, p.`name`, p.`keywords`, d.`index` FROM `production` p, `description` d WHERE p.`uid` = d.`uid` "); if ( !$production ) { die( "Cannot get production info.\n" ); } //   // while ( $product = $production->fetch_assoc() ) { //   // $keywords = json_decode( $product[ 'keywords' ] ); $index = json_decode( $product[ 'index' ] ); $range = $firewind->search( $query_index, $keywords ); $range += $firewind->search( $query_index, $index ); if ( $range > 0 ) { $result[ $product[ 'uid' ] ] = $range; } } //  -  // if ( isset( $result ) ) { //    // arsort( $result ); //   // $i = 1; foreach ( $result as $uid => $range ) { printf( "#%d. Found product with id %d and range %d.\n", $i++, $uid, $range ); } } else { echo( "Sorry, no results found.\n" ); } } else { echo( "Query cannot be empty. Try again.\n" ); } ?>

This script accepts the search query as a GET parameter query and performs a search. As a result, the found store products are displayed.

Conclusion

The article described one of the options for implementing a search for the site. This is the very first version of it, so I will only be glad to know your comments, opinions and suggestions. Join my project on Github: https://github.com/axilirator/firewind . There are plans to add a bunch of other features there, such as caching search queries, hints when entering a search query, and a letter-by-letter comparison algorithm that will help deal with typos.

Thank you all for your attention, well, with the day of information security!

Source: https://habr.com/ru/post/244561/

All Articles