📜 ⬆️ ⬇️

Sphinx Sample Search on Real Project - Tecdoc Auto Parts Store

In short: the article will be useful to those programmers who have already become interested in the relevant search and have read the articles on the initial installation of the sphinx search, drove on test examples and the same synthetic problems. Often, these examples do not provide an answer to the question, but how can you feel the real benefits of the Sphinx search module in comparison with other simpler search options? Code examples in the article are on php + smarty, Sphinx 2.0.1-beta, the database is mysql, the source code and the dump of the database structure are laid out in a separate archive in the basement. The article describes an example of using such features of the Sphinx as:

I would also like to contribute to the development of the project and frankly insufficient Russian documentation, despite the fact that the project was created and supported by a Russian-speaking programmer. Therefore, it was decided: the uninterrupted flow of task blocker goes through the forest, instead of him, as a thank you to the developers of the sphinx in general, and to the user Andrei Aksyonov, aka shodan, I am writing this article.

1. Introduction


If the sphinx is not yet installed and want to start, a link to the article for a newbie: Creating a trial search engine in Sphinx + php . You can test and see how this search works at autoklad.biz/?action=search .

I’ll say right away that our search module is in the beta testing version: there are many more rough spots, potential holes and other less obvious bugs, and therefore it is not recommended to set the current example for your production headlong because you need another testing stage with real loads and real requests from numerous users of our sites.

Our company has been dealing with auto parts, or rather selling online stores for parts, for a long time, fruitfully and quite successfully. But for a number of reasons, the relevant search was needed just now. The main reason most likely is that full text search is not suitable for most parts. But there are 5-10% of the goods for which he is catastrophically needed without him. And our standard search with inherently direct cross-links and indicating a clear model and brand of car from tecdoc does not work for this product group. An example of such "wrong" products: oils, tires, batteries and other similar.
')
The average price for spare parts of a small ordinary company is 2-10 million positions, respectively, 10% of this base will occupy the data we need. That is, the index in the examples below is built on the basis of about 300 thousand documents.

2. Creating a single config file for windows development and linux production OS


The problem to be solved is that the configuration files of the developer’s machine and the server’s production are different, and when developing, it is necessary to promptly update the unstable structure and constantly change these sphinx configs. In our case, it was aggravated by the fact that these configs on the server should be shared with a separate project by another development team, and such a section as “include * .conf” in the Sphinx is not yet provided.

On the local windows of the machine, the config is in “D: \ Sphinx \ sphinx.conf”, on the server in "/etc/sphinx/sphinx.conf", and on Linux the machine has a symbolic link to the updated Search-> CreateConfigFile () file in /var/www/autoklad.com.ua/imgbank/sphinx/sphinx.conf. The local file is updated directly to the folder, as it does not interfere with the neighbors.

Source code methods:
public function CreateConfigFile() { $sConfigFilePath=Db::GetConstant('sphinx:config_file_path',SERVER_PATH.'/imgbank/sphinx/'); $sConfigFileName='sphinx.conf'; $sConfigTemplate=Db::GetConstant('sphinx:config_template','production'); if (!file_exists($sConfigFilePath)) mkdir($sConfigFilePath); $sTopSection.=$this->GetPriceGroupConfig(); Base::$tpl->assign('sTopSection',$sTopSection); $sFileContent=Base::$tpl->fetch($this->sPrefix.'/config_sphinx_'.$sConfigTemplate.'.tpl'); file_put_contents($sConfigFilePath.$sConfigFileName,$sFileContent); } private function GetPriceGroupConfig() { Base::$tpl->assign('sDataFilePath',Base::GetConstant('sphinx:data_file_path','/var/data/')); return Base::$tpl->fetch($this->sPrefix.'/config_price_group.tpl'); } 


Template config_price_group.tpl, the rest - in the archive, so as not to stretch the article
 source price_group {ldelim} type = mysql sql_host = {$aDbConf.Host} sql_user = {$aDbConf.User} sql_pass = {$aDbConf.Password} sql_db = {$aDbConf.Database} sql_query_pre = SET NAMES utf8 sql_query_pre = SET CHARACTER SET utf8 sql_query = \ select p.id \ , p.code as code \ , c.title as brand \ , if(ifnull(cp.name_rus,'')<>'', cp.name_rus, ifnull(p.part_rus,'')) as part_name \ , pgr.name as price_group_name \ , p.id_price_group as id_price_group \ from price as p \ left join cat_part as cp on cp.item_code=p.item_code \ inner join cat as c on p.pref=c.pref \ inner join provider_virtual as pv on p.id_provider=pv.id_provider \ inner join user_provider as up on pv.id_provider_virtual=up.id_user \ inner join provider_group as pg on up.id_provider_group=pg.id \ inner join user as u on up.id_user=u.id and u.visible=1 \ inner join currency as cu on up.id_currency=cu.id \ inner join price_group as pgr on pgr.id=p.id_price_group \ where 1=1 sql_attr_uint = id_price_group sql_query_info = SELECT * FROM price WHERE id=$id {rdelim} index price_group {ldelim} source = price_group path = {$sDataFilePath}price_group/index morphology = stem_ru min_word_len = 3 charset_type = utf-8 min_infix_len = 3 #min_prefix_len = 3 enable_star = 1 {rdelim} 


The request could be simplified with a view, but as far as I understood, they do not recommend doing this, and a direct query to the data with joins is more efficient for reasons of the load created during indexing.

The value of the constants that are taken from the database for the local site
 sphinx:data_file_path D:/Sphinx/data/ sphinx:config_template local sphinx:config_file_path D:/Sphinx/ 


The value of the constants that are taken from the database for the production site
 sphinx:data_file_path /var/data/ sphinx:config_template production sphinx:config_file_path /var/www/autoklad.com.ua/imgbank/sphinx/ 


As a result of the work for the local sphinx, we have the following config file:
http://www.mstarproject.com/temp/3/sphinx/sphinx.conf

3. SetMatchMode (SPH_MATCH_EXTENDED2) and why SPH_MATCH_ANY and others are not suitable for a real search.


In order for the morphology to work and in the query “oil Castrol 5W40” there were documents with the text “Oil” and “15W40” - you need to simultaneously use the "*" symbol and search for the word "oil", and for this you need a query builder that works it is in the “SPH_MATCH_EXTENDED2” mode. There are also SPH_MATCH_EXTENDED, but as I understand it is the old version and recommend using the new version of the mode.

In the SPH_MATCH_ANY mode, it is impossible to ensure that when words are increased in the query, the number of results decreases. In the SPH_MATCH_ALL mode, it is impossible to achieve simultaneous operation of the mode by partial occurrence and word form. Other modes casually looked through, until they were useful, so I can not say anything about them.

The very request to the Sphinx for the phrase “Castrol 5W40 oils” will look like this:
 ( | **) & (Castrol | *Castrol*) & (5W40 | *5W40*) 


Important: in the config of the used index there should be 2 lines:
 min_infix_len = 3 enable_star = 1 

The first allows you to search for a partial occurrence of the word on the right and left, that is, from the end and from the beginning of the word. The second line allows to use "*" in the query. You can use min_prefix_len, if you need, for example, only occurrences on the left (from the beginning) of a word.

The function that processes the incoming string and forms the correct query:
 private function GetSphinxKeyword($sQuery) { $aRequestString=preg_split('/[\s,-]+/', $sQuery, 5); if ($aRequestString) { foreach ($aRequestString as $sValue) { if (strlen($sValue)>3) { $aKeyword[] .= "(".$sValue." | *".$sValue."*)"; } } $sSphinxKeyword = implode(" & ", $aKeyword); } return $sSphinxKeyword; } 


The result of the query can be tested at: http://autoklad.biz/?action=search&search[query[=%D0%BC%D0%B0%D1%81%D0%BB%D0%BE%20Castrol%205W40&search[id_price_group] = 35
Below the search results, the resulting array is displayed, which returns the sphinx for processing - pay attention to the [words] section, which indicates by what words how many documents were found. Other sections are no less important, but we are not talking about them yet.

Also a very frequent question on the forum and developer site is “How to raise higher the exact phrase entry?”, I.e., so that the weight of the “Search for word” document was higher than the “Search for word and also a lot of text”. Answer - you need to use SPH_RANK_SPH04, specially created for this typical task, as I understood.

4. SetSortMode (SPH_SORT_RELEVANCE), SetFieldWeights - sort by relevance and set weights for index fields


This method determines which results will be higher in the sorted array of data returned by the sphinx. In the case of SPH_SORT_RELEVANCE, the result will be sorted by so-called. "Relevance." Relevance, as much as we would like, works by purely arithmetic rules, and not like Google or Yandex search. That is, no magic: multiplying and adding up the weight of the index, the weight of the field, the number of occurrences of the desired word in the document and the frequency of the word in other documents.

In the simplest case, we set the weights for the index fields:
 $oSphinxClient->SetFieldWeights(array ( 'code' => 50, 'brand' => 40, 'part_name' => 10, 'price_group_name' => 5, )); 

and at the output we get an array sorted by “relevance” = “total weight”, where weight is an integer value. These numbers can be controlled by adjusting the relevance to yourself, that is, the more important field you need to assign more weight. In our example, the most important field is the “code” part code.

5. SetLimits (0.20) - limiting the output of results


This method is the simplest, works similarly to the Mayskle limit 0.20, and is needed respectively for the same: to get batch data for building steppers. In our project, we just need the first 20 (constant) results, as we will be further down the steps, if there are 3 or more, there is no sense to go.

6. AddQuery, RunQueries - building multi-queries


Multi-requests are a very convenient solution for batch requests when you need to send more than one request to the sphinx, but several. In our example, this is sending all parts groups the same request for a list of groups and the number of records in each group. That is, about 100 requests are sent, and one result is returned in one connection to the sphinx. It is also “resolved” to limit the maximum 32 allowed simultaneous requests in one batch of requests.

Code example:
 $aPriceGroup=Db::GetAll(Base::GetSql("Price/Group",array( 'visible'=>1, "where"=>" and pg.code_name is not null", ))); if ($aPriceGroup) { $aResultAll=array(); $i=0; foreach ($aPriceGroup as $aValue) { $oSphinxClient->SetFilter('id_price_group', array($aValue['id'])); $iQuery = $oSphinxClient->AddQuery($sSphinxKeyword, 'price_group'); $oSphinxClient->ResetFilters(); $bAddedUnrunQuery=true; $aPriceGroupAssoc[$iQuery+(32*$i)]=$aValue; if ($iQuery && !($iQuery % 31) ) { $aResultQuery=$oSphinxClient->RunQueries(); $aResultAll=array_merge($aResultAll,$aResultQuery); $sLastError=$oSphinxClient->GetLastError(); $i++; $bAddedUnrunQuery=false; } } if ($bAddedUnrunQuery) { $aResultQuery=$oSphinxClient->RunQueries(); $aResultAll=array_merge($aResultAll,$aResultQuery); } } 

Due to the fact that the task being performed had deadlines, it was not the goal to delve into all the subtleties of requests at the start. Therefore, I probably wrote a bike that solves the problem of a “grouped” query, similar to the group by in mysql. On the other hand, if I had dealt with the grouping in the sphinx - there would be no example where you can use multi-queries.

So in the comments, a more correct example of a request to receive the same is welcomed, but with the help of sphinx grouping.

7. SetFilter, ResetFilters - add filtering in the multi-request to limit the received data


In order to use filters, you must first register the fields in the index index that will be used for filtering. In our example, this is the id_price_group field:
 sql_attr_uint = id_price_group 

Accordingly, the code is used like this:
 foreach ($aPriceGroup as $aValue) { $oSphinxClient->SetFilter('id_price_group', array($aValue['id'])); $iQuery = $oSphinxClient->AddQuery($sSphinxKeyword, 'price_group'); $oSphinxClient->ResetFilters(); //... } 

That is, in the foreach loop, for each request in a multi-request, a filter is first set, and after adding, it is reset so that it does not work for other requests. In my opinion, everything is logical, obviously, and difficulties should arise.

8. Wordforms - using synonyms and overcoming limitations for non-standard word forms, like “C #”


In order for synonyms and non-standard (own) word forms to work, you need to include a file with the following word forms in the index config:
 wordforms = D:\Sphinx\data\wordforms.txt 

The file itself may contain, for example, such a data set in UTF-8 encoding:
 bosh > bosch  > bosch CASTROL > CASTROLL  > CASTROLL  > CASTROLL 

That is, on the left side all possible synonyms are from the right meaning of these words. And on the left should not be for example "bosch", if it is already on the right. At least if you do this - the search does not behave as I expected.

In our example, you can use the query "Castrol oil 5W40" and it will find the same as Castrol 5W40 oil. In the example with “C #” you need to include such non-standard word forms so that they are not processed according to the standard index scheme and work manually exactly as you configure them. Only you know the exact meaning of the phrase in your project, for example “C #” = “TO DIEZ for musicians”

There is no such functionality in the config file and the example on the server, only an example is given, but not yet implemented in the existing structure of the project synonyms.

Source codes, archives, links to useful sites


* Unofficial wiki documentation, including non-English http://sphinxsearch.com/wiki/doku.php

* Archive of cut-down sources of auto parts sample http://www.mstarproject.com/temp/3/sphinx/sphinxsearch_soruce.zip

* Archive of the structure of the curtailed example DB http://www.mstarproject.com/temp/3/sphinx/sphinxsearch_db_structure.zip

* Archive of the curtailed example DB (43 MB) with the data http://www.mstarproject.com/temp/3/sphinx/sphinxsearch_db_data.zip

* Sponsor's sponsored link: tecdoc + sphinxsearch online store development

I will be glad to constructive criticism and try to answer your questions. Most likely I will not go to the conference in St. Petersburg : a very inconvenient flight, and indeed a winter. I decided that there would be more benefit from the article, and you can meet the author of the Sphinx in Ukraine, you just have to wait.

Source: https://habr.com/ru/post/132118/


All Articles