📜 ⬆️ ⬇️

Search Mediawiki with Sphinx

image
Hello, reader!

Some time ago, I was tasked with implementing MediaWiki in a corporate network.
And the main problem with this implementation was the search for information contained in the wiki.
In this article, I would like to talk about how to make Sphinx search friendly with MediaWiki.
The reason I would like to write this is the lack of Russian-language documentation and more or less decent guidance or description that would help my colleagues quickly and simply start using this excellent search engine.
Maybe I just do not know how to use Google ...

Why do you need it


The purpose of this implementation in our organization is to transfer the corporate knowledge base to a more convenient format for presenting and correcting / adding.

By the way - our company implements projects for the automation of documents in the complex.
Customers are large, solutions are complex and sometimes non-standard.
And articles on the wiki are supposed to have not only about projects, but also about technical solutions, features, etc., innovative methods and technologies.
And also plans to use it as a source of information for new employees — they have to study a fairly decent amount of information and the convenience of accessing it at the moment leaves much to be desired.
')
As mentioned above, the popular MediaWiki engine was taken as a basis. And the key problem that I predicted at the very beginning was the problem of information retrieval.
Everyone knows that the standard search is completely bad. And the question became natural - how to correct this misunderstanding.

Training


So, everything is deployed on Windows Server 2012 R2 64bit, IIS is naturally raised:

image

The latest version at the time of installation. The SphinxSearch extension on the screenshot is already connected. How to do this I will write below.

It is necessary to download the search engine itself from the official site . I chose 2.1.9-release (July 2014).
You also need to download the extension for MediaWiki.
I took it on GIT WikiMedia
Version 0.9.0 was up to date.

Installing and configuring the search engine Sphinx


After downloading the engine, I unpacked it in C: \ inetpub \ wwwroot \ mw \ sphinx).
The next step is to prepare the config. As a basis, I took the file sphinx.conf.in
I got this kind of work, which I quote here with comments.

# data source definition for the main index source src_wiki_main { type = mysql # data source sql_host= 127.0.0.1 # localhost      Win7+  sql_user= mwuser sql_pass= sql_db= sql_port= 3306# optional, default is 3306 # pre-query, executed before the main fetch query.      sql_query_pre= SET NAMES utf8 # main document fetch query - change the table names if you are using a prefix #        . sql_query= SELECT page_id, page_title, page_namespace, page_is_redirect, old_id, old_text FROM page, revision, text WHERE rev_id=page_latest AND old_id=rev_text_id # attribute columns sql_attr_uint= page_namespace sql_attr_uint= page_is_redirect sql_attr_uint= old_id # collect all category ids for category filtering sql_attr_multi = uint category from query; SELECT cl_from, page_id AS category FROM categorylinks, page WHERE page_title=cl_to AND page_namespace=14 # used by command-line search utility to display document information sql_query_info= SELECT page_title, page_namespace FROM page WHERE page_id=$id } # data source definition for the incremental index source src_wiki_incremental : src_wiki_main { # adjust this query based on the time you run the full index # in this case, full index runs at 7 AM UTC sql_query= SELECT page_id, page_title, page_namespace, page_is_redirect, old_id, old_text FROM page, revision, text WHERE rev_id=page_latest AND old_id=rev_text_id AND page_touched>=DATE_FORMAT(CURDATE(), '%Y%m%d070000') #     plain type = plain } # main index definition index wiki_main { type = plain # which document source to index source= src_wiki_main # this is path and index file name without extension # you may need to change this path or create this folder path= C:/inetpub/wwwroot/mw/sphinx/data/wiki_main # docinfo (ie. per-document attribute values) storage strategy docinfo= extern # morphology morphology= stem_en, stem_ru # stopwords file #stopwords= /var/data/sphinx/stopwords.txt # minimum word length min_word_len= 1 # allow wildcard (*) searches min_infix_len = 1 enable_star = 1 # charset encoding type charset_type= utf-8 # charset definition and case folding rules "table" #       .        . charset_table= 0..9, A..Z->a..z, a..z, \ U+C0->a, U+C1->a, U+C2->a, U+C3->a, U+C4->a, U+C5->a, U+C6->a, \ U+C7->c,U+E7->c, U+C8->e, U+C9->e, U+CA->e, U+CB->e, U+CC->i, \ U+CD->i, U+CE->i, U+CF->i, U+D0->d, U+D1->n, U+D2->o, U+D3->o, \ U+D4->o, U+D5->o, U+D6->o, U+D8->o, U+D9->u, U+DA->u, U+DB->u, \ U+DC->u, U+DD->y, U+DE->t, U+DF->s, \ U+E0->a, U+E1->a, U+E2->a, U+E3->a, U+E4->a, U+E5->a, U+E6->a, \ U+E7->c,U+E7->c, U+E8->e, U+E9->e, U+EA->e, U+EB->e, U+EC->i, \ U+ED->i, U+EE->i, U+EF->i, U+F0->d, U+F1->n, U+F2->o, U+F3->o, \ U+F4->o, U+F5->o, U+F6->o, U+F8->o, U+F9->u, U+FA->u, U+FB->u, \ U+FC->u, U+FD->y, U+FE->t, U+FF->s, U+410..U+42F->U+430..U+44F, \ U+430..U+44F, U+0400->U+0435, U+0401->U+0435, U+0402->U+0452, \ U+0452, U+0403->U+0433, U+0404->U+0454, U+0454, U+0405->U+0455, \ U+0455, U+0406->U+0456, U+0407->U+0456, U+0457->U+0456, U+0456, \ U+0408..U+040B->U+0458..U+045B, U+0458..U+045B, U+040C->U+043A, \ U+040D->U+0438, U+040E->U+0443, U+040F->U+045F, U+045F, \ U+0450->U+0435, U+0451->U+0435, U+0453->U+0433, U+045C->U+043A, \ U+045D->U+0438, U+045E->U+0443, U+0460->U+0461, U+0461, U+0462->U+0463, \ U+0463, U+0464->U+0465, U+0465, U+0466->U+0467, U+0467, U+0468->U+0469, \ U+0469, U+046A->U+046B, U+046B, U+046C->U+046D, U+046D, U+046E->U+046F, \ U+046F, U+0470->U+0471, U+0471, U+0472->U+0473, U+0473, U+0474->U+0475, \ U+0476->U+0475, U+0477->U+0475, U+0475, U+0478->U+0479, U+0479, \ U+047A->U+047B, U+047B, U+047C->U+047D, U+047D, U+047E->U+047F, U+047F, \ U+0480->U+0481, U+0481, U+048A->U+0438, U+048B->U+0438, U+048C->U+044C, \ U+048D->U+044C, U+048E->U+0440, U+048F->U+0440, U+0490->U+0433, \ U+0491->U+0433, U+0490->U+0433, U+0491->U+0433, U+0492->U+0433, \ U+0493->U+0433, U+0494->U+0433, U+0495->U+0433, U+0496->U+0436, \ U+0497->U+0436, U+0498->U+0437, U+0499->U+0437, U+049A->U+043A, \ U+049B->U+043A, U+049C->U+043A, U+049D->U+043A, U+049E->U+043A, \ U+049F->U+043A, U+04A0->U+043A, U+04A1->U+043A, U+04A2->U+043D, \ U+04A3->U+043D, U+04A4->U+043D, U+04A5->U+043D, U+04A6->U+043F, \ U+04A7->U+043F, U+04A8->U+04A9, U+04A9, U+04AA->U+0441, U+04AB->U+0441, \ U+04AC->U+0442, U+04AD->U+0442, U+04AE->U+0443, U+04AF->U+0443, U+04B0->U+0443, \ U+04B1->U+0443, U+04B2->U+0445, U+04B3->U+0445, U+04B4->U+04B5, U+04B5, \ U+04B6->U+0447, U+04B7->U+0447, U+04B8->U+0447, U+04B9->U+0447, U+04BA->U+04BB, \ U+04BB, U+04BC->U+04BD, U+04BE->U+04BD, U+04BF->U+04BD, U+04BD, U+04C0->U+04CF, \ U+04CF, U+04C1->U+0436, U+04C2->U+0436, U+04C3->U+043A, U+04C4->U+043A, \ U+04C5->U+043B, U+04C6->U+043B, U+04C7->U+043D, U+04C8->U+043D, U+04C9->U+043D, \ U+04CA->U+043D, U+04CB->U+0447, U+04CC->U+0447, U+04CD->U+043C, U+04CE->U+043C, \ U+04D0->U+0430, U+04D1->U+0430, U+04D2->U+0430, U+04D3->U+0430, U+04D4->U+00E6, \ U+04D5->U+00E6, U+04D6->U+0435, U+04D7->U+0435, U+04D8->U+04D9, U+04DA->U+04D9, \ U+04DB->U+04D9, U+04D9, U+04DC->U+0436, U+04DD->U+0436, U+04DE->U+0437, \ U+04DF->U+0437, U+04E0->U+04E1, U+04E1, U+04E2->U+0438, U+04E3->U+0438, \ U+04E4->U+0438, U+04E5->U+0438, U+04E6->U+043E, U+04E7->U+043E, U+04E8->U+043E, \ U+04E9->U+043E, U+04EA->U+043E, U+04EB->U+043E, U+04EC->U+044D, U+04ED->U+044D, \ U+04EE->U+0443, U+04EF->U+0443, U+04F0->U+0443, U+04F1->U+0443, U+04F2->U+0443, \ U+04F3->U+0443, U+04F4->U+0447, U+04F5->U+0447, U+04F6->U+0433, U+04F7->U+0433, \ U+04F8->U+044B, U+04F9->U+044B, U+04FA->U+0433, U+04FB->U+0433, U+04FC->U+0445, \ U+04FD->U+0445, U+04FE->U+0445, U+04FF->U+0445, U+0410..U+0418->U+0430..U+0438, \ U+0419->U+0438, U+0430..U+0438, U+041A..U+042F->U+043A..U+044F, U+043A..U+044F, } # incremental index definition index wiki_incremental : wiki_main { type = plain path= C:/inetpub/wwwroot/mw/sphinx/data/wiki_incremental } # indexer settings indexer { # memory limit (default is 32M) mem_limit= 64M } # searchd settings searchd { # IP address and port on which search daemon will bind and accept listen= 127.0.0.1:9312 # searchd run info is logged here - create or change the folder log= C:/inetpub/wwwroot/mw/sphinx/log/searchd.log # all the search queries are logged here query_log= C:/inetpub/wwwroot/mw/sphinx/log/query.log # client read timeout, seconds read_timeout= 5 # maximum amount of children to fork max_children= 30 # a file which will contain searchd process ID pid_file= C:/inetpub/wwwroot/mw/sphinx/log/searchd.pid # maximum amount of matches this daemon would ever retrieve # from each index and serve to client max_matches= 1000 workers = threads } # --eof-- 


This completes the configuration of the Sphinx.

Install Search Service


Now we install our service.
To do this, write the command line
C:/inetpub/wwwroot/mw/sphinx/bin/searchd --install --config C:/inetpub/wwwroot/mw/sphinx/bin/sphinx.conf --servicename SphinxSearch
Everything should go without errors and the service should be installed and become visible through the Administration - Services under the name SphinxSearch.
While it is not worth running it because the data is not indexed yet and we get an error when starting the service.
It is worth noting that the slashes are used exactly such /, and not such \. Otherwise, there will be an error of access to the log files and PID files of the search engine processes.
I also note that the conf file is in the folder with binaries (bin), so that when running through the console, do not write the path to the config.
But when installing the service, it is better to write the path the config is.

Now in the command line, go to the folder with binaries (bin) and write
indexer --all
We get a result like this:
  Sphinx 2.1.9-release (r4761) Copyright (c) 2001-2014, Andrew Aksyonoff Copyright (c) 2008-2014, Sphinx Technologies Inc (http://sphinxsearch.com) using config file './sphinx.conf'... indexing index 'wiki_main'... collected 159 docs, 0.5 MB collected 0 attr values sorted 0.0 Mvalues, 100.0% done sorted 1.6 Mhits, 100.0% done total 159 docs, 494176 bytes total 0.596 sec, 827807 bytes/sec, 266.34 docs/sec indexing index 'wiki_incremental'... collected 159 docs, 0.5 MB collected 0 attr values sorted 0.0 Mvalues, 100.0% done sorted 1.6 Mhits, 100.0% done total 159 docs, 494176 bytes total 0.584 sec, 844808 bytes/sec, 271.81 docs/sec total 4 reads, 0.005 sec, 2107.7 kb/call avg, 1.4 msec/call avg total 38 writes, 0.022 sec, 479.7 kb/call avg, 0.5 msec/call avg 

Everything, the index is created.

Check the operation of the search engine


As it turned out, the index was created. In the command line, we are still in the binary directory. Now we start our SphinxSearch service and on the command line we write something like:

search wiki

I got this result:

  Sphinx 2.1.9-release (r4761) Copyright (c) 2001-2014, Andrew Aksyonoff Copyright (c) 2008-2014, Sphinx Technologies Inc (http://sphinxsearch.com) using config file './sphinx.conf'... index 'wiki_main': query 'wiki ': returned 13 matches of 13 total in 0.004 sec displaying matches: 1. document=76, weight=1719, page_namespace=0, page_is_redirect=0, old_id=929, c ategory=() page_title=???????_CompanyNameWiki page_namespace=0 2. document=77, weight=1670, page_namespace=0, page_is_redirect=0, old_id=1136, category=() page_title=FAQ_CompanyNameWiki page_namespace=0 3. document=79, weight=1670, page_namespace=0, page_is_redirect=0, old_id=864, c ategory=() page_title=CompanyNameWiki:_????? page_namespace=0 4. document=81, weight=1670, page_namespace=12, page_is_redirect=0, old_id=939, category=() page_title=C???????_?????_?????? page_namespace=12 5. document=128, weight=1670, page_namespace=0, page_is_redirect=0, old_id=1075, category=() page_title=????? page_namespace=0 6. document=1, weight=1648, page_namespace=0, page_is_redirect=0, old_id=1091, c ategory=() page_title=?????????_???????? page_namespace=0 7. document=4, weight=1648, page_namespace=0, page_is_redirect=0, old_id=10, cat egory=() page_title=?????????_???????? page_namespace=0 8. document=5, weight=1648, page_namespace=0, page_is_redirect=0, old_id=181, ca tegory=() page_title=?????????_?????????_????_(???????_??????) page_namespace=0 9. document=2, weight=1608, page_namespace=8, page_is_redirect=0, old_id=1135, c ategory=() page_title=Sidebar page_namespace=8 10. document=12, weight=1608, page_namespace=0, page_is_redirect=0, old_id=719, category=() page_title=?????????_CRM page_namespace=0 11. document=71, weight=1608, page_namespace=0, page_is_redirect=0, old_id=701, category=() page_title=??????_??????? page_namespace=0 12. document=80, weight=1608, page_namespace=12, page_is_redirect=0, old_id=862, category=() page_title=?????????_CompanyNameWiki page_namespace=12 13. document=129, weight=1608, page_namespace=0, page_is_redirect=0, old_id=1085 , category=() page_title=???? page_namespace=0 words: 1. 'wiki': 13 documents, 37 hits index 'wiki_incremental': query 'wiki ': returned 13 matches of 13 total in 0.00 0 sec displaying matches: 1. document=76, weight=1719, page_namespace=0, page_is_redirect=0, old_id=929, c ategory=() page_title=???????_CompanyNameWiki page_namespace=0 2. document=77, weight=1670, page_namespace=0, page_is_redirect=0, old_id=1136, category=() page_title=FAQ_CompanyNameWiki page_namespace=0 3. document=79, weight=1670, page_namespace=0, page_is_redirect=0, old_id=864, c ategory=() page_title=CompanyNameWiki:_????? page_namespace=0 4. document=81, weight=1670, page_namespace=12, page_is_redirect=0, old_id=939, category=() page_title=C???????_?????_?????? page_namespace=12 5. document=128, weight=1670, page_namespace=0, page_is_redirect=0, old_id=1075, category=() page_title=????? page_namespace=0 6. document=1, weight=1648, page_namespace=0, page_is_redirect=0, old_id=1091, c ategory=() page_title=?????????_???????? page_namespace=0 7. document=4, weight=1648, page_namespace=0, page_is_redirect=0, old_id=10, cat egory=() page_title=?????????_???????? page_namespace=0 8. document=5, weight=1648, page_namespace=0, page_is_redirect=0, old_id=181, ca tegory=() page_title=?????????_?????????_????_(???????_??????) page_namespace=0 9. document=2, weight=1608, page_namespace=8, page_is_redirect=0, old_id=1135, c ategory=() page_title=Sidebar page_namespace=8 10. document=12, weight=1608, page_namespace=0, page_is_redirect=0, old_id=719, category=() page_title=?????????_CRM page_namespace=0 11. document=71, weight=1608, page_namespace=0, page_is_redirect=0, old_id=701, category=() page_title=??????_??????? page_namespace=0 12. document=80, weight=1608, page_namespace=12, page_is_redirect=0, old_id=862, category=() page_title=?????????_CompanyNameWiki page_namespace=12 13. document=129, weight=1608, page_namespace=0, page_is_redirect=0, old_id=1085 , category=() page_title=???? page_namespace=0 words: 1. 'wiki': 13 documents, 37 hits 


Due to the fact that there is a difference in encodings, we received "?????", and not Russian letters. But the main thing is present issue. So the search works!

That's all, we installed sphinx, indexed our database and have a working search engine!

Index Update Automation


For the full work of the search, it is also necessary to ensure regular updating of the index - after all, articles are added and it is necessary to ensure their availability in search results including.

To do this, in the task scheduler we will create a task with the launch frequency (I have 5 minutes) a bat file with the following content:
c:\inetpub\wwwroot\mw\sphinx\bin\indexer --all --config c:\inetpub\wwwroot\mw\sphinx\bin\sphinx.conf --rotate

I did a job launch on behalf of a local administrator. You must first explicitly assign rights to the entire sphinx folder.

Connect Search Sphinx in Mediawiki


Now you need to connect the search engine to Mediawiki. Otherwise, the latter doesn’t know in any way what to look for without the built-in mechanism, but with the help of the sphinx.

Go to the file LocalSettings.php (It lies in the folder with the media) and add there:

  #Sphinx search $wgSearchType = 'SphinxMWSearch'; require_once "$IP/extensions/SphinxSearch/SphinxSearch.php"; $wgSphinxSearch_host = "127.0.0.1"; $wgSphinxSearch_port = 9312; $wgSphinxSearch_matches = 50; $wgEnableSphinxPrefixSearch = true; $wgFooterIcons['poweredby']['sphinxsearch'] = array( 'src' => "$wgScriptPath/extensions/SphinxSearch/skins/images/Powered_by_sphinx.png", 'url' => 'http://www.mediawiki.org/wiki/Extension:SphinxSearch', 'alt' => 'Search Powered by Sphinx', ); 


Create a new folder in the extensions folder named SphinxSearch.
important note left vedmaka :
Add: after installing sphinx you need to go to http://sphinxsearch.com/downloads/archive/ , download the source version of the corresponding version from there and upload the sphinxapi.php file to the directory with the SphinxSearch extension.

We save. Restart the site through the IIS manager. We check the search by hand through the Mediawiki webpage. Everything should work.

image
Issuance when typing in the search box.

image
And the search results itself.

Conclusion


As a result, we received a better search through the materials in our wiki.
In the default output, the sort is set to SPH_SORT_RELEVANCE.
If desired, it can be changed by explicitly specifying the LocalSettings.php file through the parameter

$wgSphinxSearch_sortby

More information about the various options for sorting the issue can be found in this section of the documentation .

In this article, I used not only personal insights, but also information gathered in the process of implementing work with this search engine.

I did not consider possible errors that may occur in the process. I considered it right to share a working configuration, as well as a sequence of actions that ultimately lead to the work of the decision as a whole. And these errors were the sea, starting from the lack of rights to the files, "not those" with slashes and ending with the inoperability of the Sphinx configuration supplied with the extension.

Source: https://habr.com/ru/post/230073/


All Articles