⬆️ ⬇️

Search on Drupal 7 using Apache Solr Part 7 - full-text search in Russian



I finally got together and wrote another article from this series. Now I will talk about how to make a good full-text search in Russian for Drupal on Apache Solr.





In principle, this material is applicable to any language, but for obvious reasons, I chose Russian. At the end of the first article in this series, I wrote about how to improve the search in Russian. This method is simple, but not very effective. The maximum that he is capable of by default is to work with the ending of a word. Consider a simple example. The word "climate" is in the word climate.







But for the word climate results will not be.

')

In order to make the search more flexible, connect an additional dictionary. I used the HunspellStemFilterFactory class for stemming.

Download dictionaries for the Russian language from here - download.services.openoffice.org/files/contrib/dictionaries/ru_RU-pack.zip



We need two files - ru_RU.aff and ru_RU.dic. They need to be converted to utf-8, otherwise apache solr will not work with them.

Initially, I tried to change their encoding via iconv, but the solr did not work with them.

In the end, I re-saved the files in UTF-8 via Krusader - after that everything worked fine.



After you convert the files, you need to put them in the same Solra folder where schema.xml is



Now in the schema (schema.xml) we indicate that we are going to use HunspellStemFilterFactory and our dictionaries:



<filter class="solr.HunspellStemFilterFactory" dictionary="ru_RU.dic" affix="ru_RU.aff" ignoreCase="true" /> 




 <fieldType name="text" class="solr.TextField" indexed="true" stored="true" multiValued="true" positionIncrementGap="100"> <analyzer type="index"> ... <filter class="solr.HunspellStemFilterFactory" dictionary="ru_RU.dic" affix="ru_RU.aff" ignoreCase="true" /> ... <analyzer type="query"> ... <filter class="solr.HunspellStemFilterFactory" dictionary="ru_RU.dic" affix="ru_RU.aff" ignoreCase="true" /> ... 


In addition, after determining the HunspellStemFilterFactory in the analyzer for the index, we will add settings in order to break the words into parts (grams). This will make the search more flexible.



 <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="25" side="front" /> <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="25" side="back" /> 


If you had Porter filter enabled



 <filter class="solr.SnowballPorterFilterFactory" language="Russian" protected="protwords.txt"/> 


do not forget to comment it.



Now reload solr and re-index content. You can see that the search works much better! Here are a couple of examples:











The same can be done with dictionaries of other languages. Full list of dictionaries - download.services.openoffice.org/files/contrib/dictionaries



In addition to simply using ready-made dictionaries, you can also create your own rule.

I noticed that in the Russian dictionary for Solra, there is no word "wombat" and decided to add it. To do this, first go to the file ru_RU.aff and look for a suitable ending. The word "wombat" has a zero ending, and the following rule applies:



SFX K 0 and [^ her]

SFX K 0 y [^ her]

SFX K 0 ohm [^ ejogotschsch]

SFX K 0 f [^ her]

SFX K 0 s [^ guichokhchshshch]

SFX K 0 and [gzhkhchshsch]

SFX K 0 to her [zhtshsch]

SFX K 0 s [^ ejoccchsch]

SFX K 0 am [^ her]

SFX K 0 ami [^ her]

SFX K 0 ah [^ hero]



wombatA, wombatU ... wombatAH.



The code for this ending is K.



Now open the file ru_RU.dic and add a new word with the appropriate code







The code will describe how the word changes. Of course, only an example in the screenshot should insert a new word in alphabetical order.



Reboot the solr, re-index the content and see the results.







I remind you that I use apachesolr 3.6.1 (this is not important, but I have come across the fact that sometimes this or that does not work on some versions, often the problem lies in the peculiarities of building queries to Solra through search api or in the description schema.xml and config .xml)



Just in case, I attach my own scheme , if something goes wrong with you, try using it.

Source: https://habr.com/ru/post/213085/



All Articles