⬆️ ⬇️

Install and integrate solr with django under Ubuntu 12.04

imageimage



Introduction



As you know, many sites / web applications need to implement search in one way or another. Everyone wants a fast and high-quality search. Developers among other things want the search engine to be easy to install and use. Since we are talking about django, we are faced with a number of restrictions in the implementation of the search (assuming that no one canceled deadlines 24 hours a day). I offer you a small tutorial on how to install and integrate such a powerful search engine as apache solr into django as painlessly as possible. I ask all interested under cat.





')

Our rake



In the past, we had a not very pleasant experience with django-sphinx. The sphinx itself is an excellent search engine, django-sphinx is also a great library, but after django 1.3 you have to dance a bit with a tambourine to start the whole stack. But even after all this is set up and working, the feeling remains that everything lies somewhere out there, separately from the project, with its crown indexes and configs not related to the models on the project. Yes, there is an autogenerator for configs, but it was not possible to get it under 1.4+, and after each generation it would be necessary to edit the configs for the search engine itself (searchd). In general, the search for django-sphinx works, and quickly and efficiently, but it’s not very convenient to maintain, as well as integrate into the project.



How we found a needle in the haystack



Here came the turn of the new project. And again there was a question of choosing a search engine. We immediately refused django-sphinx and began to look for an alternative solution. The first library that we noticed is the django-haystack, which is not surprising, since This is the most popular library for integrating search engines into django. Having looked api, on it also stopped. We were promised full integration with any of the proposed search engines: elasticsearch, solr, whoosh, xapian. After a cursory inspection, the following decisions were made:



  1. Whoosh was eliminated due to the lack of heaps of small features, some of which were necessary for the project. Yes, and they say that he loses the speed of the other options.
  2. Xapian eliminated after an unsuccessful attempt to integrate a third-party backend.
  3. There are 2 options left - elasticsearch and solr. Both on java, both on lucene, the features are almost identical. “Throwing a coin” was decided to use solr.






More rakes



I’ll say right away that this article describes mainly work with solr, and not with haystack. Information on the haystack can be found at the official docks, it is well laid out and there is no point in duplicating it here.



The library is selected, the search engine is selected. It's time to get to work. The first mistake was the installation of the official repu ubuntu 12.04. The official turnips version is solr 1.4, although the haystack documentation recommends version 3.5+ to avoid it. We decided to put it yourself. After a series of trial and error, they came up with a solution that worked perfectly both on developers' locales, and on test and combat servers. Here is an approximate sequence of actions:



  1. Put the system packages jre and jetty:
    apt-get install openjdk-7-jre-headless jetty 
  2. Python libs:
     pip install pysolr django-haystack 
  3. A bit of “dark magic” for the global setup of the solr itself:
     #!/bin/sh #variables LIB_DIRECTORY="/opt/solr" CONF_DIRECTORY="/etc/solr" OLD_CONF_DIRECTORY="$LIB_DIRECTORY/example/solr/conf" #check if already installed if ( [ -d $LIB_DIRECTORY ] && [ -d $CONF_DIRECTORY ] ); then echo "solr is already installed" exit fi #install if not #download and unpack wget http://archive.apache.org/dist/lucene/solr/3.6.2/apache-solr-3.6.2.tgz tar -xvzf apache-solr-3.6.2.tgz #install mv apache-solr-3.6.2 $LIB_DIRECTORY mv $OLD_CONF_DIRECTORY $CONF_DIRECTORY ln -s $CONF_DIRECTORY $OLD_CONF_DIRECTORY #cleanup rm apache-solr-3.6.2.tgz echo "solr installed" 






Of course, if you do not want to put solr so “globally”, you can tweak the script a bit and put it in any other folder - solr works immediately after unarchiving without additional dances, we just wanted to see the configs in the usual places and that the next “third-party” application was there same as the rest.



In general, you can already run and everything will work:

 cd /opt/solr/example/ && java -jar start.jar 




But we can not leave it all so. How configs, reindexing and normal start? Let's take it in order.



Configs



In this context, we are interested in 2 config:



  1. /etc/solr/schema.xml - the data scheme itself and other information for indexing and searching, directly related to our data. This config needs to be updated each time you change indices in your project.
  2. /etc/solr/solrconfig.xml - the settings of the solr itself, i.e. modules, handlers and other settings that are not directly related to data. This config can be useful to you if you need additional features of solr.




After updating any of the configs, you must rerun solr for the changes to take effect. (In fact, it is not necessary, you can limit yourself to RELOAD, which will be clearly more correct in combat conditions. See wiki.apache.org/solr/CoreAdmin )

To generate schema.xml, haystack provides a great build_solr_schema command. But it has a small minus - when using it, you need to remember where to put the config, because it either prints the config to the console, or to the file, if specified. You can fix this a bit, if you make your own command, a complete copy of build_solr_schema, just specify the default value of the file name (in our case, this is /etc/solr/schema.xml). Thus, we have slightly simplified the life of developers, because now it is enough just to run this command and the config will be updated.



Reindexing



In our case there are 2 options for how to maintain the relevance of the indices:



  1. “Classic option” - full reindexing every N minutes. The only advantage of this approach that comes to mind is its simplicity. There are a few more minuses: with a large database this is a long time, between re-indexing the data may become irrelevant.
  2. “Atomic re-indexing” - re-indexing each object by the signals post_save, post_delete, etc. The main advantage is obviously the speed regardless of the size of the database. It is important that with this approach, the entire control of reindexing is “in front of” in the project code, and not in some kind of crown or, at best, celery task. But even here, it is not without drawbacks: irrelevance of data due to incorrect implementation and overhead head when calling signals with re-indexing. The latter is eliminated by putting into asynchronous task (in our case celery).




We chose the second option (no one forbids choosing both at once) and used simple mixing for models and a couple of tasks for implementation:



Mixin


 class RefreshIndexMixin(object): #index -   ,  PostIndex  CarIndex def update_index(self, index): current_app.send_task('project.tasks.update_index', args=[self, index]) def remove_index(self, index): current_app.send_task('project.tasks.remove_index', args=[self, index]) 




In the model, respectively, it is enough to call update_index or remove_index in the corresponding signal.



Taski


 from celery import shared_task @shared_task def update_index(obj, index): index().update_object(obj) @shared_task def remove_index(obj, index): index().remove_object(obj) 




Normal start.



We use supervisor everywhere, so in our case it did not take long to think. Here is an example config for a localhost / test server:



 [program:solr] command=java -jar start.jar directory=/opt/solr/example/ stderr_logfile=/var/log/solr.error.log stdout_logfile=/var/log/solr.log autorestart=true 


Why for lokalki / test? Because solr “out of the box” comes with its own admin panel, where you can see indexes, settings, etc., as well as perform any manipulations with this data. On the combat server, you most likely want to abandon this feature, so here’s the start command only on localhost on the default port:



 command=java -Djetty.host=127.0.0.1 -Djetty.port=8983 -jar start.jar 




Total



What we have in the bottom line:







During the time that we use this bundle, it has shown itself in an excellent way from all sides. Users and the customer are satisfied with the fast and adequate search on the site. For developers, the presence of such a java-monster is almost imperceptible, since they only need indices, a couple of manage.py commands and a solr reload via supervisor.

Source: https://habr.com/ru/post/225999/



All Articles