⬆️ ⬇️

Full text search in web projects: Sphinx, Apache Lucene, Xapian

Full author's version of my blog . The original material was written specifically for Developers.org.ua



Probably any modern web project is hard to imagine without ... without content! Yes, it is the content in its various manifestations that “rules the ball” in various web projects today. Not so important - automatically created by users or obtained from other sources - information is the main of any (well, or almost any) project. And if so - then the question of finding the necessary information is very serious. And more acutely every day, due to the rapid expansion of the amount of this very content, mainly due to user-generated (these are forums, blogs and fashionable communities today, like Habrahabr.ru ). Thus, any developer implementing a project today faces the need to implement a search in their web application. At the same time, the requirements for such a search are much more complicated and wider than even a year or two ago. Of course, a simple solution is quite suitable for some projects, for example, it is possible to use Custom Google Search. But the more complex the application, and the more complex the content structure, if you need special types of search and result processing, or just the amount or format of the data in your project is special, you need your own search engine. It is its own system, its own search server or service, and not a third-party, even if it is flexible and customizable. But what to choose, and in general - what are the current search projects on the market that are ready for use in real projects, not research or scientific, but real business applications? Next, we will briefly review various options for search solutions suitable for embedding into your web application or deploying on your own server.



General architecture and terms



And so, for a deeper understanding of the essence of the search, we will try to briefly get acquainted with the used concepts and terms. By a search server (or simply “search engine”), we mean a library or component, in general, a software solution that independently maintains its database (in fact, it can be a DBMS, or just files and a distributed storage platform) of documents in which in fact, the search takes place, provides third-party applications to add, delete and update documents in this database. This process is called indexing, and can be implemented as a separate component or server (indexer). The other component, namely the search engine, accepts a search request and processes the database created by the indexer and selects data that matches the query. In addition, it can calculate additional parameters for search results (rank documents, calculate the degree of compliance with the search query, etc.). These are the most important search engine systems, and they can be both monolithically implemented in a single library, or they can be independent servers, access to which is realized through various application protocols and APIs. Additionally, the search server can be used to pre-process documents before indexing (for example, extracting text from files of various formats or databases), various APIs can also be implemented by additional components. The server itself can store its index data in the database (both built-in and using an external server, for example, MySQL), in the form of files of its own optimized format, or even use a special distributed file system.

')

Separately, I would highlight the presence of a web search implementation module. That is, the built-in ability to retrieve documents from web sites via HTTP protocol and index them in the search server can be implemented. This module is usually called “spider” or “crawler”, and thus, the search server may already be similar to the “real” and familiar to everyone search like Google or Yandex. So you can implement your own search engine on the sites you need, for example, dedicated to one topic - just create a list of addresses and configure their periodic traversal. However, this task is much more complex and serious, both technically and organizationally, so we do not dwell on the details of its implementation. Among the projects that we will consider, there is one server, namely the implementing web search engine, that is, it contains everything needed to create a “Yandex killer”. Interesting?



What parameters are important?



When choosing a search engine, consider the following parameters:



Of course, there are still a lot of different parameters, and the data search area itself is quite complex and serious, but for our applications this is quite enough. You do not make a competitor to Google?



Now briefly tell about those search solutions that you should pay attention to, only you decide to start the issue of search. I intentionally do not consider the solution built into your DBMS - FULLTEXT in MySQL and FTS in PostgreSQL (integrated into the database from the version, if not mistaken, 8.3). In MySQL, the solution cannot be used for serious search, especially for large amounts of data, the search in PostgreSQL is much better, but only if you already use this database. Although, as an option - to install a separate database server and there only use data storage and search is also an option. Unfortunately, I do not have on hand data on real-world applications for large amounts of data and complex queries (units and tens of GB of text).





Sphinx search engine



Sphinx is probably the most powerful and fastest of all open engines we’ll look at. It is especially convenient in that it has direct integration with popular databases and supports advanced search capabilities, including ranking and stemming for Russian and English. It seems that the project has excellent support for the Russian language due to the fact that the author is our compatriot, Andrei Aksenov . Non-trivial features like distributed search and clustering are supported, but the company's feature is a very, very high indexing and search speed, as well as the ability to perfectly parallelize and utilize the resources of modern servers. Very serious installations are known that contain terabytes of data, so Sphinx can be recommended as a dedicated search engine for projects of any level of complexity and data volume. Transparent work with the most popular MySQL and PostgreSQL databases allows you to use it in a typical web development environment, and there is also an API out of the box for different languages, primarily for PHP without the use of additional modules and extension libraries. But the search engine itself must be compiled and installed separately, so on a regular hosting it is not applicable - only VDS or its own server, and, preferably, more memory. The index of the search engine is monolithic, so you have to “pervert” a little by adjusting the delta index to work correctly when there are a lot of new or modified documents, although the huge indexing speed allows you to organize the restructuring of the index on a schedule and this does not affect the actual search.



SphinxSE is a version that functions as a data storage engine for MySQL (requires a patch and recompiles the base), Ultrasphinx is a configurator and client for Ruby (except for the API present in the distribution), except for this there are plug-ins for many well-known CMS b blog platforms, wikis which replace the standard search (see the full list here: http://www.sphinxsearch.com/contribs.html )





Apache Lucene Family



Lucene - the most famous of the search engines, initially focused on embedding it into other programs. In particular, it is widely used in Eclipse (search documentation) and even in IBM (products from the OmniFind series). The advantages of the project include advanced search capabilities, a good system for building and storing an index, which can simultaneously be replenished, documents are deleted and optimized along with the search, as well as parallel search across multiple indices and combining the results. The index itself is built from segments, but in order to improve speed, it is recommended to optimize it, which often means almost the same costs as for re-indexing. Initially, there are variants of analyzers for different languages, including Russian with the support of stemming (bringing words to normal form). However, the minus is still a low indexing speed (especially in comparison with Sphinx), the complexity of working with databases and the absence of an API (except for native Java). And although to achieve serious performance, Lucene can cluster and store indexes in a distributed file system or database, this requires third-party solutions, as well as for all other functions — for example, initially it can index only plain text. But it is precisely in terms of using “Lucene ahead of the rest” as part of third-party products - for no other engine there are so many ports to other languages ​​and uses. One of the factors behind this popularity is the very good format of index files that third-party solutions use, so it is possible to build solutions that work with the index and search, but do not have their own indexer (this is easier to implement and the requirements are much lower).



Solr - the best solution based on Lucene, significantly expanding its capabilities. This is a standalone enterprise-level server that provides extensive search capabilities as a web service. Solr standardly accepts HTTP documents in XML format and returns the result also via HTTP (XML, JSON or another format). Clustering and replication to several servers are fully supported, support for additional fields in documents is expanded (unlike Lucene, various standard data types are supported for them, which brings the index closer to databases), facet search and filtering support, advanced configuration and administration tools, and also the possibility of backup index in the process. Embedded caching also improves performance. On the one hand, this is an independent solution based on Lucene, on the other - its capabilities are very significantly expanded relative to the base ones, so if you need a separate search server, pay attention first to Solr.



Nutch is the second most famous project based on Lucene. This is a web search engine (a search engine + a web spider for crawling websites) combined with a distributed Hadoop storage system . Nutch "from the box" can work with remote nodes on the network, indexes not only HTML, but also MS Word, PDF, RSS, PowerPoint and even MP3 files (meta tags, of course), in fact it is a full-fledged search engine-killer Google. Just kidding, the payback for this is a significant reduction in functionality, even the basic of Lucene, for example, Boolean operators are not supported in the search, and no stemming is used. If the task is to make a small local search engine for local resources or a limited set of sites in advance, you need full control over all aspects of the search, or you create a research project to test new algorithms, in which case Nutch will be your best choice. However, take into account its requirements for hardware and a wide channel - for a real web search engine, traffic goes to terabytes.



Do you think no one uses Nutch as an “adult”? You are mistaken - one of the most famous projects that you could hear about, it is used by the search engine for the source codes Krugle ( http://krugle.com/ ).



But not only due to add-on projects, Lucene is known and popular. Being a leader among open source solutions and embodying many excellent solutions, Lucene is the first candidate for porting to other platforms and languages. Now there are the following ports (I mean those that are more or less actively developing and the most complete ports):



Xapian



While this is the only contender for competition with the dominance of Lucene and Sphinx, it compares favorably with the presence of a “live” index that does not require restructuring when adding documents, a very powerful query language, including embedded stemming and even spell checking, as well as support for synonyms. This library will be the best choice if you have a perl system or you need advanced search engine building capabilities and the index is updated very often, and new documents should be immediately available for searching. However, I did not find any information about the ability to add arbitrary additional fields to documents and retrieve them with search results, so the connection between the search system and your own can be difficult. The package includes Omega - an add-on library that is ready for use as an independent search engine and is responsible for the ability to index various types of documents and the CGI interface.



Perhaps this is where our review can be completed. Although there are still many search engines, some of them are ports or add-ins above those already reviewed. For example, an industrial level search server for eZ's own CMS of the company, ezFind is not really a separate search engine, but an interface to the standard Lucene Java and includes it in its delivery. The same applies to the Search component from their eZ Components package - it provides a unified interface for accessing external search servers, in particular, it interacts with the Lucene server. And even such an interesting and powerful solution as Carrot and SearchBox are seriously modified versions of the same Lucene, significantly expanded and complemented by new features. There are not so many independent search solutions with open code that fully implement indexing and search using their own algorithms on the market. And what decision to choose depends on you and the features, often not at all obvious, of your project.



findings



Although the final decision to decide whether or not a particular search engine is suitable for your project can only you and, often, only after detailed research and tests, but some conclusions can be drawn now.



Sphinx is right for you if you need to index large amounts of data in the MySQL database and you are interested in the speed of indexing and searching, but you do not need specific search capabilities like “fuzzy search” and you agree to allocate a separate server or even a cluster for this.



If you need to integrate the search module into your application, then the best search is the ready ports for your language to the Lucene library - they exist for all common languages, but not all the features of the original can be realized. If you are developing an application in Java, then Lucene is definitely the best choice. However, consider sufficient slow indexing and the need for frequent index optimization (and demands on the CPU and disk speed). For PHP, this seems to be the only acceptable option to fully implement the search without additional modules and extensions.



Xapian is quite a good and quality product, but less common and flexible than the rest. For C ++ applications and requirements for the broad capabilities of the query language, it will be the best choice, but it requires manual refinement and modifications to be embedded in your own code or used as a separate search server.



Related Links

Source: https://habr.com/ru/post/30594/



All Articles