Full text search in web projects: Sphinx, Apache Lucene, Xapian

Full author's version of my blog . The original material was written specifically for Developers.org.ua

Probably any modern web project is hard to imagine without ... without content! Yes, it is the content in its various manifestations that “rules the ball” in various web projects today. Not so important - automatically created by users or obtained from other sources - information is the main of any (well, or almost any) project. And if so - then the question of finding the necessary information is very serious. And more acutely every day, due to the rapid expansion of the amount of this very content, mainly due to user-generated (these are forums, blogs and fashionable communities today, like Habrahabr.ru ). Thus, any developer implementing a project today faces the need to implement a search in their web application. At the same time, the requirements for such a search are much more complicated and wider than even a year or two ago. Of course, a simple solution is quite suitable for some projects, for example, it is possible to use Custom Google Search. But the more complex the application, and the more complex the content structure, if you need special types of search and result processing, or just the amount or format of the data in your project is special, you need your own search engine. It is its own system, its own search server or service, and not a third-party, even if it is flexible and customizable. But what to choose, and in general - what are the current search projects on the market that are ready for use in real projects, not research or scientific, but real business applications? Next, we will briefly review various options for search solutions suitable for embedding into your web application or deploying on your own server.

General architecture and terms

And so, for a deeper understanding of the essence of the search, we will try to briefly get acquainted with the used concepts and terms. By a search server (or simply “search engine”), we mean a library or component, in general, a software solution that independently maintains its database (in fact, it can be a DBMS, or just files and a distributed storage platform) of documents in which in fact, the search takes place, provides third-party applications to add, delete and update documents in this database. This process is called indexing, and can be implemented as a separate component or server (indexer). The other component, namely the search engine, accepts a search request and processes the database created by the indexer and selects data that matches the query. In addition, it can calculate additional parameters for search results (rank documents, calculate the degree of compliance with the search query, etc.). These are the most important search engine systems, and they can be both monolithically implemented in a single library, or they can be independent servers, access to which is realized through various application protocols and APIs. Additionally, the search server can be used to pre-process documents before indexing (for example, extracting text from files of various formats or databases), various APIs can also be implemented by additional components. The server itself can store its index data in the database (both built-in and using an external server, for example, MySQL), in the form of files of its own optimized format, or even use a special distributed file system.
')
Separately, I would highlight the presence of a web search implementation module. That is, the built-in ability to retrieve documents from web sites via HTTP protocol and index them in the search server can be implemented. This module is usually called “spider” or “crawler”, and thus, the search server may already be similar to the “real” and familiar to everyone search like Google or Yandex. So you can implement your own search engine on the sites you need, for example, dedicated to one topic - just create a list of addresses and configure their periodic traversal. However, this task is much more complex and serious, both technically and organizationally, so we do not dwell on the details of its implementation. Among the projects that we will consider, there is one server, namely the implementing web search engine, that is, it contains everything needed to create a “Yandex killer”. Interesting?

What parameters are important?

When choosing a search engine, consider the following parameters:

indexing speed — that is, how quickly the search server “grinds” documents and puts them in its index, making search available for them. Usually measured in megabytes of clear text per second.
speed of reindexing - in the process of working documents change or are added new, so you have to re-index the information. If the server supports incremental indexing, then we process only new documents, and leave updating the entire index for later or even not at all. Other servers require a complete restructuring of the index when adding new information, or use additional indexes (delta index), which includes only new information
Supported APIs - if you use a search engine in conjunction with a web application, pay attention to the presence of a built-in API for your language or platform. Most search engines have API for all popular platforms - Java, PHP, Ruby, Phyton
supported protocols - besides the API, access protocols are important, in particular, if you want to access from another server or application that does not have a native API. XML-RPC (or variations such as JSON-RPC), SOAP, or access via http / socket are usually supported.
database size and search speed - these parameters are very interrelated and if you implement something unique and provide that you can have millions and more documents in the database for which you need to perform an instant search, then look at the known implementations of the chosen platform. Although no one states explicitly about the limitations on the number of documents in the databases, and in small collections (for example, several thousand or tens of thousands of documents), all search engines will be about the same, but if we are talking about millions of documents, this can be a problem. By the way, by itself, this parameter does not always matter, you need to look at the features of each system and the algorithms for working with search, as well as other parameters, such as the speed of reindexing or possible types of indexes and their storage system.
supported document types - of course, any server supports plain text (although you should look at the ability to work with multilingual documents and UTF-8 encoding), but if you need to index different file types, for example, HTML, XML, DOC or PDF, then you should see solutions where there is a built-in component for indexing various formats and extracting text from them. Of course, all this can be done right in your application, but better to find ready-made solutions. This also includes support for indexing and searching information stored in a DBMS - it’s not a secret that such storage is the most common for web applications, and it’s better that the search server works directly with the database, without the need to manually extract and “feed” documents to it for indexing.
work with different languages and stemming - for correct search using different languages, you need native support not only of encodings, but also work with language features. All support the English language, which is quite simple for search and processing, but for Russian and other similar ones, it is necessary to use automatic tools to ensure morphology. Stemming module allows you to convince and parse words in a search query for a more correct search. If the search in Russian is critical for you, pay attention to the presence of this module and its features (of course, manual stemming is better than automatic, but more difficult, although automatic ones are very different).
support for additional field types in documents - in addition to the text itself, which is indexed and searched, it is necessary to have an unlimited number of other fields in the document that can store meta-information about the document, which is necessary for further work with the search results. It is highly desirable that the number and types of fields are not limited, and the indexability of the fields can be customized. For example: in one field the name is stored, in the second - the abstract, in the third - keywords, in the fourth - the document identifier in your system. It is necessary to flexibly adjust the search area (in each field or in the specified ones), as well as those fields that will be retrieved from the search engine database and displayed in the search results.
platform and language are just as important, though to a lesser extent. If you are going to allocate the search to a separate module from the application or server, or even take it to a separate server (iron in the sense), then the role of the platform is not that big. This is usually either C ++ or Java, although there are options for other languages (usually solution ports in java).
the presence of built-in mechanisms for ranking and sorting is especially good if the search engine can be expanded (and it is written in a language you know) and write the implementations of these functions you need, because there are so many different algorithms, and it’s not a fact that the default search engine is right for you .

Of course, there are still a lot of different parameters, and the data search area itself is quite complex and serious, but for our applications this is quite enough. You do not make a competitor to Google?

Now briefly tell about those search solutions that you should pay attention to, only you decide to start the issue of search. I intentionally do not consider the solution built into your DBMS - FULLTEXT in MySQL and FTS in PostgreSQL (integrated into the database from the version, if not mistaken, 8.3). In MySQL, the solution cannot be used for serious search, especially for large amounts of data, the search in PostgreSQL is much better, but only if you already use this database. Although, as an option - to install a separate database server and there only use data storage and search is also an option. Unfortunately, I do not have on hand data on real-world applications for large amounts of data and complex queries (units and tens of GB of text).

Sphinx search engine

Type : separate server, MySQL storage engine
Platform : C ++ / cross-platform
Index : monolithic + delta index, the possibility of distributed search
Search capabilities : Boolean search, phrase search, etc. with the ability to group, rank and sort the result
APIs and protocols : SQL DB (as well as native support for MySQL and PostgreSQL), native XML interface, built-in APIs for PHP, Ruby, Python, Java, Perl
Language support : built-in English and Russian stemming, soundex to implement morphology
Additional fields : yes, unlimited
Document Types : Text Only or Native XML Format
Index size and search speed : very fast, indexing about 10 Mb / s (depending on CPU), search about 0.1 sec / ~ 2 - 4 GB index, supports index sizes of hundreds of GB and hundreds of millions of documents (this is if you do not use clustering) However, there are examples of works on terabyte databases.
License : open source, GPL 2 or commercial.
URL : http://sphinxsearch.com

Sphinx is probably the most powerful and fastest of all open engines we’ll look at. It is especially convenient in that it has direct integration with popular databases and supports advanced search capabilities, including ranking and stemming for Russian and English. It seems that the project has excellent support for the Russian language due to the fact that the author is our compatriot, Andrei Aksenov . Non-trivial features like distributed search and clustering are supported, but the company's feature is a very, very high indexing and search speed, as well as the ability to perfectly parallelize and utilize the resources of modern servers. Very serious installations are known that contain terabytes of data, so Sphinx can be recommended as a dedicated search engine for projects of any level of complexity and data volume. Transparent work with the most popular MySQL and PostgreSQL databases allows you to use it in a typical web development environment, and there is also an API out of the box for different languages, primarily for PHP without the use of additional modules and extension libraries. But the search engine itself must be compiled and installed separately, so on a regular hosting it is not applicable - only VDS or its own server, and, preferably, more memory. The index of the search engine is monolithic, so you have to “pervert” a little by adjusting the delta index to work correctly when there are a lot of new or modified documents, although the huge indexing speed allows you to organize the restructuring of the index on a schedule and this does not affect the actual search.

SphinxSE is a version that functions as a data storage engine for MySQL (requires a patch and recompiles the base), Ultrasphinx is a configurator and client for Ruby (except for the API present in the distribution), except for this there are plug-ins for many well-known CMS b blog platforms, wikis which replace the standard search (see the full list here: http://www.sphinxsearch.com/contribs.html )

Apache Lucene Family

Type : single server or servlet, embedded library
Platform : Java / cross-platform (there are ports in many languages and platforms)
Index : incremental index, but requiring the operation of merging segments (can be parallel with the search)
Search capabilities : boolean search, phrase search, fuzzy search, etc. with the ability to group, rank and sort the result
API and Protocols : Java API
Language support : no morphology, stemming (Snowball) and analyzers for a number of languages (including Russian)
Additional fields : yes, unlimited
Document types : text, database indexing via JDBC possible
Index size and search speed : about 20 MB / minute, the size of index files is limited to 2 GB (on 32-bit OS). There are parallel search capabilities across multiple indexes and clustering (requires third-party platforms)
License : open source, Apache License 2.0
URL : http://lucene.apache.org/

Lucene - the most famous of the search engines, initially focused on embedding it into other programs. In particular, it is widely used in Eclipse (search documentation) and even in IBM (products from the OmniFind series). The advantages of the project include advanced search capabilities, a good system for building and storing an index, which can simultaneously be replenished, documents are deleted and optimized along with the search, as well as parallel search across multiple indices and combining the results. The index itself is built from segments, but in order to improve speed, it is recommended to optimize it, which often means almost the same costs as for re-indexing. Initially, there are variants of analyzers for different languages, including Russian with the support of stemming (bringing words to normal form). However, the minus is still a low indexing speed (especially in comparison with Sphinx), the complexity of working with databases and the absence of an API (except for native Java). And although to achieve serious performance, Lucene can cluster and store indexes in a distributed file system or database, this requires third-party solutions, as well as for all other functions — for example, initially it can index only plain text. But it is precisely in terms of using “Lucene ahead of the rest” as part of third-party products - for no other engine there are so many ports to other languages and uses. One of the factors behind this popularity is the very good format of index files that third-party solutions use, so it is possible to build solutions that work with the index and search, but do not have their own indexer (this is easier to implement and the requirements are much lower).

Solr - the best solution based on Lucene, significantly expanding its capabilities. This is a standalone enterprise-level server that provides extensive search capabilities as a web service. Solr standardly accepts HTTP documents in XML format and returns the result also via HTTP (XML, JSON or another format). Clustering and replication to several servers are fully supported, support for additional fields in documents is expanded (unlike Lucene, various standard data types are supported for them, which brings the index closer to databases), facet search and filtering support, advanced configuration and administration tools, and also the possibility of backup index in the process. Embedded caching also improves performance. On the one hand, this is an independent solution based on Lucene, on the other - its capabilities are very significantly expanded relative to the base ones, so if you need a separate search server, pay attention first to Solr.

Nutch is the second most famous project based on Lucene. This is a web search engine (a search engine + a web spider for crawling websites) combined with a distributed Hadoop storage system . Nutch "from the box" can work with remote nodes on the network, indexes not only HTML, but also MS Word, PDF, RSS, PowerPoint and even MP3 files (meta tags, of course), in fact it is a full-fledged search engine-killer Google. Just kidding, the payback for this is a significant reduction in functionality, even the basic of Lucene, for example, Boolean operators are not supported in the search, and no stemming is used. If the task is to make a small local search engine for local resources or a limited set of sites in advance, you need full control over all aspects of the search, or you create a research project to test new algorithms, in which case Nutch will be your best choice. However, take into account its requirements for hardware and a wide channel - for a real web search engine, traffic goes to terabytes.

Do you think no one uses Nutch as an “adult”? You are mistaken - one of the most famous projects that you could hear about, it is used by the search engine for the source codes Krugle ( http://krugle.com/ ).

But not only due to add-on projects, Lucene is known and popular. Being a leader among open source solutions and embodying many excellent solutions, Lucene is the first candidate for porting to other platforms and languages. Now there are the following ports (I mean those that are more or less actively developing and the most complete ports):

Lucene.Net is the full Lucene port, fully algorithmic, with classes and API identical porting to MS.NET/Mono and C # language. While the project is in the incubator, and the last release is dated April 2007 (port of the final 2.0 version). Project page .
Ferret - port to Ruby language
CLucene is a C ++ version that promises a significant performance boost. According to some tests, it is 3–5 times faster than the original, and sometimes more (by indexing, the search is comparable or faster by only 5–10%). It turned out that this version uses a large number of projects and companies - ht: // Dig, Flock, Kat (search engine for KDE), BitWeaver CMS and even companies such as Adobe (search documentation) and Nero. Project page
Plucene - Perl implementation
PyLucene is an implementation for Python applications , but not complete, and partly requires Java.
Zend_Search_Lucene is the only port in PHP that is available as part of the Zend Framework. By the way, it is quite efficient and as an independent solution, outside of the framework, I did experiments and as a result of selecting the entire search engine now takes only 520 KB in a single PHP file. Project home page: http://framework.zend.com/manual/en/zend.search.lucene.htm

Xapian

Type : Embedded Library
Platform : C ++
Index : incremental index, transparently updated in parallel with the search, work with several indexes, in-memory indexes for small databases.
Search options : boolean search, phrase search, ranking search, mask search, synonym search, etc. with the ability to group, rank and sort the result
API and protocols : C ++, Perl API, Java JINI, Python, PHP, TCL, C # and Ruby, CGI interface with XML / CSV format
Language support : there is no morphology, there is a stemming for a number of languages (including Russian), spelling check is implemented in search queries
Additional fields : none
Document types : text, HTML, PHP, PDF, PostScript, OpenOffice / StarOffice, OpenDocument, Microsoft Word / Excel / Powerpoint / Works, Word Perfect, AbiWord, RTF, DVI, indexing SQL databases through Perl DBI
Index size and search speed : working installations of 1.5 TB index are known and the number of documents in the hundreds of millions.
License : open source, GPL
URL : http://xapian.org

While this is the only contender for competition with the dominance of Lucene and Sphinx, it compares favorably with the presence of a “live” index that does not require restructuring when adding documents, a very powerful query language, including embedded stemming and even spell checking, as well as support for synonyms. This library will be the best choice if you have a perl system or you need advanced search engine building capabilities and the index is updated very often, and new documents should be immediately available for searching. However, I did not find any information about the ability to add arbitrary additional fields to documents and retrieve them with search results, so the connection between the search system and your own can be difficult. The package includes Omega - an add-on library that is ready for use as an independent search engine and is responsible for the ability to index various types of documents and the CGI interface.

Perhaps this is where our review can be completed. Although there are still many search engines, some of them are ports or add-ins above those already reviewed. For example, an industrial level search server for eZ's own CMS of the company, ezFind is not really a separate search engine, but an interface to the standard Lucene Java and includes it in its delivery. The same applies to the Search component from their eZ Components package - it provides a unified interface for accessing external search servers, in particular, it interacts with the Lucene server. And even such an interesting and powerful solution as Carrot and SearchBox are seriously modified versions of the same Lucene, significantly expanded and complemented by new features. There are not so many independent search solutions with open code that fully implement indexing and search using their own algorithms on the market. And what decision to choose depends on you and the features, often not at all obvious, of your project.

findings

Although the final decision to decide whether or not a particular search engine is suitable for your project can only you and, often, only after detailed research and tests, but some conclusions can be drawn now.

Sphinx is right for you if you need to index large amounts of data in the MySQL database and you are interested in the speed of indexing and searching, but you do not need specific search capabilities like “fuzzy search” and you agree to allocate a separate server or even a cluster for this.

If you need to integrate the search module into your application, then the best search is the ready ports for your language to the Lucene library - they exist for all common languages, but not all the features of the original can be realized. If you are developing an application in Java, then Lucene is definitely the best choice. However, consider sufficient slow indexing and the need for frequent index optimization (and demands on the CPU and disk speed). For PHP, this seems to be the only acceptable option to fully implement the search without additional modules and extensions.

Xapian is quite a good and quality product, but less common and flexible than the rest. For C ++ applications and requirements for the broad capabilities of the query language, it will be the best choice, but it requires manual refinement and modifications to be embedded in your own code or used as a separate search server.

Related Links

Sphinx search ( http://sphinxsearch.com/ )
Apache Nutch ( http://lucene.apache.org/nutch/ )
Apache Solr ( http://lucene.apache.org/solr/ )
Apache Lucene Java ( http://lucene.apache.org/ )
Apache Lucy ( http://lucene.apache.org/lucy/ )
Lucene.Net ( http://incubator.apache.org/lucene.net/ )
CLucene ( http://clucene.wiki.sourceforge.net/ )
List of Lucene ports ( http://wiki.apache.org/jakarta-lucene/LuceneImplementations )
Zend Framework Lucene full PHP port ( http://framework.zend.com/manual/ru/zend.search.lucene.html )
ezComponents Search ( http://ezcomponents.org/docs/tutorials/Search )
ez Find ( http://ez.no/ezfind )
Xapian ( http://xapian.org )
OpenFTS ( http://openfts.sourceforge.net/ )
List of Java-based search solutions ( http://www.manageability.org/blog/stuff/full-text-lucene-jxta-search-engine-java-xml )
DotLucene.net closed ( http://www.dotlucene.net/ )
Lidia ( http://www.nttdata.co.jp/en/media/2006/101100.html )
Hyper Estraier ( http://hyperestraier.sourceforge.net/ )
Kneobase ( http://sourceforge.net/projects/kneobase/ )
Egothor ( http://egothor.sourceforge.net/ )
Ferret ( http://ferret.davebalmain.com/ )
OpenGrok ( http://www.opensolaris.org/os/project/opengrok/ )

Source: https://habr.com/ru/post/30594/

All Articles

Full text search in web projects: Sphinx, Apache Lucene, Xapian

More articles: